mashing-pumpkins : m(in|ax)hash

Flexible-yet-pretty-fast minhash/maxhash-related library for Python >= 3.5.


This package is a library to perform top and bottom sketches, or MinHash / MaxHash sketches initially for sequences of bytes, using fixed-length subsets as the candidate components of the sketch (also called “k-minimum values” variant of MinHash).

The design of this package aims at making experimentations with such sketches easy to perform while conserving a reasonable performance profile. At the time of writing it has a very competitive performance profile both in runtime and memory usage when compared to alternatives.

Why Minhash sketches ?

Bottom-sketches (Minhash sketches) are samples of the elements present in a set. They have been extensively used for text document matching or retrieval, which can extend to the context of genomics where strings are DNA or RNA sequences. There the set of k-mers present in a genome (called “k-shingles” in MinHash-related litterature), or in the reads from a sequencing assay, and they have been shown to be useful to measure similarity between genomes[1].

Sampling subsequences from genome sequences or sequence assays has also been demonstrated to be a very efficient approach to identify DNA sequences of unknown origin[2], both in terms of accuracy and in terms of usage of bandwidth.

This is making such sketches interesting tools for the analysis of NGS data, with several implementations already available[1,3]

  1. MASH (Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.)
  2. Tapir/DNAsnout ([Gautier, Laurent, and Ole Lund. “Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads.” PloS one 8.12 (2013): e83784.](
  3. sourmash

Why this implementation ?

The purpose of this implementation is to provide a library design that is combining flexibility and expressivity with performance (speed and memory usage).


The design is allowing us to implement with a relatively short code base:

  • the use different hash functions (MurmurHash3, XXHash), and with user-specified seeds
  • Minhash and Maxhash sketches
  • “Count sketches”
  • Demonstrate quickly the comparative efficiency of alternative hashing strategies for double-stranded genomes (see -


The implementation also happens to be pretty fast, making it a reasonable option as a building block for minhash-related research and prototypes. At the time of writing it is able to build a minhash sketch (k=31, size=1000) for a FASTQ file with ~21M reads (700MB when gzip-compressed) on a laptop[1] in under 1‘30”.

$ python -m mashingpumpkins.demo.cmdline --parser=fastqandfurious --ncpu=3 DRR065801.fastq.gz
Processing DRR065801.fastq.gz as a FASTQ file...
    20853697 records in 1m20s (9.43 MB/s)
  1. ASUS ultrabook, dual-core with hyperthreading, running Linux - adding more cores to the task on more powerful hardware should make it faster


The package released under an MIT license. It contains code for the following hashing functions:

  • MurmurHash3 (public domain - author: Austin Appleby)
  • XXHash (BSD-2 license - author: Yann Collet)

Released versions are on the Python package index (pypi) and can installed with pip.

pip install mashing-pumpkins

To install the master on github:

# master on github
pip install git+https://


A C/C++ compiler as well the Python development headers will be required. The C compiler should accept the flag -std=c99.

The installation can be tested with:

python -m pytest --pyargs mashingpumpkins


python -m pytest --cov=mashingpumpkins ---cov-report term


Jupyter notebooks are available in the doc/notebooks (in the source tree).

While this is primarily a Python libray, there is also demo command line:

python -m mashingpumpkins.demo.cmdline


This requires the Python package sourmash (its dev version in master)


A bottom-sketch of maximum size maxsize is a sample of the elements in a set such as only the maxsize smallest hash values for all elements are selected (or maxsize highest hash values if a top sketch). Here the elements are the sub-sequences of length nsize in one or a collection of input sequences.

For example, building a bottom sketch just as done by MASH and sourmash can be written in few lines:

# import parts needed
from mashingpumpkins.minhashsketch import MinSketch
from mashingpumpkins.sourmash import mash_hashfun, DEFAULT_SEED

# create a min-sketch
nsize = 31
maxsize = 1000
mhs = MinSketch(nsize,

# add a sequence

# add an other sequence

If after a top-sketch, there is a class for this.

# import parts needed
from mashingpumpkins.minhashsketch import MaxSketch

Sketches of arbritary objects

While the library was designed to work on sequences of bytes for which sketches are the MinHash or MaxHash sets for all k-shingles of fixed window length, it can also work with input sets of arbitrary objects, although this requires to defined adapter hashing functions ignore parameters associated with windowing, and if the hashing function returns a non-numerical only use MaxSketch.

import hashlib
mhs = (mashingpumpkins.minhashsketch
           None,  # size is unspecified
           None,  # hashing function
           None  # seed

 x = object()

An other example is when all elements in the input set are bytes and the common hashing function SHA1 is wanted. In that case the hashing function would look like follows. Note that hashing function can return non-numerical values and the hashing function will plainly ignore the :param:`size` and :param:`buffer` as no sliding window is wanted.

def hashing_sha1(obj, size, buffer=None, seed=None):
    return (hashlib.sha1(obj).digest(), )

import hashlib
mhs = (mashingpumpkins.minhashsketch
           None,  # size is unspecified
           None  # seed

values = (hashlib.sha1(x).digest()
          for x in (b'This is a sequence of bytes',
                    b'This is an other sequence of bytes'))


Top and bottom sketches have interesting composability properties, which we represent by letting them be “added” to one another. That property can be useful for parallelization (See Parallelization utilities).

# sketch for sequencea
mhs_a = MinSketch(nsize,


# sketch for sequence_a
mhs_b = MinSketch(nsize,


# "add" the two sketches
mhs_ab = mhs_a + mhs_b

# remember our `mhs` above to which sequence_a and sequence_b
# where added ? mhs_ab contains the same sketch.

The methods are:

  • mashingpumpkins.minhashsketch.MaxSketch.__add__()
  • mashingpumpkins.minhashsketch.MaxSketch.__iadd__()
  • mashingpumpkins.minhashsketch.MaxSketch.update()
  • mashingpumpkins.minhashsketch.MinSketch.__add__()
  • mashingpumpkins.minhashsketch.MinSketch.__iadd__()
  • mashingpumpkins.minhashsketch.MinSketch.update()


The classes mashingpumpkins.minhashsketch.MaxSketch and mashingpumpkins.minhashsketch.MinSketch are designed to build sketches, and to do so efficiently they contain data structures than may be not be necessary for further use. To highlight this usage pattern we have a method freeze (mashingpumpkins.minhashsketch.MaxSketch.freeze() and mashingpumpkins.minhashsketch.MinSketch.freeze respectively) that returns an instance of a class wrapping frozenset.

“Frozen” sets have methods to compute similarity measure (e.g., two Jaccard’s index measures - see the class mashingpumpkins.minhashsketch.FrozenSketch for a full list).


Hashing functions

The package first entry point to customization is to allow the use of different hashing function. Several proposed hashing function are included, but any Python function satisfying the required interface can be used.

Hashing functions included

Extending the base classes

The sketch is a set of hash values. This means that each hash value is represented only once. However, it might be intersting to count the number of times a hash value is seen in the input data. This can be achieved relatively easily by extending one of the base sketch classes. More precisely, by overriding the methods __init__(), _anynew(), _replace(), freeze(), and update(). Each one of these method contain very little code, and the design made to allow this function to only implement additional operations (the methods call the parent class’ method, and the added part concern the update of a collections.Counter to keep track of the number of times a hash values has been seen so far).

The handling of counting is gathered in the class mashingpumpkins.minhashsketch.CountTrait and is used by mashingpumpkins.minhashsketch.MaxCountSketch and mashingpumpkins.minhashsketch.MinCountSketch.

The expected benefit is too allow exploration in child classes or other additional structures while keeping the core classes relatively lean.

Parallelization utilities

This module suggests primitives to write code performing parallel computation.

For example using multiprocessing to build in parallel a sketch for a list of large sequences (e.g., the chromosomes in a genome):

import multiprocessing
from mashingpumpkins.minhashsketch import MinSketch
from mashingpumpkins.sourmash import mash_hashfun, DEFAULT_SEED
from mashingpumpkins import parallel

ncpu = 4

ksize = 31
maxsize = 1000
# create child processes
p = multiprocessing.Pool(ncpu,
                         initargs=(MinSketch, ksize, maxsize,
                                   mash_hashfun, DEFAULT_SEED))
# map the list of sequences
result = p.imap_unordered(Sketch.map_sequence,
# reduce into a result sketch
mhs = reduce(Sketch.reduce, result,
             MinHash(ksize, maxsize, mash_hashfun, DEFAULT_SEED))
# finalize the child processes

Misc. utilities

Indices and tables