mash-dist - estimate the distance of query sequences to
references
mash dist [options] <reference> <query>
[<query>] ...
Estimate the distance of each query sequence to the reference.
Both the reference and queries can be fasta or fastq, gzipped or not, or
Mash sketch files (.msh) with matching k-mer sizes. Query files can also be
files of file names (see -l). Whole files are compared by default
(see -i). The output fields are [reference-ID, query-ID, distance,
p-value, shared-hashes].
-h
Help
-p <int>
Parallelism. This many threads will be spawned for
processing. [1]
-l
List input. Each query file contains a list of sequence
files, one per line. The reference file is not affected.
-t
Table output (will not report p-values, but fields will
be blank if they do not meet the p-value threshold).
-v <num>
Maximum p-value to report. (0-1) [1.0]
-d <num>
Maximum distance to report. (0-1) [1.0]
-k <int>
K-mer size. Hashes will be based on strings of this many
nucleotides. Canonical nucleotides are used by default (see Alphabet options
below). (1-32) [21]
-s <int>
Sketch size. Each sketch will have at most this many
non-redundant min-hashes. [1000]
-i
Sketch individual sequences, rather than whole
files.
-w <num>
Probability threshold for warning about low k-mer size.
(0-1) [0.01]
-r
Input is a read set. See Reads options below.
Incompatible with -i.
-b <size>
Use a Bloom filter of this size (raw bytes or with
K/M/G/T) to filter out unique k-mers. This is useful if exact filtering with
-m uses too much memory. However, some unique k-mers may pass
erroneously, and copies cannot be counted beyond 2. Implies -r.
-m <int>
Minimum copies of each k-mer required to pass noise
filter for reads. Implies -r. [1]
-c <num>
Target coverage. Sketching will conclude if this coverage
is reached before the end of the input file (estimated by average k-mer
multiplicity). Implies -r.
-g <size>
Genome size. If specified, will be used for p-value
calculation instead of an estimated size from k-mer content. Implies
-r.
-n
Preserve strand (by default, strand is ignored by using
canonical DNA k-mers, which are alphabetical minima of forward-reverse pairs).
Implied if an alphabet is specified with -a or -z.
-a
Use amino acid alphabet (A-Z, except BJOUXZ). Implies
-n, -k 9.
-z <text>
Alphabet to base hashes on (case ignored by default; see
-Z). K-mers with other characters will be ignored. Implies
-n.
-Z
Preserve case in k-mers and alphabet (case is ignored by
default). Sequence letters whose case is not in the current alphabet will be
skipped when sketching.