DOKK / manpages / debian 12 / mash / mash-dist.1.en
MASH-DIST(1)   MASH-DIST(1)

mash-dist - estimate the distance of query sequences to references

mash dist [options] <reference> <query> [<query>] ...

Estimate the distance of each query sequence to the reference. Both the reference and queries can be fasta or fastq, gzipped or not, or Mash sketch files (.msh) with matching k-mer sizes. Query files can also be files of file names (see -l). Whole files are compared by default (see -i). The output fields are [reference-ID, query-ID, distance, p-value, shared-hashes].

-h

Help

-p <int>

Parallelism. This many threads will be spawned for processing. [1]

-l

List input. Each query file contains a list of sequence files, one per line. The reference file is not affected.

-t

Table output (will not report p-values, but fields will be blank if they do not meet the p-value threshold).

-v <num>

Maximum p-value to report. (0-1) [1.0]

-d <num>

Maximum distance to report. (0-1) [1.0]

-k <int>

K-mer size. Hashes will be based on strings of this many nucleotides. Canonical nucleotides are used by default (see Alphabet options below). (1-32) [21]

-s <int>

Sketch size. Each sketch will have at most this many non-redundant min-hashes. [1000]

-i

Sketch individual sequences, rather than whole files.

-w <num>

Probability threshold for warning about low k-mer size. (0-1) [0.01]

-r

Input is a read set. See Reads options below. Incompatible with -i.

-b <size>

Use a Bloom filter of this size (raw bytes or with K/M/G/T) to filter out unique k-mers. This is useful if exact filtering with -m uses too much memory. However, some unique k-mers may pass erroneously, and copies cannot be counted beyond 2. Implies -r.

-m <int>

Minimum copies of each k-mer required to pass noise filter for reads. Implies -r. [1]

-c <num>

Target coverage. Sketching will conclude if this coverage is reached before the end of the input file (estimated by average k-mer multiplicity). Implies -r.

-g <size>

Genome size. If specified, will be used for p-value calculation instead of an estimated size from k-mer content. Implies -r.

-n

Preserve strand (by default, strand is ignored by using canonical DNA k-mers, which are alphabetical minima of forward-reverse pairs). Implied if an alphabet is specified with -a or -z.

-a

Use amino acid alphabet (A-Z, except BJOUXZ). Implies -n, -k 9.

-z <text>

Alphabet to base hashes on (case ignored by default; see -Z). K-mers with other characters will be ignored. Implies -n.

-Z

Preserve case in k-mers and alphabet (case is ignored by default). Sequence letters whose case is not in the current alphabet will be skipped when sketching.

mash(1)

2019-12-13