minimap - fast mapping between long DNA sequences
minimap [-lSOV] [-k kmer] [-w
winSize] [-I batchSize] [-d dumpFile]
[-f occThres] [-r bandWidth] [-m
minShared] [-c minCount] [-L minMatch]
[-g maxGap] [-T dustThres] [-t
nThreads] [-x preset] target.fa query.fa
> output.paf
Minimap is a tool to efficiently find multiple approximate mapping
positions between two sets of long sequences, such as between reads and
reference genomes, between genomes and between long noisy reads. Minimap has
an indexing and a mapping phase. In the indexing phase, it collects all
minimizers of a large batch of target sequences in a hash table; in the
mapping phase, it identifies good clusters of colinear minimizer hits.
Minimap does not generate detailed alignments between the target and the
query sequences. It only outputs the approximate start and the end
coordinates of these clusters.
- -k INT
- Minimizer k-mer length [15]
- -w INT
- Minimizer window size [2/3 of k-mer length]. A minimizer is the smallest
k-mer in a window of w consecutive k-mers.
- -I NUM
- Load at most NUM target bases into RAM for indexing [4G]. If there
are more than NUM bases in target.fa, minimap needs to read
query.fa multiple times to map it against each batch of target
sequences. NUM may be ending with k/K/m/M/g/G.
- -d FILE
- Dump minimizer index to FILE [no dump]
- -l
- Indicate that target.fa is in fact a minimizer index generated by
option -d, not a FASTA or FASTQ file.
- -f FLOAT
- Ignore top FLOAT fraction of most occurring minimizers [0.001]
- -r INT
- Approximate bandwidth for initial minimizer hits clustering [500]. A
minimizer hit is a minimizer present in both the target and query
sequences. A minimizer hit cluster is a group of potentially
colinear minimizer hits between a target and a query sequence.
- -m FLOAT
- Merge initial minimizer hit clusters if FLOAT or higher fraction of
minimizers are shared between the clusters [0.5]
- -c INT
- Retain a minimizer hit cluster if it contains INT or more minimizer
hits [4]
- -L INT
- Discard a minimizer hit cluster if after colinearization, the number of
matching bases is below INT [40]. This option mainly reduces the
size of output. It has little effect on the speed and peak memory.
- -g INT
- Split a minimizer hit cluster at a gap INT-bp or longer that does
not contain any minimizer hits [10000]
- -T INT
- Mask regions on query sequences with SDUST score threshold INT; 0
to disable [0]. SDUST is an algorithm to identify low-complexity
subsequences. It is not enabled by default. If SDUST is preferred, a value
between 20 and 25 is recommended. A higher threshold masks less sequences.
- -S
- Perform all-vs-all mapping. In this mode, if the query sequence name is
lexicographically larger than the target sequence name, the hits between
them will be suppressed; if the query sequence name is the same as the
target name, diagonal minimizer hits will also be suppressed.
- -O
- Drop a minimizer hit if it is far away from other hits (EXPERIMENTAL).
This option is useful for mapping long chromosomes from two diverged
species.
- -x STR
- Changing multiple settings based on STR [not set]. It is
recommended to apply this option before other options, such that the
following options may override the multiple settings modified by this
option.
- ava10k
- for PacBio or Oxford Nanopore all-vs-all read mapping (-Sw5 -L100
-m0).
- -t INT
- Number of threads [3]. Minimap uses at most three threads when collecting
minimizers on target sequences, and uses up to INT+1 threads when
mapping (the extra thread is for I/O, which is frequently idle and takes
little CPU time).
- -V
- Print version number to stdout
Minimap outputs mapping positions in the Pairwise mApping Format
(PAF). PAF is a TAB-delimited text format with each line consisting of at
least 12 fields as are described in the following table:
Col |
Type |
Description |
1 |
string |
Query sequence name |
2 |
int |
Query sequence length |
3 |
int |
Query start coordinate (0-based) |
4 |
int |
Query end coordinate (0-based) |
5 |
char |
`+' if query and target on the same strand; `-' if opposite |
6 |
string |
Target sequence name |
7 |
int |
Target sequence length |
8 |
int |
Target start coordinate on the original strand |
9 |
int |
Target end coordinate on the original strand |
10 |
int |
Number of matching bases in the mapping |
11 |
int |
Number bases, including gaps, in the mapping |
12 |
int |
Mapping quality (0-255 with 255 for missing) |
When the alignment is available, column 11 gives the total number
of sequence matches, mismatches and gaps in the alignment; column 10 divided
by column 11 gives the alignment identity. As minimap does not generate
detailed alignment, these two columns are approximate. PAF may optionally
have additional fields in the SAM-like typed key-value format. Minimap
writes the number of minimizer hits in a cluster to the cm tag.