MMseqs2 - MMseqs2 (Many against Many sequence searching): fast,
parallelized protein sequence searches and clustering of huge protein
sequence data sets.
MMseqs2 (Many-against-Many sequence searching) is a software suite
to search and cluster huge proteins/nucleotide sequence sets. MMseqs2 is
open source GPL-licensed software implemented in C++ for Linux, MacOS, and
(as beta version, via cygwin) Windows. The software is designed to run on
multiple cores and servers and exhibits very good scalability. MMseqs2 can
run 10000 times faster than BLAST. At 100 times its speed it achieves almost
the same sensitivity. It can perform profile searches with the same
sensitivity as PSI-BLAST at over 400 times its speed.
The following depicts the different <module> that can
be used.
Easy workflows (for non-experts)
An example for running a command using easy-* modules would be
mmseqs easy-search <DB> <targetDB>
- easy-search
- Search with a query fasta against target fasta (or database) and return a
BLAST-compatible result in a single step
- easy-linsearch
- Linear time search with a query fasta against target fasta (or database)
and return a BLAST-compatible result in a single step
- easy-linclust
- Compute clustering of a fasta/fastq database in linear time. The workflow
outputs the representative sequences, a cluster tsv and a fasta-like
format containing all sequences.
- easy-cluster
- Compute clustering of a fasta database. The workflow outputs the
representative sequences, a cluster tsv and a fasta-like format containing
all sequences.
- easy-taxonomy
- Compute taxonomy and lowest common ancestor for each sequence. The
workflow outputs a taxonomic classification for sequences and a
hierarchical summery report.
Main tools (for non-experts)
- createdb
- Convert protein sequence set in a FASTA file to MMseqs sequence DB
format
- search
-
Search with query sequence or profile DB (iteratively) through target
sequence DB
- linsearch
- Search with query sequence DB through target sequence DB
- map
-
Fast ungapped mapping of query sequences to target sequences.
- cluster
- Compute clustering of a sequence DB (quadratic time)
- linclust
- Cluster sequences of >30% sequence identity *in linear time*
- createindex
- Precompute index table of sequence DB for faster searches
- createlinindex
- Precompute index for linsearch
- enrich
-
Enrich a query set by searching iteratively through a profile sequence
set.
- rbh
-
Find reciprocal best hits between query and target
- clusterupdate
- Update clustering of old sequence DB to clustering of new sequence DB
Utility tools for format conversions
- createtsv
- Create tab-separated flat file from prefilter DB, alignment DB, cluster
DB, or taxa DB
- convertalis
- Convert alignment DB to BLAST-tab format or specified custom-column output
format
- convertprofiledb
- Convert ffindex DB of HMM files to profile DB
- convert2fasta
- Convert sequence DB to FASTA format
- result2flat
- Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster
DB
- createseqfiledb
- Create DB of unaligned FASTA files (1 per cluster) from sequence DB and
cluster DB
Taxonomy tools
- taxonomy
- Compute taxonomy and lowest common ancestor for each sequence.
- createtaxdb
- Annotates a sequence database with NCBI taxonomy information
- addtaxonomy
- Add taxonomy information to result database.
- lca
-
Compute the lowest common ancestor from a set of taxa.
- taxonomyreport
- Create Kraken-style taxonomy report.
- filtertaxdb
- Filter taxonomy database.
Multi-hit search tools
- multihitdb
- Create sequence database and associated metadata for multi hit
searches
- multihitsearch
- Search with a grouped set of sequences against another grouped set
- besthitperset
- For each set of sequences compute the best element and updates the
p-value
- combinepvalperset
- For each set compute the combined p-value
- summerizeresultsbyset
- For each set compute summary statistics, such as spread-pvalue etc.
- resultsbyset
- For each set compute the combined p-value
- mergeresultsbyset
- Merge results from multiple orfs back to their respective contig
Utility tools for clustering
- mergeclusters
- Merge multiple cluster DBs into single cluster DB
Core tools (for advanced users)
- prefilter
- Search with query sequence / profile DB through target DB (k-mer matching
+ ungapped alignment)
- ungappedprefilter
- Search with query sequence / profile DB through target DB and compute
optimal ungapped alignment score
- align
-
Compute Smith-Waterman alignments for previous results (e.g. prefilter DB,
cluster DB)
- alignall
- Compute all against all Smith-Waterman alignments for a results (e.g.
prefilter DB, cluster DB)
- transitivealign
- Transfers alignments by transitivity via a center star alignment
- clust
-
Cluster sequence DB from alignment DB (e.g. created by searching DB against
itself)
- kmermatcher
- Finds exact $k$-mers matches between sequences
- kmersearch
- Search with query sequence through target DB. (k-mer matching)
- kmerindexdb
- Finds exact $k$-mers matches between sequences and stores them as
index
- clusthash
- Cluster sequences of same length and >90% sequence identity *in linear
time*
Utility tools to manipulate DBs
- compress
- Compresses a database.
- decompress
- Decompresses a database.
- apply
-
Passes each input database entry to stdin of the specified program, executes
it and writes its stdout to the output database.
- Extract open reading frames from all six frames from nucleotide sequence
DB
- Extract frames reading frames from a nucleotide sequence DB
- orftocontig
- Obtain location information of extracted orfs with respect to their
contigs in alignment format
- reverseseq
- Reverse each sequence in a DB
- touchdb
- Memory map database
- translatenucs
- Translate nucleotide sequence DB into protein sequence DB
- translateaa
- Translate protein sequence into nucleotide sequence DB
- swapresults
- Reformat prefilter or alignment DB as if target DB had been searched
through query DB
- swapdb
-
Create a DB where the key is from the first column of the input result
DB
- mergedbs
- Merge multiple DBs into a single DB, based on IDs (names) of entries
- splitdb
- Split a mmseqs DB into multiple DBs
- splitsequence
- Split sequences by length
- subtractdbs
- Generate a DB with entries of first DB not occurring in second DB
- filterdb
- Filter a DB by conditioning (regex, numerical, ...) on one of its
whitespace-separated columns
- createsubdb
- Create a subset of a DB from a file of IDs of entries
- view
-
Prints entries to console
- rmdb
-
Removes the database
- mvdb
-
Move the database
- result2profile
- Compute profile and consensus DB from a prefilter, alignment or cluster
DB
- result2pp
- Merge the query profiles with target profiles according to search results
and outputs an enriched profile DB
- result2rbh
- Filter a merged result DB to retain only reciprocal best hits
- result2msa
- Generate MSAs for queries by locally aligning their matched targets in
prefilter/alignment/cluster DB
- convertmsa
- Turns an MSA file into an MSA database.
- msa2profile
- Turns an MSA database into a MMseqs profile database.
- profile2pssm
- Converts a profile database into a human readable tab-separated PSSM
file.
- profile2cs
- Converts a profile database into a column state sequence.
- result2stats
- Compute statistics for each entry in a sequence, prefilter, alignment or
cluster DB
- proteinaln2nucl
- Map protein alignment to nucleotide alignment
- tsv2db
-
Turns a TSV file into a MMseqs database
- result2repseq
- Get representative sequences for a result database
Special-purpose utilities
- rescorediagonal
- Compute sequence identity for diagonal
- alignbykmer
- Predict sequence identity, score, alignment start and end by kmer
alignment
- diffseqdbs
- Find IDs of sequences kept, added and removed between two versions of
sequence DB
- concatdbs
- Concatenate two DBs, giving new IDs to entries from second input DB
- sortresult
- Sort a result database in the same order as prefilter or align would.
- summarizealis
- Summarize alignment results into a single show uniq. coverage, coverage
and avg. sequence identity
- summarizeresult
- Extract annotations from alignment DB
- summarizetabs
- Extract annotations from HHblits BAST-tab-formatted results
- gff2db
-
Turn a gff3 (generic feature format) file into a gff3 DB
- masksequence
- Soft mask sequences using tantan, low. complex regions in lower case the
rest upper
- maskbygff
- X out sequence regions in a sequence DB by features in a gff3 file
- prefixid
- For each entry in a DB prepend the entry ID to the entry itself
- suffixid
- For each entry in a DB append the entry ID to the entry itself
- convertkb
- Convert UniProt knowledge base files into MMseqs2 database format for the
selected column types
- Return a new summarized header DB from the UniProt headers of a cluster
DB
- Extract aligned sequence region from query
- extractdomains
- Extract highest scoring alignment region for each sequence from BLAST-tab
file
- convertca3m
- Converts a cA3M database into a MMseqs2 result database.
- expandaln
- Expands an alignment result based on another.
- countkmer
- Simple kmer counter, it prints the numeric, alphanumeric representation
and kmercount