mkvtree - construct index for sequence
The program mkvtree constructs an index for a given set of
sequences. These are given as a list of input files. The sequences are
referred to as database sequences. They can be over any given alphabet. The
alphabet can be the DNA alphabet, or the protein alphabet, or any other
alphabet consisting of printable characters. An alphabet is specified by a
file storing a symbol mapping. The index consists of several files, the
index files. Each such file stores a different table. The user specifies
which tables (i.e. which part of the index) is written to a file, using one
of eight output options, or a single option specifying that all tables are
written to file.
We support the following formats for the input files. They are
recognized according to the first non-whitespace symbol in the file.
•multiple FASTA format: If the file begins with
the symbol ">", then this file is considered to be a file in
multiple FASTA format (i.e. it contains one or more sequences). Each line
starting with the symbol ">" contains the description of the
sequence following it. Each line not starting with the symbol ">"
contains the sequence. Empty lines are allowed and ignored when reading the
input.
•multiple EMBL/SWISSPROT format: If the file
begins with the string "ID", then this file is considered to be a
file in multiple EMBL format (i.e. containing one or more sequences, each in
EMBL format). The information contained in the "ID" and
"DE" lines is taken as the description of the corresponding
sequence. The EMBL format is identical to the SWISSPROT format (w.r.t. the
information we need to extract from such entries). So one can also use files
in multiple SWISSPROT format as input.
•multiple GENBANK format: If the file begins with
the string "LOCUS", then this file is considered to be a file in
multiple GENBANK format (i.e. containing one or more entries in GENBANK
format). The information contained in the "LOCUS" and the
"DEFINITION" lines is taken as the description of the corresponding
sequence.
•plain format: If the file does not begin with the
symbol ">" or the strings "ID" or "LOCUS",
then the file is taken verbatim. That is, the entire file is considered to be
the input sequence (whitespaces are not ignored).
There is no special option necessary to tell the program the
sequence format. It automatically detects the appropriate format, according
to the rules given above. If none of the above rules apply, then the program
cannot recognize the input format and exits with error code 1. In such a
case please check you input files for if they are conform with the input
formats above. Another good solution is to use a more versatile sequence
format transformation programs (e.g. readseq) to first generate
multiple FASTA files and then feed this into mkvtree.
Today many files containing sequence files are provided compressed
by the program gzip. To simplify the use of these files,
mkvtree also accepts gzipped input files. These files must have the
ending ".gz". The gzipped formatted files are gunzipped internally
and then processed as any other file.
-db <file>
Specify database files (mandatory).
-smap <file>
Specify file containing a symbol mapping. This describes
the grouping of symbols. It is possible to set the environment variable
MKVTREESMAPDIR to the path where these files can be found.
-dna
Input is DNA sequence.
-protein
Input is Protein sequence.
-indexname <string>
Specify name for index to be generated.
-pl <length>
Specify prefix length for bucket sort. Recommendation:
use without argument; then a reasonable prefix length is automatically
determined.
-tis
Output transformed input sequences (tistab) to
file.
-ois
Output original input sequences (oistab) to file.
-suf
Output suffix array (suftab) to file.
-sti1
Output reduced inverse suffix array (sti1tab) to
file.
-bwt
Output Burrows-Wheeler Transformation (bwttab) to
file.
-bck
Output bucket boundaries (bcktab) to file.
-skp
Output skip values (skptab) to file.
-lcp
Output longest common prefix lengths (lcptab) to
file.
-allout
Output all index tables to files.
-maxdepth <len>
Restrict the sorting to prefixes of the given
length.
-v
Verbose mode
-version
Show the version of the Vmatch package.
-help
Show help.
If an error occurs, the program exits with error code 1.
Otherwise, the exit code is 0.