The program vmatch allows one to solve a multitude of
different matching tasks over an index constructed by mkvtree. Each
matching task is solved by a combination of options specifying
•the input,
•the kind of matches sought,
•additional constraints on the matches,
•the direction of the matches (in case of
DNA),
•the kind of postprocessing to be done,
•the output mode and output format.
Additionally, if there is more than one algorithm to solve a
certain matching task, vmatch allows one to specify which algorithm
is to be used. vmatch supports computing the following kinds of
matches:
1.match all substrings of the database sequences against
itself. The matches can be one of the following kinds:
1.branching tandem repeats, i.e. repeats where the two
instances of the repeat occur at consecutive positions
2.maximal repeats, i.e. pairs of maximal substrings
occurring more than once in the database sequences
3.supermaximal repeats, i.e. pairs of maximal substrings
occurring more than once in the database sequences, but not in any other
maximal repeat
2.match a set of query sequences (given in an extra
query file) against the index. The matches can be one of the following kinds:
1.maximal substring matches, i.e. the substrings of the
query sequences matching substrings of the database sequences. All matches
exceeding some minimum length,extended maximally to the left and to the right,
are reported.
2.maximal unique matches, i.e. the substrings of the
query sequences matching substrings of the database sequences. A match is
reported if it is unique in the database sequences as well as in the query
sequences.
3.complete matches, i.e. a query sequence must
completely match (i.e. from the first character to the last character) a
substring of the database sequences.
For all these match kinds, the matches themselves can be direct or
palindromic (i.e. on the reverse strand, in case of DNA sequences). If
required, DNA sequences are translated into six reading frames and the
matches are computed on the protein level, and reported on the DNA level.
Besides exact matches, also degenerate matches with a maximal number of
errors (insertions, deletions, and mismatches) are supported. Moreover,
degenerate matches can be derived from exact matches by extending these
using a greedy extension strategy. This does not apply to complete matches.
For all different match kinds, the matches delivered by vmatch can be
selected according to their E-value, their identity value, or their match
score.
In the default case, a match is reported as a formatted row of
numbers, containing its lengths, the positions where it occurs, the E-value,
the number of errors it contains, the match score, and the identity value.
Optionally, an alignment of the sequences that are involved in the match can
be reported. An important feature of vmatch is the capability of
directly postprocessing the matches found in the following ways:
1.inverse output, i.e. report substrings of the database
sequences or the query sequences not covered by a match
2.masking substrings of the database sequences or the
query sequences covered by a match
3.clustering of a set of database sequences according to
the matches found between these sequences. The output of this option can be a
representation of the clusters, or a set of sequences each being
representative for a cluster.
4.chaining of a set of matches, i.e. finding optimal
subsets of all matches which do not cross
5.clustering of matches according to the pairwise
similarities on the sequences involved inthe match
6.clustering of matches according to the positions where
they occur
Finally, to accommodate many more kinds of user defined post
processing tasks, vmatch provides the concept of selection functions.
These provide an open interface which allow arbitrary on-the-fly
postprocessing of the matches without output and parsing of the matches. For
more details on this concept, see the manual.
-q <file>
Specify files containing queries to be matched.
-dnavsprot <table>
Perform six frame translation. Specify codon translation
table by a number in the range [1,23] except for 7, 8, 17, 18, 19 and 20;
(default is 1): 1 Standard 2 Vertebrate Mitochondrial 3 Yeast Mitochondrial 4
Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial;
Mycoplasma; Spiroplasma 5 Invertebrate Mitochondrial 6 Ciliate Nuclear;
Dasycladacean Nuclear; Hexamita Nuclear 9 Echinoderm Mitochondrial 10 Euplotid
Nuclear 11 Bacterial 12 Alternative Yeast Nuclear 13 Ascidian Mitochondrial 14
Flatworm Mitochondrial 15 Blepharisma Macronuclear 16 Chlorophycean
Mitochondrial 21 Trematode Mitochondrial 22 Scenedesmus Obliquus Mitochondrial
23 Thraustochytrium Mitochondrial
-tandem
Compute right branching tandem repeats.
-supermax
Compute supermaximal matches.
-mum
Compute maximal unique matches.
-complete
Specify that query sequences must match completely.
-dbnomatch <arg>
Mask all database substrings containing a match; optional
argument:
•keepleft means to not mask the left instance of a
match
•keepright means to not mask the right instance of
a match
•keepleftifsamesequence means to not mask the left
instance of the match if the right instance occurs in the same sequence
•keeprightifsamesequence means to not mask the
right instance of the match if the left instance occurs in the same
sequence
-qnomatch
Show all query substrings not containing a match.
-dbmaskmatch <arg>
Mask all database substrings containing a match; optional
argument:
•keepleft means to not mask the left instance of a
match
•keepright means to not mask the right instance of
a match
•keepleftifsamesequence means to not mask the left
instance of the match if the right instance occurs in the same sequence
•keeprightifsamesequence means to not mask the
right instance of the match if the left instance occurs in the same
sequence
-qmaskmatch
Mask all query substrings containing a match.
-pp
Generic postprocessing of matches.
-online
Run algorithms online without using the index.
-qspeedup <level>
Specify speedup level when matching queries (0: fast, 2:
faster; default is 2), beware of time/space tradeoff.
-d
Compute direct matches (default).
-p
Compute palindromic (i.e. reverse complemented
matches).
-h <dist>
Specify the allowed hamming distance > 0. In
combination with option -complete one can switch on the percentage
search mode or the best search mode for the percentage search mode use an
argument of the form ip (where i is a positive integer). This means that up to
i*100/m mismatches are allowed in a match of a query of length m. For the best
search mode use an argument of the form ib where i is a positive integer. This
means that in a first phase the minimum threshold q is determined such that
there is still a match with q mismatches. q is in the range 0 to
i*100/m.
-e <dist>
Specify the allowed edit distance > 0. In combination
with option -complete one can switch on the percentage search mode or
the best search mode for the percentage search mode use an argument of the
form ip (where i is a positive integer). This means that up to i*100/m
differences are allowed in a match of a query of length m. For the best search
mode use an argument of the form ib where i is a positive integer. This means
that in a first phase the minimum threshold q is determined such that there is
still a match with q differences. q is in the range 0 to i*100/m.
-allmax
Show all maximal matches in the order of their
computation.
-seedlength <length>
Specify the seed length.
-hxdrop <value>
Specify the xdrop value for hamming distance
extension.
-exdrop <value>
Specify the xdrop value for edit distance
extension.
-i
Give information about number of different matches.
-dbcluster <args>
Cluster the database sequences.
•first argument is percentage of shorter string to
be included in match,
•second argument is percentage of larger string to
be included in match,
•third optional argument is filenameprefix,
•fourth optional argument is (minclustersize,
maxclustersize)
-nonredundant
Generate file with non-redundant set of sequences; only
works together with option -dbcluster.
-selfun <file>
Specify shared object file containing selection
function.
-l <length>
Specify that match must have the given length, optionally
specify minimum and maximum size of gaps between repeat instances.
-leastscore <score>
Specify the minimum score of a match.
-evalue <value>
Specify the maximum E-value of a match.
-identity <value>
Specify minimum identity of match in range
[1..100%].
-sort <mode>
Sort the matches, additional argument is mode: la:
ascending order of length ld: descending order of length ia: ascending order
of first position id: descending order of first position ja: ascending order
of second position jd: descending order of second position ea: ascending order
of Evalue ed: descending order of Evalue sa: ascending order of score sd:
descending order of score ida: ascending order of identity idd: descending
order of identity
-best <n>
Show the best matches (those with smallest E-values),
default is best 50.
-s
Show the alignment of matching sequences.
-showdesc
Show sequence description of match.
-f
Show filename where match occurs.
-absolute
Show absolute positions.
-nodist
Do not show distance of match.
-noevalue
Do not show E-value of match.
-noscore
Do not show score of match.
-noidentity
Do not show identity of match.
-v
Verbose mode.
-version
Show the version of the Vmatch package.
-help
Show basic options.
-help+
Show all options.