cmsearch - search covariance model(s) against a sequence
database
cmsearch [options] <cmfile>
<seqdb>
cmsearch is used to search one or more covariance models
(CMs) against a sequence database. For each CM in <cmfile>, use
that query CM to search the target database of sequences in
<seqdb>, and output ranked lists of the sequences with the most
significant matches to the CM. To build CMs from multiple alignments, see
cmbuild.
The query <cmfile> must have been calibrated for
E-values with cmcalibrate. As a special exception, any models in
<cmfile> that have zero basepairs need not be calibrated. For
these models, profile HMM search algorithms will be used instead of CM ones,
as discussed further below.
The query <cmfile> may be '-' (a dash character), in
which case the query CM input will be read from a <stdin> pipe instead
of from a file. The <seqdb> may not be '-' because the current
implementation needs to be able to rewind the database, which is not
possible with stdin input.
The output format is designed to be human-readable, but is often
so voluminous that reading it is impractical, and parsing it is a pain. The
--tblout option saves output in a simple tabular format that is
concise and easier to parse. The -o option allows redirecting the
main output, including throwing it away in /dev/null.
cmsearch reexamines the 5' and 3' termini of target
sequences using specialized algorithms for detection of truncated
hits, in which part of the 5' and/or 3' end of the actual full length
homologous sequence is missing in the target sequence file. These types of
hits will be most common in sequence files consisting of unassembled
sequencing reads. By default, any 5' truncated hit is required to include
the first residue of the target sequence it derives from in
<seqdb>, and any 3' truncated hit is required to include the
final residue of the target sequence it derives from. Any 5' and 3'
truncated hit must include the first and final residue of the target
sequence it derives from. The --anytrunc option will relax the
requirements for hit inclusion of sequence endpoints, and truncated hits are
allowed to start and stop at any positions of target sequences. Importantly
though, with --anytrunc, hit E-values will be less accurate because
model calibration does not consider the possibility of truncated hits, so
use it with caution. The --notrunc option can be used to turn off
truncated hit detection. --notrunc will reduce the running time of
cmsearch, most significantly for target <seqdb> files
that include many short sequences.
Truncated hit detection is automatically turned off when the
--max, --nohmm, --qdb, or --nonbanded options
are used because it relies on the use of an accelerated HMM banded alignment
strategy that is turned off by any of those options.
- -h
- Help; print a brief reminder of command line usage and all available
options.
- -g
- Turn on the glocal alignment algorithm, global with respect to the
query model and local with respect to the target database. By default, the
local alignment algorithm is used which is local with respect to both the
target sequence and the model. In local mode, the alignment to span two or
more subsequences if necessary (e.g. if the structures of the query model
and target sequence are only partially shared), allowing certain large
insertions and deletions in the structure to be penalized differently than
normal indels. Local mode performs better on empirical benchmarks and is
significantly more sensitive for remote homology detection. Empirically,
glocal searches return many fewer hits than local searches, so glocal may
be desired for some applications. With -g, all models must be
calibrated, even those with zero basepairs.
- -Z <x>
- Calculate E-values as if the search space size was <x>
megabases (Mb). Without the use of this option, the search space size is
defined as the total number of nucleotides in <seqdb> times
2, because both strands of each target sequence will be searched.
- --devhelp
- Print help, as with -h , but also include expert options that are
not displayed with -h . These expert options are not expected to be
relevant for the vast majority of users and so are not described in the
manual page. The only resources for understanding what they actually do
are the brief one-line descriptions output when --devhelp is
enabled, and the source code.
- -o <f>
- Direct the main human-readable output to a file <f> instead
of the default stdout.
- -A <f>
- Save a multiple alignment of all significant hits (those satisfying
inclusion thresholds) to the file <f>.
- --tblout
<f>
- Save a simple tabular (space-delimited) file summarizing the hits found,
with one data line per hit. The format of this file is described in
section 6 of the Infernal user guide.
- --acc
- Use accessions instead of names in the main output, where available for
profiles and/or sequences.
- --noali
- Omit the alignment section from the main output. This can greatly reduce
the output volume.
- --notextw
- Unlimit the length of each line in the main output. The default is a limit
of 120 characters per line, which helps in displaying the output cleanly
on terminals and in editors, but can truncate target profile description
lines.
- --textw
<n>
- Set the main output's line length limit to <n> characters per
line. The default is 120.
- --verbose
- Include extra search pipeline statistics in the main output, including
filter survival statistics for truncated hit detection and number of
envelopes discarded due to matrix size overflows.
Reporting thresholds control which hits are reported in output
files (the main output and --tblout) Hits are ranked by statistical
significance (E-value). By default, all hits with an E-value <= 10 are
reported. The following options allow you to change the default E-value
reporting thresholds, or to use bit score thresholds instead.
- -E <x>
- In the per-target output, report target sequences with an E-value of <=
<x>. The default is 10.0, meaning that on average, about 10
false positives will be reported per query, so you can see the top of the
noise and decide for yourself if it's really noise.
- -T <x>
- Instead of thresholding per-CM output on E-value, report target sequences
with a bit score of >= <x>.
Inclusion thresholds are stricter than reporting thresholds.
Inclusion thresholds control which hits are considered to be reliable enough
to be included in an output alignment or in a possible subsequent search
round, or marked as significant ("!") as opposed to questionable
("?") in hit output.
- --incE
<x>
- Use an E-value of <= <x> as the hit inclusion threshold.
The default is 0.01, meaning that on average, about 1 false positive would
be expected in every 100 searches with different query sequences.
- --incT
<x>
- Instead of using E-values for setting the inclusion threshold, instead use
a bit score of >= <x> as the hit inclusion threshold. By
default this option is unset.
Curated CM databases may define specific bit score thresholds for
each CM, superseding any thresholding based on statistical significance
alone.
To use these options, the profile must contain the appropriate
(GA, TC, and/or NC) optional score threshold annotation; this is picked up
by cmbuild from Stockholm format alignment files. Each thresholding
option has a score of <x> bits, and acts as if -T
<x> --incT <x> has been applied specifically
using each model's curated thresholds.
- --cut_ga
- Use the GA (gathering) bit scores in the model to set hit reporting and
inclusion thresholds. GA thresholds are generally considered to be the
reliable curated thresholds defining family membership; for example, in
Rfam, these thresholds define what gets included in Rfam Full alignments
based on searches with Rfam Seed models.
- --cut_nc
- Use the NC (noise cutoff) bit score thresholds in the model to set hit
reporting and inclusion thresholds. NC thresholds are generally considered
to be the score of the highest-scoring known false positive.
- --cut_tc
- Use the TC (trusted cutoff) bit score thresholds in the model to set hit
reporting and inclusion thresholds. TC thresholds are generally considered
to be the score of the lowest-scoring known true positive that is above
all known false positives.
Infernal 1.1 searches are accelerated in a six-stage filter
pipeline. The first five stages use a profile HMM to define envelopes that
are passed to the stage six CM CYK filter. Any envelopes that survive all
filters are assigned final scores using the the CM Inside algorithm. (See
the user guide for more information.)
The profile HMM filter is built by the cmbuild program and
is stored in <cmfile>.
Each successive filter is slower than the previous one, but better
than it at disciminating between subsequences that may contain high-scoring
CM hits and those that do not. The first three HMM filter stages are the
same as those used in HMMER3. Stage 1 (F1) is the local HMM SSV filter
modified for long sequences. Stage 2 (F2) is the local HMM Viterbi filter.
Stage 3 (F3) is the local HMM Forward filter. Each of the first three stages
uses the profile HMM in local mode, which allows a target subsequence to
align to any region of the HMM. Stage 4 (F4) is a glocal HMM filter, which
requires a target subsequence to align to the full-length profile HMM. Stage
5 (F5) is the glocal HMM envelope definition filter, which uses HMMER3's
domain identification heursitics to define envelope boundaries. After each
stage from 2 to 5 a bias filter step (F2b, F3b, F4b, and F5b) is used to
remove sequences that appear to have passed the filter due to biased
composition alone. Any envelopes that survive stages F1 through F5b are then
passed with the local CM CYK filter. The CYK filter uses constraints (bands)
derived from an HMM alignment of the envelope to reduce the number of
required calculations and save time. Any envelopes that pass CYK are scored
with the local CM Inside algorithm, again using HMM bands for
acceleration.
The default filter thresholds that define the minimum score
required for a subsequence to survive each stage are defined based on the
size of the database in <seqdb> (or the size <x>
in megabases (Mb) specified by the -Z <x> or
--FZ <x> options). For larger databases, the filters are
more strict leading to more acceleration but potentially a greater loss of
sensitivity. The rationale is that for larger databases, hits must have
higher scores to achieve statistical significance, so stricter filtering
that removes lower scoring insignificant hits is acceptable.
The P-value thresholds for all possible search space sizes and all
filter stages are listed next. (A P-value threshold of 0.01 means that
roughly 1% of the highest scoring nonhomologous subsequence are expected to
pass the filter.) Z is defined as the number of nucleotides in the complete
target sequence file times 2 because both strands will be searched with each
model.
If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b,
F4, F4b and F5 are 0.02; F6 is 0.0001.
If Z is between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off;
F3, F3b, F4, F4b and F5 are 0.005; F6 is 0.0001.
If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15;
F3, F3b, F4, F4b and F5 are 0.003; F6 is 0.0001.
If Z is between 200 Mb and 2 Gb: F1 is 0.15; F2 and F2b are 0.15;
F3, F3b, F4, F4b, F5, and F5b are 0.0008; and F6 is 0.0001.
If Z is between 2 Gb and 20 Gb: F1 is 0.15; F2 and F2b are 0.15;
F3, F3b, F4, F4b, F5, and F5b are 0.0002; and F6 is 0.0001.
If Z is more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b,
F4, F4b, F5, and F5b are 0.0002; and F6 is 0.0001.
These thresholds were chosen based on performance on an internal
benchmark testing many different possible settings.
There are five options for controlling the general filtering
level. These options are, in order from least strict (slowest but most
sensitive) to most strict (fastest but least sensitive): --max,
--nohmm, --mid, --default, (this is the default
setting), --rfam. and --hmmonly. With --default the
filter thresholds will be database-size dependent. See the explanation of
each of these individual options below for more information.
Additionally, an expert user can precisely control each filter
stage score threshold with the --F1, --F1b, --F2,
--F2b, --F3, --F3b, --F4, --F4b,
--F5, --F5b, and --F6 options. As well as turn each
stage on or off with the --noF1, --doF1b, --noF2,
--noF2b, --noF3, --noF3b, --noF4,
--noF4b, --noF5, and --noF6. options. These options are
only displayed if the --devhelp option is used to keep the number of
displayed options with -h reasonable, and because they are only
expected to be useful to a small minority of users.
As a special case, for any models in <cmfile> which
have zero basepairs, profile HMM searches are run instead of CM searches.
HMM algorithms are more efficient than CM algorithms, and the benefit of CM
algorithms is lost for models with no secondary structure (zero basepairs).
These profile HMM searches will run significantly faster than the CM
searches. You can force HMM-only searches with the --hmmonly option.
For more information on HMM-only searches see the description of the
--hmmonly option below, and the user guide.
- --max
- Turn off all filters, and run non-banded Inside on every full-length
target sequence. This increases sensitivity somewhat, at an extremely
large cost in speed.
- --nohmm
- Turn off all HMM filter stages (F1 through F5b). The CYK filter, using
QDBs, will be run on every full-length target sequence and will enforce a
P-value threshold of 0.0001. Each subsequence that survives CYK will be
passed to Inside, which will also use QDBs (but a looser set). This
increases sensitivity somewhat, at a very large cost in speed.
- --mid
- Turn off the HMM SSV and Viterbi filter stages (F1 through F2b). Set
remaining HMM filter thresholds (F3 through F5b) to 0.02 by default, but
changeable to <x> with --Fmid <x>
sequence. This may increase sensitivity, at a significant cost in speed.
- --default
- Use the default filtering strategy. This option is on by default. The
filter thresholds are determined based on the database size.
- --rfam
- Use a strict filtering strategy devised for large databases (more than 20
Gb). This will accelerate the search at a potential cost to sensitivity.
It will have no effect if the database is larger than 20 Gb.
- --hmmonly
- Only use the filter profile HMM for searches, do not use the CM. Only
filter stages F1 through F3 will be executed, using strict P-value
thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3). Additionally a
bias composition filter is used after the F1 stage (with P=0.02 survival
threshold). Any hit that survives all stages and has an HMM E-value or bit
score above the reporting threshold will be output. The user can change
the HMM-only filter thresholds and options with --hmmF1,
--hmmF2, --hmmF3, --hmmnobias, --hmmnonull2,
and --hmmmax. By default, searches for any model with zero
basepairs will be run in HMM-only mode. This can be turned off, forcing CM
searches for these models with the --nohmmonly option. These
options are only displayed if the --devhelp option is used.
- --FZ
<x>
- Set filter thresholds as the defaults used if the database were
<x> megabases (Mb). If used with <x> greater
than 20000 (20 Gb) this option has the same effect as --rfam.
- --Fmid
<x>
- With the --mid option set the HMM filter thresholds (F3 through
F5b) to <x>. By default, <x> is 0.02.
- --notrunc
- Turn off truncated hit detection.
- --anytrunc
- Allow truncated hits to begin and end at any position in a target
sequence. By default, 5' truncated hits must include the first residue of
their target sequence and 3' truncated hits must include the final residue
of their target sequence. With this option you may observe fewer full
length hits that extend to the beginning and end of the query CM.
- --nonull3
- Turn off the null3 CM score corrections for biased composition. This
correction is not used during the HMM filter stages.
- --mxsize
<x>
- Set the maximum allowable CM DP matrix size to <x> megabytes.
By default this size is 128 Mb. This should be large enough for the vast
majority of searches, especially with smaller models. If cmsearch
encounters an envelope in the CYK or Inside stage that requires a larger
matrix, the envelope will be discounted from consideration. This behavior
is like an additional filter that prevents expensive (slow) CM DP
calculations, but at a potential cost to sensitivity. Note that if
cmsearch is being run in <n> multiple threads on a
multicore machine then each thread may have an allocated matrix of up to
size <x> Mb at any given time.
- --smxsize
<x>
- Set the maximum allowable CM search DP matrix size to <x>
megabytes. By default this size is 128 Mb. This option is only relevant if
the CM will not use HMM banded matrices, i.e. if the --max,
--nohmm, --qdb, --fqdb, --nonbanded, or
--fnonbanded options are also used. Note that if cmsearch is
being run in <n> multiple threads on a multicore machine then
each thread may have an allocated matrix of up to size <x> Mb
at any given time.
- --cyk
- Use the CYK algorithm, not Inside, to determine the final score of all
hits.
- --acyk
- Use the CYK algorithm to align hits. By default, the Durbin/Holmes optimal
accuracy algorithm is used, which finds the alignment that maximizes the
expected accuracy of all aligned residues.
- --wcx
<x>
- For each CM, set the W parameter, the expected maximum length of a hit, to
<x> times the consensus length of the model. By default, the
W parameter is read from the CM file and was calculated based on the
transition probabilities of the model by cmbuild. You can find out
what the default W is for a model using cmstat. This option should
be used with caution as it impacts the filtering pipeline at several
different stages in nonobvious ways. It is only recommended for expert
users searching for hits that are much longer than any of the homologs
used to build the model in cmbuild, e.g. ones with large introns or
other large insertions. This option cannot be used in combination with the
--nohmm, --fqdb or --qdb options because in those
cases W is limited by query-dependent bands.
- --toponly
- Only search the top (Watson) strand of target sequences in
<seqdb>. By default, both strands are searched. This will
halve the database size (Z).
- --bottomonly
- Only search the bottom (Crick) strand of target sequences in
<seqdb>. By default, both strands are searched. This will
halve the database size (Z).
- --tformat
<s>
- Assert that the target sequence database file is in format
<s>. Accepted formats include fasta, embl,
genbank, ddbj, stockholm, pfam, a2m,
afa, clustal, and phylip The default is to autodetect
the format of the file.
- --cpu
<n>
- Set the number of parallel worker threads to <n>. By default,
Infernal sets this to the number of CPU cores it detects in your machine -
that is, it tries to maximize the use of your available processor cores.
Setting <n> higher than the number of available cores is of
little if any value, but you may want to set it to something less. You can
also control this number by setting an environment variable,
INFERNAL_NCPU. This option is only available if Infernal was
compiled with POSIX threads support. This is the default, but it may have
been turned off at compile-time for your site or machine for some reason.
- --stall
- For debugging the MPI master/worker version: pause after start, to enable
the developer to attach debuggers to the running master and worker(s)
processes. Send SIGCONT signal to release the pause. (Under gdb: (gdb)
signal SIGCONT) (Only available if optional MPI support was enabled at
compile-time.)
- --mpi
- Run in MPI master/worker mode, using mpirun. To use --mpi,
the sequence file must have first been 'indexed' using the
esl-sfetch program, which is included with Infernal, in the
easel/miniapps/ subdirectory. (Only available if optional MPI
support was enabled at compile-time.)
See infernal(1) for a master man page with a list of all
the individual man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your
Infernal distribution (Userguide.pdf); or see the Infernal web page
(http://eddylab.org/infernal/).
Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.
For additional information on copyright and licensing, see the
file called COPYRIGHT in your Infernal source distribution, or see the
Infernal web page (http://eddylab.org/infernal/).