cmalign(1) | Infernal Manual | cmalign(1) |
cmalign - align sequences to a covariance model
cmalign aligns the RNA sequences in <seqfile> to the covariance model (CM) in <cmfile>. The new alignment is output to stdout in Stockholm format, but can be redirected to a file <f> with the -o <f> option.
Either <cmfile> or <seqfile> (but not both) may be '-' (dash), which means reading this input from stdin rather than a file.
The sequence file <seqfile> must be in FASTA or Genbank format.
cmalign uses an HMM banding technique to accelerate alignment by default as described below for the --hbanded option. HMM banding can be turned off with the --nonbanded option.
By default, cmalign computes the alignment with maximum expected accuracy that is consistent with constraints (bands) derived from an HMM, using a banded version of the Durbin/Holmes optimal accuracy algorithm. This behavior can be changed with the --cyk or --sample options.
cmalign takes special care to correctly align truncated sequences, where some nucleotides from the beginning (5') and/or end (3') of the actual full length biological sequence are not present in the input sequence (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243, 2009). This behavior is on by default, but can be turned off with --notrunc. In previous versions of cmalign the --sub option was required to appropriately handle truncated sequences. The --sub option is still available in this version, but the new default method for handling truncated sequences should be as good or superior to the sub method in nearly all cases.
The --mapali <s> option allows inclusion of the fixed training alignment used to build the CM from file <s> within the output alignment of cmalign.
It is possible to merge two or more alignments created by the same CM using the Easel miniapp esl-alimerge (included in the easel/miniapps/ subdirectory of Infernal). Previous versions of cmalign included options to merge alignments but they were deprecated upon development of esl-alimerge, which is significantly more memory efficient.
By default, cmalign will output the alignment to stdout. The alignment can be redirected to an output file <f> with the -o <f> option. With -o, information on each aligned sequence, including score and model alignment boundaries will be printed to stdout (more on this below).
The output alignment will be in Stockholm format by default. This can be changed to Pfam, aligned FASTA (AFA), A2M, Clustal, or Phylip format using the --outformat <s> option, where <s> is the name of the desired format. As a special case, if the output alignment is large (more than 10,000 sequences or more than 10,000,000 total nucleotides) than the output format will be Pfam format, with each sequence appearing on a single line, for reasons of memory efficiency. For alignments larger than this, using --ileaved will force interleaved Stockholm format, but the user should be aware that this may require a lot of memory. --ileaved will only work for alignments up to 100,000 sequences or 100,000,000 total nucleotides.
If the output alignment format is Stockholm or Pfam, the output alignment will be annotated with posterior probabilities which estimate the confidence level of each aligned nucleotide. This annotation appears as lines beginning with "#=GR <seq name> PP", one per sequence, each immediately below the corresponding aligned sequence "<seq name>". Characters in PP lines have 12 possible values: "0-9", "*", or ".". If ".", the position corresponds to a gap in the sequence. A value of "0" indicates a posterior probability of between 0.0 and 0.05, "1" indicates between 0.05 and 0.15, "2" indicates between 0.15 and 0.25 and so on up to "9" which indicates between 0.85 and 0.95. A value of "*" indicates a posterior probability of between 0.95 and 1.0. Higher posterior probabilities correspond to greater confidence that the aligned nucleotide belongs where it appears in the alignment. With --nonbanded, the calculation of the posterior probabilities considers all possible alignments of the target sequence to the CM. Without --nonbanded (i.e. in default mode), the calculation considers only possible alignments within the HMM bands. Further, the posterior probabilities are conditional on the truncation mode of the alignment. For example, if the sequence alignment is truncated 5', a PP value of "9" indicates between 0.85 and 0.95 of all 5' truncated alignments include the given nucleotide at the given position. The posterior annotation can be turned off with the --noprob option. If --small is enabled, posterior annotation must also be turned off using --noprob.
The tabular output that is printed to stdout if the -o option is used includes one line per sequence and twelve fields per line: "idx": the index of the sequence in the input file, "seq name": the sequence name; "length": the length of the sequence; "cm from" and "cm to": the model start and end positions of the alignment; "trunc": "no" if the sequence is not truncated, "5'" if the beginning of the sequence truncated 5', "3'" if the end of the sequence is truncated, and "5'&3'" if both the beginning and the end are truncated; "bit sc": the bit score of the alignment, "avg pp" the average posterior probability of all aligned nucleotides in the alignment; "band calc", "alignment" and "total": the time in seconds required for calculating HMM bands, computing the alignment, and complete processing of the sequence, respectively; "mem (Mb)": the size in Mb of all dynamic programming matrices required for aligning the sequence. This tabular data can be saved to file <f> with the --sfile <f> option.
Importantly, HMM banding sacrifices the guarantee of determining the optimally accurarte or optimal alignment, which will be missed if it lies outside the bands. The tau parameter is the amount of probability mass considered negligible during HMM band calculation; lower values of tau yield greater speedups but also a greater chance of missing the optimal alignment. The default tau is 1E-7, determined empirically as a good tradeoff between sensitivity and speed, though this value can be changed with the --tau <x> option. The level of acceleration increases with both the length and primary sequence conservation level of the family. For example, with the default tau of 1E-7, tRNA models (low primary sequence conservation with length of about 75 nucleotides) show about 10X acceleration, and SSU bacterial rRNA models (high primary sequence conservation with length of about 1500 nucleotides) show about 700X. HMM banding can be turned off with the --nonbanded option.
See infernal(1) for a master man page with a list of all the individual man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf); or see the Infernal web page (http://eddylab.org/infernal/).
Copyright (C) 2020 Howard Hughes Medical Institute. Freely distributed under the BSD open source license.
For additional information on copyright and licensing, see the file called COPYRIGHT in your Infernal source distribution, or see the Infernal web page (http://eddylab.org/infernal/).
http://eddylab.org
Dec 2020 | Infernal 1.1.4 |