hmm2build [options] hmmfile
alignfile
hmm2build reads a multiple sequence alignment file
alignfile , builds a new profile HMM, and saves the HMM in
hmmfile.
alignfile may be in ClustalW, GCG MSF, SELEX, Stockholm, or
aligned FASTA alignment format. The format is automatically detected.
By default, the model is configured to find one or more
nonoverlapping alignments to the complete model: multiple global alignments
with respect to the model, and local with respect to the sequence. This is
analogous to the behavior of the hmmls program of HMMER 1. To
configure the model for multiple local alignments with respect to the
model and local with respect to the sequence, a la the old program
hmmfs, use the -f (fragment) option. More rarely, you may want
to configure the model for a single global alignment (global with respect to
both model and sequence), using the -g option; or to configure the
model for a single local/local alignment (a la standard Smith/Waterman, or
the old hmmsw program), use the -s option.
- -f
- Configure the model for finding multiple domains per sequence, where each
domain can be a local (fragmentary) alignment. This is analogous to the
old hmmfs program of HMMER 1.
- -g
- Configure the model for finding a single global alignment to a target
sequence, analogous to the old hmms program of HMMER 1.
- -h
- Print brief help; includes version number and summary of all options,
including expert options.
- -n <s>
- Name this HMM <s>. <s> can be any string of
non-whitespace characters (e.g. one "word"). There is no length
limit (at least not one imposed by HMMER; your shell will complain about
command line lengths first).
- -o <f>
- Re-save the starting alignment to <f>, in Stockholm format.
The columns which were assigned to match states will be marked with x's in
an #=RF annotation line. If either the --hand or --fast
construction options were chosen, the alignment may have been slightly
altered to be compatible with Plan 7 transitions, so saving the final
alignment and comparing to the starting alignment can let you view these
alterations. See the User's Guide for more information on this arcane side
effect.
- -s
- Configure the model for finding a single local alignment per target
sequence. This is analogous to the standard Smith/Waterman algorithm or
the hmmsw program of HMMER 1.
- -A
- Append this model to an existing hmmfile rather than creating
hmmfile. Useful for building HMM libraries (like Pfam).
- -F
- Force overwriting of an existing hmmfile. Otherwise HMMER will
refuse to clobber your existing HMM files, for safety's sake.
- --amino
- Force the sequence alignment to be interpreted as amino acid sequences.
Normally HMMER autodetects whether the alignment is protein or DNA, but
sometimes alignments are so small that autodetection is ambiguous. See
--nucleic.
- --archpri
<x>
- Set the "architecture prior" used by MAP architecture
construction to <x>, where <x> is a probability
between 0 and 1. This parameter governs a geometric prior distribution
over model lengths. As <x> increases, longer models are
favored a priori. As <x> decreases, it takes more residue
conservation in a column to make a column a "consensus" match
column in the model architecture. The 0.85 default has been chosen
empirically as a reasonable setting.
- --binary
- Write the HMM to hmmfile in HMMER binary format instead of readable
ASCII text.
- --cfile
<f>
- Save the observed emission and transition counts to <f> after
the architecture has been determined (e.g. after residues/gaps have been
assigned to match, delete, and insert states). This option is used in
HMMER development for generating data files useful for training new
Dirichlet priors. The format of count files is documented in the User's
Guide.
- --fast
- Quickly and heuristically determine the architecture of the model by
assigning all columns will more than a certain fraction of gap characters
to insert states. By default this fraction is 0.5, and it can be changed
using the --gapmax option. The default construction algorithm is a
maximum a posteriori (MAP) algorithm, which is slower.
- --gapmax
<x>
- Controls the --fast model construction algorithm, but if
--fast is not being used, has no effect. If a column has more than
a fraction <x> of gap symbols in it, it gets assigned to an
insert column. <x> is a frequency from 0 to 1, and by default
is set to 0.5. Higher values of <x> mean more columns get
assigned to consensus, and models get longer; smaller values of
<x> mean fewer columns get assigned to consensus, and models
get smaller. <x>
- --hand
- Specify the architecture of the model by hand: the alignment file must be
in SELEX or Stockholm format, and the reference annotation line (#=RF in
SELEX, #=GC RF in Stockholm) is used to specify the architecture. Any
column marked with a non-gap symbol (such as an 'x', for instance) is
assigned as a consensus (match) column in the model.
- --idlevel
<x>
- Controls both the determination of effective sequence number and the
behavior of the --wblosum weighting option. The sequence alignment
is clustered by percent identity, and the number of clusters at a cutoff
threshold of <x> is used to determine the effective sequence
number. Higher values of <x> give more clusters and higher
effective sequence numbers; lower values of <x> give fewer
clusters and lower effective sequence numbers. <x> is a
fraction from 0 to 1, and by default is set to 0.62 (corresponding to the
clustering level used in constructing the BLOSUM62 substitution matrix).
- --informat
<s>
- Assert that the input seqfile is in format <s>; do not
run Babelfish format autodection. This increases the reliability of the
program somewhat, because the Babelfish can make mistakes; particularly
recommended for unattended, high-throughput runs of HMMER. Valid format
strings include FASTA, GENBANK, EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,
CLUSTAL, and PHYLIP. See the User's Guide for a complete list.
- --noeff
- Turn off the effective sequence number calculation, and use the true
number of sequences instead. This will usually reduce the sensitivity of
the final model (so don't do it without good reason!)
- --nucleic
- Force the alignment to be interpreted as nucleic acid sequence, either RNA
or DNA. Normally HMMER autodetects whether the alignment is protein or
DNA, but sometimes alignments are so small that autodetection is
ambiguous. See --amino.
- --null
<f>
- Read a null model from <f>. The default for protein is to use
average amino acid frequencies from Swissprot 34 and p1 = 350/351; for
nucleic acid, the default is to use 0.25 for each base and p1 = 1000/1001.
For documentation of the format of the null model file and further
explanation of how the null model is used, see the User's Guide.
- --pam
<f>
- Apply a heuristic PAM- (substitution matrix-) based prior on match
emission probabilities instead of the default mixture Dirichlet. The
substitution matrix is read from <f>. See --pamwgt.
The default Dirichlet state transition prior and insert
emission prior are unaffected. Therefore in principle you could combine
--prior with --pam but this isn't recommended, as it
hasn't been tested. ( --pam itself hasn't been tested much!)
- --pamwgt
<x>
- Controls the weight on a PAM-based prior. Only has effect if --pam
option is also in use. <x> is a positive real number, 20.0 by
default. <x> is the number of "pseudocounts"
contriubuted by the heuristic prior. Very high values of <x>
can force a scoring system that is entirely driven by the substitution
matrix, making HMMER somewhat approximate Gribskov profiles.
- --pbswitch
<n>
- For alignments with a very large number of sequences, the GSC, BLOSUM, and
Voronoi weighting schemes are slow; they're O(N^2) for N sequences.
Henikoff position-based weights (PB weights) are more efficient. At or
above a certain threshold sequence number <n>
hmm2build will switch from GSC, BLOSUM, or Voronoi weights to PB
weights. To disable this switching behavior (at the cost of compute time,
set <n> to be something larger than the number of sequences
in your alignment. <n> is a positive integer; the default is
1000.
- --prior
<f>
- Read a Dirichlet prior from <f>, replacing the default
mixture Dirichlet. The format of prior files is documented in the User's
Guide, and an example is given in the Demos directory of the HMMER
distribution.
- --swentry
<x>
- Controls the total probability that is distributed to local entries into
the model, versus starting at the beginning of the model as in a global
alignment. <x> is a probability from 0 to 1, and by default
is set to 0.5. Higher values of <x> mean that hits that are
fragments on their left (N or 5'-terminal) side will be penalized less,
but complete global alignments will be penalized more. Lower values of
<x> mean that fragments on the left will be penalized more,
and global alignments on this side will be favored. This option only
affects the configurations that allow local alignments, e.g. -s and
-f; unless one of these options is also activated, this option has
no effect. You have independent control over local/global alignment
behavior for the N/C (5'/3') termini of your target sequences using
--swentry and --swexit.
- --swexit
<x>
- Controls the total probability that is distributed to local exits from the
model, versus ending an alignment at the end of the model as in a global
alignment. <x> is a probability from 0 to 1, and by default
is set to 0.5. Higher values of <x> mean that hits that are
fragments on their right (C or 3'-terminal) side will be penalized less,
but complete global alignments will be penalized more. Lower values of
<x> mean that fragments on the right will be penalized more,
and global alignments on this side will be favored. This option only
affects the configurations that allow local alignments, e.g. -s and
-f; unless one of these options is also activated, this option has
no effect. You have independent control over local/global alignment
behavior for the N/C (5'/3') termini of your target sequences using
--swentry and --swexit.
- --verbose
- Print more possibly useful stuff, such as the individual scores for each
sequence in the alignment.
- --wblosum
- Use the BLOSUM filtering algorithm to weight the sequences, instead of the
default. Cluster the sequences at a given percentage identity (see
--idlevel); assign each cluster a total weight of 1.0, distributed
equally amongst the members of that cluster.
- --wgsc
- Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting algorithm.
This is already the default, so this option has no effect (unless it
follows another option in the -\-w family, in which case it overrides it).
- --wme
- Use the Krogh/Mitchison maximum entropy algorithm to "weight"
the sequences. This supersedes the Eddy/Mitchison/Durbin maximum
discrimination algorithm, which gives almost identical weights but is less
robust. ME weighting seems to give a marginal increase in sensitivity over
the default GSC weights, but takes a fair amount of time.
- --wnone
- Turn off all sequence weighting.
- --wpb
- Use the Henikoff position-based weighting scheme.
- --wvoronoi
- Use the Sibbald/Argos Voronoi sequence weighting algorithm in place of the
default GSC weighting.
Master man page, with full list of and guide to the individual man
pages: see hmmer2(1).
For complete documentation, see the user guide
(ftp://selab.janelia.org/pub/software/hmmer/2.3.2/Userguide.pdf); or see the
HMMER web page, http://hmmer.janelia.org/.
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
Freely distributed under the GNU General Public License (GPL).
See the file COPYING in your distribution for details on
redistribution conditions.
Sean Eddy
HHMI/Dept. of Genetics
Washington Univ. School of Medicine
4566 Scott Ave.
St Louis, MO 63110 USA
http://www.genetics.wustl.edu/eddy/