phastCons - Identify conserved elements or produce conservation
scores, given
The alignment file can be in any of several file formats (see
--msa-format). The phylogenetic models must be in the .mod format
produced by the phyloFit program.
Identify conserved elements or produce conservation scores, given
a multiple alignment and a phylo-HMM. By default, a phylo-HMM consisting of
two states is assumed: a "conserved" state and a
"non-conserved" state. Separate phylogenetic models can be
specified for these two states, e.g.,
- phastCons myfile.ss cons.mod,noncons.mod > scores.wig
or a single model can be given for the non-conserved state,
e.g.,
- phastCons myfile.ss --rho 0.5 noncons.mod > scores.wig
in which case the model for the conserved state will be obtained
by multiplying all branch lengths by the scaling parameter rho (0 < rho
< 1). If the --rho option is not used, rho will be set to its
default value of 0.3.
By default, the phylogenetic models will be left unaltered, but if
the --estimate-trees option is used, e.g.,
- phastCons myfile.ss init.mod --estimate-trees newtree >
scores.wig
then the phylogenetic models for the two states will be estimated
from the data, and the given tree model (there must be only one in this
case) will be used for initialization only. It is also possible to estimate
only the scale factor --rho, using the --estimate-rho option.
The transition probabilities for the HMM can either be specified at the
command line or estimated from the data using an EM algorithm. To specify
them at the command line, use either the --transitions option or the
--target-coverage and --expected-length options. The
recommended method is to use --target-coverage and
--expected-length, e.g.,
- phastCons --target-coverage 0.25 --expected-length 12
myfile.ss cons.mod,noncons.mod > scores.wig
The primary output, sent to stdout in fixed-step WIG format
(http://genome.ucsc.edu/goldenPath/help/wiggle.html), is a set of
base-by-base conservation scores. The score at each base is equal to the
posterior probability that that base was "generated" by the
conserved state of the phylo-HMM. The scores are reported in the coordinate
frame of a designated reference sequence (see --refidx), which is by
default the first sequence in the alignment. They can be suppressed with the
--no-post-probs option. The secondary type of output, activated with
the --most-conserved (aka --viterbi) option, is a set of
discrete conserved elements. These elements are output in either BED or GFF
format, also in the coordinate system of the reference sequence (see
--most-conserved). They can be assigned log-odds scores using the
--score option.
Other uses are also supported, but will not be described in detail
here. For example, it is possible to produce conservation scores and
conserved elements using a k-state phylo-HMM of the kind described by
Felsenstein and Churchill (1996) (see --FC), and it is possible to
produce a "coding potential" score instead of a conservation score
(see --coding-potential). It is also possible to give the program a
custom HMM and to specify any subset of its states to use for prediction
(see --hmm and --states).
See the phastCons HOWTO for additional details.
1. Given phylogenetic models for conserved and nonconserved
regions and HMM transition parameters, compute a set of conservation
scores.
- phastCons --transitions 0.01,0.01 mydata.ss cons.mod,noncons.mod
> scores.wig
2. Similar to (1), but define the conserved model as a scaled
version of the nonconserved model, with rho=0.4 as the scaling parameter.
Also predict conserved elements as well as conservation scores, and assign
log-odds scores to predictions.
- phastCons --transitions 0.01,0.01 --most-conserved
mostcons.bed --score --rho 0.4 mydata.ss noncons.mod >
scores.wig
(if output file were "mostcons.gff," then output would
be in GFF instead of BED format)
3. This time, estimate the parameter rho from the data. Suppress
both the scores and the conserved elements. Specify the transition
probabilities using --target-coverage and --expected-length
instead of --transitions.
- phastCons --target-coverage 0.25 --expected-length 12
--estimate-rho newtree --no-post-probs mydata.ss
noncons.mod
4. This time estimate all free parameters of the tree models.
- phastCons --target-coverage 0.25 --expected-length 12
--estimate-trees newtree --no-post-probs mydata.ss
noncons.mod
5. Estimate the state-transition parameters but not the tree
models. Output the conservation scores but not the conserved elements.
- phastCons mydata.ss cons.mod,noncons.mod > scores.wig
6. Estimate just the expected-length parameter and also estimate
rho.
- phastCons --target-coverage 0.25 --estimate-rho newtree
mydata.ss noncons.mod > scores.wig
--rho, -R <rho>
- Set the *scale* (overall evolutionary rate) of the model for the conserved
state to be <rho> times that of the model for the non-conserved
state (0 < <rho> < 1; default 0.3). If used with
--estimate-trees or --estimate-rho, the specified value will
be used for initialization only (rho will be estimated). This option is
ignored if two tree models are given.
--estimate-trees, -T <fname_root> Estimate
free parameters of tree models and write new models to
<fname_root>.cons.mod and <fname_root>.noncons.mod.
--estimate-rho, -O <fname_root>
- Like --estimate-trees, but estimate only the parameter rho.
--gc, -G <val> (Optionally use with
--estimate-trees or --estimate-rho) Assume a background
nucleotide distribution consistent with the given average G+C content (0
< <val> < 1) when estimating tree models. (The frequencies of G
and C will be set to <val>/2 and the frequencies of A and T will be
set to (1-<val>)/2.) This option overrides the default behavior of
estimating the background distribution from the data (if
--estimate-trees) or obtaining them from the input model (if
--estimate-rho).
--nrates, -k <nrates> |
<nrates_conserved,nrates_nonconserved> (Optionally use with a
discrete-gamma model and --estimate-trees) Assume the specified
number of rate categories, instead of the number given in the *.mod file.
The shape parameter 'alpha' will be as given in the *.mod file. In the case
of the default two-state HMM, two values can be specified, for the numbers
of rates for the conserved and the nonconserved states, resp.
--transitions, -t [~]<mu>,<nu>
- Fix the transition probabilities of the two-state HMM as specified, rather
than estimating them by maximum likelihood. Alternatively, if first
character of argument is '~', estimate parameters, but initialize to
specified values. The argument <mu> is the probability of
transitioning from the conserved to the non-conserved state, and
<nu> is the probability of the reverse transition. The probabilities
of self transitions are thus 1-<mu> and 1-<nu> and the
expected lengths of conserved and nonconserved elements are 1/<mu>
and 1/<nu>, respectively.
--target-coverage, -C <gamma>
- (Alternative to --transitions) Constrain transition parameters such
that the expected fraction of sites in conserved elements is <gamma>
(0 < <gamma> < 1). This is a *prior* rather than *posterior*
expectation and assumes stationarity of the state-transition process.
Adding this constraint causes the ratio mu/nu to be fixed at
(1-<gamma>)/<gamma>. If used with --expected-length,
the transition probabilities will be completely fixed; otherwise the
expected-length parameter <omega> will be estimated by maximum
likelihood. --expected-length, -E [~]<omega>
{--expected-lengths also allowed, for backward compatibility}
- (For use with --target-coverage, alternative to
--transitions) Set transition probabilities such that the expected
length of a conserved element is <omega>. Specifically, the
parameter mu is set to 1/<omega>. If preceded by '~', <omega>
will be estimated, but will be initialized to the specified value.
--msa-format, -i PHYLIP|FASTA|MPM|SS|MAF
- Alignment file
format.
- Default is to guess format based on
- file contents.
- Note that the msa_view program can be used to
- convert between formats.
--viterbi [alternatively --most-conserved],
-V <fname> Predict discrete elements using the Viterbi
algorithm and write to specified file. Output is in BED format, unless
<fname> has suffix ".gff", in which case output is in
GFF.
--score, -s (Optionally use with
--viterbi) Assign a log-odds score to each prediction.
--lnl, -L <fname>
- Compute total log likelihood using the forward algorithm and write to
specified file.
--no-post-probs, -n Suppress output of posterior
probabilities. Useful if only discrete elements or likelihood is of
interest.
--log, -g <log_fname>
- (Optionally use when estimating free parameters) Write log of optimization
procedure to specified file.
--refidx, -r <refseq_idx> Use coordinate
frame of specified sequence in output. Default
- value is 1, first sequence in alignment; 0 indicates coordinate frame of
entire multiple alignment.
--seqname, -N <name> (Optionally use with
--viterbi) Use specified string for 'seqname' (GFF) or 'chrom' field
in output file. Default is obtained from input file name (double filename
root, e.g., "chr22" if input file is "chr22.35.ss").
--idpref, -P <name>
- (Optionally use with --viterbi) Use specified string as prefix of
generated ids in output file. Can be used to ensure ids are unique.
Default is obtained from input file name (single filename root, e.g.,
"chr22.35" if input file is "chr22.35.ss").
--quiet, -q Proceed quietly (without updates to
stderr).
--help, -h
- Print this help message. (Indels) [experimental]
--indels, -I
- Expand HMM state space to model indels as described in Siepel &
Haussler (2004).
--max-micro-indel, -Y <length> (Optionally
use with --indels) Maximum length of an alignment gap to be
considered a "micro-indel" and therefore addressed by the indel
model. Gaps longer than this threshold will be treated as missing data.
Default value is 20.
--indel-params, -D
[~]<alpha_0,beta_0,tau_0,alpha_1,beta_1,tau_1>
- (Optionally use with --indels and default two-state HMM) Fix the
indel parameters at (alpha_0, beta_0, tau_0) for the conserved state and
at (alpha_1, beta_1, tau_1) for the non-conserved state, rather than
estimating them by maximum likelihood. Alternatively, if first character
of argument is '~', estimate parameters, but initialize with specified
values. Alpha_j is the rate of insertion events per substitution per site
in state j (typically ~0.05), beta_j is the rate of deletion events per
substitution per site in state j (typically ~0.05), and tau_j is
approximately the inverse of the expected indel length in state j
(typically 0.2-0.5).
--indels-only, -J Like --indels but force
the use of a single-state HMM. This option allows the effect of the indel
model in isolation to be observed. Implies --no-post-probs. Use with
--lnl. (Felsenstein/Churchill model) [rarely used]
--FC, -X
- (Alternative to --hmm; specify only one *.mod file with this
option) Use an HMM with a state for every rate category in the given
phylogenetic model, and transition probabilities defined by an
autocorrelation parameter lambda (as described by Felsenstein and
Churchill, 1996). A rate constant for each state (rate category) will be
multiplied by the branch lengths of the phylogenetic model, to create a
"scaled" version of the model for that state. If the
phylogenetic model was estimated using Yang's discrete gamma method
(-k option to phyloFit), then the rate constants will be defined
according to the estimated shape parameter 'alpha', as described by Yang
(1994). Otherwise, a nonparameteric model of rate variation must have been
used (-K option to phyloFit), and the rate constants will be as
defined (explicitly) in the *.mod file. By default, the parameter lambda
will be estimated by maximum likelihood (see --lambda).
--lambda, -l [~]<lambda>
- (Optionally use with --FC) Fix lambda at the specified value rather
than estimating it by maximum likelihood. Alternatively, if first
character is '~', estimate but initialize at specified value. Allowable
range is 0-1. With k rate categories, the transition probability between
state i and state j will be lambda * I(i == j) + (1-lambda)/k, where I is
the indicator function. Thus, lambda = 0 implies no autocorrelation and
lambda = 1 implies perfect autocorrelation. (Coding potential)
[experimental]
--coding-potential, -p
- Use parameter settings that cause output to be interpretable as a coding
potential score. By default, a simplified version of exoniphy's phylo-HMM
is used, with a noncoding (background) state, a conserved non-coding (CNS)
state, and states for the three codon positions. This option implies
--catmap "NCATS=4; CNS 1; CDS 2-4" --hmm
<default-HMM-file> --states CDS --reflect-strand
background,CNS and a set of default *.mod files (all of which can be
overridden). This option can be used with or without --indels.
--extrapolate, -e <phylog.nh> | default
- Extrapolate to a larger set of species based on the given phylogeny
(Newick-format). The trees in the given tree models (*.mod files) must be
subtrees of the larger phylogeny. For each tree model M, a copy will be
created of the larger phylogeny, then scaled such that the total branch
length of the subtree corresponding to M's tree equals the total branch
length of M's tree; this new version will then be used in place of M's
tree. (Any species name present in this tree but not in the data will be
ignored.) If the string "default" is given instead of a
filename, then a phylogeny for 25 vertebrate species, estimated from
sequence data for Target 1 (CFTR) of the NISC Comparative Sequencing
Program (Thomas et al., 2003), will be assumed.
--alias, -A <alias_def>
- Alias names in input alignment according to given definition, e.g.,
"hg17=human; mm5=mouse; rn3=rat". Useful with default *.mod
files, e.g., with --coding-potential. (Default models use generic
common names such as "human", "mouse", and
"rat". This option allows a mapping to be established between
the leaves of trees in these files and the sequences of an alignment that
uses an alternative naming convention.)
--hmm, -H <hmm_fname>
- Name of HMM file explicitly defining the probabilities of all state
transitions. States in the file must correspond in number and order to
phylogenetic models in <mod_fname_list>. Expected file format is as
produced by 'hmm_train.'
--catmap, -c <fname>|<string>
(Optionally use with --hmm) Mapping of feature types to category
numbers. Can give either a filename or an "inline" description of
a simple category map, e.g., --catmap "NCATS = 3 ; CDS
1-3".
--states, -S <state_list>
- States of interest in the phylo-HMM, specified by number (indexing starts
with 0), or if --catmap, by category name. Default value is 1.
Choosing --states "0,1,2" will cause output of the sum of
the posterior probabilities for states 0, 1, and 2, and/or of regions in
which the Viterbi path coincides with (any of) states 0, 1, or 2 (see
--viterbi).
--reflect-strand, -U <pivot_states>
- (Optionally use with --hmm) Given an HMM describing the forward
strand, create a larger HMM that allows for features on both strands by
"reflecting" the original HMM about the specified
"pivot" states. The new HMM will be used for prediction on both
strands. States can be specified by number (indexing starts with 0), or if
--catmap, by category name.
--require-informative, -M <states> Require
"informative" columns (i.e., columns with more than two
non-missing-data characters, excluding sequences specified by
--not-informative) in specified HMM states, to help eliminate false
positive predictions. States can be specified by number (indexing starts
with 0) or, if --catmap is used, by category name. Non-informative
columns will be given emission probabilities of zero. By default, this
option is active, with <states> equal to the set of states of interest
for prediction (as specified by --states). Use "none" to
disable completely.
--not-informative, -F <list>
- Do not consider the specified sequences (listed by name) when deciding
whether a column is informative. This option may be useful when sequences
are present that are very close to the reference sequence and thus do not
contribute much in the way of phylogenetic information. E.g., one might
use "--not-informative chimp" with a human-referenced multiple
alignment including chimp sequence, to avoid false-positive predictions
based only on human/chimp alignments (can be a problem, e.g., with
--coding-potential).
--ignore-missing, -z
- (For use when estimating transition probabilities) Ignore regions of
missing data in all sequences but the reference sequence (excluding
sequences specified by --not-informative) when estimating
transition probabilities. Can help avoid too-low estimates of <mu>
and <nu> or too-high estimates of <lambda>. Warning: this
option should not be used with --viterbi because coordinates in
output will be unrecognizable.
J. Felsenstein and G. Churchill. 1996. A hidden Markov model
approach to variation among sites in rate of evolution. Mol. Biol. Evol.,
13:93-104. A. Siepel, G. Bejerano, J. S. Pedersen, et al. 2005.
Evolutionarily conserved elements in vertebrate, insect, worm, and
yeast genomes. Genome Res. (in press) A. Siepel and D. Haussler. 2004.
Computational identification of evolutionarily conserved exons. Proc. 8th
Annual Int'l Conf. on Research in Computational Biology (RECOMB '04), pp.
177-186.
J. Thomas et al. 2003. Comparative analyses of multi-species
sequences from targeted genomic regions. Nature 424:788-793.
Z. Yang. 1994. Maximum likelihood phylogenetic estimation from DNA
sequences with variable rates over sites: approximate methods. J. Mol.
Evol., 39:306-314.