(If you're like me, you want some basic examples first, and a list
of all options later.)
1. Compute the distance between two aligned sequences (in FASTA
file pair.fa) under the REV model.
- phyloFit pair.fa
(output is to phyloFit.mod; distance in substitutions per site
appears in the TREE line in the output file)
2. Fit a phylogenetic model to an alignment of human, chimp,
mouse, and rat sequences. Use the HKY85 substitution model. Write output to
files with prefix "myfile".
- phyloFit --tree "((human,chimp),(mouse,rat))"
--subst-mod HKY85 --out-root myfile primate-rodent.fa
3. As above, but use the discrete-gamma model for rate variation,
with 4 rate categories.
- phyloFit --tree "((human,chimp),(mouse,rat))"
--subst-mod HKY85 --out-root myfile --nrates 4
primate-rodent.fa
4. As above, but use genome-wide data, stored in the compact
"sufficient-statistics" format (can be produced with
"msa_view -o SS").
- phyloFit --tree "((human,chimp),(mouse,rat))"
--subst-mod HKY85 --out-root myfile --nrates 4
--msa-format SS primate-rodent.ss
5. Fit a context-dependent phylogenetic model (U2S) to an
alignment of human, mouse, and rat sequences. Use an EM algorithm for
parameter optimization and relax the convergence criteria a bit (recommended
with context-dependent models). Write a log file for the optimization
procedure. Consider only non-overlapping pairs of sites.
- phyloFit --tree "(human,(mouse,rat))" --subst-mod
U2S --EM --precision MED --non-overlapping
--log u2s.log --out-root hmr-u2s hmr.fa
6. As above, but allow overlapping pairs of sites, and compute
likelihoods by assuming Markov-dependence of columns (see Siepel &
Haussler, 2004). The EM algorithm can no longer be used (optimization will
be much slower).
- phyloFit --tree "(human,(mouse,rat))" --subst-mod
U2S --precision MED --log u2s-markov.log --markov
hmr.fa
7. Compute a likelihood using parameter estimates obtained in (5)
and an assumption of Markov dependence. This provides a lower bound on the
likelihood of the Markov-dependent model.
- phyloFit --init-model hmr-u2s.mod --lnl --markov
hmr.fa
8. Given an alignment of several mammalian sequences (mammals.fa),
a tree topology (tree.nh), and a set of gene annotations in GFF (genes.gff),
fit separate models to sites in 1st, 2nd, and 3rd codon positions. Use the
REV substitution model. Assume coding regions have feature type 'CDS'.
- phyloFit --tree tree.nh --features genes.gff
--out-root mammals-rev --catmap "NCATS = 3; CDS
1-3" --do-cats 1,2,3 mammals.fa
(output will be to mammals-rev.cds-1.mod, mammals-rev.cds-2.mod,
and mammals-rev.cds-3.mod)
--tree, -t
<tree_fname>|<tree_string>
- (Required if more than three species, or more than two species and a
non-reversible substitution model, e.g., UNREST, U2, U3) Name of file or
literal string defining tree topology. Tree must be in Newick format, with
the label at each leaf equal to the index or name of the corresponding
sequence in the alignment (indexing begins with 1). Examples:
--tree "(1,(2,3))", --tree
"(human,(mouse,rat))". Currently, the topology must be rooted.
When a reversible substitution model is used, the root is ignored during
the optimization procedure.
--subst-mod, -s
JC69|F81|HKY85|HKY85+Gap|REV|SSREV|UNREST|R2|R2S|U2|U2S|R3|R3S|U3|U3S
- (default REV).
- Nucleotide substitution model. JC69, F81, HKY85
- REV, and UNREST have the usual meanings (see, e.g., Yang, Goldman, and
Friday, 1994). SSREV is a strand-symmetric version of REV. HKY85+Gap is an
adaptation of HKY that treats gaps as a fifth character (courtesy of James
Taylor). The others, all considered "context-dependent", are as
defined in Siepel and Haussler, 2004. The options --EM and
--precision MED are recommended with context-dependent models (see
below).
--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS (default
is to guess format from file contents) Alignment format. FASTA is as usual.
PHYLIP is compatible with the formats used in the PHYLIP and PAML packages.
MPM is the format used by the MultiPipMaker aligner and some other of Webb
Miller's older tools. MAF ("Multiple Alignment Format") is used by
MULTIZ/TBA and the UCSC Genome Browser. SS is a simple format describing the
sufficient statistics for phylogenetic inference (distinct columns or tuple
of columns and their counts). Note that the program "msa_view" can
be used for file conversion.
--out-root, -o <output_fname_root> (default
"phyloFit"). Use specified string as root filename for all files
created.
--min-informative, -I <ninf_sites>
- Require at least <ninf_sites> "informative" sites
-- i.e., sites at which at least two non-gap and non-missing-data
('N' or '*') characters are present. Default is 50.
--gaps-as-bases, -G Treat alignment gap
characters ('-') like ordinary bases. By
- default, they are treated as missing data.
--ignore-branches, -b <branches>
- Ignore specified branches in likelihood computations and parameter
estimation, and treat the induced subtrees as independent. Can be useful
for likelihood ratio tests. The argument <branches> should be a
comma-separated list of nodes in the tree, indicating the branches above
these nodes, e.g., human-chimp,cow-dog. (See tree_doctor
--name-ancestors regarding names for ancestral nodes.) This option
does not currently work with --EM.
--quiet, -q Proceed quietly.
--help, -h
- Print this help message. (Options for controlling and monitoring the
optimization procedure)
--lnl, -L
- (for use with --init-model) Simply evaluate the log likelihood of
the specified tree model, without performing any further optimization. Can
be used with --post-probs, --expected-subs, and
--expected-total-subs.
--EM, -E
- Fit model(s) using EM rather than the BFGS quasi-Newton algorithm (the
default).
--precision, -p HIGH|MED|LOW
- (default HIGH) Level of precision to use in estimating model parameters.
Affects convergence criteria for iterative algorithms: higher precision
means more iterations and longer execution time.
--log, -l <log_fname>
- Write log to <log_fname> describing details of the optimization
procedure.
--init-model, -M <mod_fname> Initialize
with specified tree model. By choosing good
- starting values for parameters, it is possible to reduce execution time
dramatically. If this option is chosen, --tree is not allowed. The
substitution model used in the given model will be used unless
--subst-mod is also specified. Note: currently only one mod_fname
may be specified; it will be used for all categories.
--init-random, -r Initialize parameters randomly.
Can be used multiple times to test
- whether the m.l.e. is real.
--seed, -D <seed>
- Provide a random number seed for choosing initial parameter values
(usually with --init-random, though random values are used in some
other cases as well). Should be an integer >=1. If not provided, seed
is chosen based on current time.
--init-parsimony, -y Initialize branch lengths
using parsimony counts for given data. Only currently implemented for models
with single character state (ie, not di- or tri-nucleotides). Other
--init options such as --init-random or --init-model
can be used in conjunction to initialize substitution matrix parameters.
--print-parsimony, -Y <filename>
- Print parsimony score to
given file, and quit.
- (Does not optimize
- or report likelihoods).
--clock, -z Assume a molecular clock in
estimation. Causes the distances to all descendant leaves to be equal for
each ancestral node and cuts the number of free branch-length parameters
roughly in half.
--scale-only, -B
- (for use with --init-model) Estimate only the scale of the tree,
rather than individual branch lengths (branch proportions fixed).
Equilibrium frequencies and rate-matrix parameters will still be estimated
unless --no-freqs and --no-rates are used.
--scale-subtree, -S <node_name> (for use
with --scale-only) Estimate separate scale factors for subtree
beneath identified node and rest of tree. The branch leading to the subtree
is included with the subtree. If ":loss" or ":gain" is
appended to <node_name>, subtree scale is constrained to be greater
than or less than (respectively) scale for rest of tree.
--estimate-freqs, -F
- Estimate equilibrium frequencies by maximum likelihood, rather than
approximating them by the relative frequencies in the data. If using the
SSREV model, this option implies --sym-freqs.
--sym-freqs, -W
- Estimate equilibrium frequencies, assuming freq(A)=freq(T) and
freq(C)=freq(G). This only works for an alphabet ACGT (and possibly gap).
This option implies --estimate-freqs.
--no-freqs, -f
- (for use with --init-model) Do not estimate equilibrium
frequencies; just use the ones from the given tree model.
--no-rates, -n
- (for use with --init-model) Do not estimate rate-matrix parameters;
just use the ones from the given tree model.
--ancestor, -A <seqname> Treat specified
sequence as the root of the tree. The tree topology must define this
sequence to be a child of the root (in practice, the branch from the root to
the specified sequence will be retained, but will be constrained to have
length zero).
--error, -e <filename>
- For each parameter, report estimate, variance, and 95%% confidence
interval, printed to given filename, one parameter per line.
--no-opt, -O <param_list> Hold parameters
listed in comma-separated param_list constant at initial values. This
applies only to the "main" model, and not to any models defined
with the --alt-mod option. Param list can contain values such as
"branches" to hold branch lengths constant,
"ratematrix", "backgd", or "ratevar" to hold
entire rate matrix, equilibrium frequencies, or rate variation parameters
constant (respectively). There are also substitution model-specific
parameters such as "kappa" (transition/transversion rate ratio).
Note: to hold certain branches constant, but optimize others, put an
exclamation point in the newick-formatted tree after the branch lengths that
should be held constant. This can be useful for enforcing a star-phylogeny.
However, note that the two branches coming from root of tree are treated as
one. So they should both be held constant, or not held constant. This option
does *not* work with --scale-only or --clock.
--bound <param_name[lower_bound,upper_bound]> Set
boundaries for parameter. lower_bound or upper_bound may be empty string to
keep default. For example --bound gc_param[1,] will
- set the lower bound for gc_param to 1 (keeping upper bound at infinity),
for a GC model. Only applies to parameters for model in the
"main" tree, but similar syntax can be used within the
--alt-mod arguments. Can be used multiple times to set boundaries
for different parameters.
--selection <selection_param>
- Use selection in the model (is also implied if --init-model is used
and contains selection parameter). Selection scales rate matrix entries by
selection_param/(1-exp(-selection-param)); this is done after rate
matrix is scaled to set expected number of substitutions per unit time to
1. If using codon models selection acts only on nonysynonymous mutations.
(Options for modeling rate variation)
--nrates, -k <nratecats> (default 1).
Number of rate categories to use. Specifying a
- value of greater than one causes the discrete gamma model for rate
variation to be used (Yang, 1994).
--alpha, -a <alpha> (for use with
--nrates). Initial value for alpha, the shape parameter of the gamma
distribution. Default is 1.
--rate-constants, -K <rate_consts>
- Use a non-parameteric mixture model for rates, instead of assuming a gamma
distribution. The argument <rate_consts> must be a comma-delimited
list explicitly defining the rate constants to be used. The
"weight" (mixing proportion) associated with each rate constant
will be estimated by EM (this option implies --EM). If
--alpha is used with this option, then the mixing proportions will
be initialized to reflect a gamma distribution with the specified shape
parameter. (Options for separate handling of sites in different annotation
categories)
--features, -g <fname>
- Annotations file (GFF or BED format) describing features on one or more
sequences in the alignment. Together with a category map (see
--catmap), will be taken to define site categories, and a separate
model will be estimated for each category. If no category map is
specified, a category will be assumed for each type of feature, and they
will be numbered in the order of appearance of the features. Features are
assumed to use the coordinate frame of the first sequence in the alignment
and should be non-overlapping (see 'refeature --unique').
--catmap, -c <fname>|<string>
- (optionally use with --features) Mapping of feature types to
category numbers. Can either give a filename or an "inline"
description of a simple category map, e.g., --catmap "NCATS =
3 ; CDS 1-3" or --catmap "NCATS = 1 ; UTR 1". Note
that category 0 is reserved for "background" (everything that is
not described by a defined feature type).
--do-cats, -C <cat_list>
- (optionally use with --features) Estimate models for only the
specified categories (comma-delimited list categories, by name or
numbera). Default is to fit a model for every category.
--reverse-groups, -R <tag>
- (optionally use with --features) Group features by <tag>
(e.g., "transcript_id" or "exon_id") and reverse
complement segments of the alignment corresponding to groups on the
reverse strand. Groups must be non-overlapping (see refeature
--unique). Useful with categories corresponding to strand-specific
phenomena (e.g., codon positions). (Options for context-dependent
substitution models)
--markov, -N
- (for use with context-dependent substitutions models and not available
with --EM.) Assume Markov dependence of alignment columns, and
compute the conditional probability of each column given its N-1
predecessors using the two-pass algorithm described by Siepel and Haussler
(2004). (Here, N is the "order" of the model, as defined by
--subst-mod; e.g., N=1 for REV, N=2 for U2S, N=3 for U3S.) The
alternative (the default) is simply to work with joint probabilities of
tuples of columns. (You can ensure that these tuples are non-overlapping
with the --non-overlapping option.) The use of joint probabilities
during parameter estimation allows the use of the --EM option and
can be much faster; in addition, it appears to produce nearly equivalent
estimates. If desired, parameters can be estimated without
--markov, and then the likelihood can be evaluated using
--lnl and --markov together. This gives a lower bound on the
likelihood of the Markov-dependent model.
--non-overlapping, -V (for use with
context-dependent substitution models; not compatible with --markov,
--features, or --msa-format SS) Avoid using overlapping tuples
of sites in parameter estimation. If a dinucleotide model is selected, every
other tuple will be considered, and if a nucleotide triplet model is
selected, every third tuple will be considered. This option cannot be used
with an alignment represented only by unordered sufficient statistics.
(Option for lineage-specific models)
--label-branches branch1,branch2,branch3...:label
- Create a group of branches by giving a set of branches a single label. The
label should be a word which does not contain special characters (in
particular, no spaces, brackets, parentheses, pound signs, commas, or
colons). The label is for use with --alt-model option below, so
that an alternate model can be defined for a set of branches. A branch is
specified by the name of the node which is a descendant of that
branch.
- For example, --label-branches hg18,chimp,hg18-chimp:HC will apply
the label "HC" to the hg18,chimp,and hg18-chimp branches in the
following tree: (((hg18,chimp)hg18-chimp, (mouse,rat)mouse-rat)
- The same label could be defined without using --label-branches by
specifying the tree (either on the command-line or within init-model) as
follows: (((hg18 # HC, chimp #HC)#HC, (mouse,rat))
--label-subtree node[+]:label Similar to label-branches,
except labels the entire subtree of the named node. If the node name is
followed by a "+" sign, then includes the branch leading up to the
node in the subtree.
--alt-model, -d <label:(model|param_list)>
Create a lineage-specific substitution model on a group of branches. The
group is defined by a label, which can be specified within the tree string
(using the # sign notation), or by using the --label-branches or
--label-subtree arguments. If the alt-model applies to only a single
branch, labelling is not necessary and the name of the node descending from
the branch can be used instead. See --label-branches above for more
details on labelling groups of branches. If a name of a substitution model
(HKY85, REV, UNREST, etc) is given after the colon, then this model will be
used for the group of branches, and parameters relevant to the model will be
estimated separately in this group. This model may be different (or the
same) as the model used in the rest of the tree, but it must have the same
number of states and be of the same order as the model used for the rest of
the tree.
- Alternately, a list of parameter names can be given after the colon. In
this case, the same substitution model will be used for the entire tree,
but the parameters listed will be estimated separately in the specified
group of branches.
- The parameter names are model-specific, and include "kappa" for
HKY models, "alpha" for GC models, "ratematrix" to
specify all rate-matrix parameters in general models, and
"backgd" for the equilibrium background frequencies. The
parameter names may optionally be followed by boundaries in square
brackets to specify parameter bounds, as described in --bound
option.
- The --alt-model argument may be used multiple times, if one wishes
to (for example) optimize a parameter independently on several different
groups of branches. Example: phyloFit align.fa --subst-mod HKY85 \
--tree "(human, (mouse#MR, rat#MR)#MR, cow)"\
--alt-model "MR:kappa[0, 1]"
- will estimate the HKY85 parameter kappa separately on the mouse/rat
subtree, and constrain kappa between 0 and 1. The quotes are often
necessary to prevent the square brakcets from being parsed by the shell.
The same model could be achieved with:
- phyloFit align.fa --subst-mod HKY85 \ --tree "(human,
(mouse,rat)mouse-rat, cow)"\ --label-branches
mouse,rat,mouse-rat:MR \ --alt-model "MR:kappa[0,1]"
or
- phyloFit align.fa --subst-mod HKY85 \ --tree "(human,
(mouse,rat)mouse-rat, cow)" \ --label-subtree
"mouse-rat+:MR" \ --alt-model
"MR:kappa[0,1]"
- (Options for posterior probabilities)
--post-probs, -P
- Output posterior probabilities of all bases at all ancestral nodes. Output
will be to auxiliary file(s) with suffix ".postprobs".
--expected-subs, -X
- Output posterior expected number of substitutions on each branch at each
site, summed across all types of substitutions. Output will be to
auxiliary file(s) with suffix ".expsub".
--expected-subs-col, -J
- Output posterior expected number of substitutions of each type on each
branch, for each site. Output will be to auxiliary file(s) with suffix
.expcolsub
--expected-total-subs, -Z
- Output posterior expected number of substitutions of each type on each
branch, summed across all sites. Output will be to auxiliary file(s) with
suffix ".exptotsub".
--column-probs, -U
- (for use with -init-model; implies --lnl)
- Output a separate log
- probability for
each type of column in the input.
- Output will
- be to a file with suffix
".colprobs".
- Values are log base 2.
- (Options for estimation in sliding window)
--windows, -w <size,shift> Apply a sliding
window to the alignment, and fit a separate tree to each window. Arguments
specify size of window and amount by which to shift it on each iteration,
both in bases of the first sequence in the alignment (assumed to be the
reference sequence). Separate versions of all output files will be created
for each window.
--windows-explicit, -v <window_coord_list>
Like --windows, except that all start and end coordinates must be
explicitly specified. Each successive pair of numbers is interpreted as
defining the start and end of a window. Can be used with a two-column file
and the '*' operator, e.g., --windows-explicit '*mycoords'.