PREQUEL(1) | User Commands | PREQUEL(1) |
prequel - Compute marginal probability distributions for bases at ancestral
Compute marginal probability distributions for bases at ancestral nodes in a phylogenetic tree, using the tree model defined in tree.mod (may be produced with phyloFit). These distributions are computed using the sum-product algorithm, assuming independence of sites.
Currently, indels are not treated probabilistically (hence the "largely") but are reconstructed by parsimony, also assuming site independence. Specifically, each base is assumed to have been inserted on the branch leading to the last common ancestor (LCA) of all species that have actual bases (as opposed to alignment gaps or missing data) at a given site; gaps in descendant species are assumed to have arisen (parsimoniously) from deletions. When this LCA is either the left or right child of the root, insertions on one branch cannot be distinguished from deletions on the other. We conservatively assume that the base was present at the root and was subsequently deleted. (Note that this will produce an upward bias on the length of the sequence at the root.) Output is to files of the form outroot.XXX.probs, where XXX is the name of an ancestral node in the tree. These nodes may be named explicitly in tree.mod. Any ancestral node that is left unnamed will be given a name that is a concatenation of two names, belonging to arbitrarily selected leaves of each subtree beneath the node (see below).
Given a multiple alignment in a file called "mammals.fa" and a tree model called "mytree.mod" (see phyloFit), reconstruct all ancestral sequences:
If the TREE definition in mytree.mod has labeled ancestral nodes, e.g.,
then output will be to files named "anc.primate.probs", "anc.rodent.probs", and "anc.mammal.probs". (See http://evolution.genetics.washington.edu/phylip/newicktree.html) If instead the ancestral nodes are unlabeled, e.g.,
then names will be created by concatenating leaf names, e.g., "anc.human-chimp.probs", "anc.mouse-rat.probs", and "anc.human-mouse.probs".
Each output file will consist of a row for each position in the sequence and a column for each base, with the (i,j)th value giving the probability of base j at position i. For example,
By default, no row is reported for bases inferred not to have been present at an ancestral node, so the number of rows will generally be smaller than the number of columns in the input alignment. However, if you wish to maintain a correspondence between row number and alignment column, you can use the --keep-gaps option, which will cause "padding" rows to be included in the output, e.g.,
The output files produced by prequel can get quite large. They can be encoded in a compact binary form using pbsEncode, e.g.,
although this encoding results in some loss of precision. Encoded files can be decoded using pbsDecode, e.g.,
For maximum efficiency, encode ancestral reconstructions on the fly using the --encode option to prequel, e.g.,
Prequel can also be useful in optimizing a code based on training data. The --suff-stats option produces a more compact output file, which can then be fed to pbsTrain, e.g.,
--seqs, -s <seqlist>
--exclude, -x (for use with --seqs) Exclude rather than include specified sequences.
--keep-gaps, -k
--no-probs, -n
--suff-stats, -S
--encode, -e <code_file> Encode probabilities using given code and output as binary files. Output files will have suffix ".bin" rather than ".probs"
--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS
--refseq, -r <fname>
--gibbs, -G <nsamples>
--help, -h
May 2016 | prequel 1.4 |