PHYLOP(1) | User Commands | PHYLOP(1) |
phyloP - Compute conservation or acceleration p-values based on an alignment and The phylogenetic model must be in the .mod format produced by the phyloFit program. The alignment file can be in any of several file formats (see --msa-format). No alignment is required with the --null option.
Compute conservation or acceleration p-values based on an alignment and a model of neutral evolution. Will also compute p-values of conservation/acceleration in a subtree and in its complementary supertree given the whole tree (see --subtree). P-values can be produced for entire input alignments (the default), pre-specified intervals within an alignment (see --features), or individual sites (see --wig-scores and --base-by-base).
The default behavior is to compute a null distribution for the total number of substitutions from the tree model, an estimate of the number of substitutions that have actually occurred, and the p-value of this estimate wrt the null distribution. These computations are performed as described by Siepel, Pollard, and Haussler (2006). In addition to the SPH method, phyloP can compute p-values or conservation/acceleration scores using a likelihood ratio test (--method LRT), a score-based test (--method SCORE), or a procedure similar to that used by GERP (Cooper et al., 2005) (--method GERP). These alternative methods are currently supported only with --base-by-base, --wig-scores, or --features.
The main advantage of the SPH method is that it can provide a complete and exact description of distributions over numbers of substitutions. However, simulation experiments suggest that the LRT and SCORE methods have somewhat better power than SPH for identifying selection, especially when the expected number of substitutions is small (e.g., with short branch lengths and/or short intervals/individual sites). These two methods are also faster. They are generally similar to one another in power, but in many cases SCORE is considerably faster than LRT. On the other hand, SCORE appears to have slightly less power than LRT at low false positive rates, i.e., for cases of extreme selection. Thus, when using --base-by-base, --wig-scores, or --features, LRT is recommended for most purposes, but SCORE is a good alternative if speed is an issue. When computing p-values with the SPH method, the default is to use the posterior expected number of substitutions as an estimate of the actual number. This is a conservative estimate, because it is biased toward the mean of the null distribution by the prior. These p-values can be made less conservative with --fit-model and more conservative with --confidence-interval (see below).
1. Using the SPH method, compute and report p-values of conservation and acceleration for a given alignment with respect to a neutral model of evolution. Estimated numbers of substitutions are also reported.
The file neutral.mod could be produced by running phyloFit on data from ancestral repeats or fourfold degenerate sites with an appropriate tree topology and substitution model.
2. Compute and report p-values of conservation and acceleration for a particular subtree of interest (using SPH).
Here human-mouse_lemur denote the most recent common ancestor of human and mouse_lemur, which is the node that defines the primate clade in this phylogeny. The tree_doctor program with the --name-ancestors option can be used to assign names to ancestral nodes of the tree.
3. Describe the complete null distribution over the number of substitutions for a 10bp alignment given the specified neutral model (using SPH).
A two-column table is produced with numbers of substitutions and their probabilities, up to an appropriate upper limit.
4. Describe the complete posterior distribution over the number of substitutions in a given alignment (using SPH).
5. Compute conservation scores (-log10 p-values) for each site in an alignment and output them in the fixed-step wig format (see http://genome.ucsc.edu/goldenPath/help/wiggle.html). Use the likelihood ratio test (LRT) method.
The --mode option can be used instead to produce acceleration scores (ACC), scores of nonneutrality (NNEUT), or scores that summarize conservation and acceleration (CONACC). The --base-by-base option can be used to output additional statistics of interest (estimated scale factors, log10 likelihood ratios, etc.). As discussed above, several arguments to --method are possible.
6. Similarly, compute scores describing lineage-specific conservation in primates.
7. Compute conservation p-values and associated statistics for each element in a BED file. This time use a score test and allow for acceleration as well as conservation, flagging elements under acceleration by making their p-values negative (CONACC mode).
This option can also be used with --subtree. The --gff-scores option can be used to output the original features in GFF format with scores equal to -log10 p. Note that the input file can be in GFF instead of BED format.
--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS
--method, -m SPH|LRT|SCORE|GERP
--wig-scores, -w
--base-by-base, -b
--refidx, -r <refseq_idx>
--mode, -o CON|ACC|NNEUT|CONACC
--features, -f <file>
--gff-scores, -g
--subtree, -s <node-name>
--branch, -B <node-name(s)>
--chrom, -N <name>
--log, -l <fname>
--seed, -d <seed>
--no-prune,-P
--help, -h
--null, -n <nsites> Compute just the null (prior) distribution of the number of substitutions, as defined by the tree model and the given number of sites, and output as a table. The 'alignment' argument will be ignored. If used with --subtree, the joint distribution over the number of substitutions in the specified supertree and subtree will be output instead.
--posterior, -p Compute just the posterior distribution of the number of substitutions, given the alignment and the model, and output as a table. If used with --subtree, the joint distribution over the number of substitutions in the specified supertree and subtree will be output instead.
--fit-model, -F
--epsilon, -e <val>
--confidence-interval, -c <val>
--quantiles, -q
Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005 15(7):901-13.
Siepel A, Pollard KS, and Haussler D. New methods for detecting lineage-specific selection. In Proceedings of the 10th International Conference on Research in Computational Molecular Biology (RECOMB 2006), pp. 190-205.
May 2016 | phyloP 1.4 |