NAME

exoniphy - Prediction of evolutionarily conserved protein-coding exons using Required argument <msa_fname> must be a multiple alignment file, in one of several possible formats (see --msa-format).

DESCRIPTION

Prediction of evolutionarily conserved protein-coding exons using a phylogenetic hidden Markov model (phylo-HMM). By default, a model definition and model parameters are used that are appropriate for exon prediction in human DNA, based on human/mouse/rat alignments and a 60-state HMM. Using the --hmm, --tree-models, and --catmap options, however, it is possible to define alternative phylo-HMMs, e.g., for different sets of species and different phylogenies, or for prediction of exon pairs or complete gene structures.

OPTIONS

: (Model definition and model parameters)

--hmm, -H <fname>

: Name of HMM file defining states and transition probabilities. By default, the 60-state HMM described in Siepel & Haussler (2004) is used, with transition probabilities appropriate for mammalian genomes (estimated as described in that paper).

--tree-models, -m <fname_list> List of tree model (*.mod) files, one for each state in the HMM. Order of models must correspond to order of states in HMM file. By default, a set of models appropriate for human, mouse, and rat are used (estimated as described in Siepel & Haussler, 2004).

--catmap, -c <fname>|<string>

Mapping of feature types to category numbers.: Can give either

: a filename or an "inline" description of a simple category map, e.g., --catmap "NCATS = 3 ; CDS 1-3". By default, a category map is used that is appropriate for the 60-state HMM mentioned above.

--extrapolate, -e <phylog.nh> | default Extrapolate to a larger set of species based on the given phylogeny (Newick-format). The trees in the given tree models (*.mod files) must be subtrees of the larger phylogeny. For each tree model M, a copy will be created of the larger phylogeny, then scaled such that the total branch length of the subtree corresponding to M's tree equals the total branch length of M's tree; this new version will then be used in place of M's tree. (Any species name present in this tree but not in the data will be ignored.) If the string "default" is given instead of a filename, then a phylogeny for 25 vertebrate species, estimated from sequence data for Target 1 (CFTR) of the NISC Comparative Sequencing Program (Thomas et al., 2003), will be assumed.

--data-path, -D <path>

: Path to the directory with phast data. Exoniphy default models should be in <path>/exoniphy/. Default is set at compile time.
: (Input and output)

--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

File format of input alignment.: Default is to guess alignment

: format from file contents.

--score, -S Report log-odds scores for predictions, equal to their log total probability under an exon model minus their log total probability under a background model. The exon model can be altered using --cds-types and --signal-types and the background model can be altered using --backgd-types (see below).

--seqname, -s <name>

: Use specified string as "seqname" field in GFF output. Default is obtained from input file name (double filename root, e.g., "chr22" if input file is "chr22.35.ss").

--idpref, -p <name>

: Use specified string as prefix of generated ids in GFF output. Can be used to ensure ids are unique. Default is obtained from input file name (single filename root, e.g., "chr22.35" if input file is "chr22.35.ss").

--grouptag, -g <tag> Use specified string as the tag denoting groups in GFF output (default is "transcript_id").

--alias, -A <alias_def>

: Alias names in input alignment according to given definition, e.g., "hg17=human; mm5=mouse; rn3=rat". Useful with default tree models and with --extrapolate. (Default models use generic common names such as "human", "mouse", and "rat". This option allows a mapping to be established between the leaves of trees in these files and the sequences of an alignment that uses an alternative naming convention.)
: (Altering the states and transition probabilities of the HMM)

--no-cns, -x

: Eliminate the state/category for conserved noncoding sequence from the default HMM and category map. Ignored if non-default HMM and category map are selected.

--reflect-strand, -U

: Given an HMM describing the forward strand, create a larger HMM that allows for features on both strands by "reflecting" the HMM about all states associated with background categories (see --backgd-cats). The new HMM will be used for predictions on both strands. If the default HMM is used, then this option will be used automatically.

--bias, -b <val>

: Set "coding bias" equal to the specified value (default -3.33 if default HMM is used, 0 otherwise). The coding bias is added to the log probabilities of transitions from background states to non-background states (see --backgd-cats), then all transition probabilities are renormalized. If the coding bias is positive, then more predictions will tend to be made and sensitivity will tend to improve, at some cost to specificity; if it is negative, then fewer predictions will tend to be made, and specificity will tend to improve, at some cost to sensitivity.

--sens-spec, -Y <fname-root> Make predictions for a range of different coding biases (see --bias), and write results to files with given filename root. This allows the sensitivity/specificity tradeoff to be examined. The range is fixed at -20 to 10, and 10 different sets of predictions are produced. (Feature types)

--backgd-types, -B <list>

: Feature types to be considered "background" (default value: "background,CNS"). Affects --reflect-strand, --score, and --bias.

--cds-types, -C <list>

: (for use with --score) Feature types that represent protein-coding regions (default value: "CDS").

--signal-types, -L <list> (for use with --score) Types of features to be considered "signals" during scoring (default value: "start_codon,stop_codon,5'splice,3'splice,prestart,cds5'ss,cds3'ss"). One score is produced for a CDS feature (as defined by --cds-types) and the adjacent signal features; the score is then assigned to the CDS feature.

: (Indels)

--indels, -I

: Use the indel model described in Siepel & Haussler (2004).

--no-gaps, -W <list> Prohibit gaps in sites of the specified categories (gaps result in emission probabilities of zero). If the default category map is used (see --catmap), then gaps are prohibited in start and stop codons and at the canonical GT and AG positions of splice sites (with or without --indels). In all other cases, the default behavior is to treat gaps as missing data, or to address them with the indel model (--indels).

--require-informative, -N <list>

: Require "informative" columns (i.e., columns with more than two non-missing-data characters, excluding sequences specified by --not-informative) in the given categories (list by name or number). Non-informative columns will be given emission probabilities of zero. If the default category map is used (see --catmap), then this option applies automatically to CDSs, start and stop codons, and the canonical GT and AG positions of splice sites. Note that alignment gaps *are* considered informative; the way they are handled is defined by --indels and --no-gaps.

--not-informative, -n <list>

: Do not consider the specified sequences (listed by name) when deciding whether a column is informative. This option can be useful when sequences are present that are very close to the reference sequence and thus do not contribute much in the way of phylogenetic information. E.g., one might use "--not-informative chimp" with a human-referenced multiple alignment including chimp sequence.
: (Other)

--quiet, -q

: Proceed quietly (without messages to stderr).

--help -h Print this help message.

REFERENCES: A. Siepel and D. Haussler. 2004. Computational identification of evolutionarily conserved exons. Proc. 8th Annual Int'l Conf.

: on Research in Computational Biology (RECOMB '04), pp. 177-186. J. Thomas et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424:788-793.

May 2016

exoniphy 1.4