phastMotif - Predicts motifs from a set of multiple alignments.
Uses
Predicts motifs from a set of multiple alignments. Uses an EM
algorithm similar to that of MEME, but a motif is defined by phylogenetic
models rather than multinomial distributions. The specified multiple
alignments may actually be single sequences (see -m). Various
parameters control the strategy for initialization (see below). Currently,
the F81 substitution model is assumed.
phastMotif [-t <treefile>] [OPTIONS] <msa_list>
-t <file> (Required unless -m or -p)
Use specified tree topology for all phylogenetic models (Newick format).
- -i <fmt>
- Input format for alignment. May be FASTA, PHYLIP, MPM, SS, or MAF (default
FASTA).
-b <file> Read background model from specified
file (.mod format).
- By default, the background model is estimated in a preprocessing step, by
pooling all data.
- -s
- Estimate a separate background model for each multiple alignment. (Not yet
implemented.)
-k <size> Learn motifs of the specified size
(default is 10).
- -B <n>
- Report best <n> motifs (default 3).
- -m
- MEME mode. Use multinomial rather than phylogenetic models. Causes
multiple alignments to be ignored -- any gaps are discarded and all
sequences are assumed independent.
-d <+lst> Use the discriminative training method
of Segal et al. (RECOMB'02), rather than EM. The specified list
- should contain the filenames from msa_list that are to be considered
*positive* examples (containing the desired motif); all others will be
considered negative examples. Can be used with or without -m.
-p Use "profile" models rather than phylogenetic models
(characters in each alignment column assumed independent). The resulting
model is a hybrid of the full model and MEME's model. Essentially, it uses
the multiple alignments but not the phylogeny. NOT YET IMPLEMENTED.
-n <n> Perform <n> random restarts and report the motif
with highest likelihood. Default number is 10. Ignored with -I,
-P, and -R unless -S is specified (see below).
-I <mlst> Run the algorithm after a
"soft" initialization with
- each of the consensus sequences in the specified list. At each position,
<pc> pseudocounts (see -c) are given to the consensus base
and 1 pseudocount to all other bases. Each string must have length at most
equal to the size of the motif. If shorter, it is used as a
"seed" for a motif, with flanking positions treated as
wildcards. -P <x,y> Initialize with the x most prevalent
y-tuples. A soft initialization is performed, as above. If y is less than
the motif size, y-tuples are used as a "seed" for a motif, as
above. -R <x,y> Initialize with a random sample of x
y-tuples. A soft initialization is performed, as above. If y is less than
the motif size, y-tuples are used as a "seed" for a motif, as
above. -w <n> (for use with -I, -P, -R)
Winnow initialization sequences to the top <n> based on the
unmaximized likelihood.
- -c <pc>
- (for use with -I, -P, -R) Number of pseudocounts for
consensus bases (default 5). -S (for use with -I, -P,
-R) Instead of doing a deterministic initialization based on a
consensus sequence, sample parameters from a Dirichlet distribution
defined by the pseudocounts (see -c). In this case, random restarts
are performed, as specified by -n.
-o <pref> Use the specified prefix for all output
files (dflt. "phastm"). -H Produce HTML formatted output,
in addition to ordinary output. One file is produced per predicted motif, as
well as a single HTML-formatted summary file.
- -D
- Produce a BED file with predicted motifs, for use in the UCSC browser.
Currently, sequence names must be formatted such as
"chr10:102553847-102554897+", with the final '+' or '-'
indicating strand.
- -x
- (For use with -H or -D) Suppress ordinary output to
stdout.
- -h
- Print this help message.