phybin - binning/clustering newick trees by topology
phybin [OPTION...] files or
directories...
PhyBin takes Newick tree files as input. Paths of Newick files can
be passed directly on the command line. Or, if directories are provided, all
files in those directories will be read. Taxa are named based on the files
containing them. If a file contains multiple trees, all are read by phybin,
and the taxa name then includes a suffix indicating the position in the
file:
- e.g. FILENAME_0, FILENAME_1, etc.
When clustering trees, Phybin computes a complete all-to-all
Robinson-Foulds distance matrix. If a threshold distance (tree edit
distance) is given, then a flat set of clusters will be produced in files
clusterXX_YY.tr. Otherwise it produces a full dendogram.
Binning mode provides an especially quick-and-dirty form of
clustering. When running with the --bin option, only exactly equal
trees are put in the same cluster. Tree pre-processing still applies,
however: for example collapsing short branches.
* Currently phybin ignores input trees with the wrong number of
taxa.
* If given a directory as input phybin will assume all contained
files are Newick trees.
- -v --verbose
- print WARNINGS and other information (recommended at first)
- -V --version
- show version number
- -o DIR
--output=DIR
- set directory to contain all output files (default
"./phybin_out/")
- --selftest
- run internal unit tests
- --bin
- Use simple binning, the cheapest form of 'clustering'
- --single
- Use single-linkage clustering (nearest neighbor)
- --complete
- Use complete-linkage clustering (furthest neighbor)
- --UPGMA
- Use Unweighted Pair Group Method (average linkage) - DEFAULT mode
- --editdist=DIST
- Combine all clusters separated by DIST or less. Report a flat list of
clusters. Irrespective of whether this is activated, a hierarchical
clustering (dendogram.pdf) is produced.
- --hashrf
- (default) use a variant of the HashRF algorithm for the distance
matrix
- --tolerant
- use a slower, modified RF metric that tolerates missing taxa
- -g
--graphbins
- use graphviz to produce .dot and .pdf output files
- -d --drawbins
- like -g, but open GUI windows to show each bin's tree
- -w --view
- for convenience, "view mode" simply displays input Newick files
without binning
- --showtrees
- Print (textual) tree topology inside the nodes of the dendrogram
- --highlight=FILE
- Highlight nodes in the tree-of-trees (dendrogram) consistent with the.
given tree file. Multiple highlights are permitted and use different
colors.
- --interior
- Show the consensus trees for interior nodes in the dendogram, rather than
just points.
- -p NUM
--nameprefix=NUM
- Leaf names in the input Newick trees can be gene names, not taxa. Then it
is typical to extract taxa names from genes. This option extracts a prefix
of NUM characters to serve as the taxa name.
- -s STR
--namesep=STR
- An alternative to --nameprefix, STR provides a set of delimeter
characters, for example '-' or '0123456789'. The taxa name is then a
variable-length prefix of each gene name up to but not including any
character in STR.
- -m FILE
--namemap=FILE
- Even once prefixes are extracted it may be necessary to use a lookup table
to compute taxa names, e.g. if multiple genes/plasmids map onto one taxa.
This option specifies a text file with find/replace entries of the form
"<string> <taxaname>", which are applied AFTER
-s and -p.
- --rfdist
- print a Robinson Foulds distance matrix for the input trees
- --setdiff
- for convenience, print the set difference between cluster*.txt files
- --print
- simply print out a concise form of each input tree
- --printnorms
- simply print out a concise and NORMALIZED form of each input tree
- --consensus
- print a strict consensus tree for the inputs, then exit
- --matching
- print a list of tree names that match any --highlight argument
This manpage was written by Andreas Tille for the Debian
distribution and can be used for any other usage of the program.