mptp — single-locus species delimitation
Maximum-likelihood species delimitation:
mptp --ml (--single | --multi) --tree_file
newickfile --output_file outputfile [options]
Species delimitation with support values:
mptp --mcmc positive integer (--single |
--multi) (--mcmc_startnull | --mcmc_startrandom | --mcmc_startml) --mcmc_log
positive integer --tree_file newickfile --output_file
outputfile [options]
Species is one of the fundamental units of comparison in virtually
all subfields of biology, from systematics to anatomy, development, ecology,
evolution, genetics and molecular biology. The aim of mptp is to
offer an open source tool to infer species boundaries on a a given
phylogenetic tree based on the Poisson Tree Process (PTP) and the Multiple
Poisson Tree Process (mPTP) models.
mptp offers two methods for inferring species delimitation.
First, a maximum-likelihood based method that uses a dynamic programming
approach to infer an ML estimate. Second, an mcmc approach for sampling the
space of possible delimitations providing the user with support values on
the tree clades. Both approaches are available in two flavours: the PTP and
the mPTP model. The PTP model is specified by using the single switch
and the mPTP by using multi.
The input for mptp is a newick file that contains one
phylogenetic tree, i.e., branches express the expected number of
substitutions per alignment site.
mptp parses a large number of command-line options. For
easier navigation, options are grouped below by theme.
General options:
- --help
- Display help text and exit.
- --version
- Output version information and exit.
- --quiet
- Supress all output to stdout except for warnings and fatal error
messages.
- --tree_file filename
- Input newick file that contains a phylogenetic tree. Can be rooted or
unrooted.
- --output_file filename
- Specifies the prefix used for generating output files. For
maximum-likelihood species delimitation two files will be created. First,
filename.txt that contains the actual delimitation and
filename.svg that contains an SVG figure of the computed
delimitation. For mcmc analyses, a file filename.txt is created
that contains the newick tree with supports values.
- --outgroup comma-separated
list of taxa
- All computations for species delimitation are carried out on rooted trees.
This option is used only (and is required) In case an unrooted tree was
specified with the --tree_file option. mptp roots the unrooted tree
by splitting the branch leading to the most recent common ancestor (MRCA)
of the comma-separated list of taxa into two branches of equal size and
introducing a new node (the root of the new rooted tree) that connects
these two branches.
- --outgroup_crop
- Crops taxa specified with the --outgroup option from the the tree.
- --min_br real
- Any branch lengths in the input tree smaller or equal than real are
excluded (ignored) from the computations. In addition, for mcmc analyses,
subtrees that exclusively consist of branch lengths smaller or equal to
real are completely ignored from the proposals (support values for
those clades are set to 0). (default: 0.0001)
- --precision positive
integer
- Specifies the precision of the decimal part of floating point numbers on
output (default: 7)
- --minbr_auto filename
- Automatically detects the minimum branch length from the p-distances of
the FASTA file filename.
- --tree_show
- Show an ASCII version of the processed input tree (i.e. after it is rooted
by, potentially cropping, the outgroup).
Maximum-likelihood estimations:
Estimating the maximum-likelihood delimitation is
triggered by the switch --ml followed by --single (the PTP model) or --ml
--multi (the mPTP model). Note that these two methods affect how options
--output_file behaves and can be controlled using the --min_br switch. Both
methods require a rooted phylogenetic tree, however an unrooted tree may be
specified in conjuction with the option --outgroup. In this case,
mptp
roots it at that outgroup (see General options, --outgroup for more info).
Note that both methods output an SVG depiction of the ML delimitation. See
Visualization for more information on adjusting and fine-tuning the SVG
output.
Both methods ignore discard branch lengths of size smaller than
the size specified using the --min_br option. The PTP model then attempts to
find a connected subgraph of the rooted tree that (a) contains the root, and
(b) the sum of likelihoods of fitting the edges of that subgraph in one
exponential distribution and the remaining edges in another (exponential
distribution) is maximized. With likelihood we mean the sums of the
probability density function with the mean defined as the reciprocal of the
average of edge lengths in the particular distribution.
- --ml --single
- Triggers the algorithm for computing an ML estimate of the delimitation
using the PTP model.
- --ml --multi
- Triggers the algorithm for computing an ML estimate of the delimitation
using the mPTP model.
- --pvalue
real
- Only used with the PTP model (specified with --single). Sets the p-value
for performing a likelihood ratio test. Note that, there is no likelihood
ratio test for the mPTP model this test is not done. (default: 0.001)
MCMC method:
The MCMC method is triggered with the --mcmc switch
combined with either --single (the PTP model) or --multi (the mPTP model).
Some more stuff to write
- --mcmc positive
integer --single
- Triggers the algorithm for computing support values by taking the
specified number of MCMC samples (delimitations) using the PTP model.
- --mcmc positive
integer --multi
- Triggers the algorithm for computing support values by taking the
specified number of MCMC samples (delimitations) using the mPTP
model.
- --mcmc_sample
positive integer
- Sample only every n-th MCMC step.
- --mcmc_log
- Log the scores (log-likelihood) for each MCMC sample in a file and create
an SVG plot.
- --mcmc_burnin
positive integer
- Ignore all MCMC samples generated before the specified step. (default:
1)
- --mcmc_runs
positive integer
- Perform multiple MCMC runs. If more than 1 run is specified, mptp will
generate one seed for each run based on the provided seed using the --seed
switch. Output files will be generated for each run (default: 1)
- --mcmc_credible
real
- Specify the probability (0.0 to 1.0) for which to generate the credible
interval i.e., the probability the true number of species will fall within
the credible interval given the observed data. (default: 0.95)
- --mcmc_startnull
- Start MCMC sampling from the null-model.
- --mcmc_startrandom
- Start MCMC sampling from a random delimitation.
- --mcmc_startrandom
- Start MCMC sampling from the ML delimitation.
- --seed positive
integer
- Specifies the seed for the pseudo-random number generator. (default:
randomly generated based on system time)
SVG Output:
The ML method generates one SVG file that visualizes the
processed input tree (i.e. after it is rooted by, potentially cropping, the
outgroup) and marks the subtrees corresponding to coalescent processes (the
detected species groups) with red color, while the speciation process is
colored green.
The MCMC method generates one SVG file per run visualizing the
processed tree, and indicates the support value for each node, i.e., the
percentage of MCMC samples (delimitations) in which the particular node was
part of the speciation process. A value of 1 means it was always in the
speciation process while a value of 0 means it was always in a coalescent
process. The tree branches are colored according to the support values of
descendant nodes; a support of value of 0 is colored with red, 1 with black,
and values in between are gradients of the two colors. Only support values
above 0.5 are shown to avoid packed numbers in dense branching events. In
addition, if --mcmc_log is specified, an additional SVG image of
log-likelihoods plots for each sampled delimitation is created.
- --svg_width
positive integer
- Sets the total width (including margins) of the SVG in pixels. (default:
1920)
- --svg_fontsize
positive integer
- Size of font in SVG image. (default: 12)
- --svg_tipspacing
positive integer
- Vertical space in pixels between taxa in SVG tree. (default: 20)
- --svg_legend_ratio
real
- Ratio (value between 0.0 and 1.0) of total tree length to be displayed as
legend line. (default: 0.1)
- --svg_nolengend
- Hide legend.
- --svg_marginleft
positive integer
- Left margin in pixels. (default: 20)
- --svg_marginright
positive integer
- Right margin in pixels. (default: 20)
- --svg_margintop
positive integer
- Top margin in pixels. (default: 20)
- --svg_marginbottom
positive integer
- Top margin in pixels. (default: 20)
- --svg_inner_radius
positive integer
- Radius of inner nodes in pixels. (default: 0)
Compute the maximum likelihood estimate using the mPTP model by
discarding all branches with length below or equal to 0.0001
mptp --ml --multi --min_br 0.0001 --tree_file
newick.txt --output_file out
Run an MCMC analysis of 100 million steps with the mPTP model,
that logs every one million-th step, ignores the first 2 million steps and
discards all branches with lengths smaller or equal to 0.0001. Use 777 as
seed. The chain will start from the ML delimitation (default).
mptp --mcmc 100000000 --multi --min_br 0.0001
--tree_file newick.txt --output_file out --mcmc_log 1000000
--mcmc_burnin 2000000 -seed 777
Perform an MCMC analysis of 5 runs, each of 100 million steps with
the mPTP model, log every one million-th step, ignore the first 2 million
steps, and detect the minimum branch length by specifying the FASTA file
alignment.fa that contains the alignment. Use 777 as seed. Start each run
from a random delimitation.
mptp --mcmc 100000000 --multi ---mcmc_runs 5
--mcmc_log 1000000 --minbr_auto alignment.fa --tree_file
newick.txt --output_file out --mcmc_burnin 2000000 -seed 777
--mcmc_startrandom
Implementation by Tomas Flouri, Sarah Lutteropp and Paschalia
Kapli. Additional PTP and mPTP model authors include Kassian Kobert, Jiajie
Zhang, Pavlos Pavlidis, and Alexandros Stamatakis.
Submit suggestions and bug-reports at
<https://github.com/Pas-Kapli/mptp/issues>, or e-mail Tomas Flouri
<Tomas.Flouri@h-its.org>.
Source code and binaries are available at
<https://github.com/Pas-Kapli/mptp>.
Copyright (C) 2015-2017, Tomas Flouri, Sarah Lutteropp, Paschalia
Kapli
All rights reserved.
Contact: Tomas Flouri <Tomas.Flouri@h-its.org>, Scientific
Computing, Heidelberg Insititute for Theoretical Studies, 69118 Heidelberg,
Germany
This software is licensed under the terms of the GNU Affero
General Public License version 3.
GNU Affero General Public License version 3
This program is free software: you can redistribute it and/or
modify it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the License,
or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero
General Public License for more details.
You should have received a copy of the GNU Affero General Public
License along with this program. If not, see
<http://www.gnu.org/licenses/>.
New features and important modifications of mptp (short
lived or minor bug releases may not be mentioned):
- v0.1.0 released
June 27th, 2016
- First public release.
- v0.1.1 released
July 15th, 2016
- Bug fix (now LRT test is not printed in output file when using
--multi)
- v.0.2.0 released
September 27th, 2016
- Fixed floating point exception error when constructing random trees,
caused from dividing by zero. Changed allocation from malloc to calloc, as
it caused unititialized variables when converting unrooted trees to rooted
when using the MCMC method. Fixed sample size for the AIC with a
correction for finite sample sizes.
- v.0.2.1 released
October 18th, 2016
- Updated ASV to consider only coalescent roots of ML delimitation. Removed
assertion stopping mptp when using random starting delimitations for the
MCMC method.
- v0.2.2 released
January 31st, 2017
- Fixed regular expressions to allow scientific notation for branch lengths
when parsing trees. Improved the accuracy of ASV score by also taking into
account tips forming coalescent roots. Fixed memory leaks that occur when
parsing incorrectly formatted trees.
- v0.2.3 released
July 25th, 2017
- Replaced hsearch() with custom hashtable. Fixed minor output error
messages.
- v0.2.4 released
May 14th, 2018
- If we do not manage to generate a random starting delimitation with the
wanted number of species (randomly chosen), we use the currently generated
delimitation instead.