tpot - Automated Machine Learning tool
usage: tpot [-h] [-is INPUT_SEPARATOR] [-target TARGET_NAME]
- [-mode {classification,regression}] [-o OUTPUT_FILE] [-g GENERATIONS] [-p
POPULATION_SIZE] [-os OFFSPRING_SIZE] [-mr MUTATION_RATE] [-xr
CROSSOVER_RATE] [-scoring SCORING_FN] [-cv NUM_CV_FOLDS] [-sub SUBSAMPLE]
[-njobs NUM_JOBS] [-maxtime MAX_TIME_MINS] [-maxeval MAX_EVAL_MINS] [-s
RANDOM_STATE] [-config CONFIG_FILE] [-template TEMPLATE] [-memory MEMORY]
[-cf CHECKPOINT_FOLDER] [-es EARLY_STOP] [-v {0,1,2,3}] [-log LOG]
[--version] INPUT_FILE
A Python tool that automatically creates and optimizes machine
learning pipelines using genetic programming.
- INPUT_FILE
- Data file to use in the TPOT optimization process. Ensure that the class
label column is labeled as "class".
- -h, --help
- Show this help message and exit.
- -is
INPUT_SEPARATOR
- Character used to separate columns in the input file.
- -target
TARGET_NAME
- Name of the target column in the input file.
- -mode
{classification,regression}
- Whether TPOT is being used for a supervised classification or regression
problem.
- -o OUTPUT_FILE
- File to export the code for the final optimized pipeline.
- -g GENERATIONS
- Number of iterations to run the pipeline optimization process. It must be
a positive number or None. If None, the parameter max_time_mins must be
defined as the runtime limit. Generally, TPOT will work better when you
give it more generations (and therefore time) to optimize the pipeline.
TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE
pipelines in total.
- -p POPULATION_SIZE
- Number of individuals to retain in the GP population every generation.
Generally, TPOT will work better when you give it more individuals (and
therefore time) to optimize the pipeline. TPOT will evaluate
POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.
- -os OFFSPRING_SIZE
- Number of offspring to produce in each GP generation. By
default,OFFSPRING_SIZE = POPULATION_SIZE.
- -mr MUTATION_RATE
- GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how
many pipelines to apply random changes to every generation. We recommend
using the default parameter unless you understand how the mutation rate
affects GP algorithms.
- -xr CROSSOVER_RATE
- GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how
many pipelines to "breed" every generation. We recommend using
the default parameter unless you understand how the crossover rate affects
GP algorithms.
- -scoring
SCORING_FN
- Function used to evaluate the quality of a given pipeline for the problem.
By default, accuracy is used for classification problems and mean squared
error (mse) is used for regression problems. Note: If you wrote your own
function, set this argument to mymodule.myfunctionand TPOT will import
your module and take the function from there.TPOT will assume the module
can be imported from the current workdir.TPOT assumes that any function
with "error" or "loss" in the name is meant to be
minimized, whereas any other functions will be maximized. Offers the same
options as cross_val_score: accuracy, adjusted_rand_score,
average_precision, f1, f1_macro, f1_micro, f1_samples, f1_weighted,
neg_log_loss, neg_mean_absolute_error, neg_mean_squared_error,
neg_median_absolute_error, precision, precision_macro, precision_micro,
precision_samples, precision_weighted, r2, recall, recall_macro,
recall_micro, recall_samples, recall_weighted, roc_auc
- -cv NUM_CV_FOLDS
- Number of folds to evaluate each pipeline over in stratified k-fold
cross-validation during the TPOT optimization process.
- -sub SUBSAMPLE
- Subsample ratio of the training instance. Setting it to 0.5 means that
TPOT will use a random subsample of half of training data for the pipeline
optimization process.
- -njobs
NUM_JOBS
- Number of CPUs for evaluating pipelines in parallel during the TPOT
optimization process. Assigning this to -1 will use as many cores
as available on the computer. For n_jobs below -1, (n_cpus + 1 +
n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are
used.
- -maxtime
MAX_TIME_MINS
- How many minutes TPOT has to optimize the pipeline. If not None, this
setting will allow TPOT to run until max_time_mins minutes elapsed and
then stop. TPOT will stop earlier if generationsis set and all generations
are already evaluated.
- -maxeval
MAX_EVAL_MINS
- How many minutes TPOT has to evaluate a single pipeline. Setting this
parameter to higher values will allow TPOT to explore more complex
pipelines but will also allow TPOT to run longer.
- -s RANDOM_STATE
- Random number generator seed for reproducibility. Set this seed if you
want your TPOT run to be reproducible with the same seed and data set in
the future.
- -config
CONFIG_FILE
- Configuration file for customizing the operators and parameters that TPOT
uses in the optimization process. Must be a Python module containing a
dict export named "tpot_config" or the name of built-in
configuration.
- -template
TEMPLATE
- Template of predefined pipeline structure. The option is for specifying a
desired structurefor the machine learning pipeline evaluated in TPOT. So
far this option only supportslinear pipeline structure. Each step in the
pipeline should be a main class of operators(Selector, Transformer,
Classifier or Regressor) or a specific operator(e.g. SelectPercentile)
defined in TPOT operator configuration. If one step is a main class,TPOT
will randomly assign all subclass operators (subclasses of
SelectorMixin,TransformerMixin, ClassifierMixin or RegressorMixin in
scikit-learn) to that step.Steps in the template are delimited by
"-", e.g. "SelectPercentile-Transformer-Classifier".By
default value of template is None, TPOT generates tree-based pipeline
randomly.
- -memory
MEMORY
- Path of a directory for pipeline caching or "auto" for using a
temporary caching directory during the optimization process. If supplied,
pipelines will cache each transformer after fitting them. This feature is
used to avoid repeated computation by transformers within a pipeline if
the parameters and input data are identical with another fitted pipeline
during optimization process.
- -cf
CHECKPOINT_FOLDER
- If supplied, a folder in which tpot will periodically save the best
pipeline so far while optimizing. This is useful in multiple cases: sudden
death before tpot could save an optimized pipeline, progress tracking,
grabbing a pipeline while it's still optimizing etc.
- -es EARLY_STOP
- How many generations TPOT checks whether there is no improvement in
optimization process. End optimization process if there is no improvement
in the set number of generations.
- -v {0,1,2,3}
- How much information TPOT communicates while it is running: 0 = none, 1 =
minimal, 2 = high, 3 = all. A setting of 2 or higher will add a progress
bar during the optimization procedure.
- -log LOG
- Save progress content to a file
- --version
- Show the TPOT version number and exit.