DOKK Library

Training SciKitt-learn

Authors Enthought Inc.

License CC-BY-NC-ND-4.0

Plaintext
1                             Classification:
                              Predict Categorical Data
Predict the class, or label (t), of a sample based on its features (x). Examples: Recognize hand-written digits,
or mark email as spam. In scikit-learn, labels are represented as integers and get expanded internally into
matrices of binary choices between unique integer labels.
Use class_weight='balanced' in most models to adjust for unbalanced datasets (more training data from
one class than others). Training data has N samples and D features.

Logistic Regression O(ND2)                                                                Tips: Hard to generate the perfect mix of estimators.
When to use it: When you need to under-                                                   Code: All in ensemble module.
stand the contributions of features using                                                 Averaging estimators:
a method that’s fast to train and easy
                                                                                          • RandomForestClassifier(max_features)
to interpret.
                                                                                          • ExtraTreesClassifier(max_features)
How it works: Fits an s-shaped function
                                                                                          Start with these, but always cross-validate:
(logistic function), which is continuous but
has a steep transition between the two                                                    • max_features=sqrt(n_features)
classes, and assigns class based on sign.                                                 • max_depth=None
                                                                                          • min_samples_split=1
Tips: Inputs must be scaled and uncorrelated.
Code: linear_model.LogisticRegression(C, solver)                                          Boosting estimator:
• penalty='l1' to use estimator for feature selection                                     • AdaBoostClassifier()
• solver='liblinear' for small datasets or L1 penalty                                     • GradientBoostingClassifier()
• 'lbfgs', 'sag' or 'newton-cg' for multi-class problems
  and large datasets                                                                      All:
• 'sag' for very large datasets
                                                                                          • Parallelize with n_jobs=-1
Decision Tree O(NDlog(N))                                                                 • Increasing n_estimators is better, but slower
When to use it: When you need to understand prediction deci-
sions, when data has both continuous and categorical features,
                                                                                          Support Vector Classifier O(ND2) to O(ND3)
                                                                                          When to use it: When you have a large number of features or
and when no scaling is needed.
                                                                                          slightly more features than samples.
How it works: Chain binary decisions on increasingly smaller
                                                                                          How it works: Maximize distance between
subsets of data. Deeper trees have more complex decision rules
                                                                                          classes in high-dimensional space, i.e.,
and a better fit.
                                                                                          “maximum margin classifier.”
                             Yes      known number?                  No                   Tips: Scale your data.
                                                                                          Code: svm.SVC(kernel, C=1). Make C small-
                     is grandma?                                  ignore                    er if lots of noisy samples. If accuracy is
                                                                                            important set kernel='rbf'. If fast training is important, use
         pick up                         ignore                                             svm.LinearSVC().

                                                                                          Neighbor Classifiers
Tips: Very often overfits. Consider doing dimensionality reduction                        O(DlogN) to O(DN)
beforehand. N must double with each extra level.                                          When to use them: When you have large
Code: tree.DecisionTreeClassifier(max_depth). Start with                                  datasets or a very irregular decision boundary.          ?
max_depth=3, then increase. Use tree.export_graphviz to                                   How it works: Predict class by majority vote            K=3
visualize tree.                                                                           from nearby data.                                       K=5

                                                                                          Tips: Efficiency comes at the cost of also having
Ensemble Methods
                                                                                          high variance.
When to use them: When no single estimator gives satisfying
results.                                                                                  Code: neighbors.KNeighborsClassifier(n_neighbors).

How they work: Combines predictions of multiple weak, biased                              • Use RadiusNeighborsClassifier() for unbalanced data and
estimators to create a better one. There are two types; averaging                           a D not too large.
methods —build many estimators—and average predictions.                                   • Try weights='uniform' and 'distance'.
In boosting methods, each new estimator tries to improve the
previous one.



Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0                              www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
Classification: Predict Categorical Data

Stochastic Gradient Descent (SGD) Classifier                                              Performance Metrics in sklearn.metrics
When to use it: When you have a very large                                                They take targets, t, and predicted classes, y,




                                                                                                                                                C3 C2 C1
N and D.                                                                                  as arguments. There's more than one way to
How it works: “Online” method, learns the weights                                         be wrong. A fire alarm that always goes off is
                                                                                                                                                           CI C2 C3
in batches.                                                                               annoying, one that never goes off is costly.

Tips: Data must be scaled.                                                                confusion_matrix: Explore how model confuses classes.
                                                                                          Visualize with seaborn.heatmap.
Code: linear_model.SGDClassifier(loss, alpha, n_iter)
and partial_fit() method. Use n_iter=np.ceil(10**6/n_
samples). loss='hinge' gives SVC, 'log' gives logistic
regression.



    accuracy_score (default for model.score): Fraction cor-
    rectly predicted. Meaningless if samples are unbalanced.
    (TP + TN) / Total




    recall_score: Fraction of predicted fire when there's
    actually fire. TP / (TP + FN)




    precision_score: Fraction of correctly predicted fire of all
    cases where fire is predicted.P predicted as P. TP / (TP + FP)




Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0                             www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
2                             Clustering:
                              Unsupervised Learning
Predict the underlying structure in features, without the use of targets or labels. Split samples into groups
called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a
cluster. Data has N samples, D features, and the model discovers k clusters. Models can be used for predic-
tion or for transformation, by reducing D features into one with k unique values.

Some models expect                                                                          DBSCAN O(N2)
geometries that are “flat,”                                                                 When to use it: When you have very non-flat
or roughly spherical.                                                                       geometries or very uneven clusters.
Clusters with complicated
                                                                                            How it works: Clusters are contiguous areas
shapes like rings or lines
                                                                                            with high data density. Bounds of clusters are
are not flat and will not
                                                                                            found using graph connectivity.
work in those models.
                                                                                            Tips: O(N2) memory use. Not deterministic at
K-Means O(kN)                                                                               cluster boundaries.
When to use it: When you need something that scales well and                                Code: cluster.DBSCAN(min_samples, eps, metric)
has a small number of flat clusters. For large sample sizes, substi-
                                                                                            Higher min_samples or lower eps requires higher density to
tute MiniBatchKMeans.
                                                                                            form a cluster.
How it works: Assigns samples to nearest of k cluster centers, then
moves the centers to minimize the average distance between cen-                             Agglomerative Clustering O(N2logN)
ters and samples.                                                                           When to use it: When you need a flexible definition of distance
Tips: The K-Means algorithm used by scikit-learn is sensitive to the                        (e.g. Levenshtein).
initial location of the centers. Performs poorly on complex, non-flat                       How it works: Defines all observations as unique clusters, then
shapes.                                                                                     merges the closest ones iteratively.
Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize.                                 Tips: Worst time complexity.
                                                                                            Code: cluster.AgglomerativeClustering(linkage, affinity,
                                                                                            connectivity). Set linkage criteria for merging:
                                                                                            • 'ward': minimize sum of square differences. Minimizes variance
                                                                                              Gives most regular cluster size.
                                                                                            • 'complete': minimize max distance between sample pairs.
                                                                                            • 'average': minimize average distance between all sample pairs.
                                                                                              Yields uneven cluster sizes.
Mean Shift O(NlogN)                                                                         affinity: defines type of distances. 'l1' for sparse features, e.g.,
When to use it: When you have non-flat geometries, an unknown                               text; 'cosine' is invariant to scaling.
number of clusters, and need to guarantee convergence.
                                                                                            connectivity: provides extra constraints about which nodes can
How it works: Finds local maxima given a window size.                                       be merged, e.g., neighbors.kneighbors_graph.
Tips: Accuracy strongly tied to selecting correct window.
Code: cluster.MeanShift(bandwith). Set bandwidth manually
to small value for large dataset. Estimating it is O(N2) and can be
the bottleneck.

Affinity Propagation O(N2)
When to use it: When you have an unknown number of clusters
and need to specify own similarity metric (affinity argument).
How it works: Finds data points which maximize similarity within
cluster while minimizing similarity with data outside of cluster.
Tips: O(N2) memory use. Accuracy tied to damping.
Code: cluster.AffinityPropagation(preference, damping)
• preference: Negative. Controls the number of clusters. Explore
  on log scale.
• damping: 0.5 to 1.



Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0                            www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
Clustering: Unsupervised Learning

BIRCH O(kN)                                                                               Performance Metrics in sklearn.metrics
When to use it: When you have a large number of observations                              The metrics do not take into account the exact class values,
and small number of features.                                                             only their separation. Score is based on ground truth (targets), if
How it works: Builds a balanced tree of groups of data, then clus-                        available, or to a measure of similarity within class, and difference
ters those groups instead of the raw data.                                                across classes.

Tips: Performs poorly with large number of features.                                      Needs ground truth:

Code: cluster.Birch(threshold, branching_factor,                                          • adjusted_rand_score: -1 to 1 (best). 0 is random classes.
n_clusters)                                                                                 Measures similarity. Related to accuracy (% correct).
                                                                                          • adjusted_mutual_info_score: 0 to 1 (best). 0 is random
                                                                                            classes. 10x slower than adjusted_rand_score. Measures agree-
                                                                                            ment.
                                                                                          • homogeneity_completeness_v_measure: 0 to 1 (best).
                                                                                            homogeneity: each cluster only contains members of one
                                                                                            class; completeness: all members of a class are in the same
                                                                                            cluster; and, v_measure_score: the harmonic mean of both.
                                                                                            Not normalized for random labeling.

                                                                                          Doesn't need ground truth:
                                                                                          • silhouette_score: -1 to 1 (best). 0 means overlapping
                                                                                            clusters. Based on distance to samples in same cluster and
                                                                                            distance to next nearest cluster.




Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0                             www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
3                             Regression:
                              Predict Continuous Data
Predict how a dependent variable (output, t) changes when any of the independent variables (inputs, or
features, x) change. For example, how house prices change as a function of neighborhood and size, or how
time spent on a web page varies as a function of the number of ads and content type. Training data has N
samples and D features.

Linear Model O(ND2)                                                                       Ridge vs. Lasso ­— Shape of Ew
Solves problems of the form:                                                              With Ridge and Lasso, the error to minimize E has an extra
                                                                                          component Ew:

with predicted value y, features, x, and fitted weights w.
Solved by minimizing “least square error”, ED:                                            Lasso produces sparse models because small weights are forced
                                                                                          to zero.




On fitted models, access w as model.coef_
and w0 as model.intercept_.
Tips: Features must be uncorrelated, use
decomposition.PCA().
Code: linear_model.LinearRegression() if less than
100,000 samples, or see SGD.                                                              Nonlinear Transformations
                                                                                          When to use them: When a “straight line” is not sufficient, like
Ridge O(ND2)                                                                              predicting temperature has a function of time of day.
When to use it: When you have less than 100,000 samples or                                How it works: “Reword” a nonlinear model in linear terms using
noisy outputs.                                                                            nonlinear basis functions, Øj(x), so we can use linear model
How it works: Linear model that limits the size of the weights.                           machinery to solve nonlinear problems. The linear model be-
Prevents overfitting by increasing bias. Minimizes E instead of ED ,                      comes:
where the second term is called the “L2 norm”:

                                                                                          Polynomial Expansion of Order P: A 2nd order polynomial
                                                                                          two-feature model:

Code: linear_model.Ridge(alpha)
                                                                                          Becomes a model with these six basis functions:
alpha: Regularization strength, alpha > 0, corresponds to 1/C in
other models. Increase if noisy samples.

Lasso O(ND2)                                                                              Tips: The same feature affects many different coefficients, so an
When to use it: When you have less than 100,000 samples, and                              outlier can have a big global effect. Number of basis functions
only some features should be important.                                                   grows very quickly, O((P+1)(D+1)).

How it works: Linear model that forces small weights to be zero.                          Code: poly = preprocessing.PolynomialFeatures(degree)
Minimizes E instead of ED , where the second term is called the “L1                       x_poly = poly.fit_transform(x)
norm”:                                                                                    Radial Basis Functions (RBF): Local, Gaussian-shaped functions,
                                                                                          defined by centers and width. Turns one feature into P features.
                                                                                          Code: metrics.pairwise.rbf_kernel(x, centers, gamma)


Tip: Use with feature_selection.SelectFromModel as a
transformation stage to select features with non-zero weights.
Code: linear_model.Lasso(alpha)
alpha: Regularization strength, alpha > 0, corresponds to 1/C in
other models. Increase if noisy samples.




Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0                             www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
Regression: Predict Continuous Data
Support Vector Regressor ~O(N2D)                                                          Performance Metrics in sklearn.metrics
When to use it: When you have many important features,                                    mean_squared_error: Smaller is better. Puts large weight on
more features than samples, or a nonlinear problem.                                       outliers.
How it works: Find a function such that training
points fit within a “tube” of acceptable error, with
some tolerance towards points that are outside
the tube.                                                                                 r2_score: Coefficient of determination. Best score is 1.0. Propor-
Tips: Must scale inputs, see StandardScalar and                                           tion of explained variance. Default for model.score(x, t).
RobustScalar.
Code: Start with svm.LinearSVR(epsilon, C=1). Make C smaller
if lots of noisy observations (C = 1/α, small C means more regu-
larization).                                                                              mean_absolute_error: Smaller is better. Uses same scale as the
If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma).                              data.

Stochastic Gradient Descent (SGD) Regressor
When to use it: When the fit is too slow with other estimators.
How it works: “Online” method, learns the weights
                                                                                          median_absolute_error: Robust to outliers.
in batches, with a subset of the data each time.
Pair with manual basis function expansion to train
nonlinear models on really large datasets.
Code: linear_model.SGDRegressor()
and partial_fit() method.




Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0                            www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/