Training SciKitt-learn

Authors Enthought Inc.,

Plaintext

1 Classification:
Predict Categorical Data
Predict the class, or label (t), of a sample based on its features (x). Examples: Recognize hand-written digits,
or mark email as spam. In scikit-learn, labels are represented as integers and get expanded internally into
matrices of binary choices between unique integer labels.
Use class_weight='balanced' in most models to adjust for unbalanced datasets (more training data from
one class than others). Training data has N samples and D features.

Logistic Regression O(ND2) Tips: Hard to generate the perfect mix of estimators.
When to use it: When you need to under- Code: All in ensemble module.
stand the contributions of features using Averaging estimators:
a method that’s fast to train and easy
• RandomForestClassifier(max_features)
to interpret.
• ExtraTreesClassifier(max_features)
How it works: Fits an s-shaped function
Start with these, but always cross-validate:
(logistic function), which is continuous but
has a steep transition between the two • max_features=sqrt(n_features)
classes, and assigns class based on sign. • max_depth=None
• min_samples_split=1
Tips: Inputs must be scaled and uncorrelated.
Code: linear_model.LogisticRegression(C, solver) Boosting estimator:
• penalty='l1' to use estimator for feature selection • AdaBoostClassifier()
• solver='liblinear' for small datasets or L1 penalty • GradientBoostingClassifier()
• 'lbfgs', 'sag' or 'newton-cg' for multi-class problems
and large datasets All:
• 'sag' for very large datasets
• Parallelize with n_jobs=-1
Decision Tree O(NDlog(N)) • Increasing n_estimators is better, but slower
When to use it: When you need to understand prediction deci-
sions, when data has both continuous and categorical features,
Support Vector Classifier O(ND2) to O(ND3)
When to use it: When you have a large number of features or
and when no scaling is needed.
slightly more features than samples.
How it works: Chain binary decisions on increasingly smaller
How it works: Maximize distance between
subsets of data. Deeper trees have more complex decision rules
classes in high-dimensional space, i.e.,
and a better fit.
“maximum margin classifier.”
Yes known number? No Tips: Scale your data.
Code: svm.SVC(kernel, C=1). Make C small-
is grandma? ignore er if lots of noisy samples. If accuracy is
important set kernel='rbf'. If fast training is important, use
pick up ignore svm.LinearSVC().

Neighbor Classifiers
Tips: Very often overfits. Consider doing dimensionality reduction O(DlogN) to O(DN)
beforehand. N must double with each extra level. When to use them: When you have large
Code: tree.DecisionTreeClassifier(max_depth). Start with datasets or a very irregular decision boundary. ?
max_depth=3, then increase. Use tree.export_graphviz to How it works: Predict class by majority vote K=3
visualize tree. from nearby data. K=5

Tips: Efficiency comes at the cost of also having
Ensemble Methods
high variance.
When to use them: When no single estimator gives satisfying
results. Code: neighbors.KNeighborsClassifier(n_neighbors).

How they work: Combines predictions of multiple weak, biased • Use RadiusNeighborsClassifier() for unbalanced data and
estimators to create a better one. There are two types; averaging a D not too large.
methods —build many estimators—and average predictions. • Try weights='uniform' and 'distance'.
In boosting methods, each new estimator tries to improve the
previous one.

Stochastic Gradient Descent (SGD) Classifier Performance Metrics in sklearn.metrics
When to use it: When you have a very large They take targets, t, and predicted classes, y,

C3 C2 C1
N and D. as arguments. There's more than one way to
How it works: “Online” method, learns the weights be wrong. A fire alarm that always goes off is
CI C2 C3
in batches. annoying, one that never goes off is costly.

Tips: Data must be scaled. confusion_matrix: Explore how model confuses classes.
Visualize with seaborn.heatmap.
Code: linear_model.SGDClassifier(loss, alpha, n_iter)
and partial_fit() method. Use n_iter=np.ceil(10**6/n_
samples). loss='hinge' gives SVC, 'log' gives logistic
regression.

accuracy_score (default for model.score): Fraction cor-
rectly predicted. Meaningless if samples are unbalanced.
(TP + TN) / Total

recall_score: Fraction of predicted fire when there's
actually fire. TP / (TP + FN)

precision_score: Fraction of correctly predicted fire of all
cases where fire is predicted.P predicted as P. TP / (TP + FP)

Some models expect DBSCAN O(N2)
geometries that are “flat,” When to use it: When you have very non-flat
or roughly spherical. geometries or very uneven clusters.
Clusters with complicated
How it works: Clusters are contiguous areas
shapes like rings or lines
with high data density. Bounds of clusters are
are not flat and will not
found using graph connectivity.
work in those models.
Tips: O(N2) memory use. Not deterministic at
K-Means O(kN) cluster boundaries.
When to use it: When you need something that scales well and Code: cluster.DBSCAN(min_samples, eps, metric)
has a small number of flat clusters. For large sample sizes, substi-
Higher min_samples or lower eps requires higher density to
tute MiniBatchKMeans.
form a cluster.
How it works: Assigns samples to nearest of k cluster centers, then
moves the centers to minimize the average distance between cen- Agglomerative Clustering O(N2logN)
ters and samples. When to use it: When you need a flexible definition of distance
Tips: The K-Means algorithm used by scikit-learn is sensitive to the (e.g. Levenshtein).
initial location of the centers. Performs poorly on complex, non-flat How it works: Defines all observations as unique clusters, then
shapes. merges the closest ones iteratively.
Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize. Tips: Worst time complexity.
Code: cluster.AgglomerativeClustering(linkage, affinity,
connectivity). Set linkage criteria for merging:
• 'ward': minimize sum of square differences. Minimizes variance
Gives most regular cluster size.
• 'complete': minimize max distance between sample pairs.
• 'average': minimize average distance between all sample pairs.
Yields uneven cluster sizes.
Mean Shift O(NlogN) affinity: defines type of distances. 'l1' for sparse features, e.g.,
When to use it: When you have non-flat geometries, an unknown text; 'cosine' is invariant to scaling.
number of clusters, and need to guarantee convergence.
connectivity: provides extra constraints about which nodes can
How it works: Finds local maxima given a window size. be merged, e.g., neighbors.kneighbors_graph.
Tips: Accuracy strongly tied to selecting correct window.
Code: cluster.MeanShift(bandwith). Set bandwidth manually
to small value for large dataset. Estimating it is O(N2) and can be
the bottleneck.

Affinity Propagation O(N2)
When to use it: When you have an unknown number of clusters
and need to specify own similarity metric (affinity argument).
How it works: Finds data points which maximize similarity within
cluster while minimizing similarity with data outside of cluster.
Tips: O(N2) memory use. Accuracy tied to damping.
Code: cluster.AffinityPropagation(preference, damping)
• preference: Negative. Controls the number of clusters. Explore
on log scale.
• damping: 0.5 to 1.

BIRCH O(kN) Performance Metrics in sklearn.metrics
When to use it: When you have a large number of observations The metrics do not take into account the exact class values,
and small number of features. only their separation. Score is based on ground truth (targets), if
How it works: Builds a balanced tree of groups of data, then clus- available, or to a measure of similarity within class, and difference
ters those groups instead of the raw data. across classes.

Tips: Performs poorly with large number of features. Needs ground truth:

Code: cluster.Birch(threshold, branching_factor, • adjusted_rand_score: -1 to 1 (best). 0 is random classes.
n_clusters) Measures similarity. Related to accuracy (% correct).
• adjusted_mutual_info_score: 0 to 1 (best). 0 is random
classes. 10x slower than adjusted_rand_score. Measures agree-
ment.
• homogeneity_completeness_v_measure: 0 to 1 (best).
homogeneity: each cluster only contains members of one
class; completeness: all members of a class are in the same
cluster; and, v_measure_score: the harmonic mean of both.
Not normalized for random labeling.

Doesn't need ground truth:
• silhouette_score: -1 to 1 (best). 0 means overlapping
clusters. Based on distance to samples in same cluster and
distance to next nearest cluster.

Linear Model O(ND2) Ridge vs. Lasso — Shape of Ew
Solves problems of the form: With Ridge and Lasso, the error to minimize E has an extra
component Ew:

with predicted value y, features, x, and fitted weights w.
Solved by minimizing “least square error”, ED: Lasso produces sparse models because small weights are forced
to zero.

On fitted models, access w as model.coef_
and w0 as model.intercept_.
Tips: Features must be uncorrelated, use
decomposition.PCA().
Code: linear_model.LinearRegression() if less than
100,000 samples, or see SGD. Nonlinear Transformations
When to use them: When a “straight line” is not sufficient, like
Ridge O(ND2) predicting temperature has a function of time of day.
When to use it: When you have less than 100,000 samples or How it works: “Reword” a nonlinear model in linear terms using
noisy outputs. nonlinear basis functions, Øj(x), so we can use linear model
How it works: Linear model that limits the size of the weights. machinery to solve nonlinear problems. The linear model be-
Prevents overfitting by increasing bias. Minimizes E instead of ED , comes:
where the second term is called the “L2 norm”:

Polynomial Expansion of Order P: A 2nd order polynomial
two-feature model:

Code: linear_model.Ridge(alpha)
Becomes a model with these six basis functions:
alpha: Regularization strength, alpha > 0, corresponds to 1/C in
other models. Increase if noisy samples.

Lasso O(ND2) Tips: The same feature affects many different coefficients, so an
When to use it: When you have less than 100,000 samples, and outlier can have a big global effect. Number of basis functions
only some features should be important. grows very quickly, O((P+1)(D+1)).

How it works: Linear model that forces small weights to be zero. Code: poly = preprocessing.PolynomialFeatures(degree)
Minimizes E instead of ED , where the second term is called the “L1 x_poly = poly.fit_transform(x)
norm”: Radial Basis Functions (RBF): Local, Gaussian-shaped functions,
defined by centers and width. Turns one feature into P features.
Code: metrics.pairwise.rbf_kernel(x, centers, gamma)

Tip: Use with feature_selection.SelectFromModel as a
transformation stage to select features with non-zero weights.
Code: linear_model.Lasso(alpha)
alpha: Regularization strength, alpha > 0, corresponds to 1/C in
other models. Increase if noisy samples.

Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
Regression: Predict Continuous Data
Support Vector Regressor ~O(N2D) Performance Metrics in sklearn.metrics
When to use it: When you have many important features, mean_squared_error: Smaller is better. Puts large weight on
more features than samples, or a nonlinear problem. outliers.
How it works: Find a function such that training
points fit within a “tube” of acceptable error, with
some tolerance towards points that are outside
the tube. r2_score: Coefficient of determination. Best score is 1.0. Propor-
Tips: Must scale inputs, see StandardScalar and tion of explained variance. Default for model.score(x, t).
RobustScalar.
Code: Start with svm.LinearSVR(epsilon, C=1). Make C smaller
if lots of noisy observations (C = 1/α, small C means more regu-
larization). mean_absolute_error: Smaller is better. Uses same scale as the
If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma). data.

Stochastic Gradient Descent (SGD) Regressor
When to use it: When the fit is too slow with other estimators.
How it works: “Online” method, learns the weights
median_absolute_error: Robust to outliers.
in batches, with a subset of the data each time.
Pair with manual basis function expansion to train
nonlinear models on really large datasets.
Code: linear_model.SGDRegressor()
and partial_fit() method.