SciKit Cheat Sheets

Authors Enthought Inc.

Plaintext

1 Classification:
Predict Categorical Data
Predict the class, or label (t), of a sample based on its features (x). Examples: Recognize hand-written digits, or mark email
as spam. In scikit-learn, labels are represented as integers and get expanded internally into matrices of binary choices
between unique integer labels.
Use class_weight='balanced' in most models to adjust for unbalanced datasets (more training data from one class
than others). Training data has N samples and D features.

Logistic Regression O(ND2) Code: All in ensemble module.
When to use it: When you need to under- Averaging estimators:
stand the contributions of features using
• RandomForestClassifier(max_features)
a method that’s fast to train and easy
• ExtraTreesClassifier(max_features)
to interpret.
Start with these, but always cross-validate:
How it works: Fits an s-shaped function
(logistic function), which is continuous but has • max_features=sqrt(n_features)
a steep transition between the two classes, • max_depth=None
and assigns class based on sign. • min_samples_split=1

Tips: Inputs must be scaled and uncorrelated.
Boosting estimator:
Code: linear_model.LogisticRegression(C, solver)
• AdaBoostClassifier()
• penalty='l1' to use estimator for feature selection • GradientBoostingClassifier()
• solver='liblinear' for small datasets or L1 penalty
• 'lbfgs', 'sag' or 'newton-cg' for multi-class problems All:
and large datasets
• Parallelize with n_jobs=-1
• 'sag' for very large datasets
• Increasing n_estimators is better, but slower
Decision Tree O(NDlog(N)) Support Vector Classifier O(ND2) to O(ND3)
When to use it: When you need to understand prediction decisions,
When to use it: When you have a large number
when data has both continuous and categorical features, and when
of features or slightly more features than samples.
no scaling is needed.
How it works: Maximize distance between classes
How it works: Chain binary decisions on increasingly smaller subsets
in high-dimensional space, i.e., “maximum margin
of data. Deeper trees have more complex decision rules and a better fit.
classifier.”
Tips: Scale your data.
Yes known number? No
Code: svm.SVC(kernel, C=1). Make C smaller
if lots of noisy samples. If accuracy is important set
is grandma? ignore
kernel='rbf'. If fast training is important, use svm.LinearSVC().

pick up ignore Neighbor Classifiers
O(DlogN) to O(DN)
Tips: Very often overfits. Consider doing dimensionality reduction When to use them: When you have large
beforehand. N must double with each extra level. datasets or a very irregular decision boundary.
Code: tree.DecisionTreeClassifier(max_depth). Start with How it works: Predict class by majority vote
max_depth=3, then increase. Use tree.export_graphviz to from nearby data. \
visualize tree. Tips: Efficiency comes at the cost of also having h Z P
high variance. h Z R
Ensemble Methods
Code: neighbors.KNeighborsClassifier(n_neighbors).
When to use them: When no single estimator gives satisfying results.
• Use RadiusNeighborsClassifier() for unbalanced data and
How they work: Combines predictions of multiple weak, biased
a D not too large.
estimators to create a better one. There are two types; averaging
• Try weights='uniform' and 'distance'.
methods —build many estimators—and average predictions.
In boosting methods, each new estimator tries to improve the
previous one.
Tips: Hard to generate the perfect mix of estimators.

Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training www.enthought.com
Classification: Predict Categorical Data
Stochastic Gradient Descent (SGD) Classifier Performance Metrics in sklearn.metrics
When to use it: When you have a very large N and D. They take targets, t, and predicted classes, y,

C3 C2 C1
How it works: “Online” method, learns the weights as arguments. There's more than one way to
in batches. be wrong. A fire alarm that always goes off is
annoying, one that never goes off is costly. CI C2 C3
Tips: Data must be scaled.
confusion_matrix: Explore how model confuses classes. Visualize
Code: linear_model.SGDClassifier(loss, alpha, n_iter) and
with seaborn.heatmap.
partial_fit() method. Use n_iter=np.ceil(10**6/n_samples).
loss='hinge' gives SVC, 'log' gives logistic regression.

accuracy_score (default for model.score): Fraction correctly
predicted. Meaningless if samples are unbalanced. (TP + TN) / Total

recall_score: Fraction of predicted fire when there's actually fire.
TP / (TP + FN)

precision_score: Fraction of correctly predicted fire of all cases
where fire is predicted.
P predicted as P. TP / (TP + FP)

Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
2 Clustering:
Unsupervised Learning
Predict the underlying structure in features, without the use of targets or labels. Split samples into groups called
“clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Data has
N samples, D features, and the model discovers k clusters. Models can be used for prediction or for transformation,
by reducing D features into one with k unique values.

Some models expect DBSCAN O(N2)
geometries that are “flat,” When to use it: When you have very non-flat
or roughly spherical. geometries or very uneven clusters.
Clusters with complicated
How it works: Clusters are contiguous areas
shapes like rings or lines
with high data density. Bounds of clusters are
are not flat and will not
found using graph connectivity.
work in those models.
Tips: O(N2) memory use. Not deterministic at
K-Means O(kN) cluster boundaries.
When to use it: When you need something that scales well and Code: cluster.DBSCAN(min_samples, eps, metric)
has a small number of flat clusters. For large sample sizes, substitute
Higher min_samples or lower eps requires higher density to
MiniBatchKMeans.
form a cluster.
How it works: Assigns samples to nearest of k cluster centers, then
moves the centers to minimize the average distance between centers Agglomerative Clustering O(N2logN)
and samples. When to use it: When you need a flexible definition of distance
Tips: The K-Means algorithm used by scikit-learn is sensitive to the initial (e.g. Levenshtein).
location of the centers. Performs poorly on complex, non-flat shapes. How it works: Defines all observations as unique clusters, then merges
Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize. the closest ones iteratively.
Tips: Worst time complexity.
Code: cluster.AgglomerativeClustering(linkage, affinity,
connectivity). Set linkage criteria for merging:
• 'ward': minimize sum of square differences. Minimizes variance
Gives most regular cluster size.
• 'complete': minimize max distance between sample pairs.
• 'average': minimize average distance between all sample pairs.
Mean Shift O(NlogN) Yields uneven cluster sizes.
When to use it: When you have non-flat geometries, an unknown affinity: defines type of distances. 'l1' for sparse features, e.g., text;
number of clusters, and need to guarantee convergence. 'cosine' is invariant to scaling.
How it works: Finds local maxima given a window size. connectivity: provides extra constraints about which nodes can be
Tips: Accuracy strongly tied to selecting correct window. merged, e.g., neighbors.kneighbors_graph.
Code: cluster.MeanShift(bandwith). Set bandwidth manually
to small value for large dataset. Estimating it is O(N2) and can be
the bottleneck.

Affinity Propagation O(N2)
When to use it: When you have an unknown number of clusters
and need to specify own similarity metric (affinity argument).
How it works: Finds data points which maximize similarity within
cluster while minimizing similarity with data outside of cluster.
Tips: O(N2) memory use. Accuracy tied to damping.
Code: cluster.AffinityPropagation(preference, damping)
• preference: Negative. Controls the number of clusters. Explore
on log scale.
• damping: 0.5 to 1.

Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training www.enthought.com
Clustering: Unsupervised Learning
BIRCH O(kN) Performance Metrics in sklearn.metrics
When to use it: When you have a large number of observations and The metrics do not take into account the exact class values, only their
small number of features. separation. Score is based on ground truth (targets), if available, or to a
How it works: Builds a balanced tree of groups of data, then clusters measure of similarity within class, and difference across classes.
those groups instead of the raw data. Needs ground truth:
Tips: Performs poorly with large number of features. • adjusted_rand_score: -1 to 1 (best). 0 is random classes. Measures
Code: cluster.Birch(threshold, branching_factor, similarity. Related to accuracy (% correct).
n_clusters) • adjusted_mutual_info_score: 0 to 1 (best). 0 is random classes.
10x slower than adjusted_rand_score. Measures agreement.
• homogeneity_completeness_v_measure: 0 to 1 (best). homogene-
ity: each cluster only contains members of one class; completeness:
all members of a class are in the same cluster; and, v_measure_
score: the harmonic mean of both. Not normalized for random
labeling.

Doesn't need ground truth:
• silhouette_score: -1 to 1 (best). 0 means overlapping clusters.
Based on distance to samples in same cluster and distance to next
nearest cluster.

Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/
3 Regression:
Predict Continuous Data
Predict how a dependent variable (output, t) changes when any of the independent variables (inputs, or features, x)
change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web
page varies as a function of the number of ads and content type. Training data has N samples and D features.

Linear Model O(ND2) Ridge vs. Lasso —
Shape of Ew
Solves problems of the form: With Ridge and Lasso, the error to minimize E has an extra
component Ew:

with predicted value y, features, x, and fitted weights w.
Solved by minimizing “least square error”, ED: Lasso produces sparse models because small weights are forced
to zero.

On fitted models, access w as model.coef_ and
w0 as model.intercept_.
Tips: Features must be uncorrelated, use
decomposition.PCA().
Code: linear_model.LinearRegression() if less than
100,000 samples, or see SGD. Nonlinear Transformations
When to use them: When a “straight line” is not sufficient, like
Ridge O(ND2) predicting temperature has a function of time of day.
When to use it: When you have less than 100,000 samples or How it works: “Reword” a nonlinear model in linear terms using
noisy outputs. nonlinear basis functions, Φj(x), so we can use linear model machinery
How it works: Linear model that limits the size of the weights. to solve nonlinear problems. The linear model becomes:
Prevents overfitting by increasing bias. Minimizes E instead of ED,
where the second term is called the “L2 norm”:
Polynomial Expansion of Order P: A 2nd order polynomial
two-feature model:

Code: linear_model.Ridge(alpha) Becomes a model with these six basis functions:

alpha: Regularization strength, alpha > 0, corresponds to 1/C in other
models. Increase if noisy samples.
Tips: The same feature affects many different coefficients, so an outlier
Lasso O(ND ) 2 can have a big global effect. Number of basis functions grows very
When to use it: When you have less than 100,000 samples, and only quickly, O((P+1)(D+1)).
some features should be important. Code: poly = preprocessing.PolynomialFeatures(degree)
How it works: Linear model that forces small weights to be zero. x_poly = poly.fit_transform(x)
Minimizes E instead of ED, where the second term is called the “L1 norm”: Radial Basis Functions (RBF): Local, Gaussian-shaped functions,
defined by centers and width. Turns one feature into P features.
Code: metrics.pairwise.rbf_kernel(x, centers, gamma)

Tip: Use with feature_selection.SelectFromModel as a
transformation stage to select features with non-zero weights.
Code: linear_model.Lasso(alpha)
alpha: Regularization strength, alpha > 0, corresponds to 1/C in other
models. Increase if noisy samples.

Take your machine learning skills to the next level!
Register at enthought.com/python-for-data-science-training www.enthought.com
Regression: Predict Continuous Data
Support Vector Regressor ~O(N2D) Performance Metrics in sklearn.metrics
When to use it: When you have many important features, mean_squared_error: Smaller is better. Puts large weight on outliers.
more features than samples, or a nonlinear problem.
How it works: Find a function such that training
points fit within a “tube” of acceptable error, with
some tolerance towards points that are outside r2_score: Coefficient of determination. Best score is 1.0. Proportion of
the tube. explained variance. Default for model.score(x, t).
Tips: Must scale inputs, see StandardScalar and RobustScalar.
Code: Start with svm.LinearSVR(epsilon, C=1). Make C smaller if
lots of noisy observations (C = 1/α, small C means more regularization).
If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma). mean_absolute_error: Smaller is better. Uses same scale as the data.

Stochastic Gradient Descent (SGD) Regressor
When to use it: When the fit is too slow with other estimators.
How it works: “Online” method, learns the weights
median_absolute_error: Robust to outliers.
in batches, with a subset of the data each time. Pair
with manual basis function expansion to train
nonlinear models on really large datasets.
Code: linear_model.SGDRegressor()
and partial_fit() method.