1 Classification: Predict Categorical Data Predict the class, or label (t), of a sample based on its features (x). Examples: Recognize hand-written digits, or mark email as spam. In scikit-learn, labels are represented as integers and get expanded internally into matrices of binary choices between unique integer labels. Use class_weight='balanced' in most models to adjust for unbalanced datasets (more training data from one class than others). Training data has N samples and D features. Logistic Regression O(ND2) Tips: Hard to generate the perfect mix of estimators. When to use it: When you need to under- Code: All in ensemble module. stand the contributions of features using Averaging estimators: a method that’s fast to train and easy • RandomForestClassifier(max_features) to interpret. • ExtraTreesClassifier(max_features) How it works: Fits an s-shaped function Start with these, but always cross-validate: (logistic function), which is continuous but has a steep transition between the two • max_features=sqrt(n_features) classes, and assigns class based on sign. • max_depth=None • min_samples_split=1 Tips: Inputs must be scaled and uncorrelated. Code: linear_model.LogisticRegression(C, solver) Boosting estimator: • penalty='l1' to use estimator for feature selection • AdaBoostClassifier() • solver='liblinear' for small datasets or L1 penalty • GradientBoostingClassifier() • 'lbfgs', 'sag' or 'newton-cg' for multi-class problems and large datasets All: • 'sag' for very large datasets • Parallelize with n_jobs=-1 Decision Tree O(NDlog(N)) • Increasing n_estimators is better, but slower When to use it: When you need to understand prediction deci- sions, when data has both continuous and categorical features, Support Vector Classifier O(ND2) to O(ND3) When to use it: When you have a large number of features or and when no scaling is needed. slightly more features than samples. How it works: Chain binary decisions on increasingly smaller How it works: Maximize distance between subsets of data. Deeper trees have more complex decision rules classes in high-dimensional space, i.e., and a better fit. “maximum margin classifier.” Yes known number? No Tips: Scale your data. Code: svm.SVC(kernel, C=1). Make C small- is grandma? ignore er if lots of noisy samples. If accuracy is important set kernel='rbf'. If fast training is important, use pick up ignore svm.LinearSVC(). Neighbor Classifiers Tips: Very often overfits. Consider doing dimensionality reduction O(DlogN) to O(DN) beforehand. N must double with each extra level. When to use them: When you have large Code: tree.DecisionTreeClassifier(max_depth). Start with datasets or a very irregular decision boundary. ? max_depth=3, then increase. Use tree.export_graphviz to How it works: Predict class by majority vote K=3 visualize tree. from nearby data. K=5 Tips: Efficiency comes at the cost of also having Ensemble Methods high variance. When to use them: When no single estimator gives satisfying results. Code: neighbors.KNeighborsClassifier(n_neighbors). How they work: Combines predictions of multiple weak, biased • Use RadiusNeighborsClassifier() for unbalanced data and estimators to create a better one. There are two types; averaging a D not too large. methods —build many estimators—and average predictions. • Try weights='uniform' and 'distance'. In boosting methods, each new estimator tries to improve the previous one. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ Classification: Predict Categorical Data Stochastic Gradient Descent (SGD) Classifier Performance Metrics in sklearn.metrics When to use it: When you have a very large They take targets, t, and predicted classes, y, C3 C2 C1 N and D. as arguments. There's more than one way to How it works: “Online” method, learns the weights be wrong. A fire alarm that always goes off is CI C2 C3 in batches. annoying, one that never goes off is costly. Tips: Data must be scaled. confusion_matrix: Explore how model confuses classes. Visualize with seaborn.heatmap. Code: linear_model.SGDClassifier(loss, alpha, n_iter) and partial_fit() method. Use n_iter=np.ceil(10**6/n_ samples). loss='hinge' gives SVC, 'log' gives logistic regression. accuracy_score (default for model.score): Fraction cor- rectly predicted. Meaningless if samples are unbalanced. (TP + TN) / Total recall_score: Fraction of predicted fire when there's actually fire. TP / (TP + FN) precision_score: Fraction of correctly predicted fire of all cases where fire is predicted.P predicted as P. TP / (TP + FP) Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ 2 Clustering: Unsupervised Learning Predict the underlying structure in features, without the use of targets or labels. Split samples into groups called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Data has N samples, D features, and the model discovers k clusters. Models can be used for predic- tion or for transformation, by reducing D features into one with k unique values. Some models expect DBSCAN O(N2) geometries that are “flat,” When to use it: When you have very non-flat or roughly spherical. geometries or very uneven clusters. Clusters with complicated How it works: Clusters are contiguous areas shapes like rings or lines with high data density. Bounds of clusters are are not flat and will not found using graph connectivity. work in those models. Tips: O(N2) memory use. Not deterministic at K-Means O(kN) cluster boundaries. When to use it: When you need something that scales well and Code: cluster.DBSCAN(min_samples, eps, metric) has a small number of flat clusters. For large sample sizes, substi- Higher min_samples or lower eps requires higher density to tute MiniBatchKMeans. form a cluster. How it works: Assigns samples to nearest of k cluster centers, then moves the centers to minimize the average distance between cen- Agglomerative Clustering O(N2logN) ters and samples. When to use it: When you need a flexible definition of distance Tips: The K-Means algorithm used by scikit-learn is sensitive to the (e.g. Levenshtein). initial location of the centers. Performs poorly on complex, non-flat How it works: Defines all observations as unique clusters, then shapes. merges the closest ones iteratively. Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize. Tips: Worst time complexity. Code: cluster.AgglomerativeClustering(linkage, affinity, connectivity). Set linkage criteria for merging: • 'ward': minimize sum of square differences. Minimizes variance Gives most regular cluster size. • 'complete': minimize max distance between sample pairs. • 'average': minimize average distance between all sample pairs. Yields uneven cluster sizes. Mean Shift O(NlogN) affinity: defines type of distances. 'l1' for sparse features, e.g., When to use it: When you have non-flat geometries, an unknown text; 'cosine' is invariant to scaling. number of clusters, and need to guarantee convergence. connectivity: provides extra constraints about which nodes can How it works: Finds local maxima given a window size. be merged, e.g., neighbors.kneighbors_graph. Tips: Accuracy strongly tied to selecting correct window. Code: cluster.MeanShift(bandwith). Set bandwidth manually to small value for large dataset. Estimating it is O(N2) and can be the bottleneck. Affinity Propagation O(N2) When to use it: When you have an unknown number of clusters and need to specify own similarity metric (affinity argument). How it works: Finds data points which maximize similarity within cluster while minimizing similarity with data outside of cluster. Tips: O(N2) memory use. Accuracy tied to damping. Code: cluster.AffinityPropagation(preference, damping) • preference: Negative. Controls the number of clusters. Explore on log scale. • damping: 0.5 to 1. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ Clustering: Unsupervised Learning BIRCH O(kN) Performance Metrics in sklearn.metrics When to use it: When you have a large number of observations The metrics do not take into account the exact class values, and small number of features. only their separation. Score is based on ground truth (targets), if How it works: Builds a balanced tree of groups of data, then clus- available, or to a measure of similarity within class, and difference ters those groups instead of the raw data. across classes. Tips: Performs poorly with large number of features. Needs ground truth: Code: cluster.Birch(threshold, branching_factor, • adjusted_rand_score: -1 to 1 (best). 0 is random classes. n_clusters) Measures similarity. Related to accuracy (% correct). • adjusted_mutual_info_score: 0 to 1 (best). 0 is random classes. 10x slower than adjusted_rand_score. Measures agree- ment. • homogeneity_completeness_v_measure: 0 to 1 (best). homogeneity: each cluster only contains members of one class; completeness: all members of a class are in the same cluster; and, v_measure_score: the harmonic mean of both. Not normalized for random labeling. Doesn't need ground truth: • silhouette_score: -1 to 1 (best). 0 means overlapping clusters. Based on distance to samples in same cluster and distance to next nearest cluster. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ 3 Regression: Predict Continuous Data Predict how a dependent variable (output, t) changes when any of the independent variables (inputs, or features, x) change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web page varies as a function of the number of ads and content type. Training data has N samples and D features. Linear Model O(ND2) Ridge vs. Lasso — Shape of Ew Solves problems of the form: With Ridge and Lasso, the error to minimize E has an extra component Ew: with predicted value y, features, x, and fitted weights w. Solved by minimizing “least square error”, ED: Lasso produces sparse models because small weights are forced to zero. On fitted models, access w as model.coef_ and w0 as model.intercept_. Tips: Features must be uncorrelated, use decomposition.PCA(). Code: linear_model.LinearRegression() if less than 100,000 samples, or see SGD. Nonlinear Transformations When to use them: When a “straight line” is not sufficient, like Ridge O(ND2) predicting temperature has a function of time of day. When to use it: When you have less than 100,000 samples or How it works: “Reword” a nonlinear model in linear terms using noisy outputs. nonlinear basis functions, Øj(x), so we can use linear model How it works: Linear model that limits the size of the weights. machinery to solve nonlinear problems. The linear model be- Prevents overfitting by increasing bias. Minimizes E instead of ED , comes: where the second term is called the “L2 norm”: Polynomial Expansion of Order P: A 2nd order polynomial two-feature model: Code: linear_model.Ridge(alpha) Becomes a model with these six basis functions: alpha: Regularization strength, alpha > 0, corresponds to 1/C in other models. Increase if noisy samples. Lasso O(ND2) Tips: The same feature affects many different coefficients, so an When to use it: When you have less than 100,000 samples, and outlier can have a big global effect. Number of basis functions only some features should be important. grows very quickly, O((P+1)(D+1)). How it works: Linear model that forces small weights to be zero. Code: poly = preprocessing.PolynomialFeatures(degree) Minimizes E instead of ED , where the second term is called the “L1 x_poly = poly.fit_transform(x) norm”: Radial Basis Functions (RBF): Local, Gaussian-shaped functions, defined by centers and width. Turns one feature into P features. Code: metrics.pairwise.rbf_kernel(x, centers, gamma) Tip: Use with feature_selection.SelectFromModel as a transformation stage to select features with non-zero weights. Code: linear_model.Lasso(alpha) alpha: Regularization strength, alpha > 0, corresponds to 1/C in other models. Increase if noisy samples. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ Regression: Predict Continuous Data Support Vector Regressor ~O(N2D) Performance Metrics in sklearn.metrics When to use it: When you have many important features, mean_squared_error: Smaller is better. Puts large weight on more features than samples, or a nonlinear problem. outliers. How it works: Find a function such that training points fit within a “tube” of acceptable error, with some tolerance towards points that are outside the tube. r2_score: Coefficient of determination. Best score is 1.0. Propor- Tips: Must scale inputs, see StandardScalar and tion of explained variance. Default for model.score(x, t). RobustScalar. Code: Start with svm.LinearSVR(epsilon, C=1). Make C smaller if lots of noisy observations (C = 1/α, small C means more regu- larization). mean_absolute_error: Smaller is better. Uses same scale as the If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma). data. Stochastic Gradient Descent (SGD) Regressor When to use it: When the fit is too slow with other estimators. How it works: “Online” method, learns the weights median_absolute_error: Robust to outliers. in batches, with a subset of the data each time. Pair with manual basis function expansion to train nonlinear models on really large datasets. Code: linear_model.SGDRegressor() and partial_fit() method. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/