1 Classification: Predict Categorical Data Predict the class, or label (t), of a sample based on its features (x). Examples: Recognize hand-written digits, or mark email as spam. In scikit-learn, labels are represented as integers and get expanded internally into matrices of binary choices between unique integer labels. Use class_weight='balanced' in most models to adjust for unbalanced datasets (more training data from one class than others). Training data has N samples and D features. Logistic Regression O(ND2) Code: All in ensemble module. When to use it: When you need to under- Averaging estimators: stand the contributions of features using • RandomForestClassifier(max_features) a method that’s fast to train and easy • ExtraTreesClassifier(max_features) to interpret. Start with these, but always cross-validate: How it works: Fits an s-shaped function (logistic function), which is continuous but has • max_features=sqrt(n_features) a steep transition between the two classes, • max_depth=None and assigns class based on sign. • min_samples_split=1 Tips: Inputs must be scaled and uncorrelated. Boosting estimator: Code: linear_model.LogisticRegression(C, solver) • AdaBoostClassifier() • penalty='l1' to use estimator for feature selection • GradientBoostingClassifier() • solver='liblinear' for small datasets or L1 penalty • 'lbfgs', 'sag' or 'newton-cg' for multi-class problems All: and large datasets • Parallelize with n_jobs=-1 • 'sag' for very large datasets • Increasing n_estimators is better, but slower Decision Tree O(NDlog(N)) Support Vector Classifier O(ND2) to O(ND3) When to use it: When you need to understand prediction decisions, When to use it: When you have a large number when data has both continuous and categorical features, and when of features or slightly more features than samples. no scaling is needed. How it works: Maximize distance between classes How it works: Chain binary decisions on increasingly smaller subsets in high-dimensional space, i.e., “maximum margin of data. Deeper trees have more complex decision rules and a better fit. classifier.” Tips: Scale your data. Yes known number? No Code: svm.SVC(kernel, C=1). Make C smaller if lots of noisy samples. If accuracy is important set is grandma? ignore kernel='rbf'. If fast training is important, use svm.LinearSVC(). pick up ignore Neighbor Classifiers O(DlogN) to O(DN) Tips: Very often overfits. Consider doing dimensionality reduction When to use them: When you have large beforehand. N must double with each extra level. datasets or a very irregular decision boundary. Code: tree.DecisionTreeClassifier(max_depth). Start with How it works: Predict class by majority vote max_depth=3, then increase. Use tree.export_graphviz to from nearby data. \ visualize tree. Tips: Efficiency comes at the cost of also having h Z P high variance. h Z R Ensemble Methods Code: neighbors.KNeighborsClassifier(n_neighbors). When to use them: When no single estimator gives satisfying results. • Use RadiusNeighborsClassifier() for unbalanced data and How they work: Combines predictions of multiple weak, biased a D not too large. estimators to create a better one. There are two types; averaging • Try weights='uniform' and 'distance'. methods —build many estimators—and average predictions. In boosting methods, each new estimator tries to improve the previous one. Tips: Hard to generate the perfect mix of estimators. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training www.enthought.com Classification: Predict Categorical Data Stochastic Gradient Descent (SGD) Classifier Performance Metrics in sklearn.metrics When to use it: When you have a very large N and D. They take targets, t, and predicted classes, y, C3 C2 C1 How it works: “Online” method, learns the weights as arguments. There's more than one way to in batches. be wrong. A fire alarm that always goes off is annoying, one that never goes off is costly. CI C2 C3 Tips: Data must be scaled. confusion_matrix: Explore how model confuses classes. Visualize Code: linear_model.SGDClassifier(loss, alpha, n_iter) and with seaborn.heatmap. partial_fit() method. Use n_iter=np.ceil(10**6/n_samples). loss='hinge' gives SVC, 'log' gives logistic regression. accuracy_score (default for model.score): Fraction correctly predicted. Meaningless if samples are unbalanced. (TP + TN) / Total recall_score: Fraction of predicted fire when there's actually fire. TP / (TP + FN) precision_score: Fraction of correctly predicted fire of all cases where fire is predicted. P predicted as P. TP / (TP + FP) Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ 2 Clustering: Unsupervised Learning Predict the underlying structure in features, without the use of targets or labels. Split samples into groups called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Data has N samples, D features, and the model discovers k clusters. Models can be used for prediction or for transformation, by reducing D features into one with k unique values. Some models expect DBSCAN O(N2) geometries that are “flat,” When to use it: When you have very non-flat or roughly spherical. geometries or very uneven clusters. Clusters with complicated How it works: Clusters are contiguous areas shapes like rings or lines with high data density. Bounds of clusters are are not flat and will not found using graph connectivity. work in those models. Tips: O(N2) memory use. Not deterministic at K-Means O(kN) cluster boundaries. When to use it: When you need something that scales well and Code: cluster.DBSCAN(min_samples, eps, metric) has a small number of flat clusters. For large sample sizes, substitute Higher min_samples or lower eps requires higher density to MiniBatchKMeans. form a cluster. How it works: Assigns samples to nearest of k cluster centers, then moves the centers to minimize the average distance between centers Agglomerative Clustering O(N2logN) and samples. When to use it: When you need a flexible definition of distance Tips: The K-Means algorithm used by scikit-learn is sensitive to the initial (e.g. Levenshtein). location of the centers. Performs poorly on complex, non-flat shapes. How it works: Defines all observations as unique clusters, then merges Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize. the closest ones iteratively. Tips: Worst time complexity. Code: cluster.AgglomerativeClustering(linkage, affinity, connectivity). Set linkage criteria for merging: • 'ward': minimize sum of square differences. Minimizes variance Gives most regular cluster size. • 'complete': minimize max distance between sample pairs. • 'average': minimize average distance between all sample pairs. Mean Shift O(NlogN) Yields uneven cluster sizes. When to use it: When you have non-flat geometries, an unknown affinity: defines type of distances. 'l1' for sparse features, e.g., text; number of clusters, and need to guarantee convergence. 'cosine' is invariant to scaling. How it works: Finds local maxima given a window size. connectivity: provides extra constraints about which nodes can be Tips: Accuracy strongly tied to selecting correct window. merged, e.g., neighbors.kneighbors_graph. Code: cluster.MeanShift(bandwith). Set bandwidth manually to small value for large dataset. Estimating it is O(N2) and can be the bottleneck. Affinity Propagation O(N2) When to use it: When you have an unknown number of clusters and need to specify own similarity metric (affinity argument). How it works: Finds data points which maximize similarity within cluster while minimizing similarity with data outside of cluster. Tips: O(N2) memory use. Accuracy tied to damping. Code: cluster.AffinityPropagation(preference, damping) • preference: Negative. Controls the number of clusters. Explore on log scale. • damping: 0.5 to 1. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training www.enthought.com Clustering: Unsupervised Learning BIRCH O(kN) Performance Metrics in sklearn.metrics When to use it: When you have a large number of observations and The metrics do not take into account the exact class values, only their small number of features. separation. Score is based on ground truth (targets), if available, or to a How it works: Builds a balanced tree of groups of data, then clusters measure of similarity within class, and difference across classes. those groups instead of the raw data. Needs ground truth: Tips: Performs poorly with large number of features. • adjusted_rand_score: -1 to 1 (best). 0 is random classes. Measures Code: cluster.Birch(threshold, branching_factor, similarity. Related to accuracy (% correct). n_clusters) • adjusted_mutual_info_score: 0 to 1 (best). 0 is random classes. 10x slower than adjusted_rand_score. Measures agreement. • homogeneity_completeness_v_measure: 0 to 1 (best). homogene- ity: each cluster only contains members of one class; completeness: all members of a class are in the same cluster; and, v_measure_ score: the harmonic mean of both. Not normalized for random labeling. Doesn't need ground truth: • silhouette_score: -1 to 1 (best). 0 means overlapping clusters. Based on distance to samples in same cluster and distance to next nearest cluster. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/ 3 Regression: Predict Continuous Data Predict how a dependent variable (output, t) changes when any of the independent variables (inputs, or features, x) change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web page varies as a function of the number of ads and content type. Training data has N samples and D features. Linear Model O(ND2) Ridge vs. Lasso — Shape of Ew Solves problems of the form: With Ridge and Lasso, the error to minimize E has an extra component Ew: with predicted value y, features, x, and fitted weights w. Solved by minimizing “least square error”, ED: Lasso produces sparse models because small weights are forced to zero. On fitted models, access w as model.coef_ and w0 as model.intercept_. Tips: Features must be uncorrelated, use decomposition.PCA(). Code: linear_model.LinearRegression() if less than 100,000 samples, or see SGD. Nonlinear Transformations When to use them: When a “straight line” is not sufficient, like Ridge O(ND2) predicting temperature has a function of time of day. When to use it: When you have less than 100,000 samples or How it works: “Reword” a nonlinear model in linear terms using noisy outputs. nonlinear basis functions, Φj(x), so we can use linear model machinery How it works: Linear model that limits the size of the weights. to solve nonlinear problems. The linear model becomes: Prevents overfitting by increasing bias. Minimizes E instead of ED, where the second term is called the “L2 norm”: Polynomial Expansion of Order P: A 2nd order polynomial two-feature model: Code: linear_model.Ridge(alpha) Becomes a model with these six basis functions: alpha: Regularization strength, alpha > 0, corresponds to 1/C in other models. Increase if noisy samples. Tips: The same feature affects many different coefficients, so an outlier Lasso O(ND ) 2 can have a big global effect. Number of basis functions grows very When to use it: When you have less than 100,000 samples, and only quickly, O((P+1)(D+1)). some features should be important. Code: poly = preprocessing.PolynomialFeatures(degree) How it works: Linear model that forces small weights to be zero. x_poly = poly.fit_transform(x) Minimizes E instead of ED, where the second term is called the “L1 norm”: Radial Basis Functions (RBF): Local, Gaussian-shaped functions, defined by centers and width. Turns one feature into P features. Code: metrics.pairwise.rbf_kernel(x, centers, gamma) Tip: Use with feature_selection.SelectFromModel as a transformation stage to select features with non-zero weights. Code: linear_model.Lasso(alpha) alpha: Regularization strength, alpha > 0, corresponds to 1/C in other models. Increase if noisy samples. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training www.enthought.com Regression: Predict Continuous Data Support Vector Regressor ~O(N2D) Performance Metrics in sklearn.metrics When to use it: When you have many important features, mean_squared_error: Smaller is better. Puts large weight on outliers. more features than samples, or a nonlinear problem. How it works: Find a function such that training points fit within a “tube” of acceptable error, with some tolerance towards points that are outside r2_score: Coefficient of determination. Best score is 1.0. Proportion of the tube. explained variance. Default for model.score(x, t). Tips: Must scale inputs, see StandardScalar and RobustScalar. Code: Start with svm.LinearSVR(epsilon, C=1). Make C smaller if lots of noisy observations (C = 1/α, small C means more regularization). If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma). mean_absolute_error: Smaller is better. Uses same scale as the data. Stochastic Gradient Descent (SGD) Regressor When to use it: When the fit is too slow with other estimators. How it works: “Online” method, learns the weights median_absolute_error: Robust to outliers. in batches, with a subset of the data each time. Pair with manual basis function expansion to train nonlinear models on really large datasets. Code: linear_model.SGDRegressor() and partial_fit() method. Take your machine learning skills to the next level! Register at enthought.com/python-for-data-science-training ©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0 www.enthought.com International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/