binder

Shapelet based time series machine learning

Shapelets are subseries of times series taken from the train data that are a useful for time series machine learning. They were first proposed as a primitive for machine learning [1] and were embedded in a decision tree for classification. The Shapelet Transform Classifier (STC)[2] is a pipeline classifier which searches the training data for shapelets, transforms series to vectors of distances to shapelets based on information gain, then builds a classifier on this feature space.

Finding shapelets involves selecting and evaluating shapelets. The original shapelet tree and STC performed a full enumeration of all possible shapelets before keeping the best ones. This is computationally inefficient and can lead to over fitting. More recent shapelet based machine learning algorithms randomise the search.

[1]:
import warnings

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from aeon.datasets import load_basic_motions
from aeon.registry import all_estimators
from aeon.transformations.collection.shapelet_based import RandomShapeletTransform

warnings.filterwarnings("ignore")
all_estimators("classifier", filter_tags={"algorithm_type": "shapelet"})
[1]:
[('MrSQMClassifier',
  aeon.classification.shapelet_based._mrsqm.MrSQMClassifier),
 ('RDSTClassifier', aeon.classification.shapelet_based._rdst.RDSTClassifier),
 ('ShapeletTransformClassifier',
  aeon.classification.shapelet_based._stc.ShapeletTransformClassifier)]

Shapelet Transform for Classification

The RandomShapeletTransform transformer takes a set of labelled training time series in the fit function, randomly samples n_shapelet_samples shapelets, keeping the best max_shapelets. The resulting shapelets are used in the transform function to create a new tabular dataset, where each row represents a time series instance, and each column stores the distance from a time series to a shapelet. The resulting tabular data can be used by any scikit learn compatible classifier. In this notebook we will explain these terms and describe how the algorithm works. But first we show it in action. We will use the BasicMotions data as an example. This data set contains time series of motion traces for the activities “running”, “walking”, “standing” and “badminton”. The learning problem is to predict the activity given the time series. Each time series has six channels: x, y, z position and x, y, z accelerometer of the wrist. Data was recorded on a smart watch.

[2]:
X, y = load_basic_motions(split="train")
rst = RandomShapeletTransform(n_shapelet_samples=100, max_shapelets=10, random_state=42)
st = rst.fit_transform(X, y)
print(" Shape of transformed data = ", st.shape)
print(" Distance of second series to third shapelet = ", st[1][2])
testX, testy = load_basic_motions(split="test")
tr_test = rst.transform(testX)
rf = RandomForestClassifier(random_state=10)
rf.fit(st, y)
preds = rf.predict(tr_test)
print(" Shapelets + random forest acc = ", accuracy_score(preds, testy))
 Shape of transformed data =  (40, 8)
 Distance of second series to third shapelet =  1.302772121165026
 Shapelets + random forest acc =  0.95

Visualising Shapelets

The first column of the transformed data represents the distance from the first shapelet to each time series. The shapelets are sorted, so the first shapelet is the one we estimate is the best (using the calculation described below). You can recover the shapelets from the transform. Each shapelet is a 7-tuple, storing the following information:

[3]:
running_shapelet = rst.shapelets[0]
print("Quality = ", running_shapelet[0])
print("Length = ", running_shapelet[1])
print("position = ", running_shapelet[2])
print("Channel = ", running_shapelet[3])
print("Origin Instance Index = ", running_shapelet[4])
print("Class label = ", running_shapelet[5])
print("Shapelet = ", running_shapelet[6])
Quality =  0.81127812
Length =  39
position =  55
Channel =  0
Origin Instance Index =  11
Class label =  running
Shapelet =  [-0.85667017 -1.88711152 -0.8751295   0.80633757  1.10838333  0.69810992
  0.85713394  1.23190921  0.01801365 -1.29683966 -1.94694259 -0.37487726
 -0.37487726  1.39471462  0.74922685  0.74922685  0.22343376  0.22343376
 -0.7730703  -1.37591995 -0.80376393  1.32758071  0.99778845  0.6013481
  0.83711118  0.93684593  0.93684593 -1.30429475 -1.64522057 -0.56312308
  0.96855713  0.56796251  0.35714242  0.62066541  0.65135287 -0.80531237
 -1.49170075 -1.18512797  0.69685753]

We can directly extract shapelets and inspect them. These are the the two shapelets that are best at discriminating badminton and running against other activities. All shapelets are normalised to provide scale invariance.

[4]:
import matplotlib.pyplot as plt

badminton_shapelet = rst.shapelets[4]
print(" Badminton shapelet from channel 0 (x-dimension)", badminton_shapelet)
plt.title("Best shapelets for running and badminton")
plt.plot(badminton_shapelet[6], label="Badminton")
plt.plot(running_shapelet[6], label="Running")
plt.legend()
 Badminton shapelet from channel 0 (x-dimension) (0.65194393, 74, 7, 1, 1, 'standing', array([-5.27667376, -0.94911454,  0.90433173,  1.26316864,  2.34760078,
        1.84408   ,  0.9192852 ,  0.9192852 , -1.29868372, -1.29868372,
       -1.5476774 , -1.03000413,  0.27593674, -0.70184658,  0.37460295,
        1.27398121,  1.02881837,  0.64543662, -0.0669839 , -0.54373096,
       -0.55716134, -0.56605101, -0.08611633,  0.31270572,  0.25642625,
        0.5512744 ,  0.78929504,  0.73385326,  0.73385326, -0.26777726,
       -0.63967737, -0.63967737, -0.5539071 , -0.5539071 ,  0.3867047 ,
        0.3867047 ,  0.88832979,  0.85074214,  0.46901267,  0.0925433 ,
       -0.34444436, -0.72498936, -0.83763127, -0.53034818, -0.05869122,
        0.46600593,  1.02537238,  0.81800526,  0.51709059,  0.17497366,
       -0.31072836, -0.64876695, -0.89102368, -0.60834799, -0.0627886 ,
        0.42532723,  0.95696668,  0.91077086,  0.77491818,  0.14283377,
        0.14283377, -1.08722874, -1.08722874, -0.65706914, -0.65706914,
        0.28210933,  0.74159654,  0.8064869 ,  0.8064869 ,  0.19889294,
       -0.16601048, -0.78706337, -0.76364317, -0.63789726]))
[4]:
<matplotlib.legend.Legend at 0x2ce6adcd2e0>
../../../_images/examples_classification_shapelet_based_7_2.png

Both shapelets are in the x-axis, so represent side to side motion. Badminton is characterised by a single large peak in one direction, capturing the drawing of the hand back before quickly hitting the shuttlcock. Running is characterised by a longer repetition of side to side motions, with a sharper peak representing bringing the arm forward accross the body in a running motion.

Performance on the UCR univariate datasets

Our current shapelet classifiers are in the directory aeon.classification .shapelet_based. We currently only have reference results for three of these.

[6]:

all_shapelet_classifiers = [ "MrSQMClassifier", "ShapeletTransformClassifier", "RDSTClassifier", "SASTClassifier", "RSASTClassifier", "LearningShapeletClassifier", ] from aeon.benchmarking import get_estimator_results_as_array from aeon.datasets.tsc_datasets import univariate est = ["MrSQMClassifier", "RDSTClassifier", "ShapeletTransformClassifier"] names = [t.replace("Classifier", "") for t in est] results, present_names = get_estimator_results_as_array( names, univariate, include_missing=False ) results.shape
[6]:
(112, 3)
[7]:
from aeon.visualisation import plot_boxplot_median, plot_critical_difference

plot_critical_difference(results, names)
[7]:
(<Figure size 600x220 with 1 Axes>, <Axes: >)
../../../_images/examples_classification_shapelet_based_11_1.png
[8]:
plot_boxplot_median(results, names)
[8]:
(<Figure size 1000x600 with 1 Axes>, <Axes: >)
../../../_images/examples_classification_shapelet_based_12_1.png

References

[1] L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. In Proc. 15th ACM SIGKDD, 2009 [2] J. Lines et al. A shapelet transform for time series classification. In Proc. 18th ACM SIGKDD, 2012.


Generated using nbsphinx. The Jupyter notebook can be found here.