Deep dive in the distance profiles¶
In this notebook, we will talk more about the theory behind distance profile, how they are computed, and how they can be optimized. For practical experiments on the speedups implemented in aeon, refer to the notebook on the Analysis of the speedups provided by similarity search module notebook.
What are distance profiles ?¶
In the context of similarity search, where we have as input a time series
Given a distance or dissimilarity function
We can then find the “best match” between
Trivial matches¶
One should be careful of what is called “trivial matches” in this situation. If q_index
parameter in the similarity search predict
methods. The exclusion_factor
parameter is used to define the neighbors of
For example, if [i - l//exclusion_factor, i + l//exclusion_factor]
will the set to
The same reasoning can also be applied for the best matches of apply_exclusion_to_result
boolean parameter in predict
allows you to apply the exclusion zone defined by [i - l//exclusion_factor, i + l//exclusion_factor]
to the output of the algorithm.
Optimizing the distance profile computation¶
The main idea behind optimizing the distance profile computation, in the case where we want to obtain the exact results (i.e. make no approximation), is to avoid recomputations as much as possible. For example, when computing the mean and standard deviation of each subsequence
Consider the case of the mean of
Let
We can apply the same reasoning for computing the standard deviation by keeping a rolling squared sum. For a code example, you can check the sliding_mean_std_one_series function of aeon.
Optimization for the (squared) euclidean distance¶
In this section, we consider the case of the squared euclidean distance (i.e. no square root), but the following also holds for the euclidean distance by using
Applying the same reasoning as the computation of the mean, we can minimize recomputation in the case of the euclidean distance profile. If we consider the euclidean distance between a subsequence
Seeing the euclidean distance as the second equation allows us to : - Compute the sum
What about the normalized (squared) euclidean distance ?¶
The trick we use for the non normalized euclidean distance cannot apply to the normalized distance, as we cannot keep the rolling sum between subsequences with different means and standard deviations. However, we can compute the normalized euclidean distance as the pearson correlation coefficient, as show in the paper Matrix Profile I: All Pairs Similarity Joins for Time Series.
Consider the mean
[ ]:
Generated using nbsphinx. The Jupyter notebook can be found here.