binder

Deep dive in the distance profiles

In this notebook, we will talk more about the theory behind distance profile, how they are computed, and how they can be optimized. For practical experiments on the speedups implemented in aeon, refer to the notebook on the Analysis of the speedups provided by similarity search module notebook.

What are distance profiles ?

In the context of similarity search, where we have as input a time series X={x1,,xm} and a query Q={q1,,ql}, a distance profile is defined as a vector containing the similarity of Q to every subsequence of size l in X, with the ith subsequence denoted by Wi={xi,,xi+(l1)}.

Given a distance or dissimilarity function dist, such as the Euclidean distance, the distance profile P(X,Q) is expressed as :

P(X,Q)={dist(W1,Q),,dist(Wm(l1),Q)}

We can then find the “best match” between Q and X by looking at the distance profile minimum value and extract the subsequence WargminP(X,Q) as the best match.

Trivial matches

One should be careful of what is called “trivial matches” in this situation. If Q is extracted from X, it is extremely likely that it will match with itself, as dist(Q,Q)=0. To avoid this, it is common to set the parts of the distance profile that are neighbors to Q to . This is the role of the q_index parameter in the similarity search predict methods. The exclusion_factor parameter is used to define the neighbors of Q that will also get value.

For example, if Q was extracted at index i in X (i.e. Q={xi,,xi+(l1)}), then all points in the interval [i - l//exclusion_factor, i + l//exclusion_factor] will the set to in the distance profile to avoid a trivial match.

The same reasoning can also be applied for the best matches of Q. It is highly likely that the two best matches will be neighbours, as if Wi and Wi+/1 share l1 values. The apply_exclusion_to_result boolean parameter in predict allows you to apply the exclusion zone defined by [i - l//exclusion_factor, i + l//exclusion_factor] to the output of the algorithm.

Optimizing the distance profile computation

The main idea behind optimizing the distance profile computation, in the case where we want to obtain the exact results (i.e. make no approximation), is to avoid recomputations as much as possible. For example, when computing the mean and standard deviation of each subsequence W1,,Wml+1, instead of computing these statistics independently for each subsequence Wi, we can exploit the fact that these subsequences are extracted from the same time series.

Consider the case of the mean of Wi expressed as μi=1lj=0l1xi+j. Instead of completly recomputing the mean of Wi+1, we can keep a rolling sum S and update it as we go from μ1 to μml+1.

Let S1=j=0l1x1+j, we compute the mean of W1 as μ1=1lS1. Then we can compute S2 as S2=S1x1+x2+l1 and compute μ2=1lS2. It can be generalized as Si=Si1xi1+xi+l1.

We can apply the same reasoning for computing the standard deviation by keeping a rolling squared sum. For a code example, you can check the sliding_mean_std_one_series function of aeon.

Optimization for the (squared) euclidean distance

In this section, we consider the case of the squared euclidean distance (i.e. no square root), but the following also holds for the euclidean distance by using d(Wi,Q).

Applying the same reasoning as the computation of the mean, we can minimize recomputation in the case of the euclidean distance profile. If we consider the euclidean distance between a subsequence Wi={xi,,xi+(l1)} extracted from a time series X={x1,,xm}, and a query Q={q1,,ql}, we have :

d(Wi,Q)=j=1l(qjxi+j1)2

d(Wi,Q)=j=1lqj2+j=1lxi+j122×j=1lqj×xi+j1

Seeing the euclidean distance as the second equation allows us to : - Compute the sum j=1lqj2 only once for all Wi - Keep a rolling sum to update j=1lxi+j12 as we go from W1 to Wml+1 - Use a cross-correlation operation efficiently to compute j=1lqj×xi+j1 as XQ. For large inputs, we can also compute this in the frequency domain with a convolution (see scipy convolve function), in which case we need to flip the query Q before computing the convolution.

What about the normalized (squared) euclidean distance ?

The trick we use for the non normalized euclidean distance cannot apply to the normalized distance, as we cannot keep the rolling sum between subsequences with different means and standard deviations. However, we can compute the normalized euclidean distance as the pearson correlation coefficient, as show in the paper Matrix Profile I: All Pairs Similarity Joins for Time Series.

Consider the mean μi and standard deviation σi of a subsequence Wi (computed with rolling sums) and the mean μq and standard deviation σq of the query. Given QWi the dot product obtained from a cross correlation or a convolution operation between X and Q, we can compute the normalized squared euclidean distance as :

d(Wi,Q)=2l(QWil.μq.μil.σq.σi)

[ ]:


Generated using nbsphinx. The Jupyter notebook can be found here.