Question: Anytime Algorithm Variant for multivariate timeseries #274

sim-san · 2020-10-27T09:15:05Z

sim-san
Oct 27, 2020

If I understand this correctly, the mstump function uses the algorithm mSTOMP from [1].
Is it planned to implement the algorithm mSTAMP with the anytime property as well ?

[1] https://www.cs.ucr.edu/~eamonn/Motif_Discovery_ICDM.pdf

sim-san · 2020-10-27T09:56:06Z

sim-san
Oct 27, 2020
Author

This is the algorithm I refer to:
STAMP Based mSTAMP Implemented as an Anytime Algorithm

0 replies

seanlaw · 2020-10-27T11:01:03Z

seanlaw
Oct 27, 2020
Maintainer

If I understand this correctly, the mstump function uses the algorithm mSTOMP from [1].

Yes, stumpy.mstump is based on mSTOMP

Is it planned to implement the algorithm mSTAMP with the anytime property as well ?

No, we do not have an mSTAMP implementation at this time. Do you have a particular use case in mind? Would you be interested in contributing a PR?

0 replies

sim-san · 2020-11-02T11:08:40Z

sim-san
Nov 2, 2020
Author

Would you be interested in contributing a PR?

I will take a look into the code and see if I can manage that

Do you have a particular use case in mind?

I have a CSV file with CAN data (information and sensors from a car) measured for a 1.5 hour drive. The data is sampled with 200Hz and contains 10 time-series. For each time-series, there are about one million data points.
My goal is to identify patterns in the multivariate time series and group the patterns hierarchically into classes to segment the ride into different parts like acceleration, braking, or line maneuvers.

0 replies

seanlaw · 2020-11-02T14:05:40Z

seanlaw
Nov 2, 2020
Maintainer

Perhaps you are fully aware of this already so feel free to ignore this comment but I want to reiterate (in case others are reading this and if the paper wasn't clear on the following point) that the multi-dimensional matrix profile is not the same as computing individual matrix profiles for each dimension of your time series and stacking them together.

I have a CSV file with CAN data (information and sensors from a car) measured for a 1.5 hour drive. The data is sampled with 200Hz and contains 10 time-series. For each time-series, there are about one million data points.
My goal is to identify patterns in the multivariate time series and group the patterns hierarchically into classes to segment the ride into different parts like acceleration, braking, or line maneuvers.

Thank you for the context! For something of this size (10 time series x 1 million data points), it should take about a day (or less) to compute the multi-dimensional matrix profile on a 2-core machine. So, depending on the hardware that you have available to you, in the time between our comment exchanges, you could've computed the matrix profile. This is not meant to come across as a snarky comment but I often find myself overthinking things before trying things out and so it is a reminder that it may be beneficial to be pragmatic. Of course, I'm guessing that you will have more data to analyze after this?

I'm curious as to why you necessarily need/want the anytime algorithm? This sort of implies that you don't want to necessarily compute the full multi-dimensional matrix profile and you only need an approximate matrix profile? If you do want the full matrix profile file, then the current STUMPY mstump implementation will be the fastest option (it is faster than mSTAMP).

0 replies

sim-san · 2020-11-02T14:30:50Z

sim-san
Nov 2, 2020
Author

I'm curious as to why you necessarily need/want the anytime algorithm? This sort of implies that you don't want to necessarily compute the full multi-dimensional matrix profile and you only need an approximate matrix profile? If you do want the full matrix profile file, then the current STUMPY mstump implementation will be the fastest option (it is faster than mSTAMP).

I want to use the anytime algorithm because I expect that it will be much faster than the ordered algorithm. Why is the STUMPY mstump (ordered alogorithm) faster than an anytime algorithm ?

0 replies

seanlaw · 2020-11-02T14:57:08Z

seanlaw
Nov 2, 2020
Maintainer

I want to use the anytime algorithm because I expect that it will be much faster than the ordered algorithm. Why is the STUMPY mstump (ordered alogorithm) faster than an anytime algorithm ?

Good question! To make sure that we are on the same page, let's set aside STUMPY mstump (which corresponds to mSTOMP) and only focus on mSTAMP (the anytime algorithm) and mSTOMP the ordered algorithm.

When computing the full multi-dimensional matrix profile, mSTAMP uses the MASS algorithm for computing a distance profile (i.e., comparing a single subsequence to the rest of the time series). MASS has a computational complexity of n * logn where n is the length of your time series. So, for a single time series to compute the full distance matrix, it would cost n^2 * logn (this is basically STAMP). Taking this a step further, if instead of a single time series, we had d time series (this is mSTAMP), then the total computational complexity would be d * n^2 * logn.

Conversely, the STOMP algorithm for a single time series can compute a distance profile in constant time and the full distance matrix in n^2 time and therefore, for d time series, the mSTOMP algorithm can compute the multi-dimensional matrix profile in roughly d * n^2 time. Essentially, when you want to compute the full matrix profile, we are able to knock off logn time with mSTOMP.

Now, the main benefit with an anytime mSTAMP algorithm is that you aren't forced to compute the full matrix profile since the distance profiles for each subsequence can be computed independent of the others. This means that you can compute an approximate matrix profile (and allow for false positives/false negatives). However, in my opinion, this is rarely what people want (and see my comment below about subsampling). On the other hand, the ordered mSTOMP algorithm is ordered for a reason. Thanks to some nice algebra (see the Matrix Profile II paper), the ordered algorithm allows for redundant computations to be eliminated (which can't be done with mSTAMP) but with the caveat that the computation is essentially all or nothing (i.e., you must compute the full matrix profile).

In our development of STUMPY, we had to decide whether it was more important for users to compute the full multi-dimensional matrix profile (in which case, we should provide the fastest algorithm that is mSTOMP) or if an approximate multi-dimensional matrix profile was preferred (in which case, we should implement mSTAMP). Ultimately, given our rough benchmark timing estimates, we opted to compute the full matrix profile with the mSTOMP algorithm. This was partially chosen because mSTOMP wasn't well documented but we had enough experience with STOMP to understand what needed to be done and, hopefully, help prevent others from making the same mistakes. In fact, we had found and helped fix a small bug in the original author's open sourced mSTOMP implementation. This is not a criticism of the original author's wonderful work but a point to stress that some of these algorithms are nuanced and require attention to detail in order to get right (we get things wrong all the time too). Though, we make sure that all of our implementations are well tested as evidenced by our commitment to 100% unit test coverage (where we test against naive implementations of our algorithms in addition to static and random inputs).

Have you already tried mstump? You may also consider subsampling your data to, say, 1% (so ~10,000 data points for each dimension of your time series) and then you can get an approximate multi-dimensional matrix profile with mstump in roughly 20 minutes.

Let me know if that helps or if you have any further questions.

0 replies

sim-san · 2020-11-05T16:34:16Z

sim-san
Nov 5, 2020
Author

Many thanks for a large amount of new and helpful information.

For my task of grouping the motifs hierarchically into classes, I think read the MPdist paper and will follow your ideas from your comment #269 (comment).

0 replies

seanlaw · 2020-11-23T00:07:11Z

seanlaw
Nov 23, 2020
Maintainer

@sim-san I'm closing this for now but feel free to re-open (or start a new issue) if you have more questions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Anytime Algorithm Variant for multivariate timeseries #274

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question: Anytime Algorithm Variant for multivariate timeseries #274

sim-san Oct 27, 2020

Replies: 8 comments

sim-san Oct 27, 2020 Author

seanlaw Oct 27, 2020 Maintainer

sim-san Nov 2, 2020 Author

seanlaw Nov 2, 2020 Maintainer

sim-san Nov 2, 2020 Author

seanlaw Nov 2, 2020 Maintainer

sim-san Nov 5, 2020 Author

seanlaw Nov 23, 2020 Maintainer

sim-san
Oct 27, 2020

sim-san
Oct 27, 2020
Author

seanlaw
Oct 27, 2020
Maintainer

sim-san
Nov 2, 2020
Author

seanlaw
Nov 2, 2020
Maintainer

sim-san
Nov 2, 2020
Author

seanlaw
Nov 2, 2020
Maintainer

sim-san
Nov 5, 2020
Author

seanlaw
Nov 23, 2020
Maintainer