A dataset for STUMPY #662

NimaSarajpoor · 2022-08-26T16:39:26Z

NimaSarajpoor
Aug 26, 2022
Maintainer

I have been working on a Kaggle competition lately American Express Fault Prediction I didn't get a chance to improve my work and submit my prediction as I started just a few days before the deadline, but I think it might be a good dataset for STUMPY and I just want to share it with the community. So, if you think it is not appropriate or not worth it, please feel free and close it.

The data is tabular; however, it is 3D! Each row corresponds to one person (one observation), and each column corresponds to one feature. However, each feature itself is a timeseries data (depth of table). While the depth of table is the same in all features for each person, it can be different from one person to another.

I used a combination of shapelet discovery and multi-dimensional match to predict the label. I did not train a model based on the distance of each pattern to the shapelets. Instead, I just considered KNN and used a voting policy. Long story short, the model performs well in predicting the observations with true label 1. However, regarding the observations with true label 0, the performance was about 50%.

I do not have enough time to continue this work. However, it might be a good problem for those who are interested in using STUMPY and exploring its power in different domains.

seanlaw · 2022-08-26T17:37:43Z

seanlaw
Aug 26, 2022
Maintainer

@NimaSarajpoor Thanks for sharing, this is interesting. We may be able to add this as a "Bonus Section" to our Shapelet Discovery tutorial? The key question is whether this performance is significant. I don't know. Additionally, the dataset may be hard for people to follow (it took me a minute to understand that it is essentially many, many sets of time series for many people)

2 replies

NimaSarajpoor Aug 26, 2022
Maintainer Author

The key question is whether this performance is significant. I don't know

To be honest, I didn't check it for all cases. The minimum length of timeseries is one, and the maximum is 13. For the observations with true label 1 and timeseries of length 9-13, I got 80-90% accuracy using a simple KNN with n_neighbors=11 and majority voting. (The neighbors to a query is discovered by the stumpy.match with multi-dimensional input).

So, we cannot use it for those observations that have time series of length 1 in their features. Furthermore, the data is not clean, and what I did was applying stumpy on some features that are remained after preprocessing. There were some features of type binary. I excluded them. (By binary, I meant the value in time series were just 0 and 1)

Additionally, the dataset may be hard for people to follow (it took me a minute to understand that it is essentially many, many sets of time series for many people)

Rigth...it took me a lot of time to visualize it in my head.

For the bonus thing, the problem needs to be simplified as the raw data set itself is complicated. I just opened the discussion to see if someone is interested in exploring this problem further and sharing their results later :)

NimaSarajpoor Aug 26, 2022
Maintainer Author

I MIGHT get a chance to revisit it later in the future. If I find out something, I will share it for sure. I leave it to your discretion to close this discussion for now or leave it open :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A dataset for STUMPY #662

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

A dataset for STUMPY #662

NimaSarajpoor Aug 26, 2022 Maintainer

Replies: 1 comment · 2 replies

seanlaw Aug 26, 2022 Maintainer

NimaSarajpoor Aug 26, 2022 Maintainer Author

NimaSarajpoor Aug 26, 2022 Maintainer Author

NimaSarajpoor
Aug 26, 2022
Maintainer

Replies: 1 comment 2 replies

seanlaw
Aug 26, 2022
Maintainer

NimaSarajpoor Aug 26, 2022
Maintainer Author

NimaSarajpoor Aug 26, 2022
Maintainer Author