Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to mitigate the impact of abnormal points in historical data on training? #3

Open
lalafanchuan opened this issue Jul 14, 2020 · 2 comments

Comments

@lalafanchuan
Copy link

Hi, Zeyan Li. I have a question when applying the model on my datasets:

Bagel and donut assumes that the historical data follow normal pattern,
however, when the amount of historical data is not very large,
the impact of the abnormal points can not be ignored. So I want to ask that:
how to mitigate the impact of abnormal points in historical data on training?

I have tried to introduce the labels of the data into the model training, but it has not improved much.

I will be appreciated if you can help me to solve this problem, thank you.

@lizeyan
Copy link
Member

lizeyan commented Jul 14, 2020

Well, in our experiments, there are abnormal points in historical data.
We actually assume that abnormal points are much less than normal points such that our model learns the normal pattern from a contaminated dataset.
So if this assumption does not hold due to too much abnormal points, Bagel and Donut would not work.
We have not studied on such cases since in practice normal points are always much more prevalent than abnormal ones.
Maybe we can give you more suggesstions if you describe your data with more details.

BTW, as for the influence of labels, you can refer to Fig.7 in Donut paper. (Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications)

@lalafanchuan
Copy link
Author

Hi Zeyan, thank you for your answer!
Our dataset is the business KPI data, the dataset is different from the dataset described in bagel paper from the following parts:
(1) our dataset has an interval of 1 hour between two observations.
(2) considering that the dataset varies a lot from one month ago to today, we only use one month data to train the model. we will train the model every hour and use the trained model to detect the following hour.
(3) the holidays and some specific events like 618 affect our business a lot, data pattern during these days looks different from the other days.

Therefore, our training dataset is not very large, and sometimes there exits a series of abnormal points in historical data which has affected the model performance .
Actually, in order to solve the above problem, we have introduced a predict model to replace abnormal points with predict values during the training process. This technique has mitigated the impacts somehow, but we do not think this is a perfect solution(predict model may introduce prediction error), so we try to solve it in another way.

BTW, in last question, I said that 'I have tried to introduce the labels of the data into the model training, but it has not improved much.'
this means I have introduced the labels in the training datasets to make the M-ELBO work.
However, with or without(all labels are zero) labels does not show much difference in our datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants