Performance degradation on PCA
I was getting unexpectedly poor performance from PCA on the Exathlon data (using the VUS-PR metric):
|
TSB-AD - latest commit |
Values reported in the paper |
Difference |
| TranAD |
0.95 |
0.10 |
0.86 |
| OFA |
0.85 |
0.58 |
0.28 |
| CNN |
0.95 |
0.68 |
0.27 |
| LSTMAD |
0.96 |
0.82 |
0.14 |
| OmniAnomaly |
0.97 |
0.84 |
0.13 |
| USAD |
0.97 |
0.84 |
0.13 |
| RobustPCA |
0.81 |
0.77 |
0.04 |
| AnomalyTransformer |
0.14 |
0.10 |
0.04 |
| AutoEncoder |
0.91 |
0.91 |
0.00 |
| IForest |
0.32 |
0.35 |
-0.04 |
| PCA |
0.53 |
0.95 |
-0.42 |
First column are values I obtained with the latest commit of TSB-AD, second are values from the publication.
The improvement in most methods can probably be ascribed to the fixes (e.g. use of correct hyperparameters) since publication. However, there might be a bug affecting PCA and possibly other methods as well:
Bug (in PCA?)
At least part of the issue might be normalization introduced in a79f315:
# models/PCA.py
X = Window(window = self.slidingWindow).convert(X)
if self.normalize:
if n_features == 1:
X = zscore(X, axis=0, ddof=0)
else:
X = zscore(X, axis=1, ddof=1) #<--- 2nd issue
# validate inputs X and y (optional)
X = check_array(X)
self._set_n_classes(y)
# PCA is recommended to use on the standardized data (zero mean and
# unit variance).
if self.standardization: #<--- 1st issue
X, self.scaler_ = standardizer(X, keep_scalar=True)
In the example of PCA above:
- Normalization is applied twice, once for the
normalization flag and once for the standardization flag.
X = zscore(X, axis=1, ddof=1) seems to apply normalization independently on each window. Each window contains multiple features and has form [feat0_t0, feat0_t1, ..., feat0_tw, feat1_t0, feat1_t1, ..., feat1_tw, ...].
It is not intuitive for me what No 2. is trying to achieve, but it does not seem correct PCA. Independent application of the z-score to each time window e.g. messes up information on absolute magnitude of features between time windows, and overall seems to only remove information from the data.
Also, I think normalization might not be needed for IsolationForest forest at all? For other methods, I am currently unsure I can't tell for sure without looking into them more.
Fix
Two parts to this:
- In my opinion, PCA should not have both "normalization" and "standardization". I think removing "normalization" code and renaming "standardization" to "normalization" would be appropriate to be consistent with other methods. Even though I think "standardization" is actually a better term here.
- I think
X = zscore(X, axis=1, ddof=1) is a bug at least for some methods (PCA, IForest). Naively, I believe changes from a79f315 could be replaced to use StandardScaler in standardizer along the columns for all the methods involved.
I can prepare a PR for either 1. or 1. and 2., if that is welcome.
PS: thank you for an awesome project of such a large scope <3
Performance degradation on PCA
I was getting unexpectedly poor performance from PCA on the Exathlon data (using the VUS-PR metric):
First column are values I obtained with the latest commit of TSB-AD, second are values from the publication.
The improvement in most methods can probably be ascribed to the fixes (e.g. use of correct hyperparameters) since publication. However, there might be a bug affecting PCA and possibly other methods as well:
Bug (in PCA?)
At least part of the issue might be normalization introduced in a79f315:
In the example of
PCAabove:normalizationflag and once for thestandardizationflag.X = zscore(X, axis=1, ddof=1)seems to apply normalization independently on each window. Each window contains multiple features and has form[feat0_t0, feat0_t1, ..., feat0_tw, feat1_t0, feat1_t1, ..., feat1_tw, ...].It is not intuitive for me what No 2. is trying to achieve, but it does not seem correct
PCA. Independent application of the z-score to each time window e.g. messes up information on absolute magnitude of features between time windows, and overall seems to only remove information from the data.Also, I think normalization might not be needed for
IsolationForestforest at all? For other methods, I am currently unsure I can't tell for sure without looking into them more.Fix
Two parts to this:
X = zscore(X, axis=1, ddof=1)is a bug at least for some methods (PCA, IForest). Naively, I believe changes from a79f315 could be replaced to useStandardScalerinstandardizeralong the columns for all the methods involved.I can prepare a PR for either 1. or 1. and 2., if that is welcome.
PS: thank you for an awesome project of such a large scope <3