Performance degradation caused by normalization

## Performance degradation on PCA
I was getting unexpectedly poor performance from PCA on the Exathlon data (using the VUS-PR metric):

|                    | TSB-AD - latest commit | Values reported in the [paper](https://openreview.net/pdf?id=R6kJtWsTGy) | Difference |
| ------------------ | ---------- | ------------- | ---------- |
| TranAD             | 0.95       | 0.10          | 0.86       |
| OFA                | 0.85       | 0.58          | 0.28       |
| CNN                | 0.95       | 0.68          | 0.27       |
| LSTMAD             | 0.96       | 0.82          | 0.14       |
| OmniAnomaly        | 0.97       | 0.84          | 0.13       |
| USAD               | 0.97       | 0.84          | 0.13       |
| RobustPCA          | 0.81       | 0.77          | 0.04       |
| AnomalyTransformer | 0.14       | 0.10          | 0.04       |
| AutoEncoder        | 0.91       | 0.91          | 0.00       |
| IForest            | 0.32       | 0.35          | -0.04      |
| PCA                | **0.53**       | **0.95**          | **-0.42**      |

First column are values I obtained with the latest commit of TSB-AD, second are values from the publication. 

The improvement in most methods can probably be ascribed to the fixes (e.g. use of correct hyperparameters) since publication. However, there might be a bug affecting PCA and possibly other methods as well: 

## Bug (in PCA?)
At least part of the issue might be normalization introduced in a79f3154b6dad1eca5aa83c2c6f1e3956ee711ac:

```Python

        # models/PCA.py
        X = Window(window = self.slidingWindow).convert(X)
        if self.normalize: 
            if n_features == 1:
                X = zscore(X, axis=0, ddof=0)
            else: 
                X = zscore(X, axis=1, ddof=1) #<--- 2nd issue
                
        # validate inputs X and y (optional)
        X = check_array(X)
        self._set_n_classes(y)

        # PCA is recommended to use on the standardized data (zero mean and
        # unit variance).
        if self.standardization: #<--- 1st issue
            X, self.scaler_ = standardizer(X, keep_scalar=True)
```

In the example of `PCA` above:
1. Normalization is applied twice, once for the `normalization` flag and once for the `standardization` flag.
2. `X = zscore(X, axis=1, ddof=1)` seems to apply normalization independently on each window. Each window contains multiple features and has form `[feat0_t0, feat0_t1, ..., feat0_tw, feat1_t0, feat1_t1, ..., feat1_tw, ...]`. 

It is not intuitive for me what No 2. is trying to achieve, but it does not seem correct `PCA`. Independent application of the z-score to each time window e.g. messes up information on absolute magnitude of features between time windows, and overall seems to only remove information from the data. 

Also, I think normalization might not be needed for `IsolationForest` forest at all? For other methods, I am currently unsure I can't tell for sure without looking into them more.  

## Fix
Two parts to this:
1. In my opinion, PCA should not have both "normalization" and "standardization". I think removing "normalization" code and renaming "standardization" to "normalization" would be appropriate to be consistent with other methods. Even though I think "standardization" is actually a better term here. 
2. I think `X = zscore(X, axis=1, ddof=1)` is a bug at least for some methods (PCA, IForest). Naively, I believe changes from a79f3154b6dad1eca5aa83c2c6f1e3956ee711ac could be replaced to use `StandardScaler` in `standardizer` along the columns for all the methods involved.

I can prepare a PR for either 1. or 1. and 2., if that is welcome. 

PS: thank you for an awesome project of such a large scope <3







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation caused by normalization #48

Performance degradation on PCA

Bug (in PCA?)

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	TSB-AD - latest commit	Values reported in the paper	Difference
TranAD	0.95	0.10	0.86
OFA	0.85	0.58	0.28
CNN	0.95	0.68	0.27
LSTMAD	0.96	0.82	0.14
OmniAnomaly	0.97	0.84	0.13
USAD	0.97	0.84	0.13
RobustPCA	0.81	0.77	0.04
AnomalyTransformer	0.14	0.10	0.04
AutoEncoder	0.91	0.91	0.00
IForest	0.32	0.35	-0.04
PCA	0.53	0.95	-0.42

Performance degradation caused by normalization #48

Description

Performance degradation on PCA

Bug (in PCA?)

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions