Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outliers #47

Open
stweb75 opened this issue Feb 16, 2025 · 1 comment
Open

Outliers #47

stweb75 opened this issue Feb 16, 2025 · 1 comment

Comments

@stweb75
Copy link

stweb75 commented Feb 16, 2025

Rbeast::beast can accommodate outliers with:

hasOutlier [default FALSE]
boolean; if true, fit a model with an outlier component that refers to potential spikes or dips at isolated data points: Y = trend + outlier + error if season='none',and Y = trend + season + outlier + error if season ~= 'none'.

ocp.minmax [default c(0,10)]
a vector of 2 integers (>=0); the min and max numbers of outlier-type changepoints (ocp) allowed in the time seriestrend component. Ocp refers to spikes or dips at isolated times that can't be modeled as trends or seasonal terms.

There is also the undocumented:

prior$outlierSigFactor = 2.5 # above which datapoints considered outliers

In my experience:

  • outliers 'absorb' variance and can result in overfitting of the trend and season components.
  • there often are many more outliers than there appear to be in a plot of the time series; often ncp_mode = ncp_max = ocp.minmax[2].

Below are three examples comparing Rbeast:beast and forecast::tsoutliers. The latter function is explained at Detecting time series in outliers. It is similarly based on a decomposition but identifies fewer outliers than does the former. I have also tried increasing outlierSigFactor.

I don't know how Rbeast::beast identifies outliers but suspect that prior$outlierSigFactor acts similarly to the 3IQR(r) tolerance interval in forecast::tsoutliers. For a symmetric distribution:
Q3(r) = Q2(r) + 0.5IQR(r) ~ mean(r) + 0.5IQR(r); IQR(r)~1.349sd(r); which gives a tolerance interval +/-1.349(0.5 + 3)IQR(r) ~ 4.7sd(r). That's a far more strict criteria than is 2.5sd(r). A further concern is that IQR(r) or mad(r) are robust estimators of scale and sd(r) is not.

Anyhow, my current strategy to avoid overfitting outliers is to set ocp.minmax[2] = length(tsoutliers(data)$index) or, being generous, 1.5length(tsoutliers(data)$index).

library(forecast)
library(Rbeast)
## Example 1
Data(Nile)
# Rbeast::beast(outlierSigFactor = 2.5)
# mode 7 outliers and Pr(ncp=7)=0.54
# 4 ocp with cpPr > 0.95
mod <- beast(Nile, hasOutlier=TRUE)
plot(mod)
mod
# no outliers by forecast::tsoutliers()
tsoutliers(Nile)
# Rbeast::beast(outlierSigFactor = 4.7)
# mode 1 outliers and and Pr(ncp=1)=0.95
# 1 ocp with cpPr > 0.95
mod <- beast123(Y = as.numeric(Nile),
  season="none",
  metadata = list(time = time(Nile),
    deltaTime = 1,
    hasOutlier = TRUE),
  prior = list(outlierSigFactor = 4.7),
  extra = list(dumpInputData=TRUE,
    computeTrendOrder=TRUE,
    computeTrendSlope = TRUE))
plot(mod)
mod
## Example 2
# Rbeast::beast(outlierSigFactor = 2.5)
# mode 10 outliers and Pr(ncp=10)=0.83
# 7 ocp with cpPr > 0.95
mod <- beast(googletrend_beach, hasOutlier=TRUE)
plot(mod)
mod
# five outliers by forecast::tsoutliers()
out <- tsoutliers(googletrend_beach)
# these match the top 5 outliers detected by Rbeast::beast()
out
time(googletrend_beach)[out$index]
# the ts is complex and the outliers are not obvious by visual inspection
plot(googletrend_beach)
points(time(googletrend_beach)[out$index], googletrend_beach[out$index], col="red")
# Rbeast::beast(outlierSigFactor = 4.7)
# mode 2 outliers and and Pr(ncp=2)=0.94
# 1 ocp with cpPr > 0.95
mod <- beast123(Y = googletrend_beach,
  season="harmonic",
  metadata = list(time = time(googletrend_beach),
    deltaTime = 1/12,
    period = 1.0,    
    hasOutlier = TRUE),
  prior = list(outlierSigFactor = 4.7),
  extra = list(dumpInputData=TRUE,
    computeTrendOrder=TRUE,
    computeTrendSlope = TRUE))
plot(mod)
mod
# Rbeast::beast(ocp.minmax=c(0,5))
# mode 5 outliers and and Pr(ncp=5)=0.84
# 3 ocp with cpPr > 0.95
mod <- beast(googletrend_beach, hasOutlier=TRUE, ocp.minmax=c(0,5))
plot(mod)
mod
## Example 3
data(co2)
# mode 10 outliers by Rbeast::beast() however Pr(ncp=10)=0.30
# zero ocp with cpPr > 0.95
mod <- beast(co2, hasOutlier=TRUE)
plot(mod)
mod
# no outliers by forecast::tsoutliers()
tsoutliers(co2)
# Rbeast::beast(outlierSigFactor = 4.7)
# mode 0 outliers and and Pr(ncp=0)=0.579
# zero ocp with cpPr > 0.95
mod <- beast123(Y = co2,
  season="harmonic",
  metadata = list(time = time(co2),
    deltaTime = 1/12,
    period = 1.0,    
    hasOutlier = TRUE),
  prior = list(outlierSigFactor = 4.7),
  extra = list(dumpInputData=TRUE,
    computeTrendOrder=TRUE,
    computeTrendSlope = TRUE))
plot(mod)
mod
@stweb75
Copy link
Author

stweb75 commented Feb 17, 2025

I extended the example to compare fixed ncp_max or larger outlierSigFactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant