This project provides an efficient implementation of the Hampel filter for outlier detection, leveraging Numba’s JIT compilation to improve performance. It supports both serial and parallel computation modes for median and median absolute deviation (MAD) calculations, making it ideal for processing large datasets.
Make sure to have Numba
, NumPy
, and Pandas
installed:
pip install numba numpy pandas
and then you can install the package
pip install hampel_filter
This package contains functions for outlier detection in time-series or sequence data using the Hampel filter. The main function hampel()
identifies outliers by calculating the median and median absolute deviation (MAD) within a specified window size and compares it to a threshold to detect outliers.
def hampel(arr, window_size=5, n=3, parallel=False, return_indices=True)
arr
(np.ndarray
,pd.Series
, orpd.DataFrame
): The input data array for outlier detection.window_size
(int
, default=5): The half-size of the moving window for median calculation.n
(int
, default=3): The threshold factor; outliers are values beyondn
times the MAD.parallel
(bool
, default=False): Whether to use parallel computation. WhenTrue
, it leverages multi-core processing.return_indices
(bool
, default=True): IfTrue
, returns the indices of outliers; otherwise, returns a boolean array indicating outliers.
- If
return_indices
isTrue
, returns a tuple of arrays with outlier indices. - If
return_indices
isFalse
, returns a boolean array whereTrue
indicates outliers.
import numpy as np
from hampel_filter import hampel # Assuming the script is named hampel_filter.py
# Sample data with outliers
data = np.array([1, 1, 2, 2, 100, 2, 2, 1, 1])
# Detect outliers
outlier_indices = hampel(data, window_size=2, n=3, parallel=True)
print("Outlier indices:", outlier_indices)
The implementation includes the following helper functions:
calc_medians(window_size, arr, medians)
: Calculates the moving median over a specified window.calc_medians_std(window_size, arr, medians, medians_diff)
: Computes the median absolute deviation (MAD) for outlier detection.calc_medians_parallel
andcalc_medians_std_parallel
: Parallel versions of the above functions usingprange
for improved performance.
By setting parallel=True
, the Hampel filter calculation leverages Numba's parallel processing capabilities, significantly speeding up the computations on large arrays. However, performance gains depend on your system's CPU cores and load.
This project is licensed under the MIT License.