This document describes the detection and sanitization algorithms used by Polez, their parameters, and operational characteristics.
Polez uses 10 watermark detection methods, statistical AI probability scoring, AI-specific watermark detection, and metadata scanning.
The WatermarkDetector runs up to 10 independent detection methods. Each returns a confidence score (0.0-1.0) and a detected/not-detected flag.
Analyzes high-frequency energy consistency above 15 kHz using STFT. Computes the ratio of standard deviation to mean power across high-frequency bins. A consistency score above 0.7 suggests embedded spread-spectrum watermarks. Also checks for suspicious energy at carrier frequencies (18, 19, 20, 21 kHz) where power exceeds mean + 3 standard deviations.
False positives: Audio with naturally strong high-frequency content (cymbals, synthesizers).
Performs autocorrelation on the first 50 ms of signal to find echo patterns. Tests delays in the 1-50 ms range with a minimum strength ratio of 0.1. Requires two or more consistent echo delays for detection. Measures interval consistency; triggers if consistency exceeds 0.8.
False positives: Audio recorded in small reflective rooms or with intentional delay effects.
Checks four statistical properties:
- Entropy < 6.0 (confidence 0.7) - low entropy suggests structured data
- Kurtosis deviation from 3.0 > 2.0 (confidence 0.6) - non-Gaussian distribution
- Skewness > 0.5 (confidence 0.5) - asymmetric distribution
- Spectral entropy < 8.0 (confidence 0.5) - unusually structured spectrum
False negatives: Well-designed watermarks that preserve statistical properties.
Uses STFT (2048-sample window, 75% overlap) to track phase evolution across frames. For each frequency bin, measures the standard deviation of frame-to-frame phase differences. High consistency (score > 0.7) in phase evolution suggests embedded phase modulation patterns.
False positives: Tonal audio with stable pitch.
Computes the signal envelope via Hilbert transform, then analyzes the modulation spectrum (1-100 Hz). Detection triggers when more than 5 peaks appear in the modulation spectrum, suggesting embedded AM patterns.
False positives: Tremolo effects, amplitude-modulated synthesis.
Tests multiple STFT window sizes (512, 1024, 2048, 4096). Per window, computes spectral flatness per frame. Flags if average flatness exceeds 0.3 or peak consistency across frames exceeds 0.8.
False positives: White noise or broadband signals.
Four tests on least-significant bits:
- Bias test: Checks if LSB ones-ratio deviates from 0.5 by more than 0.02
- Chi-squared test: Analyzes LSB pair distribution against uniform expectation (threshold p=0.05, df=3)
- Periodicity test: Tests LSB autocorrelation at common embedding lags (128, 256, 441, 512, 576, 1024, 1152, 2048, 2304, 4096, 4608)
- Runs test: Z-score of run length distribution; flags if z > 2.58
False positives: Dithered audio or audio with low bit depth.
Three tests for re-encoding signatures:
- MP3 frame boundaries: Tests 1152-sample periodicity; flags if energy discontinuity at boundaries exceeds 1.3x the interior average
- Frequency cutoff: Fits log-linear model to 15-20 kHz rolloff; matches against known codec profiles (MP3 128/320 kbps, AAC at various bitrates)
- Spectral band replication (SBR): Correlates lower vs upper frequency bands; flags if correlation exceeds 0.5
For stereo audio, analyzes inter-channel phase difference stability across four frequency bands (100 Hz-1 kHz, 1-4 kHz, 4-8 kHz, 8-16 kHz). Flags if more than 30% of bins have phase difference standard deviation below 0.1 (unnaturally phase-locked).
For mono audio, analyzes temporal phase consistency across 8-sample segments. Flags if average coherence exceeds 0.7 with standard deviation below 0.1.
M/S (mid-side) analysis on stereo audio. Computes mid/side ratios across four bands (low, mid, high, ultrasonic). Flags if ratio standard deviation is below 0.02 with average ratio between 0.01 and 0.5 (unnaturally stable stereo image).
Also checks inter-channel correlation above 12 kHz; localized high correlation suggests embedded watermarks.
The StatisticalAnalyzer combines AI-specific indicators (60% weight) with classic statistical features (40% weight) to estimate the probability that audio was AI-generated.
| Indicator | Weight | What It Measures |
|---|---|---|
| Spectral continuity | 20% | Frame-to-frame spectral difference consistency; AI audio has unnaturally smooth transitions |
| Micro-silence patterns | 15% | Periodic energy dips at 5-50 ms intervals; AI models often produce regular micro-silences |
| Harmonic regularity | 15% | Consistency of harmonic ratios across frames; AI has unnaturally stable overtones |
| Onset precision | 10% | Rise time consistency across detected onsets; AI produces very regular attack times |
| Feature | Weight | Natural Range |
|---|---|---|
| Entropy | 10% | 6.0-10.0 |
| Kurtosis | 10% | 1.5-6.0 |
| Skewness | 10% | -0.5 to 0.5 |
| Spectral flatness | 10% | Varies by content |
The PolezDetector targets AI-service-specific watermarks using three weighted signals:
| Signal | Weight | Method |
|---|---|---|
| Ultrasonic energy | 45% | DFT energy ratio in 23-24 kHz vs 15-20 kHz reference band. Requires 48 kHz+ sample rate. AI watermark ratio > 0.1; human audio < 0.02. |
| Bit plane bias | 35% | Analyzes 8 bit planes of 16-bit PCM. AI watermarks bias 6-8 planes; human audio biases 0-2. Threshold: deviation from 50% > 1%. |
| Autocorrelation | 20% | Tests LSB autocorrelation at periods 2-1024. AI watermarks show correlation > 0.05 at period 2. |
Confidence is based on signal agreement (lower variance between scores = higher confidence). Final probability uses sigmoid calibration to push values away from 0.5.
The MetadataScanner checks two areas:
Tag keys flagged: unique, fingerprint, identifier, tracking, license, isrc, barcode, upc, ean, catalog, txxx
Tag values flagged: suno, udio, audiocraft, musicgen, stable audio, ai-generated, generated by
Binary patterns: Searches raw bytes for AI service markers (SUNO, UDIO, AudioCraft, MusicGen, Stable Audio) and tag headers (APETAGEX, ID3).
| Mode | Operations | Use Case |
|---|---|---|
| Fast | Metadata stripping only | Quick metadata removal without audio modification |
| Standard | Metadata + spectral cleaning + fingerprint removal | Balanced cleaning for most files |
| Preserving | Standard + all stealth DSP operations | Maximum cleaning with quality preservation focus |
| Aggressive | All operations with stronger parameters | Maximum disruption when quality is secondary |
For files exceeding ~5.3M samples, processing is chunked (~30 s chunks with 1 s overlap) and parallelized with rayon.
The SpectralCleaner operates in the STFT domain targeting known watermark frequency bands:
- 18-18.5 kHz
- 19-19.5 kHz
- 20-20.5 kHz
- 21-21.5 kHz
Four operations run in a single STFT pass:
- Periodic disruption: Phase randomization above 15 kHz (normal: +/-0.02 rad, aggressive: +/-0.05 rad)
- Spectral smoothing: 5-bin moving average on magnitude above 15 kHz
- Spread-spectrum attenuation: 0.8x magnitude scaling in high-frequency bins
- Adaptive noise shaping: High-pass filtered noise addition (8 kHz cutoff, normal: 9e-9, aggressive: 1.8e-8)
All operations respect psychoacoustic masking thresholds to avoid audible artifacts.
Adaptive notch pass (post-processing): Scans 15-20 kHz for ultrasonic peaks exceeding 1.3x the predicted rolloff. Applies cascaded lowpass and notch filters (Q=3-5) to surgically remove detected peaks.
The FingerprintRemover applies five configurable techniques:
- Statistical normalization: Adjusts kurtosis toward the 1.5-4.0 human range using cubic soft expansion or compression (strength 0.01)
- Temporal randomization: Interpolated sample jitter (normal: sigma=0.1, aggressive: sigma=0.15 samples)
- Phase randomization: FFT-domain phase perturbation (normal: sigma=0.01 rad, aggressive: sigma=0.015 rad)
- Micro-timing perturbation: Circular shift by 0-1 ms to break timing synchronization
- Human imperfections: Velocity variation (sigma=0.002), drift with decay (sigma=0.0001), and soft even-harmonic distortion (0.0001x * |x|, mimicking analog saturation)
Twenty flag-gated operations applied in Preserving and Aggressive modes. Each has normal and paranoid (aggressive) parameter levels.
| Operation | Description | Normal | Paranoid |
|---|---|---|---|
| Phase dither | Sub-block (512-sample) FFT phase perturbation | +/-0.02 rad | +/-0.04 rad |
| Comb mask | Dynamic notch filters above 15 kHz at random harmonics | 3 notches, Q=10 | 5 notches, Q=10 |
| Transient shift | Sub-sample onset shifts detected via energy rise | +/-0.08 ms | +/-0.1 ms |
| Resample nudge | Subtle sample rate warping | +/-0.035% | +/-0.06% |
| Phase noise | FFT-domain full-spectrum phase perturbation | sigma=0.05 rad | sigma=0.08 rad |
| Phase swirl | Allpass cascade at 2 kHz and 5 kHz | alpha=[0.012, -0.01] | alpha=[0.016, -0.014] |
| Masked HF phase | Phase noise restricted to > 15.5 kHz | +/-0.10 rad | +/-0.15 rad |
| Gated resample nudge | RMS-gated sample rate warping (> 0.01 RMS) | 0.025% | 0.04% |
| Micro EQ flutter | Time-varying peaking EQ at 3 kHz | +/-0.01 dB, 0.3 Hz | +/-0.015 dB |
| HF decorrelate | Phase randomization in 13-17 kHz band | 13-17 kHz | 12-16 kHz |
| Refined transient | Gaussian-distributed onset shifts with crossfade | +/-0.08 ms | +/-0.12 ms |
| Adaptive transient | Onset-strength-gated shifting | 0.10 ms max | 0.15 ms max |
| Adaptive notch | Scan-based ultrasonic peak removal with cascaded filters | 2 cascades | 3 cascades |
| Spectral phase noise | Band-limited phase noise below 10 kHz | +/-0.08 rad | +/-0.12 rad |
| HF noise and dither | High-pass (10 kHz) noise + TPDF dither | 9e-8 + 2e-6 | 1.8e-7 + 4e-6 |
| Humanization | Wow/flutter simulation (tape machine imperfections) | 0.15 Hz, 5 samples | 0.21 Hz, 8 samples |
| Micro resample warp | Resample up then down for timing distortion | +/-0.15% | +/-0.22% |
| Analog warmth | tanh() soft saturation | 1.04x drive | 1.07x drive |
| Band-limiting | Butterworth lowpass with zero-phase filtering | 20 kHz | 19 kHz |
| Room tone | Gaussian noise simulating ambient room noise | sigma=1.5e-7 | sigma=3e-7 |
All sanitization modes apply:
- RMS normalization: Restores original RMS level after processing
- Soft clipping: Clamps samples to +/-0.99 to prevent digital overs
- Psychoacoustic masking: Spectral modifications only apply below the masking threshold to avoid audible artifacts