A production-grade research implementation investigating privacy-preserving mechanisms for semantic vector retrieval systems. The project implements and evaluates three Local Differential Privacy (LDP) mechanisms to protect document collections from Membership Inference Attacks while maintaining retrieval quality.
- System Overview
- Installation
- Repository Architecture
- Data Specifications
- Privacy Mechanisms
- Evaluation Framework
- Output Specifications
- Experiment Execution
- Performance Benchmarks
- Development Guidelines
The system implements three distinct privacy mechanisms for vector retrieval:
- Document-side Local Differential Privacy (Doc-LDP): Injects calibrated Gaussian noise into document embeddings before index construction
- Query-side Local Differential Privacy (Query-LDP): Applies noise to query vectors during inference
- Score-side Local Differential Privacy (Score-LDP): Perturbs similarity scores before top-k selection
- Automated privacy budget calibration based on attack success thresholds
- Comprehensive privacy-utility tradeoff analysis
- Reproducible experimental pipeline with fixed random seeds
- Membership Inference Attack auditing with feature-based classifiers
- Python 3.11 or higher
- RAM: Minimum 16GB (32GB recommended for full experiments)
- Storage: 10GB free space for data and results
- Optional: CUDA 11.7+ compatible GPU for accelerated embedding generation
The project uses Conda for environment management with pinned dependencies to ensure reproducibility.
# Repository setup
git clone https://github.com/yourusername/mute-vector.git
cd mute-vector
# Environment creation
conda env create -f environment.yml
conda activate mute-vectors
# Development dependencies (optional)
pip install -r requirements-dev.txt
# Package installation
pip install -e .
# Data initialization
python scripts/download_data.py
python scripts/generate_queries.pyThe source directory contains the core implementation organized into specialized modules:
Data Module (/src/data/): Handles data ingestion and preprocessing
- Loads 20 Newsgroups dataset with configurable category selection
- Implements stratified train/validation/test splitting (70/15/15)
- Performs text normalization and filtering
Embeddings Module (/src/embeddings/): Manages embedding generation
- Supports multiple sentence-transformer models
- Implements caching for computational efficiency
- Produces L2-normalized vectors for retrieval
Privacy Module (/src/privacy/): Core privacy mechanism implementations
- Calibrates noise parameters from privacy budgets (epsilon values)
- Implements Gaussian mechanism with sensitivity analysis
- Provides automated budget tuning for target attack thresholds
Retrieval Module (/src/retrieval/): Search infrastructure
- Builds and manages FAISS indices with inner product similarity
- Generates keyphrase queries using RAKE algorithm
- Implements top-k retrieval with configurable k values
Attacks Module (/src/attacks/): Privacy evaluation
- Implements feature-based Membership Inference Attacks
- Extracts statistical features from retrieval patterns
- Trains shadow models for attack calibration
Evaluation Module (/src/evaluation/): Metrics and analysis
- Computes retrieval metrics (Recall@k)
- Calculates privacy metrics (AUC, TPR@FPR)
- Generates statistical significance tests
Orchestration scripts for running systematic experiments:
- Baseline evaluation: Establishes non-private performance bounds
- Grid search: Exhaustive evaluation across privacy parameters
- Ablation studies: Isolates impact of individual components
- Robustness testing: Validates results on held-out data
YAML-based configuration system with hierarchical overrides:
- Default parameters in base configuration
- Mechanism-specific parameter sets
- Model specifications for embedding generation
Models (/models): Persistent model storage
- Trained MIA classifiers
- Shadow models for attack calibration
- Embedding model checkpoints
Results (/results): Experimental outputs
- Individual run results with full metrics
- Aggregated summaries with statistical analysis
- Publication-ready figures and tables
Logs (/logs): Execution tracking
- Detailed experiment logs with timestamps
- Error logs for debugging
- Performance profiling data
mute-vector/
βββ src/
β βββ init.py
β βββ data/
β β βββ init.py
β β βββ loader.py # 20newsgroups: 8 categories (comp.graphics, rec.autos, sci.med, talk.politics.guns,
β β β # alt.atheism, sci.space, rec.sport.hockey, misc.forsale)
β β β # Train: 70% (~3,500 docs), Val: 15% (~750 docs), Test: 15% (~750 docs)
β β βββ preprocessor.py # Lowercasing, stopword removal (NLTK), min_doc_length=50 tokens
β β βββ splitter.py # Stratified splits maintaining category distributions
β βββ embeddings/
β β βββ init.py
β β βββ encoder.py # all-MiniLM-L6-v2 (384 dims) as primary, all-distilroberta-v1 (768 dims) for robustness
β β βββ models.py # Model registry and lazy loading
β β βββ cache.py # Embedding cache manager (HDF5 format)
β βββ privacy/
β β βββ init.py
β β βββ mechanisms.py # Gaussian noise: N(0, ΟΒ²), Ο = Ξf/Ξ΅ * sqrt(2ln(1.25/Ξ΄)), Ξ΄=1e-5
β β βββ doc_ldp.py # Document-side noise before indexing
β β βββ query_ldp.py # Query-side noise at inference
β β βββ score_ldp.py # Score-space noise before top-k selection
β β βββ calibration.py # Ξ΅ β {β, 20, 10, 5, 2.5} β Ο mapping
β β βββ budget_tuner.py # Binary search for Ξ΅ given TPR@FPR target
β βββ retrieval/
β β βββ init.py
β β βββ indexer.py # FAISS IndexFlatIP (inner product), L2 normalized vectors
β β βββ searcher.py # Top-k β {1, 5, 10} retrieval
β β βββ queries.py # RAKE: 3 keyphrases/doc, 3-5 tokens each
β β βββ scorer.py # Cosine similarity scoring pipeline
β βββ attacks/
β β βββ init.py
β β βββ mia.py # Logistic regression MIA, 80/20 train/test split
β β βββ features.py # 7 features: max_score, mean_score, std_score, hits_in_topk,
β β β # mean_rank_when_hit, max_rank_hit, score_gini
β β βββ shadow_models.py # Shadow model training for calibration
β βββ evaluation/
β β βββ init.py
β β βββ metrics.py # Recall@{1,5,10}, MIA-AUC, TPR@{0.1%, 1%}FPR, latency_ms
β β βββ evaluator.py # Full evaluation pipeline orchestration
β β βββ statistical_tests.py # Bootstrap confidence intervals (n=1000)
β βββ utils/
β βββ init.py
β βββ config.py # YAML parsing with schema validation
β βββ logging.py # Structured logging (JSON format)
β βββ reproducibility.py # Seed management (global seed: 2025)
β βββ io.py # Unified I/O for results
βββ experiments/
β βββ init.py
β βββ run_baseline.py # No-DP baseline: full corpus, all queries
β βββ run_grid_search.py # Full factorial: mechanism Γ Ξ΅ Γ k
β βββ run_ablations.py # Ablation A: kβ{1,5,10}, Ablation B: queries_per_docβ{1,3,5}
β βββ run_robustness.py # Test set evaluation, model swapping
β βββ run_combined_mechanisms.py # Two-layer combinations (Doc+Query LDP)
βββ scripts/
β βββ setup_environment.sh
β βββ download_data.py # Fetches 20newsgroups via sklearn
β βββ generate_queries.py # Pre-generates all query sets
β βββ build_indices.py # Pre-builds FAISS indices for all Ξ΅ values
β βββ visualize_results.py # Matplotlib/Seaborn plots
βββ configs/
β βββ default.yaml # Base configuration (overrideable)
β βββ experiments/
β β βββ baseline.yaml # mechanism: none, Ξ΅: β
β β βββ doc_ldp.yaml # mechanism: doc, Ξ΅: [2.5, 5, 10, 20]
β β βββ query_ldp.yaml # mechanism: query, Ξ΅: [2.5, 5, 10, 20]
β β βββ score_ldp.yaml # mechanism: score, Ξ΅: [2.5, 5, 10, 20]
β β βββ combined.yaml # Two-layer combinations
β βββ models/
β βββ minilm.yaml # all-MiniLM-L6-v2 config
β βββ distilroberta.yaml # all-distilroberta-v1 config
βββ tests/
β βββ init.py
β βββ unit/
β β βββ test_privacy_mechanisms.py # Noise distribution tests
β β βββ test_retrieval.py # Index consistency tests
β β βββ test_metrics.py # Metric computation tests
β βββ integration/
β βββ test_pipeline.py # End-to-end pipeline
β βββ test_reproducibility.py # Determinism checks
βββ models/ # Trained models storage
β βββ mia/ # MIA classifier checkpoints
β βββ shadow/ # Shadow models for calibration
β βββ embedders/ # Fine-tuned embedders (if applicable)
βββ results/
β βββ runs/ # Individual run outputs (CSV)
β β βββ archive/ # Historical runs
β βββ aggregated/ # Aggregated results across runs
β β βββ summary.csv # Main results table
β β βββ statistical_analysis.csv # With confidence intervals
β βββ figures/
β β βββ privacy_utility/ # Frontier plots
β β βββ ablations/ # Ablation study plots
β β βββ comparison/ # Mechanism comparison plots
β βββ tables/
β β βββ latex/ # LaTeX-formatted tables
β β βββ csv/ # CSV tables
β βββ checkpoints/ # Intermediate experiment states
βββ data/
β βββ raw/
β β βββ 20newsgroups/ # Original sklearn fetch
β βββ processed/
β β βββ train/ # 70% split (~3,500 docs)
β β βββ val/ # 15% split (~750 docs)
β β βββ test/ # 15% split (~750 docs)
β βββ queries/
β β βββ train_queries.csv # ~10,500 queries (3 per doc)
β β βββ val_queries.csv # ~2,250 queries
β β βββ test_queries.csv # ~2,250 queries
β βββ indices/
β βββ baseline/ # No-noise indices
β βββ private/ # Per-mechanism, per-Ξ΅ indices
βββ logs/
β βββ experiments/ # Experiment execution logs
β βββ errors/ # Error tracking
β βββ performance/ # Timing and resource usage
βββ notebooks/
β βββ exploratory/ # Development notebooks (not for production)
βββ docs/
β βββ REPRODUCIBILITY.md
β βββ EXPERIMENTS.md
β βββ METRICS.md # Detailed metric definitions
β βββ OUTPUT_SCHEMA.md # Unified output structure specification
βββ .github/
β βββ workflows/
β βββ tests.yml # CI/CD test runner
βββ environment.yml
βββ requirements.txt
βββ requirements-dev.txt
βββ pyproject.toml
βββ setup.py
βββ .gitignore
βββ LICENSE
βββ README.md
20 Newsgroups Corpus Selection:
- 8 diverse categories for balanced representation:
comp.graphics- Computer graphics discussionsrec.autos- Automotive topicssci.med- Medical sciencetalk.politics.guns- Political discussionsalt.atheism- Religious debatessci.space- Space explorationrec.sport.hockey- Sports contentmisc.forsale- Commerce/sales
Data Splits:
- Training set: 70% (~3,500 documents)
- Validation set: 15% (~750 documents)
- Test set: 15% (~750 documents)
- Stratified splitting maintains category distributions
Keyphrase Extraction Parameters:
- Algorithm: Rapid Automatic Keyword Extraction (RAKE)
- Queries per document: 3
- Keyphrase length: 3-5 tokens
- Total queries: ~15,000 across all splits
- HTML tag removal and text extraction
- Lowercasing and Unicode normalization
- Stopword removal using NLTK English stopwords
- Document filtering (minimum 50 tokens)
- Metadata preservation (category labels, document IDs)
Gaussian Mechanism Parameters:
- Noise distribution: N(0, ΟΒ²)
- Sensitivity calculation: Ξf = 2 (for L2-normalized vectors)
- Standard deviation: Ο = Ξf/Ξ΅ Γ β(2ln(1.25/Ξ΄))
- Fixed Ξ΄ = 10β»β΅ for all experiments
Privacy Budget Grid:
- Ξ΅ β {β (baseline), 20, 10, 5, 2.5}
- Lower Ξ΅ values provide stronger privacy guarantees
Doc-LDP: Adds noise to document embeddings before index construction
- Applied once during preprocessing
- Affects all subsequent retrievals
- Most computationally efficient
Query-LDP: Adds noise to query embeddings at search time
- Applied per-query during inference
- Allows dynamic privacy adjustment
- No index reconstruction required
Score-LDP: Adds noise to similarity scores before ranking
- Applied post-retrieval
- Finest granularity of control
- Can be combined with other mechanisms
Automated epsilon selection based on privacy requirements:
- Target metric: [email protected]%FPR β€ threshold
- Binary search over epsilon grid
- Returns minimal-utility-loss configuration
Membership Inference Attack (MIA):
- Attack model: Logistic regression classifier
- Features: Statistical patterns from retrieval scores
- Training: 50/50 member/non-member split
- Evaluation metrics:
- AUC (Area Under ROC Curve)
- [email protected]%FPR (True Positive Rate at 0.1% False Positive Rate)
- TPR@1%FPR (True Positive Rate at 1% False Positive Rate)
Retrieval Quality:
- Recall@1: Fraction of queries retrieving correct document at rank 1
- Recall@5: Fraction of queries retrieving correct document in top 5
- Recall@10: Fraction of queries retrieving correct document in top 10
Performance Metrics:
- Query latency (milliseconds)
- Index construction time (seconds)
- Memory footprint (GB)
- Bootstrap confidence intervals (n=1000 iterations)
- Paired significance tests for mechanism comparison
- Effect size calculation (Cohen's d)
All experiments produce standardized CSV outputs with the following columns:
Experiment Metadata:
run_id: Unique identifier (UUID)timestamp: ISO 8601 execution timegit_commit: Repository versionconfig_hash: Configuration fingerprint
Configuration Parameters:
mechanism: {none, doc_ldp, query_ldp, score_ldp, combined}epsilon_doc: Document-side privacy budgetepsilon_query: Query-side privacy budgetepsilon_score: Score-side privacy budgetembedding_model: Model identifierdataset_split: {train, val, test}num_documents: Corpus sizenum_queries: Total queries evaluated
Privacy Metrics:
mia_auc: Attack AUC score [0,1]tpr_0.001_fpr: TPR at 0.1% FPR [0,1]tpr_0.01_fpr: TPR at 1% FPR [0,1]avg_membership_score: Mean attack confidence
Utility Metrics:
recall_at_1: Recall@1 [0,1]recall_at_5: Recall@5 [0,1]recall_at_10: Recall@10 [0,1]mrr: Mean Reciprocal Rank [0,1]
Performance Metrics:
latency_mean_ms: Average query timelatency_std_ms: Query time standard deviationlatency_p99_ms: 99th percentile latencyindex_build_time_s: Index construction durationpeak_memory_gb: Maximum memory usage
Statistical Measures:
recall_at_1_ci_lower: 95% CI lower boundrecall_at_1_ci_upper: 95% CI upper boundmia_auc_ci_lower: 95% CI lower boundmia_auc_ci_upper: 95% CI upper bound
Summary tables combining multiple runs:
- Privacy-utility frontiers per mechanism
- Statistical comparisons across mechanisms
- Best configurations per privacy target
-
Environment Validation
- Verify dependencies and data availability
- Check reproducibility settings (seeds)
-
Baseline Establishment
- Run non-private configuration
- Establish upper bounds for utility metrics
-
Mechanism Evaluation
- Execute grid search across epsilon values
- Generate per-mechanism results
-
Comparative Analysis
- Produce privacy-utility frontiers
- Identify Pareto-optimal configurations
-
Robustness Validation
- Test on held-out data split
- Verify stability across random seeds
Experiments are executed via command-line scripts with YAML configurations:
- Single experiment: Specify configuration file
- Batch execution: Use experiment orchestrator
- Parallel runs: Configure worker processes
- Fixed global seed: 2025
- Deterministic operations enforced
- Configuration versioning via Git
- Complete environment specification
Computation Time (per configuration):
- Embedding generation: ~5 minutes (CPU), ~1 minute (GPU)
- Index construction: ~30 seconds
- MIA evaluation: ~2 minutes
- Full grid search: ~4 hours
Memory Requirements:
- Embedding matrix: ~5GB for 5000 documents
- FAISS index: ~2GB
- Peak usage during evaluation: ~12GB
- Document corpus: Tested up to 10,000 documents
- Query load: Evaluated with 30,000 queries
- Embedding dimensions: 384 (MiniLM), 768 (DistilRoBERTa)
- Type hints required for all function signatures
- Docstrings following NumPy style guide
- Maximum line length: 100 characters
- Import ordering: standard library, third-party, local
- Unit test coverage minimum: 80%
- Integration tests for complete pipelines
- Reproducibility tests with fixed seeds
- Performance regression tests
- Feature branches for development
- Semantic versioning for releases
- Comprehensive commit messages
- PR reviews required for main branch
- Module-level documentation required
- Inline comments for complex algorithms
- Update README for interface changes
- Maintain experiment logs
For technical questions, implementation details, or bug reports, please open an issue on the GitHub repository with appropriate labels.
MIT License - See LICENSE file for complete terms.
This research implementation follows privacy-preserving machine learning best practices and builds upon established differential privacy frameworks.