Skip to content

yza-cmu/mute-vector

Repository files navigation

Mute the Vectors: Two-Sided Local Differential Privacy for Private Vector Retrieval

Executive Summary

A production-grade research implementation investigating privacy-preserving mechanisms for semantic vector retrieval systems. The project implements and evaluates three Local Differential Privacy (LDP) mechanisms to protect document collections from Membership Inference Attacks while maintaining retrieval quality.

Table of Contents

  1. System Overview
  2. Installation
  3. Repository Architecture
  4. Data Specifications
  5. Privacy Mechanisms
  6. Evaluation Framework
  7. Output Specifications
  8. Experiment Execution
  9. Performance Benchmarks
  10. Development Guidelines

System Overview

Core Objectives

The system implements three distinct privacy mechanisms for vector retrieval:

  • Document-side Local Differential Privacy (Doc-LDP): Injects calibrated Gaussian noise into document embeddings before index construction
  • Query-side Local Differential Privacy (Query-LDP): Applies noise to query vectors during inference
  • Score-side Local Differential Privacy (Score-LDP): Perturbs similarity scores before top-k selection

Key Capabilities

  • Automated privacy budget calibration based on attack success thresholds
  • Comprehensive privacy-utility tradeoff analysis
  • Reproducible experimental pipeline with fixed random seeds
  • Membership Inference Attack auditing with feature-based classifiers

Installation

System Requirements

  • Python 3.11 or higher
  • RAM: Minimum 16GB (32GB recommended for full experiments)
  • Storage: 10GB free space for data and results
  • Optional: CUDA 11.7+ compatible GPU for accelerated embedding generation

Environment Setup

The project uses Conda for environment management with pinned dependencies to ensure reproducibility.

# Repository setup
git clone https://github.com/yourusername/mute-vector.git
cd mute-vector

# Environment creation
conda env create -f environment.yml
conda activate mute-vectors

# Development dependencies (optional)
pip install -r requirements-dev.txt

# Package installation
pip install -e .

# Data initialization
python scripts/download_data.py
python scripts/generate_queries.py

Repository Architecture

Source Code (/src)

The source directory contains the core implementation organized into specialized modules:

Data Module (/src/data/): Handles data ingestion and preprocessing

  • Loads 20 Newsgroups dataset with configurable category selection
  • Implements stratified train/validation/test splitting (70/15/15)
  • Performs text normalization and filtering

Embeddings Module (/src/embeddings/): Manages embedding generation

  • Supports multiple sentence-transformer models
  • Implements caching for computational efficiency
  • Produces L2-normalized vectors for retrieval

Privacy Module (/src/privacy/): Core privacy mechanism implementations

  • Calibrates noise parameters from privacy budgets (epsilon values)
  • Implements Gaussian mechanism with sensitivity analysis
  • Provides automated budget tuning for target attack thresholds

Retrieval Module (/src/retrieval/): Search infrastructure

  • Builds and manages FAISS indices with inner product similarity
  • Generates keyphrase queries using RAKE algorithm
  • Implements top-k retrieval with configurable k values

Attacks Module (/src/attacks/): Privacy evaluation

  • Implements feature-based Membership Inference Attacks
  • Extracts statistical features from retrieval patterns
  • Trains shadow models for attack calibration

Evaluation Module (/src/evaluation/): Metrics and analysis

  • Computes retrieval metrics (Recall@k)
  • Calculates privacy metrics (AUC, TPR@FPR)
  • Generates statistical significance tests

Experiment Scripts (/experiments)

Orchestration scripts for running systematic experiments:

  • Baseline evaluation: Establishes non-private performance bounds
  • Grid search: Exhaustive evaluation across privacy parameters
  • Ablation studies: Isolates impact of individual components
  • Robustness testing: Validates results on held-out data

Configuration (/configs)

YAML-based configuration system with hierarchical overrides:

  • Default parameters in base configuration
  • Mechanism-specific parameter sets
  • Model specifications for embedding generation

Output Directories

Models (/models): Persistent model storage

  • Trained MIA classifiers
  • Shadow models for attack calibration
  • Embedding model checkpoints

Results (/results): Experimental outputs

  • Individual run results with full metrics
  • Aggregated summaries with statistical analysis
  • Publication-ready figures and tables

Logs (/logs): Execution tracking

  • Detailed experiment logs with timestamps
  • Error logs for debugging
  • Performance profiling data

Architecture Continued

mute-vector/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ data/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ loader.py # 20newsgroups: 8 categories (comp.graphics, rec.autos, sci.med, talk.politics.guns, β”‚ β”‚ β”‚ # alt.atheism, sci.space, rec.sport.hockey, misc.forsale) β”‚ β”‚ β”‚ # Train: 70% (~3,500 docs), Val: 15% (~750 docs), Test: 15% (~750 docs) β”‚ β”‚ β”œβ”€β”€ preprocessor.py # Lowercasing, stopword removal (NLTK), min_doc_length=50 tokens β”‚ β”‚ └── splitter.py # Stratified splits maintaining category distributions β”‚ β”œβ”€β”€ embeddings/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ encoder.py # all-MiniLM-L6-v2 (384 dims) as primary, all-distilroberta-v1 (768 dims) for robustness β”‚ β”‚ β”œβ”€β”€ models.py # Model registry and lazy loading β”‚ β”‚ └── cache.py # Embedding cache manager (HDF5 format) β”‚ β”œβ”€β”€ privacy/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ mechanisms.py # Gaussian noise: N(0, σ²), Οƒ = Ξ”f/Ξ΅ * sqrt(2ln(1.25/Ξ΄)), Ξ΄=1e-5 β”‚ β”‚ β”œβ”€β”€ doc_ldp.py # Document-side noise before indexing β”‚ β”‚ β”œβ”€β”€ query_ldp.py # Query-side noise at inference β”‚ β”‚ β”œβ”€β”€ score_ldp.py # Score-space noise before top-k selection β”‚ β”‚ β”œβ”€β”€ calibration.py # Ξ΅ ∈ {∞, 20, 10, 5, 2.5} β†’ Οƒ mapping β”‚ β”‚ └── budget_tuner.py # Binary search for Ξ΅ given TPR@FPR target β”‚ β”œβ”€β”€ retrieval/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ indexer.py # FAISS IndexFlatIP (inner product), L2 normalized vectors β”‚ β”‚ β”œβ”€β”€ searcher.py # Top-k ∈ {1, 5, 10} retrieval β”‚ β”‚ β”œβ”€β”€ queries.py # RAKE: 3 keyphrases/doc, 3-5 tokens each β”‚ β”‚ └── scorer.py # Cosine similarity scoring pipeline β”‚ β”œβ”€β”€ attacks/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ mia.py # Logistic regression MIA, 80/20 train/test split β”‚ β”‚ β”œβ”€β”€ features.py # 7 features: max_score, mean_score, std_score, hits_in_topk, β”‚ β”‚ β”‚ # mean_rank_when_hit, max_rank_hit, score_gini β”‚ β”‚ └── shadow_models.py # Shadow model training for calibration β”‚ β”œβ”€β”€ evaluation/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ metrics.py # Recall@{1,5,10}, MIA-AUC, TPR@{0.1%, 1%}FPR, latency_ms β”‚ β”‚ β”œβ”€β”€ evaluator.py # Full evaluation pipeline orchestration β”‚ β”‚ └── statistical_tests.py # Bootstrap confidence intervals (n=1000) β”‚ └── utils/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ config.py # YAML parsing with schema validation β”‚ β”œβ”€β”€ logging.py # Structured logging (JSON format) β”‚ β”œβ”€β”€ reproducibility.py # Seed management (global seed: 2025) β”‚ └── io.py # Unified I/O for results β”œβ”€β”€ experiments/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ run_baseline.py # No-DP baseline: full corpus, all queries β”‚ β”œβ”€β”€ run_grid_search.py # Full factorial: mechanism Γ— Ξ΅ Γ— k β”‚ β”œβ”€β”€ run_ablations.py # Ablation A: k∈{1,5,10}, Ablation B: queries_per_doc∈{1,3,5} β”‚ β”œβ”€β”€ run_robustness.py # Test set evaluation, model swapping β”‚ └── run_combined_mechanisms.py # Two-layer combinations (Doc+Query LDP) β”œβ”€β”€ scripts/ β”‚ β”œβ”€β”€ setup_environment.sh
β”‚ β”œβ”€β”€ download_data.py # Fetches 20newsgroups via sklearn β”‚ β”œβ”€β”€ generate_queries.py # Pre-generates all query sets β”‚ β”œβ”€β”€ build_indices.py # Pre-builds FAISS indices for all Ξ΅ values β”‚ └── visualize_results.py # Matplotlib/Seaborn plots β”œβ”€β”€ configs/ β”‚ β”œβ”€β”€ default.yaml # Base configuration (overrideable) β”‚ β”œβ”€β”€ experiments/ β”‚ β”‚ β”œβ”€β”€ baseline.yaml # mechanism: none, Ξ΅: ∞ β”‚ β”‚ β”œβ”€β”€ doc_ldp.yaml # mechanism: doc, Ξ΅: [2.5, 5, 10, 20] β”‚ β”‚ β”œβ”€β”€ query_ldp.yaml # mechanism: query, Ξ΅: [2.5, 5, 10, 20] β”‚ β”‚ β”œβ”€β”€ score_ldp.yaml # mechanism: score, Ξ΅: [2.5, 5, 10, 20] β”‚ β”‚ └── combined.yaml # Two-layer combinations β”‚ └── models/ β”‚ β”œβ”€β”€ minilm.yaml # all-MiniLM-L6-v2 config β”‚ └── distilroberta.yaml # all-distilroberta-v1 config β”œβ”€β”€ tests/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ unit/ β”‚ β”‚ β”œβ”€β”€ test_privacy_mechanisms.py # Noise distribution tests β”‚ β”‚ β”œβ”€β”€ test_retrieval.py # Index consistency tests β”‚ β”‚ └── test_metrics.py # Metric computation tests β”‚ └── integration/ β”‚ β”œβ”€β”€ test_pipeline.py # End-to-end pipeline β”‚ └── test_reproducibility.py # Determinism checks β”œβ”€β”€ models/ # Trained models storage β”‚ β”œβ”€β”€ mia/ # MIA classifier checkpoints β”‚ β”œβ”€β”€ shadow/ # Shadow models for calibration β”‚ └── embedders/ # Fine-tuned embedders (if applicable) β”œβ”€β”€ results/ β”‚ β”œβ”€β”€ runs/ # Individual run outputs (CSV) β”‚ β”‚ └── archive/ # Historical runs β”‚ β”œβ”€β”€ aggregated/ # Aggregated results across runs β”‚ β”‚ β”œβ”€β”€ summary.csv # Main results table β”‚ β”‚ └── statistical_analysis.csv # With confidence intervals β”‚ β”œβ”€β”€ figures/ β”‚ β”‚ β”œβ”€β”€ privacy_utility/ # Frontier plots β”‚ β”‚ β”œβ”€β”€ ablations/ # Ablation study plots β”‚ β”‚ └── comparison/ # Mechanism comparison plots β”‚ β”œβ”€β”€ tables/ β”‚ β”‚ β”œβ”€β”€ latex/ # LaTeX-formatted tables β”‚ β”‚ └── csv/ # CSV tables β”‚ └── checkpoints/ # Intermediate experiment states β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ raw/ β”‚ β”‚ └── 20newsgroups/ # Original sklearn fetch β”‚ β”œβ”€β”€ processed/ β”‚ β”‚ β”œβ”€β”€ train/ # 70% split (~3,500 docs) β”‚ β”‚ β”œβ”€β”€ val/ # 15% split (~750 docs) β”‚ β”‚ └── test/ # 15% split (~750 docs) β”‚ β”œβ”€β”€ queries/ β”‚ β”‚ β”œβ”€β”€ train_queries.csv # ~10,500 queries (3 per doc) β”‚ β”‚ β”œβ”€β”€ val_queries.csv # ~2,250 queries β”‚ β”‚ └── test_queries.csv # ~2,250 queries β”‚ └── indices/ β”‚ β”œβ”€β”€ baseline/ # No-noise indices β”‚ └── private/ # Per-mechanism, per-Ξ΅ indices β”œβ”€β”€ logs/ β”‚ β”œβ”€β”€ experiments/ # Experiment execution logs β”‚ β”œβ”€β”€ errors/ # Error tracking β”‚ └── performance/ # Timing and resource usage β”œβ”€β”€ notebooks/ β”‚ └── exploratory/ # Development notebooks (not for production) β”œβ”€β”€ docs/ β”‚ β”œβ”€β”€ REPRODUCIBILITY.md
β”‚ β”œβ”€β”€ EXPERIMENTS.md
β”‚ β”œβ”€β”€ METRICS.md # Detailed metric definitions β”‚ └── OUTPUT_SCHEMA.md # Unified output structure specification β”œβ”€β”€ .github/ β”‚ └── workflows/ β”‚ └── tests.yml # CI/CD test runner β”œβ”€β”€ environment.yml β”œβ”€β”€ requirements.txt β”œβ”€β”€ requirements-dev.txt β”œβ”€β”€ pyproject.toml β”œβ”€β”€ setup.py β”œβ”€β”€ .gitignore β”œβ”€β”€ LICENSE └── README.md

Data Specifications

Dataset Configuration

20 Newsgroups Corpus Selection:

  • 8 diverse categories for balanced representation:
    • comp.graphics - Computer graphics discussions
    • rec.autos - Automotive topics
    • sci.med - Medical science
    • talk.politics.guns - Political discussions
    • alt.atheism - Religious debates
    • sci.space - Space exploration
    • rec.sport.hockey - Sports content
    • misc.forsale - Commerce/sales

Data Splits:

  • Training set: 70% (~3,500 documents)
  • Validation set: 15% (~750 documents)
  • Test set: 15% (~750 documents)
  • Stratified splitting maintains category distributions

Query Generation

Keyphrase Extraction Parameters:

  • Algorithm: Rapid Automatic Keyword Extraction (RAKE)
  • Queries per document: 3
  • Keyphrase length: 3-5 tokens
  • Total queries: ~15,000 across all splits

Preprocessing Pipeline

  1. HTML tag removal and text extraction
  2. Lowercasing and Unicode normalization
  3. Stopword removal using NLTK English stopwords
  4. Document filtering (minimum 50 tokens)
  5. Metadata preservation (category labels, document IDs)

Privacy Mechanisms

Noise Calibration

Gaussian Mechanism Parameters:

  • Noise distribution: N(0, σ²)
  • Sensitivity calculation: Ξ”f = 2 (for L2-normalized vectors)
  • Standard deviation: Οƒ = Ξ”f/Ξ΅ Γ— √(2ln(1.25/Ξ΄))
  • Fixed Ξ΄ = 10⁻⁡ for all experiments

Privacy Budget Grid:

  • Ξ΅ ∈ {∞ (baseline), 20, 10, 5, 2.5}
  • Lower Ξ΅ values provide stronger privacy guarantees

Mechanism Implementations

Doc-LDP: Adds noise to document embeddings before index construction

  • Applied once during preprocessing
  • Affects all subsequent retrievals
  • Most computationally efficient

Query-LDP: Adds noise to query embeddings at search time

  • Applied per-query during inference
  • Allows dynamic privacy adjustment
  • No index reconstruction required

Score-LDP: Adds noise to similarity scores before ranking

  • Applied post-retrieval
  • Finest granularity of control
  • Can be combined with other mechanisms

Budget Tuning

Automated epsilon selection based on privacy requirements:

  • Target metric: [email protected]%FPR ≀ threshold
  • Binary search over epsilon grid
  • Returns minimal-utility-loss configuration

Evaluation Framework

Privacy Metrics

Membership Inference Attack (MIA):

  • Attack model: Logistic regression classifier
  • Features: Statistical patterns from retrieval scores
  • Training: 50/50 member/non-member split
  • Evaluation metrics:
    • AUC (Area Under ROC Curve)
    • [email protected]%FPR (True Positive Rate at 0.1% False Positive Rate)
    • TPR@1%FPR (True Positive Rate at 1% False Positive Rate)

Utility Metrics

Retrieval Quality:

  • Recall@1: Fraction of queries retrieving correct document at rank 1
  • Recall@5: Fraction of queries retrieving correct document in top 5
  • Recall@10: Fraction of queries retrieving correct document in top 10

Performance Metrics:

  • Query latency (milliseconds)
  • Index construction time (seconds)
  • Memory footprint (GB)

Statistical Analysis

  • Bootstrap confidence intervals (n=1000 iterations)
  • Paired significance tests for mechanism comparison
  • Effect size calculation (Cohen's d)

Output Specifications

Unified Output Schema

All experiments produce standardized CSV outputs with the following columns:

Experiment Metadata:

  • run_id: Unique identifier (UUID)
  • timestamp: ISO 8601 execution time
  • git_commit: Repository version
  • config_hash: Configuration fingerprint

Configuration Parameters:

  • mechanism: {none, doc_ldp, query_ldp, score_ldp, combined}
  • epsilon_doc: Document-side privacy budget
  • epsilon_query: Query-side privacy budget
  • epsilon_score: Score-side privacy budget
  • embedding_model: Model identifier
  • dataset_split: {train, val, test}
  • num_documents: Corpus size
  • num_queries: Total queries evaluated

Privacy Metrics:

  • mia_auc: Attack AUC score [0,1]
  • tpr_0.001_fpr: TPR at 0.1% FPR [0,1]
  • tpr_0.01_fpr: TPR at 1% FPR [0,1]
  • avg_membership_score: Mean attack confidence

Utility Metrics:

  • recall_at_1: Recall@1 [0,1]
  • recall_at_5: Recall@5 [0,1]
  • recall_at_10: Recall@10 [0,1]
  • mrr: Mean Reciprocal Rank [0,1]

Performance Metrics:

  • latency_mean_ms: Average query time
  • latency_std_ms: Query time standard deviation
  • latency_p99_ms: 99th percentile latency
  • index_build_time_s: Index construction duration
  • peak_memory_gb: Maximum memory usage

Statistical Measures:

  • recall_at_1_ci_lower: 95% CI lower bound
  • recall_at_1_ci_upper: 95% CI upper bound
  • mia_auc_ci_lower: 95% CI lower bound
  • mia_auc_ci_upper: 95% CI upper bound

Aggregated Outputs

Summary tables combining multiple runs:

  • Privacy-utility frontiers per mechanism
  • Statistical comparisons across mechanisms
  • Best configurations per privacy target

Experiment Execution

Standard Workflow

  1. Environment Validation

    • Verify dependencies and data availability
    • Check reproducibility settings (seeds)
  2. Baseline Establishment

    • Run non-private configuration
    • Establish upper bounds for utility metrics
  3. Mechanism Evaluation

    • Execute grid search across epsilon values
    • Generate per-mechanism results
  4. Comparative Analysis

    • Produce privacy-utility frontiers
    • Identify Pareto-optimal configurations
  5. Robustness Validation

    • Test on held-out data split
    • Verify stability across random seeds

Execution Commands

Experiments are executed via command-line scripts with YAML configurations:

  • Single experiment: Specify configuration file
  • Batch execution: Use experiment orchestrator
  • Parallel runs: Configure worker processes

Reproducibility Protocol

  • Fixed global seed: 2025
  • Deterministic operations enforced
  • Configuration versioning via Git
  • Complete environment specification

Performance Benchmarks

Expected Resource Usage

Computation Time (per configuration):

  • Embedding generation: ~5 minutes (CPU), ~1 minute (GPU)
  • Index construction: ~30 seconds
  • MIA evaluation: ~2 minutes
  • Full grid search: ~4 hours

Memory Requirements:

  • Embedding matrix: ~5GB for 5000 documents
  • FAISS index: ~2GB
  • Peak usage during evaluation: ~12GB

Scalability Considerations

  • Document corpus: Tested up to 10,000 documents
  • Query load: Evaluated with 30,000 queries
  • Embedding dimensions: 384 (MiniLM), 768 (DistilRoBERTa)

Development Guidelines

Code Standards

  • Type hints required for all function signatures
  • Docstrings following NumPy style guide
  • Maximum line length: 100 characters
  • Import ordering: standard library, third-party, local

Testing Requirements

  • Unit test coverage minimum: 80%
  • Integration tests for complete pipelines
  • Reproducibility tests with fixed seeds
  • Performance regression tests

Version Control

  • Feature branches for development
  • Semantic versioning for releases
  • Comprehensive commit messages
  • PR reviews required for main branch

Documentation Standards

  • Module-level documentation required
  • Inline comments for complex algorithms
  • Update README for interface changes
  • Maintain experiment logs

Support

For technical questions, implementation details, or bug reports, please open an issue on the GitHub repository with appropriate labels.

License

MIT License - See LICENSE file for complete terms.

Acknowledgments

This research implementation follows privacy-preserving machine learning best practices and builds upon established differential privacy frameworks.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages