CADLP is a research-grade, enterprise-ready proxy layer that intercepts prompts sent to large language model (LLM) APIs and detects, classifies, and redacts sensitive data before it leaves the organization. It addresses the Shadow AI problem: employees using unapproved consumer AI tools that inadvertently expose credentials, PII, proprietary code, or internal operational context.
The system is described in full in the accompanying IEEE-style paper: "Contextual Sensitivity Classification and Utility-Preserving Redaction for Enterprise LLM Deployments."
Prompt Input
│
▼
┌────────────────────────────────────────────────┐
│ Contextual Sensitivity Classifier (CSC) │
│ │
│ Stage 1: Regex + Entropy (fast-path) │
│ Stage 2: NER + Disambiguation │
│ Stage 3: AST Code IP Fingerprinting │
│ Stage 4: Semantic / KB Similarity │
└──────────────────────┬─────────────────────────┘
│ SensitivityMap
▼
┌─────────────────┐
│ Policy Engine │ ALLOW / REDACT / BLOCK / QUARANTINE
└────────┬────────┘
│
┌───────────┴────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ UPR Redaction │ │ Audit Logger │
│ (typed placeh.) │ │ (zero-retention)│
└─────────────────┘ └──────────────────┘
│
▼
Sanitised Prompt → LLM API
- Four-stage detection pipeline combining fast regex/entropy, heuristic NER, AST-based code IP fingerprinting, and TF-IDF semantic similarity.
- Operational vs. exemplary disambiguation in NER: entities appearing in hypothetical examples are suppressed; entities in action-verb contexts are escalated.
- Utility-Preserving Redaction (UPR): typed, indexed placeholders (
[EMAIL_1],[API_KEY_1],[PERSON_A]) replace sensitive spans while preserving prompt structure and downstream LLM utility. - Session-level redaction map: the same real-world entity always maps to the same placeholder within a multi-turn session, maintaining referential consistency.
- Zero-retention audit logging: only metadata (entity types, action taken, timestamp) is logged; raw prompt content is never persisted.
- Policy engine with six default rules, each configurable independently.
# Basic install (no optional ML dependencies)
pip install cadlp
# Full install with semantic similarity support
pip install "cadlp[full]"
# Development install
git clone https://github.com/sugentyala/cadlp
cd cadlp
pip install -e ".[dev]"from cadlp import ContextualSensitivityClassifier, UtilityPreservingRedactor
from cadlp.policy.engine import PolicyEngine, Action
csc = ContextualSensitivityClassifier()
upr = UtilityPreservingRedactor()
policy = PolicyEngine()
prompt = "Please reset the password for alice@company.com. Her SSN is 234-56-7890."
smap = csc.classify(prompt)
decision = policy.evaluate(smap)
if decision.action == Action.REDACT:
result = upr.redact(smap)
print(result.redacted_prompt)
# "Please reset the password for [EMAIL_1]. Her SSN is [SSN_1]."
elif decision.action == Action.BLOCK:
print(f"Prompt blocked: {decision.triggered_rule}")# Scan a prompt from stdin
echo "My API key is sk-abc123XYZ..." | cadlp scan
# Run the built-in demo
cadlp demo
# Run evaluation on the SEPAD-10K sample
cadlp eval --dataset data/sepad10k/sample.jsondocker compose up cadlp| Entity Type | Stage | Method | Confidence |
|---|---|---|---|
| API Keys (OpenAI, etc.) | 1 | Regex | 0.95-0.99 |
| Private Keys | 1 | Regex | 1.00 |
| JWT Tokens | 1 | Regex | 0.95 |
| Email Addresses | 1 | Regex | 0.85 |
| SSN / Credit Cards | 1 | Regex | 0.92-0.95 |
| Internal Hostnames | 1 | Regex | 0.82 |
| High-entropy Secrets | 1 | Shannon Entropy | variable |
| Person Names | 2 | NER + Disambiguation | 0.60-0.95 |
| Org / Project Codes | 2 | NER + Disambiguation | 0.60-0.95 |
| Ticket / Employee IDs | 2 | NER + Disambiguation | 0.60-0.95 |
| Proprietary Code | 3 | AST + Import Analysis | variable |
| Internal Terminology | 4 | TF-IDF Similarity | variable |
# Full suite (90 tests)
pytest tests/ -v
# Unit tests only
pytest tests/unit/ -v --cov=cadlp --cov-report=term-missing
# Integration tests only
pytest tests/integration/ -vThe data/sepad10k/sample.json file contains 10 annotated representative prompts from the SEPAD-10K synthetic enterprise prompt benchmark. Each entry includes:
id: unique sample identifierprompt: raw prompt textlabel:SENSITIVEorCLEANcategory: sensitivity category (CREDENTIAL, PII, CODE_IP, MIXED, EXEMPLARY_CONTEXT, etc.)expected_action: expected policy decisionannotations: entity types, detection stage, confidence bounds
cadlp/
├── cadlp/
│ ├── csc/
│ │ ├── stage1_fastpath.py # Regex + entropy detection
│ │ ├── stage2_ner.py # NER with disambiguation
│ │ ├── stage3_ast.py # Code IP fingerprinting
│ │ ├── stage4_semantic.py # Semantic / KB similarity
│ │ └── pipeline.py # CSC orchestrator
│ ├── upr/
│ │ └── redaction.py # Utility-preserving redactor
│ ├── policy/
│ │ └── engine.py # Policy rule engine
│ ├── audit/
│ │ └── logger.py # Zero-retention audit logger
│ ├── eval/
│ │ └── metrics.py # LPR / FPR / F1 metrics
│ └── cli.py # Click CLI entrypoint
├── tests/
│ ├── unit/ # 75 unit tests
│ └── integration/ # 15 integration tests
├── data/
│ └── sepad10k/
│ └── sample.json # SEPAD-10K sample (10 prompts)
├── .github/workflows/ci.yml
├── docker-compose.yml
├── Dockerfile
└── pyproject.toml
Apache License 2.0. See LICENSE for details.
If you use CADLP in your research, please cite:
@article{cadlp2025,
title = {Contextual Sensitivity Classification and Utility-Preserving Redaction
for Enterprise LLM Deployments},
author = {CADLP Research Team},
year = {2025},
url = {https://github.com/sunilgentyala/cadlp}
}