CADLP: Context-Aware Data Loss Prevention Proxy for LLMs

CADLP is a research-grade, enterprise-ready proxy layer that intercepts prompts sent to large language model (LLM) APIs and detects, classifies, and redacts sensitive data before it leaves the organization. It addresses the Shadow AI problem: employees using unapproved consumer AI tools that inadvertently expose credentials, PII, proprietary code, or internal operational context.

The system is described in full in the accompanying IEEE-style paper: "Contextual Sensitivity Classification and Utility-Preserving Redaction for Enterprise LLM Deployments."

Architecture Overview

Prompt Input
     │
     ▼
┌────────────────────────────────────────────────┐
│     Contextual Sensitivity Classifier (CSC)    │
│                                                │
│  Stage 1: Regex + Entropy  (fast-path)         │
│  Stage 2: NER + Disambiguation                 │
│  Stage 3: AST Code IP Fingerprinting           │
│  Stage 4: Semantic / KB Similarity             │
└──────────────────────┬─────────────────────────┘
                       │ SensitivityMap
                       ▼
             ┌─────────────────┐
             │  Policy Engine  │  ALLOW / REDACT / BLOCK / QUARANTINE
             └────────┬────────┘
                      │
          ┌───────────┴────────────┐
          │                        │
          ▼                        ▼
 ┌─────────────────┐     ┌──────────────────┐
 │ UPR Redaction   │     │  Audit Logger    │
 │ (typed placeh.) │     │  (zero-retention)│
 └─────────────────┘     └──────────────────┘
          │
          ▼
   Sanitised Prompt → LLM API

Key Features

Four-stage detection pipeline combining fast regex/entropy, heuristic NER, AST-based code IP fingerprinting, and TF-IDF semantic similarity.
Operational vs. exemplary disambiguation in NER: entities appearing in hypothetical examples are suppressed; entities in action-verb contexts are escalated.
Utility-Preserving Redaction (UPR): typed, indexed placeholders ([EMAIL_1], [API_KEY_1], [PERSON_A]) replace sensitive spans while preserving prompt structure and downstream LLM utility.
Session-level redaction map: the same real-world entity always maps to the same placeholder within a multi-turn session, maintaining referential consistency.
Zero-retention audit logging: only metadata (entity types, action taken, timestamp) is logged; raw prompt content is never persisted.
Policy engine with six default rules, each configurable independently.

Installation

# Basic install (no optional ML dependencies)
pip install cadlp

# Full install with semantic similarity support
pip install "cadlp[full]"

# Development install
git clone https://github.com/sugentyala/cadlp
cd cadlp
pip install -e ".[dev]"

Quick Start

Python API

from cadlp import ContextualSensitivityClassifier, UtilityPreservingRedactor
from cadlp.policy.engine import PolicyEngine, Action

csc    = ContextualSensitivityClassifier()
upr    = UtilityPreservingRedactor()
policy = PolicyEngine()

prompt = "Please reset the password for alice@company.com. Her SSN is 234-56-7890."

smap     = csc.classify(prompt)
decision = policy.evaluate(smap)

if decision.action == Action.REDACT:
    result = upr.redact(smap)
    print(result.redacted_prompt)
    # "Please reset the password for [EMAIL_1]. Her SSN is [SSN_1]."
elif decision.action == Action.BLOCK:
    print(f"Prompt blocked: {decision.triggered_rule}")

CLI

# Scan a prompt from stdin
echo "My API key is sk-abc123XYZ..." | cadlp scan

# Run the built-in demo
cadlp demo

# Run evaluation on the SEPAD-10K sample
cadlp eval --dataset data/sepad10k/sample.json

Docker

docker compose up cadlp

Detection Capabilities

Entity Type	Stage	Method	Confidence
API Keys (OpenAI, etc.)	1	Regex	0.95-0.99
Private Keys	1	Regex	1.00
JWT Tokens	1	Regex	0.95
Email Addresses	1	Regex	0.85
SSN / Credit Cards	1	Regex	0.92-0.95
Internal Hostnames	1	Regex	0.82
High-entropy Secrets	1	Shannon Entropy	variable
Person Names	2	NER + Disambiguation	0.60-0.95
Org / Project Codes	2	NER + Disambiguation	0.60-0.95
Ticket / Employee IDs	2	NER + Disambiguation	0.60-0.95
Proprietary Code	3	AST + Import Analysis	variable
Internal Terminology	4	TF-IDF Similarity	variable

Running Tests

# Full suite (90 tests)
pytest tests/ -v

# Unit tests only
pytest tests/unit/ -v --cov=cadlp --cov-report=term-missing

# Integration tests only
pytest tests/integration/ -v

Dataset: SEPAD-10K

The data/sepad10k/sample.json file contains 10 annotated representative prompts from the SEPAD-10K synthetic enterprise prompt benchmark. Each entry includes:

id: unique sample identifier
prompt: raw prompt text
label: SENSITIVE or CLEAN
category: sensitivity category (CREDENTIAL, PII, CODE_IP, MIXED, EXEMPLARY_CONTEXT, etc.)
expected_action: expected policy decision
annotations: entity types, detection stage, confidence bounds

Project Structure

cadlp/
├── cadlp/
│   ├── csc/
│   │   ├── stage1_fastpath.py   # Regex + entropy detection
│   │   ├── stage2_ner.py        # NER with disambiguation
│   │   ├── stage3_ast.py        # Code IP fingerprinting
│   │   ├── stage4_semantic.py   # Semantic / KB similarity
│   │   └── pipeline.py          # CSC orchestrator
│   ├── upr/
│   │   └── redaction.py         # Utility-preserving redactor
│   ├── policy/
│   │   └── engine.py            # Policy rule engine
│   ├── audit/
│   │   └── logger.py            # Zero-retention audit logger
│   ├── eval/
│   │   └── metrics.py           # LPR / FPR / F1 metrics
│   └── cli.py                   # Click CLI entrypoint
├── tests/
│   ├── unit/                    # 75 unit tests
│   └── integration/             # 15 integration tests
├── data/
│   └── sepad10k/
│       └── sample.json          # SEPAD-10K sample (10 prompts)
├── .github/workflows/ci.yml
├── docker-compose.yml
├── Dockerfile
└── pyproject.toml

License

Apache License 2.0. See LICENSE for details.

Citation

If you use CADLP in your research, please cite:

@article{cadlp2025,
  title   = {Contextual Sensitivity Classification and Utility-Preserving Redaction
             for Enterprise LLM Deployments},
  author  = {CADLP Research Team},
  year    = {2025},
  url     = {https://github.com/sunilgentyala/cadlp}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CADLP: Context-Aware Data Loss Prevention Proxy for LLMs

Architecture Overview

Key Features

Installation

Quick Start

Python API

CLI

Docker

Detection Capabilities

Running Tests

Dataset: SEPAD-10K

Project Structure

License

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
cadlp		cadlp
data/sepad10k		data/sepad10k
docs		docs
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

CADLP: Context-Aware Data Loss Prevention Proxy for LLMs

Architecture Overview

Key Features

Installation

Quick Start

Python API

CLI

Docker

Detection Capabilities

Running Tests

Dataset: SEPAD-10K

Project Structure

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages