Skip to content

ppcvote/prompt-defense-audit

Repository files navigation

prompt-defense-audit

CI npm version License: MIT Node.js Zero Dependencies

Deterministic LLM prompt defense scanner. Checks system prompts for missing defenses against 17 attack vectors (12 base + 5 agent-specific in v1.4). Pure regex — no LLM, no API calls, < 5ms, 100% reproducible.

繁體中文版

$ npx prompt-defense-audit "You are a helpful assistant."

  Grade: F  (8/100, 1/12 defenses)

  Defense Status:

  ✗ Role Boundary (80%)
    Partial: only 1/2 defense pattern(s)
  ✗ Instruction Boundary (80%)
    No defense pattern found
  ✗ Data Protection (80%)
    No defense pattern found
  ...

Why

OWASP lists Prompt Injection as the #1 threat to LLM applications. Yet most developers ship system prompts with zero defense.

We scanned 1,646 production system prompts from 4 public datasets. Results:

  • 97.8% lack indirect injection defense
  • 78.3% score F (below 45/100)
  • Average score: 36/100

Existing security tools require LLM calls (expensive, non-deterministic) or cloud services (privacy concerns). This package runs locally, instantly, for free.

Our philosophy: The deterministic engine is the product. AI deep analysis is optional — because regex is already strong enough for 90%+ of use cases. Zero AI cost by default.

Install

npm install prompt-defense-audit

Usage

Programmatic (TypeScript / JavaScript)

import { audit, auditWithDetails } from 'prompt-defense-audit'

// Quick audit
const result = audit('You are a helpful assistant.')
console.log(result.grade)    // 'F'
console.log(result.score)    // 8
console.log(result.missing)  // ['instruction-override', 'data-leakage', ...]

// Detailed audit with per-vector evidence
const detailed = auditWithDetails(mySystemPrompt)
for (const check of detailed.checks) {
  console.log(`${check.defended ? '✅' : '❌'} ${check.name}: ${check.evidence}`)
}

CLI

# Inline prompt
npx prompt-defense-audit "You are a helpful assistant."

# From file
npx prompt-defense-audit --file my-prompt.txt

# Pipe from stdin
cat prompt.txt | npx prompt-defense-audit

# JSON output (for CI/CD)
npx prompt-defense-audit --json "Your prompt"

# Traditional Chinese output
npx prompt-defense-audit --zh "你的系統提示"

# List all 12 attack vectors
npx prompt-defense-audit --vectors

CI/CD Gate

GRADE=$(npx prompt-defense-audit --json --file prompt.txt | node -e "
  const r = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
  console.log(r.grade);
")
if [[ "$GRADE" == "D" || "$GRADE" == "F" ]]; then
  echo "Prompt defense audit failed: grade $GRADE"
  exit 1
fi

17 Attack Vectors

Based on OWASP LLM Top 10, empirical research on 1,646 production prompts, and structured analysis of six documented crypto AI agent incidents (see CASE_STUDIES.md).

12 Base Vectors

# Vector What it checks Gap rate*
1 Role Escape Role definition + boundary enforcement 92.4%
2 Instruction Override Refusal clauses + meta-instruction protection
3 Data Leakage System prompt / training data disclosure prevention 9.4%
4 Output Manipulation Output format restrictions 88.3%
5 Multi-language Bypass Language-specific defense 64.3%
6 Unicode Attacks Homoglyph / zero-width character detection
7 Context Overflow Input length limits
8 Indirect Injection External data validation 97.8%
9 Social Engineering Emotional manipulation resistance 71.4%
10 Output Weaponization Harmful content generation prevention
11 Abuse Prevention Rate limiting / auth awareness
12 Input Validation XSS / SQL injection / sanitization

5 Agent Vectors (v1.4, May 2026)

Added after analysing six documented crypto AI agent incidents. Each vector is grounded in a specific real-world failure — see CASE_STUDIES.md for primary sources and root-cause analysis.

# Vector What it checks Reference incident
13 Encoding-aware Indirect Injection Treating decoded/translated content (Morse, base64, ROT13) as untrusted data, not instructions Grok×Bankrbot Morse code, May 2026
14 Function/Tool Semantic Immutability Function or tool semantics cannot be redefined mid-conversation Freysa approveTransfer redefinition, Nov 2024
15 Memory Provenance Awareness Retrieved RAG memory may be poisoned by adversaries on other platforms ElizaOS memory injection, Princeton 2025
16 Cross-Agent Authorization Boundary Authority does not silently inherit from another agent's output Grok×Bankrbot principal confusion, May 2026
17 Financial Transaction Guardrails Hard limits, multi-sig, refusal thresholds for transactions Lobstar Wilde decimal-error transfer, Feb 2026

*Gap rate = % of 1,646 production prompts missing this defense. Source: research data.

Grading

Grade Score Meaning
A 90–100 Strong defense coverage
B 70–89 Good, some gaps
C 50–69 Moderate, significant gaps
D 30–49 Weak, most defenses missing
F 0–29 Critical, nearly undefended

API Reference

audit(prompt: string): AuditResult

Quick audit. Returns grade, score, and list of missing defense IDs.

interface AuditResult {
  grade: 'A' | 'B' | 'C' | 'D' | 'F'
  score: number       // 0-100
  coverage: string    // e.g. "4/12"
  defended: number    // count of defended vectors
  total: number       // 12
  missing: string[]   // IDs of undefended vectors
}

auditWithDetails(prompt: string): AuditDetailedResult

Full audit with per-vector evidence.

interface AuditDetailedResult extends AuditResult {
  checks: DefenseCheck[]
  unicodeIssues: { found: boolean; evidence: string }
}

interface DefenseCheck {
  id: string
  name: string          // English
  nameZh: string        // 繁體中文
  defended: boolean
  confidence: number    // 0-1
  evidence: string      // Human-readable explanation
}

ATTACK_VECTORS: AttackVector[]

Array of all 12 attack vector definitions with bilingual names and descriptions.

How It Works

  1. Parses the system prompt text
  2. For each of 12 attack vectors, applies regex patterns that detect defensive language
  3. A defense is "present" when enough patterns match (usually >= 1, some require >= 2)
  4. Checks for suspicious Unicode characters embedded in the prompt
  5. Calculates coverage score and assigns a letter grade

This tool does NOT:

  • Send your prompt to any external service
  • Use LLM calls (100% regex-based)
  • Guarantee security (it checks for defensive language, not runtime behavior)
  • Replace penetration testing or behavioral evaluation

What This Scanner Does NOT Catch

Static prompt analysis is layer 1 of a defense-in-depth model. The following classes of attack require defenses at other layers — this scanner does not replace them, and we say so explicitly so it isn't oversold:

  1. Runtime credential compromise. Dashboard takeovers, leaked API keys, malicious deployment commits. Standard infosec, out of scope. (Reference: AIXBT dashboard takeover, Mar 2025.)
  2. Tool / permission scoping bugs. Whether the agent has dangerous tools, and how those tools are gated, is invisible to a prompt scanner. (Reference: Bankrbot NFT-as-authorization, May 2026.)
  3. Whether declared defenses are enforced at runtime. A prompt can declare "verify retrieved memory" and the framework can ignore it. The scanner cannot tell.
  4. Numerical and unit bugs. Off-by-1000 decimal errors, wrong-token-id transfers. Code-level bugs, not prompt issues. (Reference: Lobstar Wilde, Feb 2026.)
  5. Effectiveness vs. presence. A prompt with the keyword "never" registers as defended even if a "helpful" framing dominates under adversarial pressure. We check for presence of defensive language, not its strength.
  6. Multi-turn adversarial dynamics. Static scan of turn 0 cannot predict turn 482. (Reference: Freysa, Nov 2024.)

A pass on this scanner is necessary, not sufficient. See CASE_STUDIES.md for an honest mapping of which documented incidents this scanner would flag versus which it cannot help with.

Limitations

  • Regex-based detection is heuristic — a prompt can contain defensive language but still be vulnerable at runtime. This tool measures intent to defend, not actual defense effectiveness.
  • Only checks system prompt text, not model behavior under adversarial pressure.
  • English and Traditional Chinese patterns only (contributions welcome for other languages).
  • False positives/negatives are possible. See research data for calibration details.
  • Fullwidth CJK punctuation (e.g. ) triggers Unicode detection — known limitation.

Complementary tools

prompt-defense-audit is a static, design-time check. It pairs cleanly with runtime-side projects that detect attacks as they happen:

Lifecycle stage Tool Question it answers
Build / CI gate prompt-defense-audit (this) "Is the prompt designed to resist attacks?"
Runtime detection Agent-Threat-Rule (ATR) "Is an attack happening right now?"

Failure modes are orthogonal: the audit misses novel attacks not anticipated at design time; ATR misses prompts that have no resistance even before traffic arrives. Used together they form a defense-in-depth pattern (CI gate → runtime detection).

Detailed integration including the 1:N vector mapping (20 defense vectors → 9 ATR detection categories), recommended usage pattern, and cross-references: docs/integrations/agent-threat-rules.md.

Research

This tool is backed by empirical analysis of 1,646 production system prompts from 4 public datasets:

Dataset Size Source
LouisShark/chatgpt_system_prompt 1,389 GPT Store custom GPTs
jujumilk3/leaked-system-prompts 121 ChatGPT, Claude, Grok, Perplexity, Cursor, v0
x1xhlol/system-prompts-and-models 80 Cursor, Windsurf, Devin, Augment
elder-plinius/CL4R1T4S 56 Claude, Gemini, Grok, Cursor

Key references:

Contributing

See CONTRIBUTING.md. Key areas: new language patterns, better regex accuracy, integration examples.

Security

See SECURITY.md. Report vulnerabilities to dev@ultralab.tw — not via GitHub issues.

License

MIT — Ultra Lab

Used In Production

This library powers prompt defense detection across multiple production deployments and security frameworks. 11 PRs merged into indicator-org repos (Microsoft / Cisco / OWASP / UK Government AISI / awesome-list curators):

Frameworks + scanners (engine-level integration)

Test harnesses + eval suites

Standards mappings + awesome lists

Ultra Lab products built on this engine

  • UltraProbe (UltraLab) — free AI security scanner; uses this library as the Prompt Security engine.
  • Quartz Cloud — Taiwan-domiciled runtime AI firewall (Q3 2026 closed beta). Quartz uses this engine as its ingress detector + extends it with runtime + jurisdictional layers. The engine is open source under Ultra Lab; Quartz is a commercial brand built on top of it. Customers can audit, fork, or self-host the engine without lock-in.

Related