Skip to content

Aux LLM pass to enforce content policies on public output #374

@sentry-junior

Description

@sentry-junior

Current behavior

Content policies (e.g. "don't publish PII") are enforced only by the primary LLM's system prompt and per-skill instructions. There is no independent verification layer before content is emitted to public-facing surfaces.

Gap

System-prompt-level policy compliance is probabilistic — the primary model can miss edge cases, especially when juggling complex multi-step tasks. When output becomes public information (Slack messages in broad channels, GitHub issues, PR descriptions, canvases), a single-pass approach has no safety net for policy violations.

Related: #11 covers PII scrubbing for Sentry telemetry via field allowlists. This issue proposes a more general mechanism that can enforce arbitrary content policies across all output types.

Proposed approach

Add an auxiliary LLM call that acts as a second-pass content policy agent:

  • Runs after the primary model generates content destined for a public surface
  • Receives the draft content + a set of encoded policies (PII suppression, sensitive data handling, internal-only context stripping, etc.)
  • Returns either an approval or a redacted/flagged version
  • Policies are defined as a reviewable, checked-in spec — not just prompt text

Key design considerations:

  • Scope trigger: define which output actions route through the second pass (e.g. channel posts, issue creation, canvas writes) vs. which are exempt (e.g. ephemeral thread replies in private channels)
  • Latency budget: the aux call adds latency; may need a fast model or async check-and-retract pattern
  • Policy spec format: structured enough to be testable, flexible enough to cover "don't leak internal URLs," "strip customer names," "no PII," etc.
  • Failure mode: what happens when the aux model is unavailable or disagrees — block, warn, or log-and-emit

Action taken on behalf of David Cramer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions