Skip to content

feat(rlvr): environment-grounded vaccine verification layer#169

Closed
VoidChecksum wants to merge 20 commits into
feat/web-dashboardfrom
feat/rlvr-only
Closed

feat(rlvr): environment-grounded vaccine verification layer#169
VoidChecksum wants to merge 20 commits into
feat/web-dashboardfrom
feat/rlvr-only

Conversation

@VoidChecksum
Copy link
Copy Markdown
Collaborator

Stacked on #168 — merges cleanly once the web PR lands. Targeting feat/web-dashboard so the diff shows only the RLVR Python changes.

Summary

Replaces the LLM-judged vaccine loop outcome with an environment-grounded verification pipeline. The defender agent can no longer self-report a successful defense — the sandbox itself decides.

  • Pre-defense baseline gate — PoC is run before defenses are applied. If the exploit never triggers, the finding is INVALID_SPEC → reward 0.0. Defenses can no longer earn reward for blocking something that was never exploitable.
  • N-run consensusExploitSpec.runs (1–10) + min_success_rate (0.0–1.0). Exploit must succeed on ≥ N% of runs. Flaky one-in-three triggers yield PARTIAL, not full reward 1.0.
  • Impact-pattern confirmation — separate impact_patterns field (e.g. uid=0, root@, data-exfil regexes). Confirms actual exploitation impact, not just that the payload fired. Unconfirmed impact applies a 0.7 confidence multiplier.
  • CVSS 3.1 base score — auto-derived from TargetCheck types and impact patterns using full CVSS 3.1 formula. Emitted on RLVRReward.cvss_score.
  • Fingerprint deduplication — SHA-256(poc_command + success_patterns + target_host) keyed in rlvr/dedup.jsonl. Duplicate specs yield ERROR reward, preventing the same finding from inflating signal.
  • Inconclusive detection — contradicting TargetChecks (port closed + service reachable for same host) → ERROR instead of forced BLOCKED/PASSED.
  • ZFP demotion — negative control command matching success patterns → entire result demoted to ERROR. Already present, now applied after N-run consensus.

New schemas

Type Purpose
PoCRunResult Single PoC run result with per-run signal matching
PoCConsensus N-run aggregate: success_rate, agreed_signals, zfp_demoted
BaselineEvidence Pre-defense PoC confirming spec validity
CVSSEstimate CVSS 3.1 components + base_score + vector_string

ExploitSpec new fields

runs: int = 1                    # consensus run count (recommend 3)
min_success_rate: float = 1.0   # success fraction (0.67 for 2-of-3)
impact_patterns: list[str] = [] # actual-impact evidence regexes
target_host: str | None = None  # dedup fingerprint anchor

Reward tiers

Outcome Reward When
BLOCKED 1.0 Exploit fails, no success signals
PARTIAL 0.5 Success rate below threshold, or some env checks flipped
PASSED 0.0 Exploit still works post-defense
ERROR 0.0 ZFP demotion / invalid spec / inconclusive / duplicate

Files changed (Python only)

  • decepticon/schemas/exploit_spec.py — new fields
  • decepticon/schemas/env_verification.py — new schema types
  • decepticon/core/env_verifier.py — full verification pipeline
  • decepticon/tools/research/exploit_spec_writer.py@tool exploit_spec_register
  • decepticon/orchestrator.py — Phase 4 env-grounded path + legacy fallback
  • decepticon/core/engagement_loop.py — PRE_DEFENSE snapshot + env-grounded path
  • decepticon/agents/{exploit,recon,scanner}.py — tool registration
  • decepticon/agents/prompts/{exploit,recon,scanner}.md — agent guidance
  • tests/unit/core/test_env_verifier.py — 13 tests (8 original + 5 new)

Test plan

  • uv run pytest tests/unit/core/test_env_verifier.py -v — 13/13 pass
  • uv run ruff check decepticon/ — clean
  • uv run basedpyright decepticon/ — 0 errors
  • Integration: run engagement with VACCINE_USE_ENV_VERIFIER=1, confirm workspace/rlvr/rewards.jsonl populated after vaccine phase

…ts, container hotswap

Web UI:
- Add engagement timeline page with live event stream
- Add command palette (cmd+k) for keyboard navigation
- Add streaming agent detail panel with real-time tool call inspection
- Add live activity feed component for engagement monitoring
- Add Opplan live overlay for plan visualization
- Add health API endpoint for container readiness checks
- Add engagement export API endpoint
- Add engagement threads API endpoint
- Add engagement timeline API endpoint
- Refactor web terminal component with improved PTY handling
- Refactor dashboard pages (engagements, graph, settings, main)
- Add keyboard shortcut in sidebar

Containers:
- Refactor web.Dockerfile for streamlined build layers
- Refactor web-entrypoint.sh with healthcheck awareness
- Add web-hotswap.sh for zero-downtime container swap

Backend:
- Refactor Docker sandbox backend for resource lifecycle

Skills:
- Add stealth-infra shared skill

Build:
- Update Makefile targets
- Update benchmark validation config
…b switch

Two bugs fixed:

1. Terminal reset / task stop on tab switch:
   - Created EngagementContext + EngagementProvider — persists observer state
     across Next.js route navigation within an engagement
   - Lifted useRunObserver from LivePage into engagement layout so
     events, isRunning, and activeRunId survive tab switches
   - WebTerminal now rendered at layout level with CSS width control:
     35% width on /live, 0 width (but still mounted) on other tabs
   - PTY connection stays alive; observer continues collecting events
     even while user is on Findings/Graph/Timeline tabs

2. Stuck "Processing" indicator in AgentDetailPanel:
   - Added STALENESS_THRESHOLD_MS (15s) staleness detection
   - deriveStatus now checks event.elapsed — if the most recent event
     is older than 15s and not followed by subagent_end, status
     degrades to "idle" instead of stuck "processing"

Architecture: engagement/[id]/layout.tsx now fetches engagement data
+ plan-docs, runs the persistent observer, and hosts the terminal.
LivePage consumes from context — only renders activity feed + graph.
Two-part fix for objectives stuck showing "Running" indefinitely:

1. Startup recovery — _recover_stale_objectives() scans the OPPLAN on
   engagement loop startup and resets any IN_PROGRESS objectives back
   to PENDING. Covers the crash/restart scenario: the loop marks an
   objective IN_PROGRESS, invokes the agent, then the process dies
   before writing COMPLETED/BLOCKED. Without recovery, _next_pending
   _objective() only considers PENDING objectives, so the orphan is
   never retried.

2. Exception safety — the agent invocation (_invoke_agent) is now
   wrapped in try/except. If the agent crashes (API error, timeout,
   unhandled exception), the objective transitions to BLOCKED instead
   of being left at IN_PROGRESS. An IterationResult is synthesized
   so the iteration history stays consistent. KeyboardInterrupt is
   still propagated for clean shutdown.
… API

The engagement loop fix (bbce42d) recovers stale objectives on startup,
but only if the loop actually starts. If the engagement is complete or
the loop is never restarted, the stale "in-progress" status persists in
opplan.json and the UI shows "Running" indefinitely.

This adds a read-time staleness check in the /api/engagements/[id]/opplan
route: if opplan.json hasn't been modified in 10 minutes, any objectives
still marked "in-progress" are downgraded to "pending" in the response.
The file on disk is NOT mutated — the loop owns writes; the API only
sanitizes the display.
…ploitation

LLM prompt injection testing (OBJ-008) and similar iterative web
exploitation workflows require many graph steps:
- schema discovery → payload crafting → request → response analysis
- each iteration burns ~3-5 steps

At 400 steps the exploit agent exhausts its budget before completing
legitimate multi-step exploitation objectives.

Raise exploit agent recursion_limit to 1000 to accommodate:
- prompt injection fuzzing
- multi-step web exploitation
- protocol abuse testing

Fixes #127
… responses)

Multiple int() calls in the research/reporting/middleware stack lacked
ValueError/TypeError guards when parsing externally-sourced data:

- kg_ingest_ffuf (tools.py): int(row.get('status') or 0) — the 'or 0'
  pattern is NOT a safety net; a non-empty HTML string is truthy, so
  int() receives the raw string and crashes.
- rank_candidates (scanner_tools.py): int(hit.get('line', 0))
- _top_chains (executive.py): int(node.props.get('length', 0))
- opplan.py: int(parent.get('priority', 100)) and child equivalent

When the LLM endpoint returns HTML instead of JSON (e.g. WAF block,
error page, schema mismatch), agent-generated code or ingestion tools
may pass HTML content into fields expected to be numeric. Without
error handling, the entire agent loop crashes.

Changes:
- Wrap all int() calls on externally-sourced data in try/except
- Fall back to sensible defaults (0, 100, etc.)
- Log warnings where appropriate

Fixes #129
…alls

Local models (Ollama, qwen3-coder, etc.) occasionally produce
malformed JSON for tool call arguments:
- 'options' as a JSON string instead of a JSON array
- 'header' longer than max_length=12

Add BeforeValidator coercers that silently normalize these patterns
so the engagement flow stays alive instead of dropping into a bare
ask_user_question error loop.
- skills.py: add log warnings to silent skill-load failures (L157/174/374)
- complete_planning.py: add BeforeValidator coercer for engagement_name
  (empty→fallback, whitespace strip, >64 chars truncate)
- research/tools.py: wrap unprotected json.loads in try/except
- opplan after_model: block parallel objective_expand/collapse (race condition fix)
- DockerConfig.stall_seconds: 3.0→5.0 (reduce false-stall aborts on slow network scans)
- decepticon.py: add # noqa PLC0415 with lazy-load rationale comment
- complete_planning: add 7 unit tests for _sanitize_engagement_name
5 new middleware modules that eliminate manual work and enforce quality:

AutoContextMiddleware
  Auto-injects engagement state (workspace, scope, progress, findings)
  into every model call. Agent never manually writes context in task().

RoEGuardMiddleware
  Intercepts task() calls, extracts target domains/IPs, cross-references
  with roe.json scope patterns. Blocks out-of-scope delegations before
  they reach sub-agents. Cache scope patterns for 5 min.

FindingGuardMiddleware
  Zero false positive enforcement via 5-method verification:
  1. Evidence check (code block/HTTP trace/tool output)
  2. Reproducibility (steps/PoC)
  3. Impact statement
  4. Anti-speculation (no hedging language)
  5. Severity-impact alignment (critical requires demonstrated exploitation)
  + Content hash dedup against existing findings

BashIntelMiddleware
  Post-processes bash tool output: extracts open ports (nmap), HTTP
  status codes, tech stack headers, version strings, error indicators.
  Injects compact intel summary above raw output.

SmartRetryMiddleware
  On BLOCKED objectives, cross-references failure reason against
  bypass technique knowledge base. Injects alternative approach
  suggestions (parameter splitting, encoding bypass, JWT confusion...).

+ build_resume_briefing() for one-shot engagement restart context.
+ 10 bypass technique categories with 2-3 hints each.

Integrated into decepticon orchestrator + recon agent.
- docker_sandbox.py: align STALL_SECONDS constant (3.0→5.0) with DockerConfig default to fix test_constants_match_config_defaults
- subagent_streaming.py: guard tool-call id key lookup with None check; widen active_tool_calls type to dict[str, Any]
- llm/factory.py: widen kwargs type annotation to dict[str, Any], eliminating ~80 suppressed pyright warnings
- tools/ad/bloodhound.py: fix _build_bh_index to correctly iterate graph.nodes.values() and return dict[str, Node]
- tools/research/tools.py: fix ChainStep field access (node_kind, crown_jewel_label, entrypoint_label) — was crashing on AttributeError at runtime
Replaces LLM-judged VerificationResult with environment-grounded
verification that produces a scalar RLVR reward from raw system
signals — no LLM in the verification path.

## Motivation

The vaccine loop (attack → defense → re-attack) previously trusted
the defender agent to self-report re_attack_outcome. This is
gameable and produces noisy reward signal. The fix: use the target
system environment itself as the verifier.

## Architecture

### ExploitSpec (decepticon/schemas/exploit_spec.py)
Machine-readable replay spec written by offensive agents at
finding-discovery time:
- poc_command: exact shell command reproducing the exploit
- success_patterns: regexes proving exploit succeeded (min 1)
- negative_command: ZFP baseline (optional)
- target_checks: discriminated union of PortCheck / ServiceCheck /
  CredentialCheck / CommandOutputCheck / FileCheck probes

### EnvironmentVerifier (decepticon/core/env_verifier.py)
Independent verifier — no LLM:
1. capture_state() runs all target_checks pre/post defense
2. verify_blocked() replays poc_command with ZFP demotion
3. Outcome determined from signal table:
   - zfp_demoted → ERROR
   - no success signals → BLOCKED (reward 1.0)
   - signals + all checks still positive → PASSED (reward 0.0)
   - signals + some checks flipped → PARTIAL (reward 0.5)
4. compute_reward() → RLVRReward scalar
5. Appends to workspace/rlvr/rewards.jsonl (training stream)

### Workspace layout
  workspace/
    findings/FIND-001-exploit-spec.json   ← offensive agent writes
    verification/FIND-001-pre-snapshot.json
    verification/FIND-001-post-snapshot.json
    verification/FIND-001-evidence.json
    rlvr/rewards.jsonl                    ← append-only RLVR stream

### Backward compatibility
ExploitSpec missing for a finding → falls back to legacy
_load_verification_result (LLM-written JSON). Gated by
VACCINE_USE_ENV_VERIFIER env var (default on).

### exploit_spec_register tool
LangChain @tool added to exploit, recon, scanner agents. Offensive
agents call it after writing FIND-NNN.md to register a self-contained
spec for env-grounded verification.

## Tests
8 new async unit tests (pytest-asyncio auto mode):
- pre/post defense reward transitions (PASSED→BLOCKED)
- ZFP demotion → ERROR
- PARTIAL reward from partial check flips
- rewards.jsonl append and round-trip JSON validity
- snapshot/evidence persistence
- spec load/missing round-trips

## Files changed
New:
  decepticon/schemas/exploit_spec.py
  decepticon/schemas/env_verification.py
  decepticon/core/env_verifier.py
  decepticon/tools/research/exploit_spec_writer.py
  tests/unit/core/test_env_verifier.py

Modified:
  decepticon/orchestrator.py        — Phase 4 uses _verify_finding()
  decepticon/core/engagement_loop.py — pre-snapshot before defender,
                                       _verify_finding_env() post-defender
  decepticon/agents/exploit.py      — exploit_spec_register added
  decepticon/agents/recon.py        — exploit_spec_register added
  decepticon/agents/scanner.py      — exploit_spec_register added
  decepticon/agents/prompts/*.md    — vaccine loop instructions added
# Conflicts:
#	clients/web/src/app/(dashboard)/page.tsx
#	containers/web.Dockerfile
#	decepticon/backends/docker_sandbox.py
#	decepticon/core/engagement_loop.py
#	decepticon/middleware/engagement.py
#	decepticon/middleware/opplan.py
#	decepticon/orchestrator.py
#	decepticon/tools/interaction/ask_user.py
#	decepticon/tools/research/scanner_tools.py
#	tests/unit/tools/test_ask_user_question.py
…t patterns, dedup

Upgrades EnvironmentVerifier from single-run binary verification to a
multi-signal triager-grade pipeline. Eliminates the four primary sources
of false positives that would cause a security triager to reject findings.

## New verification pipeline (verify_blocked)

1. **Duplicate fingerprint gate** — SHA-256(poc_command + success_patterns +
   target_host) keyed in rlvr/dedup.jsonl. Duplicate specs yield ERROR reward
   immediately, preventing the same finding from inflating reward signal.

2. **Baseline validity gate** — verify_baseline() runs the PoC BEFORE defenses
   are applied and confirms the exploit actually works. If baseline.valid=False
   (exploit never triggered), the finding is INVALID_SPEC → ERROR. Defenses can
   no longer earn reward for "blocking" exploits that were never exploitable.

3. **N-run consensus** — ExploitSpec gains `runs` (1–10) and `min_success_rate`
   (0.0–1.0). Each run is independent; agreed_signals = intersection across all
   successful runs. Flaky exploits that only work 1/3 times yield PARTIAL, not
   the full 1.0 reward. ZFP check runs once after consensus, not per-run.

4. **Impact pattern confirmation** — ExploitSpec gains `impact_patterns` (separate
   from trigger patterns). Patterns like `uid=0`, `root@`, data-exfil regexes
   confirm ACTUAL IMPACT, not just "exploit ran". Unconfirmed impact lowers
   confidence multiplier to 0.7 when impact_patterns are declared.

5. **CVSS 3.1 base score estimation** — _estimate_cvss() derives AV/AC/PR/UI/
   Scope/CIA from TargetCheck types and impact_patterns heuristics. Emitted on
   RLVRReward.cvss_score. Triagers can threshold on CVSS ≥ 7.0.

6. **Inconclusive detection** — _check_inconclusive() flags contradictions where
   a PortCheck says closed but a ServiceCheck for the same host says reachable.
   Yields ERROR reward instead of forcing a wrong BLOCKED/PASSED call.

## New schemas

- `PoCRunResult` — single run result with per-run signal matching
- `PoCConsensus` — N-run aggregate: n_runs, n_success, success_rate, agreed_signals
- `BaselineEvidence` — pre-defense PoC result confirming spec validity
- `CVSSEstimate` — CVSS 3.1 components + computed base_score + vector_string

## ExploitSpec additions

- `runs: int = 1` — consensus run count (recommend 3 for reliable findings)
- `min_success_rate: float = 1.0` — success fraction threshold (0.67 for 2/3)
- `impact_patterns: list[str] = []` — actual impact evidence regexes
- `target_host: str | None` — dedup fingerprint anchor

## Backward compatibility

- Legacy `PoCEvidence` field on VerificationEvidence still populated (best-run)
- Legacy `_determine_outcome` preserved for existing call sites
- verify_blocked signature unchanged (baseline param is optional)
- All 8 existing tests pass unchanged; 5 new tests added (13 total)

## Test coverage added

- N-run consensus 2/3 → PARTIAL (below min_success_rate)
- N-run consensus 2/3 → PASSED (meets 0.5 threshold, signals still present)
- Baseline invalid → ERROR propagation through verify_blocked
- Impact patterns → evidence.impact_signals_matched + reward.impact_confirmed
- Duplicate detection → second registration yields duplicate_of + ERROR
@VoidChecksum VoidChecksum requested a review from PurpleCHOIms as a code owner May 6, 2026 07:45
@VoidChecksum VoidChecksum added the enhancement New feature or request label May 6, 2026
@PurpleCHOIms PurpleCHOIms deleted the branch feat/web-dashboard May 9, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants