Skip to content

feat(tools/research): semantic finding deduplication (report-only, opt-in)#437

Open
VoidChecksum wants to merge 1 commit into
mainfrom
feat/semantic-finding-dedupe
Open

feat(tools/research): semantic finding deduplication (report-only, opt-in)#437
VoidChecksum wants to merge 1 commit into
mainfrom
feat/semantic-finding-dedupe

Conversation

@VoidChecksum
Copy link
Copy Markdown
Collaborator

What

Adds a semantic finding deduplication layer to the research tools. It is additive, opt-in, and report-only — no existing flow changes behavior, and no nodes are deleted or mutated.

Why

The knowledge graph dedups writes only by deterministic node ID (SHA1(kind + canonical key) — see decepticon_core/types/kg.py and tools/research/neo4j_store.py). The multi-agent design (recon / exploit / analyst / vulnresearch, each running with fresh context) structurally produces separate nodes for the same underlying vulnerability when it is reached via a different route or described with different wording. Those near-duplicates survive ID-based dedup and surface as repeated entries in reports. Peers (Strix report/dedupe.py, RedAmon, CAI) ship an LLM-judge dedup layer; this adds a clean, pure, testable equivalent.

How

New module packages/decepticon/decepticon/tools/research/dedupe.py:

  • find_duplicate(candidate, existing, judge) -> DuplicateVerdict — a pure function. The judge is an injected callable (a, b) -> dict, so tests stub it and no LLM client is imported in the module. A cheap deterministic pre-filter (same NodeKind + normalized host / endpoint / CWE overlap) runs before the judge, so non-candidates are rejected without invoking it. DuplicateVerdict is a small frozen dataclass (is_duplicate: bool, canonical_id: str | None, reason: str).
  • kg_dedupe_findings @tool — walks Finding/Vulnerability nodes, groups likely-duplicates via the pre-filter (union-find over pairwise pre-filter hits), and returns a structured JSON summary of duplicate clusters. It is report-only: it never deletes or mutates nodes. It defaults to the deterministic pre-filter (no LLM wiring) to stay simple and green; a live judge is a clean follow-up via find_duplicate's injectable seam.
  • Registered in RESEARCH_TOOLS exactly like the sibling kg_* tools.

Scope / isolation

  • Does not touch reporting renderers or existing dedup/merge code.
  • tools.py change is 2 lines (one import + one list entry).
  • No package-lock.json / uv.lock changes.

Tests

packages/decepticon/tests/unit/research/test_dedupe.py (13 tests) builds fixture KnowledgeGraphs — two findings describing the same bug (same host + CWE, different wording) and two clearly-distinct findings — and, with a fake judge, asserts:

  • the pre-filter pairs the same-host+CWE candidates and not the distinct ones (and skips the judge entirely for non-candidates, verified with an exploding judge);
  • find_duplicate returns is_duplicate True/False correctly with the right canonical_id/reason;
  • kg_dedupe_findings returns the expected cluster summary and does not mutate the graph.

No Docker / Neo4j / live-LLM.

Gate evidence

  • ruff check — clean
  • ruff format --check — clean
  • basedpyright dedupe.py tools.py0 errors, 0 warnings, 0 notes
  • pytest tests/unit/research tests/unit/tools/test_skills_registry.py264 passed (13 new; registry test confirms clean registration)

…t-in)

The knowledge graph dedups writes only by deterministic node ID (SHA1 of
kind + canonical key). The multi-agent design (recon/exploit/analyst/
vulnresearch, each with fresh context) structurally produces SEPARATE
nodes for the SAME underlying vulnerability when reached via different
routes or worded differently. Those near-duplicates survive ID dedup and
show up as repeated entries in reports.

Add a clean, pure, testable semantic dedup layer that complements (does
not replace) ID-based dedup:

- dedupe.find_duplicate(candidate, existing, judge) -> DuplicateVerdict:
  a pure function with an INJECTED judge callable (a, b) -> dict, so the
  decision logic is unit-testable with a stub and no LLM client is
  imported in the module. A cheap deterministic pre-filter (same NodeKind
  plus normalized host / endpoint / CWE overlap) runs BEFORE the judge so
  non-candidates are rejected without invoking it.
- kg_dedupe_findings @tool: walks Finding/Vulnerability nodes, clusters
  likely duplicates via the pre-filter, and returns a structured JSON
  summary of duplicate clusters. It is REPORT-ONLY — it never deletes or
  mutates nodes; merging is a deliberate follow-up. Registered in
  RESEARCH_TOOLS alongside the sibling kg_* tools.

Additive and isolated: no reporting renderers or existing dedup/merge
code touched. New unit tests build fixture graphs (same host+CWE with
different wording vs. clearly-distinct findings) and, with a fake judge,
assert the pre-filter pairing, find_duplicate verdicts, the cluster
summary, and that the tool does not mutate the graph. No Docker / Neo4j /
live-LLM in tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant