feat(tools/research): semantic finding deduplication (report-only, opt-in) by VoidChecksum · Pull Request #437 · PurpleAILAB/Decepticon

VoidChecksum · 2026-05-30T11:52:51Z

What

Adds a semantic finding deduplication layer to the research tools. It is additive, opt-in, and report-only — no existing flow changes behavior, and no nodes are deleted or mutated.

Why

The knowledge graph dedups writes only by deterministic node ID (SHA1(kind + canonical key) — see decepticon_core/types/kg.py and tools/research/neo4j_store.py). The multi-agent design (recon / exploit / analyst / vulnresearch, each running with fresh context) structurally produces separate nodes for the same underlying vulnerability when it is reached via a different route or described with different wording. Those near-duplicates survive ID-based dedup and surface as repeated entries in reports. Peers (Strix report/dedupe.py, RedAmon, CAI) ship an LLM-judge dedup layer; this adds a clean, pure, testable equivalent.

How

New module packages/decepticon/decepticon/tools/research/dedupe.py:

find_duplicate(candidate, existing, judge) -> DuplicateVerdict — a pure function. The judge is an injected callable (a, b) -> dict, so tests stub it and no LLM client is imported in the module. A cheap deterministic pre-filter (same NodeKind + normalized host / endpoint / CWE overlap) runs before the judge, so non-candidates are rejected without invoking it. DuplicateVerdict is a small frozen dataclass (is_duplicate: bool, canonical_id: str | None, reason: str).
kg_dedupe_findings @tool — walks Finding/Vulnerability nodes, groups likely-duplicates via the pre-filter (union-find over pairwise pre-filter hits), and returns a structured JSON summary of duplicate clusters. It is report-only: it never deletes or mutates nodes. It defaults to the deterministic pre-filter (no LLM wiring) to stay simple and green; a live judge is a clean follow-up via find_duplicate's injectable seam.
Registered in RESEARCH_TOOLS exactly like the sibling kg_* tools.

Scope / isolation

Does not touch reporting renderers or existing dedup/merge code.
tools.py change is 2 lines (one import + one list entry).
No package-lock.json / uv.lock changes.

Tests

packages/decepticon/tests/unit/research/test_dedupe.py (13 tests) builds fixture KnowledgeGraphs — two findings describing the same bug (same host + CWE, different wording) and two clearly-distinct findings — and, with a fake judge, asserts:

the pre-filter pairs the same-host+CWE candidates and not the distinct ones (and skips the judge entirely for non-candidates, verified with an exploding judge);
find_duplicate returns is_duplicate True/False correctly with the right canonical_id/reason;
kg_dedupe_findings returns the expected cluster summary and does not mutate the graph.

No Docker / Neo4j / live-LLM.

Gate evidence

ruff check — clean
ruff format --check — clean
basedpyright dedupe.py tools.py — 0 errors, 0 warnings, 0 notes
pytest tests/unit/research tests/unit/tools/test_skills_registry.py — 264 passed (13 new; registry test confirms clean registration)

@tool

…t-in) The knowledge graph dedups writes only by deterministic node ID (SHA1 of kind + canonical key). The multi-agent design (recon/exploit/analyst/ vulnresearch, each with fresh context) structurally produces SEPARATE nodes for the SAME underlying vulnerability when reached via different routes or worded differently. Those near-duplicates survive ID dedup and show up as repeated entries in reports. Add a clean, pure, testable semantic dedup layer that complements (does not replace) ID-based dedup: - dedupe.find_duplicate(candidate, existing, judge) -> DuplicateVerdict: a pure function with an INJECTED judge callable (a, b) -> dict, so the decision logic is unit-testable with a stub and no LLM client is imported in the module. A cheap deterministic pre-filter (same NodeKind plus normalized host / endpoint / CWE overlap) runs BEFORE the judge so non-candidates are rejected without invoking it. - kg_dedupe_findings @tool: walks Finding/Vulnerability nodes, clusters likely duplicates via the pre-filter, and returns a structured JSON summary of duplicate clusters. It is REPORT-ONLY — it never deletes or mutates nodes; merging is a deliberate follow-up. Registered in RESEARCH_TOOLS alongside the sibling kg_* tools. Additive and isolated: no reporting renderers or existing dedup/merge code touched. New unit tests build fixture graphs (same host+CWE with different wording vs. clearly-distinct findings) and, with a fake judge, assert the pre-filter pairing, find_duplicate verdicts, the cluster summary, and that the tool does not mutate the graph. No Docker / Neo4j / live-LLM in tests.

VoidChecksum requested a review from PurpleCHOIms as a code owner May 30, 2026 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tools/research): semantic finding deduplication (report-only, opt-in)#437

feat(tools/research): semantic finding deduplication (report-only, opt-in)#437
VoidChecksum wants to merge 1 commit into
mainfrom
feat/semantic-finding-dedupe

VoidChecksum commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

VoidChecksum commented May 30, 2026

What

Why

How

Scope / isolation

Tests

Gate evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant