feat(tools/research): semantic finding deduplication (report-only, opt-in)#437
Open
VoidChecksum wants to merge 1 commit into
Open
feat(tools/research): semantic finding deduplication (report-only, opt-in)#437VoidChecksum wants to merge 1 commit into
VoidChecksum wants to merge 1 commit into
Conversation
…t-in) The knowledge graph dedups writes only by deterministic node ID (SHA1 of kind + canonical key). The multi-agent design (recon/exploit/analyst/ vulnresearch, each with fresh context) structurally produces SEPARATE nodes for the SAME underlying vulnerability when reached via different routes or worded differently. Those near-duplicates survive ID dedup and show up as repeated entries in reports. Add a clean, pure, testable semantic dedup layer that complements (does not replace) ID-based dedup: - dedupe.find_duplicate(candidate, existing, judge) -> DuplicateVerdict: a pure function with an INJECTED judge callable (a, b) -> dict, so the decision logic is unit-testable with a stub and no LLM client is imported in the module. A cheap deterministic pre-filter (same NodeKind plus normalized host / endpoint / CWE overlap) runs BEFORE the judge so non-candidates are rejected without invoking it. - kg_dedupe_findings @tool: walks Finding/Vulnerability nodes, clusters likely duplicates via the pre-filter, and returns a structured JSON summary of duplicate clusters. It is REPORT-ONLY — it never deletes or mutates nodes; merging is a deliberate follow-up. Registered in RESEARCH_TOOLS alongside the sibling kg_* tools. Additive and isolated: no reporting renderers or existing dedup/merge code touched. New unit tests build fixture graphs (same host+CWE with different wording vs. clearly-distinct findings) and, with a fake judge, assert the pre-filter pairing, find_duplicate verdicts, the cluster summary, and that the tool does not mutate the graph. No Docker / Neo4j / live-LLM in tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a semantic finding deduplication layer to the research tools. It is additive, opt-in, and report-only — no existing flow changes behavior, and no nodes are deleted or mutated.
Why
The knowledge graph dedups writes only by deterministic node ID (
SHA1(kind + canonical key)— seedecepticon_core/types/kg.pyandtools/research/neo4j_store.py). The multi-agent design (recon / exploit / analyst / vulnresearch, each running with fresh context) structurally produces separate nodes for the same underlying vulnerability when it is reached via a different route or described with different wording. Those near-duplicates survive ID-based dedup and surface as repeated entries in reports. Peers (Strixreport/dedupe.py, RedAmon, CAI) ship an LLM-judge dedup layer; this adds a clean, pure, testable equivalent.How
New module
packages/decepticon/decepticon/tools/research/dedupe.py:find_duplicate(candidate, existing, judge) -> DuplicateVerdict— a pure function. Thejudgeis an injected callable(a, b) -> dict, so tests stub it and no LLM client is imported in the module. A cheap deterministic pre-filter (sameNodeKind+ normalized host / endpoint / CWE overlap) runs before the judge, so non-candidates are rejected without invoking it.DuplicateVerdictis a small frozen dataclass (is_duplicate: bool,canonical_id: str | None,reason: str).kg_dedupe_findings@tool— walksFinding/Vulnerabilitynodes, groups likely-duplicates via the pre-filter (union-find over pairwise pre-filter hits), and returns a structured JSON summary of duplicate clusters. It is report-only: it never deletes or mutates nodes. It defaults to the deterministic pre-filter (no LLM wiring) to stay simple and green; a live judge is a clean follow-up viafind_duplicate's injectable seam.RESEARCH_TOOLSexactly like the siblingkg_*tools.Scope / isolation
tools.pychange is 2 lines (one import + one list entry).package-lock.json/uv.lockchanges.Tests
packages/decepticon/tests/unit/research/test_dedupe.py(13 tests) builds fixtureKnowledgeGraphs — two findings describing the same bug (same host + CWE, different wording) and two clearly-distinct findings — and, with a fake judge, asserts:find_duplicatereturnsis_duplicateTrue/False correctly with the rightcanonical_id/reason;kg_dedupe_findingsreturns the expected cluster summary and does not mutate the graph.No Docker / Neo4j / live-LLM.
Gate evidence
ruff check— cleanruff format --check— cleanbasedpyright dedupe.py tools.py— 0 errors, 0 warnings, 0 notespytest tests/unit/research tests/unit/tools/test_skills_registry.py— 264 passed (13 new; registry test confirms clean registration)