feat: deterministic tested_by edges + dashboard badge (#113)#122
Conversation
The file-analyzer LLM only sees the production↔test relationship when
analyzing a test file (production files don't import their tests), so
its emitted direction was unreliable across batches and recall was
massively undercounted (~7% on a real Nuxt 4 + Directus repo).
Move tested_by production entirely into the merge step. The linker:
- Strips every tested_by edge from batch input (LLM direction unreliable).
- Indexes file:* nodes and classifies each path as test or production.
- For each test, walks ordered candidate production paths (sibling
de-infix, __tests__/ walk-out, mirrored tests/→{src,app,lib,<root>}
tree, Maven/Gradle src/test/...→src/main/...).
- Emits canonical production → test edges and tags production nodes
"tested".
Supported conventions: JS/TS family (.test/.spec), Go (_test.go),
Python (test_*.py, *_test.py), Java (*Test/*Tests/*IT.java), Kotlin
(*Test/*Tests.kt), C# (*Test/*Tests.cs), C/C++ (test_*, *_test).
Stdlib only, type-hinted in existing style. Hooked into
merge_and_normalize between node dedup (Step 5) and edge dedup
(Step 6). Reports drops under "Fixed" and additions under a new
"Tested-by linker" section.
Tests cover path classification, candidate generation, full link_tests
behaviour (forward direction, idempotence, LLM-edge stripping,
test-to-test rejection), and the merge integration. 31 cases, stdlib
unittest, runnable with `python -m unittest test_merge_batch_graphs.py`.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Now that the merge script produces tested_by edges deterministically from path conventions, the LLM should not emit them — its direction is unreliable across batches and any emitted edges are stripped on merge. - Remove tested_by row from file-analyzer's edge table. - Add a note pointing to the deterministic linker. - Document the new behaviour in the merge section of SKILL.md. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- is_test_path: collapse 7 per-language conditional blocks into a data-driven _TEST_NAME_PATTERNS table; JS/TS infix stays inline - production_candidates: extract _join + module-level _add_unique to drop the nested closure and the repeated trailing-slash idiom - Drop dead _TEST_DIR_SEGMENTS constant and the local _splitext reimplementation; use os.path.splitext - link_tests: drop the impossible-malformed-tags guard, tighten the docstring, change edge description to "Path-based pairing (deterministic)", drop redundant break comment - Trim Step 5b inline block that duplicated the module-level header - Convert file-analyzer Note from blockquote to bold paragraph to match surrounding prompt style Tests: split the strip-edges test from the unrelated-edges-survive test, add empty-input and missing-filePath cases, pin sibling-before- walkup and sibling-before-mirror priority order, drop brittle report text assertion. 36 tests, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Render a small green dot next to the complexity badge whenever a node's tags contain "tested" — surfacing the deterministic linker's signal so users can see at a glance which files have paired tests. Plumb node.tags through both CustomNodeData construction sites in GraphView.tsx; KnowledgeGraphView.tsx already passes tags. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c257a55f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tags = prod_node.setdefault("tags", []) | ||
| if "tested" not in tags: | ||
| tags.append("tested") |
There was a problem hiding this comment.
Normalize tags type before adding "tested" tag
When a matched production file has malformed tags (for example null, a string, or any non-list value from an LLM batch), this block assumes list semantics and can raise (TypeError on membership or AttributeError on append), which aborts the whole merge. This regression is introduced by the new linker path and can break /understand on otherwise recoverable batch data; coerce non-list tags to [] (or another safe default) before checking/appending.
Useful? React with 👍 / 👎.
Codex flagged that prod_node.setdefault("tags", []) returns the existing
value when the key is present, so a raw LLM batch with tags=None or
tags="some string" would crash the whole merge on the next "tested" not
in tags membership check.
The TypeScript autoFixGraph normalizer that handles this case runs
downstream of merge-batch-graphs.py, not before it, so the Python side
has to defend itself. Coerce non-list tags to a fresh [] before the
membership/append.
Regression test exercises None / comma-string / single-string / int /
dict inputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Strip-and-rederive (current PR behaviour) drops real coverage signal on
projects whose test layout doesn't match a naming convention. On the
Google microservices-demo the LLM had emitted 7 valid tested_by edges
(3 with inverted direction); the strip pass dropped them and the path-
convention rederive could only re-pair 4 of them. Net: 7 → 4 edges,
3 production files lost their tested signal.
Replace strip-and-rederive with two-pass swap-then-supplement:
Pass 1 — walk LLM tested_by edges. Canonical (production → test)
edges pass through unchanged. Inverted (test → production) edges are
flipped in place; description gets a `[direction corrected]` audit
marker. Edges with no recoverable meaning (test↔test, prod↔prod,
orphan endpoint, duplicate pair) are dropped.
Pass 2 — for tests not yet paired by Pass 1, walk path-convention
candidates and emit a fresh production → test edge for the first
match. Pairs already covered by Pass 1 are skipped.
Tagging is consolidated into a final pass over all canonical edges so
production nodes get the "tested" tag whether the edge came from
Pass 1 (canonical / swapped) or Pass 2 (supplement).
Multi-language audit of production_candidates revealed three real-world
gaps surfaced by re-checking microservices-demo and common project
layouts:
- JS/TS walk-out only handled `__tests__/`. Extended to also walk out
of `<dir>/test/`, `<dir>/spec/`, and `<dir>/tests/` (some JS/TS
projects use these instead of __tests__/).
- Python walk-out only handled top-level `tests/`. Added in-package
`<pkg>/tests/test_<name>.py` → `<pkg>/<name>.py` (Django app style
and any project that colocates tests with the package).
- C# only had sibling fallback. Added two new mirrors:
* `<svc>/tests/X.cs` ↔ `<svc>/X.cs` and `<svc>/src/.../X.cs`
(microservices-demo cartservice exact layout).
* `<App>.Tests/Foo/BarTests.cs` ↔ `<App>/Foo/Bar.cs`
(.NET sibling-project convention).
Go is intentionally not changed — the "one _test.go covers several
.go files in the same package" pattern is now solved by Pass 1
(swapping LLM edges), not by trying to invent multi-pair path heuristics.
The file-analyzer prompt is updated: the `tested_by` row is restored
in the schema table because we now use those edges as evidence (Pass 1
canonicalizes the direction). The note explains direction will be
auto-corrected so the LLM doesn't need to be defensive about it.
link_tests now returns a 4-tuple (added, dropped, tagged, swapped);
the merge_and_normalize report distinguishes "edges produced
(supplement)" from "edges flipped" from "edges dropped".
Real-world validation on microservices-demo:
before: 7 tested_by edges, 3 inverted, 0 tagged
after PR: 4 tested_by edges, 0 inverted, 4 tagged ← strip-and-rederive
this: 7 tested_by edges, 0 inverted, 7 tagged ← swap-then-supplement
Tests: 47 pass (was 37). New cases cover all swap branches, the
shippingservice "one test, many sources" regression, and each new
language pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
@codex review this |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4bb22fd9af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if pair in covered: | ||
| # Duplicate canonical edge — drop the dup, keep the first. | ||
| dropped += 1 | ||
| continue |
There was a problem hiding this comment.
Keep max-weight duplicate tested_by edge
When duplicate tested_by pairs appear, this branch drops later edges purely by arrival order (covered), so the stronger edge can be discarded before the generic Step 6 deduper (which normally keeps the highest weight) ever runs. In multi-batch runs where the same pair is emitted with different confidences, the resulting graph can retain a lower-confidence edge solely because its batch was processed first, which degrades ranking/visual confidence semantics for tested_by links.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in a4bdc1c. Pass 1 now mirrors Step 6's weight-aware dedup locally:
pair_to_idxtracks each kept pair's slot in the compacted edges list- on a duplicate, compare weights and replace the slot when strictly heavier (tie keeps the first arrival, matching Step 6)
- swap operation is deferred until an edge is known to survive — no work spent canonicalizing a doomed duplicate
swapped_pairsset replaces the rawswappedcounter so the reported number reflects the final output (replacing a swapped edge with a heavier canonical one drops it from the count, and vice versa)
Five new unit tests cover all four weight × direction combinations.
Code reviewFound 1 issue:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Codex P2: link_tests Pass 1 dropped duplicate (production, test) pairs
purely by arrival order — when two batches both emitted a tested_by
edge for the same pair with different confidences (0.3 vs 0.9), the
edge that happened to iterate first won. The general Step 6 deduper
at line 762 mirrors `weight > existing.weight` semantics but it only
ever saw one of the duplicates, so it couldn't rescue the heavier one.
Refactor Pass 1 to mirror Step 6's weight comparison locally:
- Track `pair_to_idx` mapping each kept (prod, test) pair to its
slot in the compacted edges list. On a duplicate, look up the
existing kept edge and compare weights; if the new edge is
strictly heavier, swap (if needed) and replace the slot. Tie or
lighter → drop the new edge.
- Defer the swap operation until we know an edge will survive — no
point canonicalizing a doomed duplicate.
- Track surviving swap pairs in a separate `swapped_pairs` set so
the `swapped` counter reflects the FINAL output, not the wasted
work on edges that were later replaced. This means: replacing a
swapped edge with a heavier canonical one drops the swap from
the count; replacing a canonical edge with a heavier swapped one
adds it.
- Extract the swap-in-place mutation into `_swap_tested_by_in_place`
so it can be invoked from both code paths.
Five new unit tests cover all four weight-vs-direction combinations
plus a tie case (existing test_drops_duplicate_canonical_edges, which
still passes — tie → keep first, no swap counted).
microservices-demo regression check unchanged: 7 → 7 edges, 3 swapped,
0 dropped, 7 tagged.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Closes #113.
What's wrong
@Bulkmaker reported in #113 that test-coverage info on the graph is currently unusable for two reasons:
tested_byedge direction is inconsistent. The file-analyzer prompt says edges should beproduction → test, but production files don't import their tests, so the LLM only sees the relationship while analyzing a test file — and naturally emitstest → production(inverted). Same project, same file types, mixed directions across batches.tested_byedges came through (~7% of production files looked tested vs. true coverage being far higher).tested_bylike any other edge. Nothing visually distinguishes a tested file from an untested one.Confirmed reproduction on Google's
microservices-demo(real corpus, not Nuxt-specific): 7 LLM-emittedtested_byedges, 3 of them (43%) inverted (shippingservice_test.go → main.go,shippingservice_test.go → tracker.go,product_catalog_test.go → product_catalog.go). 0 production nodes tagged.Fix
Two pieces:
1. Canonicalize
tested_bydirection inmerge-batch-graphs.py(two-pass linker)Path-based linker integrated as Step 5b between node dedup and edge dedup, with a refined two-pass design:
Pass 1 — preserve LLM evidence, fix direction.
The LLM's
tested_bypairings are real (it seesimport { CartService } from '../src/services/CartService'in the test file). What's wrong is direction: the source is the file being analyzed = a test. So we walk every existingtested_byedge and:production → test) → keep unchangedtest → production) → flip in place; description gets[direction corrected]audit markertest ↔ test,prod ↔ prod, orphan endpoint, duplicate pair) → dropPass 2 — supplement via path conventions.
For tests Pass 1 didn't pair, walk candidate production paths from
production_candidates(test_path)and emit a freshproduction → testedge for the first match. Pairs already covered by Pass 1 are skipped.Path-convention coverage (
production_candidates):.test./.spec.); walk-out from__tests__/,test/,spec/,tests/subdirs; mirroredtests/↔{src,app,lib,""}tree<name>_test.go↔<name>.go(multi-source-per-test cases handled by Pass 1 swap, not by path heuristic)test_<name>.py/<name>_test.py; in-package<pkg>/tests/test_<name>.pywalk-out (Django apps); top-leveltests/mirrorsrc/test/<lang>/...↔src/main/<lang>/...; sibling fallback<svc>/tests/X.cs↔<svc>/X.csand<svc>/src/.../X.cs(microservices-demo cartservice layout); .NET sibling-project mirror<App>.Tests/X.cs↔<App>/X.csTagging is consolidated into a final pass over all canonical edges, so production nodes get the
"tested"tag whether the edge came from Pass 1 (canonical / swapped) or Pass 2 (supplement).tagsis coerced to a fresh list when it arrives malformed (None / string / int / dict from raw LLM batch JSON), since the TypeScriptautoFixGraphnormalizer runs downstream of this script.link_testsreturns(added, dropped, tagged, swapped); the merge report distinguishes the four counters.The file-analyzer prompt keeps the
tested_byrow in the schema table — we now actively use the LLM-emitted edges as Pass 1 evidence. The note explains direction will be auto-canonicalized so the LLM doesn't need to be defensive about it.2. Dashboard "tested" badge
Small green dot (
bg-node-function, 6×6, subtle 4px glow) next to the existing complexity badge inCustomNode.tsx. Renders only whendata.tags?.includes("tested")— older graphs without the tag look identical to before.GraphView.tsxhad twoCustomNodeDataconstruction sites missingtags: node.tags(the layer-detail topology builder andbuildCustomFlowNodehelper). Both plumbed.KnowledgeGraphView.tsxalready passes tags; no change there.Forward / backward compatibility
Pure additive, no shim needed:
tagsfield[]if missing. Old nodes havetags=[], badge code uses?.includes(...), no-op.tagsarray.tested_byedge type/understand, at which point Pass 1 swaps them in place.testedtag visibilityNo schema changes. No new edge types. No new node types. No new dashboard state.
Files
Test plan
cd understand-anything-plugin/skills/understand && python3 -m unittest test_merge_batch_graphs.py— 47/47 pass (16production_candidates+ 11is_test_path+ 19link_testsend-to-end + 1merge_and_normalizeintegration)pnpm --filter @understand-anything/core test— 654/654 pass (untouched)pnpm --filter @understand-anything/dashboard test— 42/42 pass (untouched)pnpm --filter @understand-anything/dashboard build— cleanmicroservices-demo:tested_byedges, 3 inverted, 0 production nodes taggedlink_testsregression on the shippingservice case: one Go_test.gocoveringmain.go+tracker.go+quote.gois preserved by Pass 1 swap (no path-convention pair exists; Pass 2 finds none, by design)tags(None / string / int / dict) — verified non-crashing for all five cases/understand, confirm dashboard renders the green dot on tested files only, edges all flow production → testOut of scope (deliberately, per #113 "minimal valuable thing")
#[cfg(test)], no separate file).tested_byedges in already-loaded old graphs — Pass 1 handles this on the next merge run instead.Pre-existing, not from this branch
pnpm linterrors witheslint: command not found(no eslint installed at root, noeslint.config.*). Same onmain. Out of scope.🤖 Generated with Claude Code