Skip to content

feat: deterministic tested_by edges + dashboard badge (#113)#122

Merged
Lum1104 merged 7 commits into
mainfrom
feat/issue-113-tested-by-coverage
May 9, 2026
Merged

feat: deterministic tested_by edges + dashboard badge (#113)#122
Lum1104 merged 7 commits into
mainfrom
feat/issue-113-tested-by-coverage

Conversation

@Lum1104
Copy link
Copy Markdown
Owner

@Lum1104 Lum1104 commented May 6, 2026

Closes #113.

What's wrong

@Bulkmaker reported in #113 that test-coverage info on the graph is currently unusable for two reasons:

  1. tested_by edge direction is inconsistent. The file-analyzer prompt says edges should be production → test, but production files don't import their tests, so the LLM only sees the relationship while analyzing a test file — and naturally emits test → production (inverted). Same project, same file types, mixed directions across batches.
  2. Massive undercounting. On a real Nuxt 4 + Directus monorepo with ~140 unit + e2e tests, only 17 tested_by edges came through (~7% of production files looked tested vs. true coverage being far higher).
  3. Dashboard treats tested_by like any other edge. Nothing visually distinguishes a tested file from an untested one.

Confirmed reproduction on Google's microservices-demo (real corpus, not Nuxt-specific): 7 LLM-emitted tested_by edges, 3 of them (43%) inverted (shippingservice_test.go → main.go, shippingservice_test.go → tracker.go, product_catalog_test.go → product_catalog.go). 0 production nodes tagged.

Fix

Two pieces:

1. Canonicalize tested_by direction in merge-batch-graphs.py (two-pass linker)

Path-based linker integrated as Step 5b between node dedup and edge dedup, with a refined two-pass design:

Pass 1 — preserve LLM evidence, fix direction.
The LLM's tested_by pairings are real (it sees import { CartService } from '../src/services/CartService' in the test file). What's wrong is direction: the source is the file being analyzed = a test. So we walk every existing tested_by edge and:

  • canonical (production → test) → keep unchanged
  • inverted (test → production) → flip in place; description gets [direction corrected] audit marker
  • semantically broken (test ↔ test, prod ↔ prod, orphan endpoint, duplicate pair) → drop

Pass 2 — supplement via path conventions.
For tests Pass 1 didn't pair, walk candidate production paths from production_candidates(test_path) and emit a fresh production → test edge for the first match. Pairs already covered by Pass 1 are skipped.

Path-convention coverage (production_candidates):

  • JS/TS: sibling de-infix (.test./.spec.); walk-out from __tests__/, test/, spec/, tests/ subdirs; mirrored tests/{src,app,lib,""} tree
  • Go: sibling <name>_test.go<name>.go (multi-source-per-test cases handled by Pass 1 swap, not by path heuristic)
  • Python: sibling test_<name>.py / <name>_test.py; in-package <pkg>/tests/test_<name>.py walk-out (Django apps); top-level tests/ mirror
  • Java/Kotlin: Maven/Gradle src/test/<lang>/...src/main/<lang>/...; sibling fallback
  • C#: sibling fallback; <svc>/tests/X.cs<svc>/X.cs and <svc>/src/.../X.cs (microservices-demo cartservice layout); .NET sibling-project mirror <App>.Tests/X.cs<App>/X.cs
  • C/C++: sibling de-prefix/de-suffix

Tagging is consolidated into a final pass over all canonical edges, so production nodes get the "tested" tag whether the edge came from Pass 1 (canonical / swapped) or Pass 2 (supplement). tags is coerced to a fresh list when it arrives malformed (None / string / int / dict from raw LLM batch JSON), since the TypeScript autoFixGraph normalizer runs downstream of this script.

link_tests returns (added, dropped, tagged, swapped); the merge report distinguishes the four counters.

The file-analyzer prompt keeps the tested_by row in the schema table — we now actively use the LLM-emitted edges as Pass 1 evidence. The note explains direction will be auto-canonicalized so the LLM doesn't need to be defensive about it.

2. Dashboard "tested" badge

Small green dot (bg-node-function, 6×6, subtle 4px glow) next to the existing complexity badge in CustomNode.tsx. Renders only when data.tags?.includes("tested") — older graphs without the tag look identical to before.

GraphView.tsx had two CustomNodeData construction sites missing tags: node.tags (the layer-detail topology builder and buildCustomFlowNode helper). Both plumbed. KnowledgeGraphView.tsx already passes tags; no change there.

Forward / backward compatibility

Pure additive, no shim needed:

Old graph + new dashboard New graph + old dashboard
tags field Already in schema; auto-fix sets [] if missing. Old nodes have tags=[], badge code uses ?.includes(...), no-op. Old dashboard ignores extra string in tags array.
tested_by edge type Already in schema. Inverted edges from old graphs render in the wrong direction until you re-run /understand, at which point Pass 1 swaps them in place. Same edge type, just canonical direction. Old dashboard renders fine.
tested tag visibility Badge does not render. Tag chip in NodeInfo / NodeTooltip already shows it as a gold pill — fine.

No schema changes. No new edge types. No new node types. No new dashboard state.

Files

 understand-anything-plugin/agents/file-analyzer.md                                |   3 +-
 understand-anything-plugin/packages/dashboard/src/components/CustomNode.tsx       |  16 +-
 understand-anything-plugin/packages/dashboard/src/components/GraphView.tsx        |   2 +
 understand-anything-plugin/skills/understand/SKILL.md                             |   2 +-
 understand-anything-plugin/skills/understand/merge-batch-graphs.py                | 500+++
 understand-anything-plugin/skills/understand/test_merge_batch_graphs.py           | 800+++ (new file)
 .gitignore                                                                        |   2 +

Test plan

  • cd understand-anything-plugin/skills/understand && python3 -m unittest test_merge_batch_graphs.py — 47/47 pass (16 production_candidates + 11 is_test_path + 19 link_tests end-to-end + 1 merge_and_normalize integration)
  • pnpm --filter @understand-anything/core test — 654/654 pass (untouched)
  • pnpm --filter @understand-anything/dashboard test — 42/42 pass (untouched)
  • pnpm --filter @understand-anything/dashboard build — clean
  • Real-world validation on Google microservices-demo:
    • Before this PR: 7 tested_by edges, 3 inverted, 0 production nodes tagged
    • After early commits (strip-and-rederive): 4 edges, 0 inverted, 4 tagged — dropped 3 LLM signals the path-convention pass couldn't recover (Go multi-source-per-test, .NET Maven layout)
    • After swap-then-supplement (latest commit): 7 edges, 0 inverted, 7 tagged — full coverage signal preserved, all 3 inverted edges flipped in place
  • link_tests regression on the shippingservice case: one Go _test.go covering main.go + tracker.go + quote.go is preserved by Pass 1 swap (no path-convention pair exists; Pass 2 finds none, by design)
  • Codex P1 — malformed tags (None / string / int / dict) — verified non-crashing for all five cases
  • Manual end-to-end on a real project: rerun /understand, confirm dashboard renders the green dot on tested files only, edges all flow production → test

Out of scope (deliberately, per #113 "minimal valuable thing")

  • "Show only untested" filter / per-layer coverage % — issue called these optional follow-ons.
  • Rust (tests are usually inline #[cfg(test)], no separate file).
  • C/C++ project-style mirrors (project structure varies wildly).
  • Auto-flipping inverted tested_by edges in already-loaded old graphs — Pass 1 handles this on the next merge run instead.

Pre-existing, not from this branch

pnpm lint errors with eslint: command not found (no eslint installed at root, no eslint.config.*). Same on main. Out of scope.

🤖 Generated with Claude Code

Lum1104 and others added 4 commits May 6, 2026 22:00
The file-analyzer LLM only sees the production↔test relationship when
analyzing a test file (production files don't import their tests), so
its emitted direction was unreliable across batches and recall was
massively undercounted (~7% on a real Nuxt 4 + Directus repo).

Move tested_by production entirely into the merge step. The linker:

- Strips every tested_by edge from batch input (LLM direction unreliable).
- Indexes file:* nodes and classifies each path as test or production.
- For each test, walks ordered candidate production paths (sibling
  de-infix, __tests__/ walk-out, mirrored tests/→{src,app,lib,<root>}
  tree, Maven/Gradle src/test/...→src/main/...).
- Emits canonical production → test edges and tags production nodes
  "tested".

Supported conventions: JS/TS family (.test/.spec), Go (_test.go),
Python (test_*.py, *_test.py), Java (*Test/*Tests/*IT.java), Kotlin
(*Test/*Tests.kt), C# (*Test/*Tests.cs), C/C++ (test_*, *_test).

Stdlib only, type-hinted in existing style. Hooked into
merge_and_normalize between node dedup (Step 5) and edge dedup
(Step 6). Reports drops under "Fixed" and additions under a new
"Tested-by linker" section.

Tests cover path classification, candidate generation, full link_tests
behaviour (forward direction, idempotence, LLM-edge stripping,
test-to-test rejection), and the merge integration. 31 cases, stdlib
unittest, runnable with `python -m unittest test_merge_batch_graphs.py`.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Now that the merge script produces tested_by edges deterministically
from path conventions, the LLM should not emit them — its direction is
unreliable across batches and any emitted edges are stripped on merge.

- Remove tested_by row from file-analyzer's edge table.
- Add a note pointing to the deterministic linker.
- Document the new behaviour in the merge section of SKILL.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- is_test_path: collapse 7 per-language conditional blocks into a
  data-driven _TEST_NAME_PATTERNS table; JS/TS infix stays inline
- production_candidates: extract _join + module-level _add_unique to
  drop the nested closure and the repeated trailing-slash idiom
- Drop dead _TEST_DIR_SEGMENTS constant and the local _splitext
  reimplementation; use os.path.splitext
- link_tests: drop the impossible-malformed-tags guard, tighten the
  docstring, change edge description to "Path-based pairing
  (deterministic)", drop redundant break comment
- Trim Step 5b inline block that duplicated the module-level header
- Convert file-analyzer Note from blockquote to bold paragraph to
  match surrounding prompt style

Tests: split the strip-edges test from the unrelated-edges-survive
test, add empty-input and missing-filePath cases, pin sibling-before-
walkup and sibling-before-mirror priority order, drop brittle report
text assertion. 36 tests, all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Render a small green dot next to the complexity badge whenever a node's
tags contain "tested" — surfacing the deterministic linker's signal so
users can see at a glance which files have paired tests.

Plumb node.tags through both CustomNodeData construction sites in
GraphView.tsx; KnowledgeGraphView.tsx already passes tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c257a55f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +482 to +484
tags = prod_node.setdefault("tags", [])
if "tested" not in tags:
tags.append("tested")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize tags type before adding "tested" tag

When a matched production file has malformed tags (for example null, a string, or any non-list value from an LLM batch), this block assumes list semantics and can raise (TypeError on membership or AttributeError on append), which aborts the whole merge. This regression is introduced by the new linker path and can break /understand on otherwise recoverable batch data; coerce non-list tags to [] (or another safe default) before checking/appending.

Useful? React with 👍 / 👎.

Lum1104 and others added 2 commits May 7, 2026 10:25
Codex flagged that prod_node.setdefault("tags", []) returns the existing
value when the key is present, so a raw LLM batch with tags=None or
tags="some string" would crash the whole merge on the next "tested" not
in tags membership check.

The TypeScript autoFixGraph normalizer that handles this case runs
downstream of merge-batch-graphs.py, not before it, so the Python side
has to defend itself. Coerce non-list tags to a fresh [] before the
membership/append.

Regression test exercises None / comma-string / single-string / int /
dict inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Strip-and-rederive (current PR behaviour) drops real coverage signal on
projects whose test layout doesn't match a naming convention. On the
Google microservices-demo the LLM had emitted 7 valid tested_by edges
(3 with inverted direction); the strip pass dropped them and the path-
convention rederive could only re-pair 4 of them. Net: 7 → 4 edges,
3 production files lost their tested signal.

Replace strip-and-rederive with two-pass swap-then-supplement:

  Pass 1 — walk LLM tested_by edges. Canonical (production → test)
  edges pass through unchanged. Inverted (test → production) edges are
  flipped in place; description gets a `[direction corrected]` audit
  marker. Edges with no recoverable meaning (test↔test, prod↔prod,
  orphan endpoint, duplicate pair) are dropped.

  Pass 2 — for tests not yet paired by Pass 1, walk path-convention
  candidates and emit a fresh production → test edge for the first
  match. Pairs already covered by Pass 1 are skipped.

Tagging is consolidated into a final pass over all canonical edges so
production nodes get the "tested" tag whether the edge came from
Pass 1 (canonical / swapped) or Pass 2 (supplement).

Multi-language audit of production_candidates revealed three real-world
gaps surfaced by re-checking microservices-demo and common project
layouts:

  - JS/TS walk-out only handled `__tests__/`. Extended to also walk out
    of `<dir>/test/`, `<dir>/spec/`, and `<dir>/tests/` (some JS/TS
    projects use these instead of __tests__/).
  - Python walk-out only handled top-level `tests/`. Added in-package
    `<pkg>/tests/test_<name>.py` → `<pkg>/<name>.py` (Django app style
    and any project that colocates tests with the package).
  - C# only had sibling fallback. Added two new mirrors:
      * `<svc>/tests/X.cs` ↔ `<svc>/X.cs` and `<svc>/src/.../X.cs`
        (microservices-demo cartservice exact layout).
      * `<App>.Tests/Foo/BarTests.cs` ↔ `<App>/Foo/Bar.cs`
        (.NET sibling-project convention).

Go is intentionally not changed — the "one _test.go covers several
.go files in the same package" pattern is now solved by Pass 1
(swapping LLM edges), not by trying to invent multi-pair path heuristics.

The file-analyzer prompt is updated: the `tested_by` row is restored
in the schema table because we now use those edges as evidence (Pass 1
canonicalizes the direction). The note explains direction will be
auto-corrected so the LLM doesn't need to be defensive about it.

link_tests now returns a 4-tuple (added, dropped, tagged, swapped);
the merge_and_normalize report distinguishes "edges produced
(supplement)" from "edges flipped" from "edges dropped".

Real-world validation on microservices-demo:
  before:   7 tested_by edges, 3 inverted, 0 tagged
  after PR: 4 tested_by edges, 0 inverted, 4 tagged   ← strip-and-rederive
  this:     7 tested_by edges, 0 inverted, 7 tagged   ← swap-then-supplement

Tests: 47 pass (was 37). New cases cover all swap branches, the
shippingservice "one test, many sources" regression, and each new
language pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@Lum1104
Copy link
Copy Markdown
Owner Author

Lum1104 commented May 9, 2026

@codex review this

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bb22fd9af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +583 to +586
if pair in covered:
# Duplicate canonical edge — drop the dup, keep the first.
dropped += 1
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep max-weight duplicate tested_by edge

When duplicate tested_by pairs appear, this branch drops later edges purely by arrival order (covered), so the stronger edge can be discarded before the generic Step 6 deduper (which normally keeps the highest weight) ever runs. In multi-batch runs where the same pair is emitted with different confidences, the resulting graph can retain a lower-confidence edge solely because its batch was processed first, which degrades ranking/visual confidence semantics for tested_by links.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in a4bdc1c. Pass 1 now mirrors Step 6's weight-aware dedup locally:

  • pair_to_idx tracks each kept pair's slot in the compacted edges list
  • on a duplicate, compare weights and replace the slot when strictly heavier (tie keeps the first arrival, matching Step 6)
  • swap operation is deferred until an edge is known to survive — no work spent canonicalizing a doomed duplicate
  • swapped_pairs set replaces the raw swapped counter so the reported number reflects the final output (replacing a swapped edge with a heavier canonical one drops it from the count, and vice versa)

Five new unit tests cover all four weight × direction combinations.

@Lum1104
Copy link
Copy Markdown
Owner Author

Lum1104 commented May 9, 2026

Code review

Found 1 issue:

  1. link_tests Pass 1 drops duplicate (production, test) pairs by arrival order before Step 6's weight-based dedup runs. When two batches both emit a tested_by edge for the same pair with different confidences (e.g. 0.3 vs 0.9), whichever edge is iterated first wins — the higher-weight one can be silently discarded. The general Step 6 deduper at merge-batch-graphs.py line 762 (_num(edge.get("weight", 0)) > _num(existing.get("weight", 0))) only sees one copy because the second was already discarded inside link_tests. Same bug for both the canonical-dup branch (line 583) and the inverted-dup branch (line 592). Independently flagged by Codex on this PR (feat: deterministic tested_by edges + dashboard badge (#113) #122 (comment)). Suggested fix: in both pair in covered branches, look up the existing kept edge and replace it when the new edge has a higher weight (mirroring Step 6).

# Both endpoints must be known file nodes; one test, one production.
# Anything else (orphan, test↔test, prod↔prod, non-file endpoint)
# has no recoverable meaning — drop it.
if (src_class, tgt_class) == ("prod", "test"):
pair = (src, tgt)
if pair in covered:
# Duplicate canonical edge — drop the dup, keep the first.
dropped += 1
continue
covered.add(pair)
edges[write_idx] = edge
write_idx += 1
elif (src_class, tgt_class) == ("test", "prod"):
pair = (tgt, src)
if pair in covered:
dropped += 1
continue
covered.add(pair)
# Flip in place; mark provenance so reviewers can audit.
edge["source"] = tgt
edge["target"] = src
edge["direction"] = "forward"
prev = edge.get("description")

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Codex P2: link_tests Pass 1 dropped duplicate (production, test) pairs
purely by arrival order — when two batches both emitted a tested_by
edge for the same pair with different confidences (0.3 vs 0.9), the
edge that happened to iterate first won. The general Step 6 deduper
at line 762 mirrors `weight > existing.weight` semantics but it only
ever saw one of the duplicates, so it couldn't rescue the heavier one.

Refactor Pass 1 to mirror Step 6's weight comparison locally:

  - Track `pair_to_idx` mapping each kept (prod, test) pair to its
    slot in the compacted edges list. On a duplicate, look up the
    existing kept edge and compare weights; if the new edge is
    strictly heavier, swap (if needed) and replace the slot. Tie or
    lighter → drop the new edge.
  - Defer the swap operation until we know an edge will survive — no
    point canonicalizing a doomed duplicate.
  - Track surviving swap pairs in a separate `swapped_pairs` set so
    the `swapped` counter reflects the FINAL output, not the wasted
    work on edges that were later replaced. This means: replacing a
    swapped edge with a heavier canonical one drops the swap from
    the count; replacing a canonical edge with a heavier swapped one
    adds it.
  - Extract the swap-in-place mutation into `_swap_tested_by_in_place`
    so it can be invoked from both code paths.

Five new unit tests cover all four weight-vs-direction combinations
plus a tie case (existing test_drops_duplicate_canonical_edges, which
still passes — tie → keep first, no swap counted).

microservices-demo regression check unchanged: 7 → 7 edges, 3 swapped,
0 dropped, 7 tagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@Lum1104 Lum1104 merged commit 3eb7700 into main May 9, 2026
1 check passed
@Lum1104 Lum1104 deleted the feat/issue-113-tested-by-coverage branch May 9, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Visualize test coverage on file nodes (tested_by edges + dashboard hint)

1 participant