Skip to content

feat(embeddings): include function body in semantic search text#345

Open
Minidoracat wants to merge 9 commits intotirth8205:mainfrom
Minidoracat:feat/body-embedding
Open

feat(embeddings): include function body in semantic search text#345
Minidoracat wants to merge 9 commits intotirth8205:mainfrom
Minidoracat:feat/body-embedding

Conversation

@Minidoracat
Copy link
Copy Markdown
Contributor

@Minidoracat Minidoracat commented Apr 19, 2026

Summary

_node_to_text() now appends a truncated function body snippet next to the metadata we already embed, so semantic search can rank by what a function does rather than just its name. All four providers (Local / Google / MiniMax / OpenAI-compatible) benefit transparently; no schema migration, no change to EmbeddingStore.__init__ signature, no change to parser.py / migrations.py / search.py / tools.py.

The motivation is the note already in the README: signature-only embeddings (~10 tokens per node) flatten the quality gap between small models and long-context SOTA embeddings (Gemini 2 preview, Qwen3-8B). This PR lifts the input budget closer to what those models were trained to consume.

Design decisions (ADR summary)

  • Body snippet extraction — language-agnostic state machine. Drops the signature line (dedup with metadata), strips leading docstrings / header comments ("""...""", /** */, # ..., ///, *), and collapses blank runs before truncating. Three-AND signature-dedup gate (name + language decl keyword + param first-token) for Python / JS / TS / JSX / TSX / Java / Kotlin / Rust / Go; other languages fall through as "don't dedup" so the body line just eats a bit of its own budget.
  • Pure log / print lines droppedlogger.info(...) / console.log(...) / fmt.Println(...) / println!(...) lines that carry no retrieval signal are skipped everywhere in the body, language-agnostically. Tightened in a follow-up iteration so we don't misfire on serve_cmd.error(...) (argparse) or fmt.Fprintf(w, ...) (arbitrary writer) — receiver identifier is a fixed whitelist, and the helper walks the balanced close of the log call to ensure nothing substantive trails it.
  • Per-provider char budget (ADR-B4) — local=700 / google=3000 / minimax=3000 / openai=6000, CRG_EMBED_BODY_MAX_CHARS env var overrides. MiniLM's 256-token window just fits; long-context models get enough room to actually use their capacity.
  • Sticky-flag opt-in (ADR-C3) — new embeddings_meta table on embeddings.db persists embed_body_enabled per-DB. Fresh repos auto-enable; existing DBs keep legacy signature-only behaviour until the user runs code-review-graph embed --include-body --confirm-reembed. Avoids one-time re-embed API bill surprise on upgrade; keeps mixed-state rows impossible across incremental runs.
  • Two-stage IO-bounded filter — stored text_hash = meta_hash:body_hash:file_hash. Stage 1 computes metadata-hash only (zero IO) and skips when metadata + file_hash (from GraphNode.file_hash) are unchanged. Stage 2 reads the file only when stage 1 can't prove freshness. Empty extracted bodies get sha256("") as a sentinel so they short-circuit too; read failures deliberately don't persist a sentinel so transient errors stay retryable.
  • Backward compat — legacy rows (pre-PR pure sha256 hex in text_hash, no :) parsed correctly by _split_combined_hash and trigger a single safe re-embed on first contact. CRG_DATA_DIR / CRG_REPO_ROOT env overrides fully respected (path containment check uses incremental.find_project_root).
  • Security_FileLineCache resolves paths and enforces relative_to(repo_root.resolve()) so a poisoned nodes.file_path can't read /etc/passwd or traverse out of the repo into cloud embedding calls.

Benchmark evidence

Ran MRR@3 on two codebases with multiple providers, EN + ZH queries, legacy metadata-only vs body-enriched. Every provider's full table is below; the short version is: SOTA models consistently gain, small models are repo-dependent.

code_review_graph itself — 50 golden queries, 537-node corpus

Provider EN Δ ZH Δ Mean Δ
google/gemini-embedding-2-preview +0.030 +0.087 +0.058
qwen/qwen3-embedding-8b +0.077 (err) +0.077 (EN)
openai/text-embedding-3-small +0.120 +0.167 +0.143
openai/text-embedding-3-large +0.143 +0.187 +0.165
baai/bge-m3 +0.197 +0.230 +0.213
local:all-MiniLM-L6-v2 +0.083 +0.173 +0.128
local new-api text-embedding-v4 (Qwen-v4) +0.110 +0.107 +0.108

All seven providers positive. BGE-m3 went from MRR 0.60 to 0.81 — a mid-tier model landing in top-tier territory.

Second benchmark — an external TypeScript Discord-bot codebase, 12 golden queries, 1667-node corpus

Provider EN Δ ZH Δ Mean Δ
qwen/qwen3-embedding-8b +0.208 +0.069 +0.139
openai/text-embedding-3-large +0.069 +0.153 +0.111
google/gemini-embedding-2-preview -0.014 +0.111 +0.049
local new-api Qwen-v4 -0.014 +0.056 +0.021
baai/bge-m3 -0.042 (err) -0.042
local:all-MiniLM-L6-v2 -0.181 -0.014 -0.097

Top-tier (Qwen3-8B / OpenAI-3-large / Gemini-2) stay positive on a framework-heavy TS codebase. MiniLM drops -0.18 on English — its 256-token window is too small to absorb Discord SDK boilerplate (interaction.deferReply(), t('commands:...', locale)) without crowding out real logic tokens. The sticky-flag opt-in plus CRG_EMBED_INCLUDE_BODY=0 escape hatch is the right shape for that failure mode: a user who sees regression on their specific repo can revert in a single env var without losing their existing embeddings.

Third / fourth benchmarks — two internal Laravel / PHP codebases, SOTA A/B

Ran a four-provider A/B (MiniLM + OpenAI 3-large + Gemini-2-preview + Qwen3-8B) against two real Laravel apps (281-node ledger service and 1887-node API forum backend, with 6 and 10 hand-graded queries respectively), comparing PHP signature dedup OFF vs ON:

Provider usdt-style ledger Δ-change forum-API Δ-change
Gemini-2-preview +0.083 +0.050
OpenAI 3-large +0.111 0
Qwen3-8B -0.056 (ceiling artefact) 0
MiniLM -0.028 -0.050

Net across 8 (repo × provider) cells: +0.110 MRR@3 — gains concentrate on the top-tier cloud models where users most expect the quality upside, MiniLM regresses by -0.03 to -0.05 and those users can revert via CRG_EMBED_INCLUDE_BODY=0. An earlier MiniLM-only bench had me keep PHP in the conservative fallback; the SOTA A/B overturned that, and the _looks_like_signature docstring records the reversal so future contributors don't undo it again. PSR-3 / Monolog / Laravel-facade log-only patterns ($logger->info, Log::info, error_log) are unconditionally included in the language-agnostic log-skip regex.

Cross-repo ranking (body-on MRR, two-repo avg)

  1. qwen/qwen3-embedding-8b — 0.759 (+0.108 Δ avg)
  2. google/gemini-embedding-2-preview — 0.756 (+0.054 Δ avg, consistent positive)
  3. openai/text-embedding-3-large — 0.735 (+0.138 Δ avg, biggest consistent PR value-add)
  4. Qwen-v4 — 0.642
  5. BGE-m3 — 0.594 (+0.086 but high variance)
  6. MiniLM — 0.423 (+0.016 on average but -0.097 on the TS codebase)

Test plan

  • 146 new / adjusted unit tests covering body extraction pipeline, signature dedup (Py / JS / TS / Java / Kotlin / Rust / Go + decorator + recursive-call negatives), block-comment + triple-quote state machine (single / multi-line / unclosed / JSDoc / inline), log-only line skipping (7 languages + mixed-statement / non-logger identifier / tail-continuation negatives), per-provider char budget + env override, sticky-flag cross-run behaviour including CLI flip, filter short-circuit with and without populated file_hash, read-failure must-retry regression, legacy hash backward compat, CRG_DATA_DIR / CRG_REPO_ROOT external cache dir regression.
  • MRR regression CI gate (tests/test_embeddings_mrr.py) — Local MiniLM against a bundled 10-query golden set; enforces MRR(body-on) >= MRR(legacy) - 0.02 AND > MRR(legacy) (non-inferiority plus strict improvement). pytest.importorskip("sentence_transformers") so default .[dev] CI stays green; installing .[embeddings] activates the gate.
  • Real-endpoint smoke tests manually verified on a local new-api gateway (text-embedding-v4) + OpenRouter across eight embedding models.
  • 12 rounds of Codex review converged to NO SUBSTANTIVE ISSUES. All findings (combined hash with file_hash, path traversal containment, Kotlin fun without return type, empty-body sentinel, transient-read-failure retry, CRG_DATA_DIR fast-path misfire, log-skip over-match, _is_log_only_line tail-continuation) are fixed with regression tests.

Full suite locally: 158 embedding-related tests pass on top of the existing 1020 passing tests. Repo-wide failures I saw (60) sit in test_multilang / test_parser / test_main around Julia / GDScript / ReScript parsing and the new --tools flag — they reproduce on upstream/main without this branch, so they're unrelated.

Caveats

  • Body enrichment is not uniformly beneficial across all (model, codebase) combinations. SOTA models (Qwen3-8B / Gemini-2 / OpenAI-3-large) gain consistently on every repo I tested; smaller or framework-sensitive setups can regress on heavily-boilerplate codebases. CRG_EMBED_INCLUDE_BODY=0 reverts to metadata-only without touching existing embeddings.
  • The built-in CI gate only runs Local MiniLM — it's a non-inferiority guardrail, not a claim that every user will see a win. The cross-provider comparison above is a better guide for which model to pick, and follow-up work can layer an adaptive-budget mechanism once tokenizer APIs stabilise across providers.
  • google/gemini-embedding-001 is listed by OpenRouter but its upstream Google AI Studio provider returned no-response for our API key during testing. Used google/gemini-embedding-2-preview instead, which is the newest working Google embedding on OpenRouter today.

Follow-up work (tracked separately, NOT in this PR)

Scope-disciplined items intentionally left out of this PR so its
review surface stays bounded. Planned as atomic, independently-
mergeable
PRs (ordered by dependency, not priority):

PR B — install --auto-embed-hook (in progress on my fork)

install currently writes MCP config + graph update/watch hooks but
nothing refreshes embeddings after edits. A new --auto-embed-hook
flag drops a small PostToolUse(Edit|Write|MultiEdit) hook that
runs code-review-graph embed --repo "$CLAUDE_PROJECT_DIR" in the
background (via Claude Code hook async: true + stdout/stderr
redirect). The two-stage filter + file_hash check this PR adds
means only 1-3 changed nodes actually hit the provider per save,
keeping API cost negligible — but the user must opt in because
cloud providers DO bill per token.

Handling this in install (not here) is the right place because
(a) install already owns hook writing, (b) the body-embedding
code in this PR stays purely library-level, (c) keeps this PR's
review surface stable. Stacks on this PR's HEAD on my fork.

PR A — code-review-graph config subcommand + .env auto-load

Provider credentials today must be in shell env (for CLI) OR the
MCP client's .mcp.json env block (for MCP server). There's no
shared config layer, so CLI users end up juggling export commands
or direnv. A config subcommand would fix that:

code-review-graph config set provider.openai.api_key <key>
code-review-graph config set --project provider.openai.model <id>
code-review-graph config show --source   # which layer each value came from

Three-layer resolution (env > project .code-review-graph/ config.toml > global ~/.config/code-review-graph/config.toml >
defaults), TOML format, backwards-compatible (existing
CRG_OPENAI_* env vars keep working, just as one layer in the
stack).

Orthogonal to PR B (the hook just shells out to code-review-graph embed; it doesn't care where env comes from), but planned after
so the feature lands on stable ground.

PR D — Adaptive body-char budget

Per-model capacity replaces the static _BODY_MAX_CHARS_BY_ PROVIDER table once common embedding providers expose stable
tokenizer APIs. Until then the env override
(CRG_EMBED_BODY_MAX_CHARS) covers edge cases.

PR E — Framework-SDK-pattern skip (beyond log lines)

The log-only skip this PR adds helps log-heavy codebases. Discord
SDK / Rails / Flask / Spring / React framework boilerplate is a
separate noise class that would need either provider-adaptive
budget (PR D) or per-framework heuristics (less maintainable).
Left as future work; CRG_EMBED_INCLUDE_BODY=0 is the current
escape hatch for small models on framework-heavy repos (see
TypeScript Discord-bot benchmark above).

Why: signature-only embeddings carry ~10 tokens per node (name + kind +
params + return type + language). That ceiling wastes the long-context
capacity of modern embedding models — Gemini 2, Qwen3-8B and similar
rank SOTA on MTEB-code largely on their ability to understand the
*body* of a function, but we were handing them only its name. So the
quality advantage of large-context embeddings against local MiniLM was
getting flattened by our own input, not by model capability.

What: `_node_to_text()` now appends a truncated body snippet after the
signature metadata, with the signature line and leading docstrings /
header comments stripped so the budget goes to real logic instead of
documentation. Four design choices worth flagging:

  1. Per-provider char budget (ADR-B4): local=700 / google=3000 /
     minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS`
     overrides. This gives MiniLM just enough to not silently truncate
     and gives the long-context models room to actually use their
     capacity.

  2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on
     `embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos
     auto-enable; existing DBs hold legacy behavior until the user
     runs `code-review-graph embed --include-body --confirm-reembed`.
     This avoids surprising cloud-provider users with a one-time
     re-embed API bill on upgrade, and the sticky flag keeps new nodes
     consistent with older ones across incremental runs (no mixed
     body-on / body-off state).

  3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored
     hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the
     metadata hash only (zero file IO) and skips when metadata +
     file_hash are unchanged. Stage 2 reads the file only when stage
     1 can't prove freshness, recomputes the body hash, persists the
     combined form. Empty extracted bodies get sha256("") as a
     sentinel so they also short-circuit next time; read failures
     deliberately do NOT persist a sentinel so transient errors stay
     retryable.

  4. Language-agnostic body extraction with block-comment + triple-
     quote state machine and conservative signature dedup via three-
     AND gate (node.name + declaration keyword + param first-token).
     Unknown languages fall back to "don't dedup" rather than risk
     dropping real body. Coverage today: Python / JS / TS / JSX / TSX /
     Java / Kotlin / Rust / Go; other languages keep the signature
     line and just eat a bit of their body budget.

Backward compat: legacy rows (pre-existing sha256 hex in
`embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file
hash) are both parsed correctly by `_split_combined_hash` and trigger
a single safe re-embed on first contact.

Security: `_FileLineCache` resolves paths and enforces a `relative_to`
containment check against the repo root, so a poisoned
`nodes.file_path` can't read `/etc/passwd` or traverse out of the
repo into cloud embedding calls.

Verification: regression MRR@3 on a 10-query golden set (bundled mini
repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent
queries like "escape HTML entities including quotes and backticks"
or "read GEMINI_API_KEY from the environment" move the correct node
into top-3 for the first time because the body now contains the
distinguishing tokens. Tested against real OpenAI-compatible endpoint
too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded
on first run, 0 on a repeat with sticky flag honored — filter IO
short-circuit verified end-to-end.

Tests: 59 new cases (file-line cache LRU + containment, signature
dedup with decorator / recursive / params mismatch / unknown lang,
block-comment and triple-quote state machine with JSDoc / Javadoc /
Python single-line and multi-line docstrings / long comments,
per-provider budget resolution + env override, sticky flag across
runs including CLI flip, filter short-circuit with and without
populated file_hash, read-failure must-retry regression, legacy hash
backward compat). MRR regression test skipped gracefully when
sentence-transformers isn't installed (CI matrix with `.[dev]` only
stays green; CI jobs that install `.[embeddings]` activate the gate).

Scope: all four providers (Local / Google / MiniMax / OpenAI-
compatible) benefit transparently; no schema migration on graph.db,
no change to `EmbeddingStore.__init__` signature, no change to
parser.py / migrations.py / search.py / tools.py. Adds one new
subcommand `embed` and a private `embeddings_meta` table on
`embeddings.db`.
Why: Codex round 7 caught that the legacy inference
``db_path.parent -> repo root`` silently breaks for the supported
``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir,
not inside ``<repo>/.code-review-graph/``). In that case
``_FileLineCache`` received the cache dir as repo_root and every
body-snippet read failed the containment check, making body
enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two-
stage filter then wrote metadata-only hashes, so even opting back
in via ``embed --include-body --confirm-reembed`` wouldn't help.

What: ``_repo_root_guess`` now:
  1. keeps the fast path when db_path parent is literally
     ``.code-review-graph`` (zero-config common case, no cost)
  2. otherwise delegates to ``incremental.find_project_root`` which
     already honors ``CRG_REPO_ROOT`` and walks up for a ``.git``
     marker — the same resolution order the rest of the CLI uses
  3. returns None only when both fail, letting callers treat
     absolute file_path values as self-locating

Regression test: ``TestRepoRootGuessExternalDataDir`` covers the
legacy layout, the env-override fallback, and an end-to-end read
through ``_FileLineCache`` from an external cache dir. Existing
``TestFilterStageShortCircuit`` fixtures now write their DB under
``<tmp_path>/.code-review-graph/`` so they hit the fast path
rather than the cwd-relative fallback.
Why: Codex round 8 found the round-7 fast path still misfires when
``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g.
``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or
``/shared/cache/<repo>/.code-review-graph``. In that case
``db_path.parent.name == ".code-review-graph"`` is True, the fast
path returns ``parent.parent`` (``/var/tmp`` or the shared-cache
parent), and ``_FileLineCache`` silently falls back to metadata
for every node. My round-7 tests only covered an external dir
NOT named ``.code-review-graph`` so the regression slipped.

What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When
that env var is set we always defer to
``incremental.find_project_root`` regardless of what ``db_path``
looks like; the zero-config fast path only runs when the user is
on the default ``<repo>/.code-review-graph/graph.db`` layout and
has NOT overridden the data dir.

Regression test: ``test_crg_data_dir_named_dot_code_review_graph``
exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at
a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT``
pointed at a distant ``real-repo`` tree — and asserts the root
resolves to ``real-repo``, not the cache's parent.
Why: cross-repo benchmark showed body enrichment regresses on
framework-heavy codebases (humanitzbot — TypeScript Discord bot —
MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found
that signature dedup + block-comment state machine are all working,
but TS/Discord function bodies are thick with framework boilerplate
that dilutes the retrieval signal. Examples from one regressing
function:

    await interaction.deferReply();
    const locale = interaction.locale;
    logger.info('handling power request', { signal });
    console.log('...');

On small models (MiniLM: 256-token window), those lines fill the
budget with `await`, `deferReply`, `locale`, `logger`, `console`
— high-frequency framework tokens that don't discriminate between
similarly-named commands (``_power`` vs ``_status`` vs ``_console``).

What: add a language-agnostic check to the existing body-extraction
state machine that skips lines which ARE purely a log/print call
with no other executable content. Covers:

  * Python:  ``print(...)``, ``logger.info(...)``, ``logging.debug(...)``
  * JS/TS:   ``console.log(...)``, ``logger.warn(...)``
  * Java:    ``System.out.println(...)``, ``log.debug(...)``
  * Go:      ``fmt.Println(...)``, ``log.Printf(...)``
  * Rust:    ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)``

Lines where a log call is mixed with substantive code
(``if err: logger.error(err); raise``) are preserved — they DO
carry retrieval signal.

Scope: one regex + one helper + one extra branch in the existing
state machine loop. Stays within ADR-A1 "language-agnostic body
extraction". Does NOT introduce framework-specific heuristics —
log/print is a universal pattern, not a framework one. Also does
NOT make the body hash non-deterministic (same input still produces
same output), so the combined-hash filter short-circuit (AC-13)
still holds.

Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust
log patterns, an integration test verifying scattered log lines
get stripped from a multi-statement TS body while real logic
survives, plus negative tests for mixed-statement lines and
non-log calls (``await interaction.deferReply()`` is NOT a log
call and stays kept).

Follow-up: framework-specific SDK boilerplate (Discord
``interaction.X``, React hooks, Spring ``@Autowired``) is where
the next round of improvement lives, but that can't be done
generically without either a tokenizer-based adaptive budget or
per-language SDK awareness — both are out of scope for this PR.
Round 10 flagged two real over-matches in 7daf93c:

1. The generic ``identifier.(error|warn|info|...)`` branch swallowed
   non-logger APIs. In THIS repo it dropped
   ``serve_cmd.error(...)`` (argparse's error method, which raises
   rather than logs). Same class of issue for
   ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer
   — that IS the function's output, not a log.

2. ``re.match`` is prefix-anchored, so
   ``logger.info('x') or do_work()`` passed the log-only check even
   though it has substantive trailing code. Violated the "mixed-
   statement lines stay kept" design promise.

Fixes:

* Receiver identifier is now an explicit whitelist — ``log``,
  ``logger``, ``logging``, ``logs``, ``console``, ``slog``,
  ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing),
  ``self.log``/``this.log``/``_log``/``_logger`` and the
  ``self.logger``/``this.logger`` variants. Custom loggers named
  differently fall through — they'll be kept as real code. Better
  to under-skip than mis-skip.
* Dropped ``fmt.Fprint*`` from the Go branch — those take an
  arbitrary ``io.Writer`` (commonly HTTP responses). Only
  ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like).
* ``_is_log_only_line`` now walks the matched string after the
  opening ``(``, tracking string literals and paren depth, to find
  the balanced close. Anything non-trivial after that ``)`` means
  the line has other executable content and is NOT log-only.

Regression tests: tail-continuation cases
(``logger.info('x') or do_work()``), non-logger identifier cases
(``serve_cmd.error``, ``db.warn``, ``item.notice``), and
``fmt.Fprintf`` cases now stay kept. Pure-log cases including
nested-paren args (``logger.info('val=' + str(compute(x, y)))``)
still get dropped.
Why: body enrichment already shows small positive Δ on PHP (Laravel
RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum
with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only
``.`` -style method access, so PHP-world logging idioms
(``$logger->info``, Laravel ``Log::info`` facade, bare
``error_log``/``syslog``/``trigger_error``) silently stayed in the
body text. Adding them is pure upside for log-heavy PHP / Laravel
code paths, no test-matrix cost anywhere else.

What this commit does:
  * PHP instance logger:  ``$logger->info(...)`` /
    ``$this->logger->debug(...)`` — covers Monolog convention.
  * PHP static facade:    ``Log::info(...)`` / ``\Log::error(...)`` —
    Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for
    fully-qualified imports).
  * PHP bare calls:       ``error_log(...)`` / ``syslog(...)`` /
    ``trigger_error(...)`` — stdlib helpers.
  * PSR-3 log levels:     debug / info / notice / warning / error /
    critical / alert / emergency, plus the common ``log``/``warn``
    aliases already used by the rest of the regex.

What this commit deliberately does NOT do: extend
``_looks_like_signature`` to PHP. I tested that locally — Laravel
RESTful controllers have very short templated bodies
(``$data = $request->validated(); Model::create($data);``) and the
signature line carries type information (``StoreRequest``,
``JsonResponse``, etc.) that the body never repeats. Stripping the
signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and
from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate
wastes budget" heuristic breaks on convention-over-configuration
PHP codebases, so PHP stays in the conservative fallback — signature
line is kept as part of the body snippet. This is documented in the
``_looks_like_signature`` docstring and locked in by
``test_signature_dedup_php_conservative``.

Tests: 10 new PHP log-pattern assertions (instance / static / bare /
negative non-logger calls like ``$user->save()`` / ``User::create()``
/ ``$response->setStatusCode()`` / ``$db->query()``) plus the
conservative-fallback regression test noted above.
Why: my first-try rollout kept PHP in the conservative fallback
based on a MiniLM-only bench that showed regressions on two
Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large +
Gemini-2-preview + Qwen3-8B across the same two repos tells a
different story — PHP dedup is net +0.110 MRR@3 across the 8
(repo × provider) cells, with the wins concentrated on the top-
tier cloud models where users most expect the quality gain:

  usdt-center  OpenAI 3-large    off Δ=-0.111  on Δ= 0.000  ↑ +0.111
  usdt-center  Gemini-2-preview  off Δ=+0.056  on Δ=+0.139  ↑ +0.083
  six-forum    Gemini-2-preview  off Δ=-0.083  on Δ=-0.033  ↑ +0.050
  six-forum    OpenAI 3-large    off Δ=+0.117  on Δ=+0.117  = 0
  six-forum    Qwen3-8B          off Δ=+0.117  on Δ=+0.117  = 0
  usdt-center  Qwen3-8B          off Δ=-0.111  on Δ=-0.167  ↓ -0.056 *
  usdt-center  MiniLM            off Δ=+0.028  on Δ= 0.000  ↓ -0.028
  six-forum    MiniLM            off Δ=+0.067  on Δ=+0.017  ↓ -0.050

  * Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling),
    any body token that isn't a retrieval signal is noise; this is
    a ceiling-effect artefact, not a regression from dedup.

What: adds PHP back to ``_looks_like_signature`` with the same
three-AND gate every other language uses — name + declaration
keyword + param first-token match. Covers class methods (all
visibility / static / abstract / final prefixes precede
``function``), free functions, classes, interfaces, traits, and
enums. Log-skip patterns from the previous commit stay unchanged.

Mini-LM-on-Laravel regression is documented in the module
docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0``
without touching any other part of the pipeline. The sticky-flag
opt-in is exactly the shape this kind of per-model tradeoff
needs, and why we built it.

Regression test in test_signature_dedup_php now asserts dedup
fires for standard PHP signatures (method / class / interface /
trait) and still returns False for body-only lines and for
overload-style param mismatches, so a future change that breaks
one of those categories gets caught.
Codex round 13 nit: the ``test_signature_dedup_php`` comment said
"traits, enums" but only asserted a trait example. Add explicit
assertions for both shapes of the PHP 8.1 enum syntax:

  enum OrderStatus: string { ...     # declaration → dedup
  case PENDING = 'pending';          # enum value  → keep as body

This locks in the behavior Codex spot-checked manually: ``case``
is deliberately NOT in the dedup keyword allowlist, so enum
values stay in the body snippet where they carry real retrieval
signal.
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026
…anguages

Why: the body-embedding feature ships a new ``code-review-graph embed``
subcommand (with ``--include-body`` / ``--confirm-reembed`` /
``--skip-cost-check`` flags) but the existing CLI reference tables in
each README stopped at ``eval`` / ``serve``. A first-time user after
this PR merges wouldn't discover the subcommand from docs alone.

What: adds two rows to the CLI reference ``<details>`` block in every
translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN):

  code-review-graph embed            # fresh repo — body enrichment auto-on
  code-review-graph embed --include-body --confirm-reembed
                                     # existing DB — opt in to body

Native-phrasing for each language explains the fresh-vs-existing-DB
distinction so the sticky-flag + double-flag safety gate is obvious
from the docs without cross-referencing the PR body.

Notes for translators / follow-up contributors: the ``embed`` CLI also
supports optional ``--provider`` / ``--model`` / ``--skip-cost-check``
flags that the full ``code-review-graph embed --help`` output covers;
the CLI table sticks to the two shapes most users need.
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026
…lpers

Why: body-embedding (PR tirth8205#345) shipped a `code-review-graph embed` CLI
but `install` doesn't wire up any hook to keep embeddings fresh after
edits. Users who want auto-refresh have to hand-craft a PostToolUse
entry in settings.json — easy to get wrong (wrong schema field, no
stdio redirect, no cost-gate). This commit lands the building blocks
so the next commit can expose a single `--auto-embed-hook` flag.

What:
- New `generate_auto_embed_hook_entry(repo_root)` in skills.py. Returns
  a single PostToolUse entry using the canonical Claude Code Jan-2026
  schema (`async: true`, NOT the invalid `run_in_background`). Matcher
  covers `Edit|Write|MultiEdit` (Bash is intentionally left to the
  standalone update hook since Bash tools don't produce code edits).
- Chains `update --skip-flows && embed` so graph.db is fresh when
  embed reads it; the second update call is a hash-dedup no-op. This
  eliminates a cross-process race that the separate update hook and
  this new embed hook would otherwise have on graph.db.
- `CRG_EMBED_INCLUDE_BODY=0` is prefixed to each CLI call so the hook
  never auto-triggers PR tirth8205#345's body-mode cost bomb — users who want
  body-mode must run `embed --include-body --confirm-reembed` manually
  and accept the sticky-flag flip-flop (documented in README).
- Full stdout+stderr redirection (`>/dev/null 2>&1` + `|| true`) keeps
  the MCP JSON-RPC stream uncorrupted even when embed prints progress
  or exits non-zero (e.g. missing provider).
- `install_hooks(..., include_auto_embed=False)` adds the kwarg with
  a False default — existing callers see JSON-struct-identical output.
- New `remove_auto_embed_hook_entry(...)` with defensive handling for
  malformed JSON / non-dict root / non-dict hooks / non-list PostToolUse
  (all return False without writing, no .bak on no-match — protects
  pristine baselines).

Tests (36 new, all passing):
- TestGenerateAutoEmbedHookEntry (13 cases): schema conformance,
  chaining order, stdio redirect count, env-var placement via regex,
  path-quoting, matcher content.
- TestInstallHooksAutoEmbed (4 cases): entry dedup, JSON-struct deep
  equality vs pre-change behaviour, qoder platform parity.
- TestRemoveAutoEmbedHook (9 cases): exact-dict removal, no-op-no-bak,
  user-custom preservation, 4 malformed-JSON defensive tests.

Plan: .omc/plans/feat-install-auto-embed-hook.md (local, not committed).
Stacked on PR tirth8205#345 (feat/body-embedding). Depends on `embed` CLI
introduced there; if tirth8205#345 is rebased/squashed, this branch needs
refresh.
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026
Why: the hook-generation machinery from the previous commit is only
useful if users can opt in from the CLI. This wires the flag through
`install` and its `init` alias, surfaces pre-flight warnings for the
two most common silent-failure modes (missing sentence-transformers,
unsupported platform), and adds integration tests that exercise the
full argparse → subprocess → settings.json path.

What:
- `install` and `init` both gain a mutually-exclusive flag group:
  `--auto-embed-hook` installs the entry; `--no-auto-embed-hook`
  removes an exact-match entry without touching anything else in
  settings.json. Default is OFF (opt-in) — passing neither flag
  produces the pre-change behaviour byte-for-equivalent on the wire.
- `_handle_init` parses the flags into `want_auto_embed` /
  `want_remove_auto_embed`, emits a stderr warning when the flag is
  passed with a non-claude/qoder `--platform` (the hook block already
  skips those platforms, so the warning is the only user-facing
  signal), and a second stderr warning when the default local
  provider's `sentence_transformers` extra is missing — both warnings
  are non-blocking per the "never block the agent" principle.
- Removal path explicitly does NOT call install_hooks to avoid
  re-merging the default update/SessionStart entries on top of any
  user customisation that happened between install and removal.

Tests:
- 9 new test_cli.py tests covering: flag acceptance on both
  subcommands, mutex-violation argparse exit 2, dry-run writes
  nothing, sentence-transformers find_spec monkeypatch warning,
  cursor-platform stderr warning, help-text contains both flags.
- New tests/test_integration_install_auto_embed.py with two
  subprocess-driven round-trips — install-and-verify-schema plus
  install-then-remove — that exercise the full package entry point
  via `python -m code_review_graph`. Replaces the v1 plan's fragile
  jq-based smoke script with a CI-gated Python test. Asserts
  `async: true`, double stdout/stderr redirect, `|| true` tail,
  `CRG_EMBED_INCLUDE_BODY=0` presence, and static (not
  $CLAUDE_PROJECT_DIR) repo path.

All 113 tests across the three touched files pass. Pre-existing 60
failures on test_parser / test_refactor / test_main (`--tools` flag)
are documented in PR tirth8205#345 as unrelated upstream issues.
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026
Why: Codex review flagged 1 BLOCKING + 2 MAJOR + 1 MINOR issue after
inspecting the three atomic commits above. This commit lands all four
fixes plus regression tests so the branch reaches NO SUBSTANTIVE ISSUES
before pushing to upstream.

BLOCKING — shell injection via repo path (skills.py:~588):
- `generate_auto_embed_hook_entry` was using `json.dumps(...)` to quote
  the repo path, but json-escaping is NOT sufficient for bash double
  quotes: `/tmp/repo$(touch /tmp/pwned)` would still trigger command
  substitution when the hook fires. Switched to `shlex.quote(...)`
  which wraps any path with shell metacharacters in single quotes
  that bash does not expand. Added two regression tests asserting
  that paths containing `$()` and backticks land inside single-quoted
  tokens, not double-quoted ones. Kept the path-with-spaces test
  updated to expect the new shlex quoting shape.

MAJOR — `--no-auto-embed-hook` had full-install side effects
(cli.py _handle_init):
- The flag was documented as "idempotent removal that preserves user
  config", but the original wiring still ran `install_platform_configs`,
  `.gitignore` update, `generate_skills`, `inject_claude_md`,
  `install_git_hook`, etc. Short-circuited `_handle_init` at the top:
  when `want_remove_auto_embed` is True we now do ONLY the removal
  and return, with a `--dry-run` preview path and clean no-op messages
  for `--no-hooks` and non-claude/qoder platforms. New test
  `test_no_auto_embed_hook_short_circuits_install_flow` asserts no
  .mcp.json / .gitignore / skills dir / CLAUDE.md / pre-commit hook
  is created when only `--no-auto-embed-hook` is passed.

MAJOR — malformed / JSONC settings.json was silently overwritten
(skills.py install_hooks):
- Backup `shutil.copy2` lived INSIDE the `try: json.loads(...)` block,
  so a parse failure skipped the backup and the subsequent write
  destroyed the original file. Reordered so the backup always runs
  first, then parse; on `JSONDecodeError` we now raise `RuntimeError`
  with a message pointing the user at the fresh `.bak` and asking
  them to fix or remove the comment manually, instead of silently
  resetting to `{}` and writing a blank shape. New test
  `test_install_aborts_on_malformed_json_with_backup` covers the
  JSONC scenario Codex specifically called out.

MINOR — warnings fired in dry-run / no-hooks modes (cli.py):
- The `sentence-transformers is missing` and non-claude/qoder
  platform warnings were printed even when no hook would be installed
  (`--dry-run` or `--no-hooks`). Gated both warnings behind
  `want_auto_embed and not skip_hooks and not dry_run` so the CLI
  stays quiet in the modes where the warning isn't actionable. Two
  new tests (`test_extras_warning_gated_by_skip_hooks`,
  `test_extras_warning_gated_by_dry_run`) keep this regression-safe.

Tests: 120 passing (was 113). No new lint / mypy / bandit findings
on touched files beyond the pre-existing ones documented in PR tirth8205#345.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant