feat(embeddings): include function body in semantic search text by Minidoracat · Pull Request #345 · tirth8205/code-review-graph

Minidoracat · 2026-04-19T14:26:48Z

Summary

_node_to_text() now appends a truncated function body snippet next to the metadata we already embed, so semantic search can rank by what a function does rather than just its name. All four providers (Local / Google / MiniMax / OpenAI-compatible) benefit transparently; no schema migration, no change to EmbeddingStore.__init__ signature, no change to parser.py / migrations.py / search.py / tools.py.

The motivation is the note already in the README: signature-only embeddings (~10 tokens per node) flatten the quality gap between small models and long-context SOTA embeddings (Gemini 2 preview, Qwen3-8B). This PR lifts the input budget closer to what those models were trained to consume.

Design decisions (ADR summary)

Body snippet extraction — language-agnostic state machine. Drops the signature line (dedup with metadata), strips leading docstrings / header comments ("""...""", /** */, # ..., ///, *), and collapses blank runs before truncating. Three-AND signature-dedup gate (name + language decl keyword + param first-token) for Python / JS / TS / JSX / TSX / Java / Kotlin / Rust / Go; other languages fall through as "don't dedup" so the body line just eats a bit of its own budget.
Pure log / print lines dropped — logger.info(...) / console.log(...) / fmt.Println(...) / println!(...) lines that carry no retrieval signal are skipped everywhere in the body, language-agnostically. Tightened in a follow-up iteration so we don't misfire on serve_cmd.error(...) (argparse) or fmt.Fprintf(w, ...) (arbitrary writer) — receiver identifier is a fixed whitelist, and the helper walks the balanced close of the log call to ensure nothing substantive trails it.
Per-provider char budget (ADR-B4) — local=700 / google=3000 / minimax=3000 / openai=6000, CRG_EMBED_BODY_MAX_CHARS env var overrides. MiniLM's 256-token window just fits; long-context models get enough room to actually use their capacity.
Sticky-flag opt-in (ADR-C3) — new embeddings_meta table on embeddings.db persists embed_body_enabled per-DB. Fresh repos auto-enable; existing DBs keep legacy signature-only behaviour until the user runs code-review-graph embed --include-body --confirm-reembed. Avoids one-time re-embed API bill surprise on upgrade; keeps mixed-state rows impossible across incremental runs.
Two-stage IO-bounded filter — stored text_hash = meta_hash:body_hash:file_hash. Stage 1 computes metadata-hash only (zero IO) and skips when metadata + file_hash (from GraphNode.file_hash) are unchanged. Stage 2 reads the file only when stage 1 can't prove freshness. Empty extracted bodies get sha256("") as a sentinel so they short-circuit too; read failures deliberately don't persist a sentinel so transient errors stay retryable.
Backward compat — legacy rows (pre-PR pure sha256 hex in text_hash, no :) parsed correctly by _split_combined_hash and trigger a single safe re-embed on first contact. CRG_DATA_DIR / CRG_REPO_ROOT env overrides fully respected (path containment check uses incremental.find_project_root).
Security — _FileLineCache resolves paths and enforces relative_to(repo_root.resolve()) so a poisoned nodes.file_path can't read /etc/passwd or traverse out of the repo into cloud embedding calls.

Benchmark evidence

Ran MRR@3 on two codebases with multiple providers, EN + ZH queries, legacy metadata-only vs body-enriched. Every provider's full table is below; the short version is: SOTA models consistently gain, small models are repo-dependent.

`code_review_graph` itself — 50 golden queries, 537-node corpus

Provider	EN Δ	ZH Δ	Mean Δ
google/gemini-embedding-2-preview	+0.030	+0.087	+0.058
qwen/qwen3-embedding-8b	+0.077	(err)	+0.077 (EN)
openai/text-embedding-3-small	+0.120	+0.167	+0.143
openai/text-embedding-3-large	+0.143	+0.187	+0.165
baai/bge-m3	+0.197	+0.230	+0.213
local:all-MiniLM-L6-v2	+0.083	+0.173	+0.128
local new-api `text-embedding-v4` (Qwen-v4)	+0.110	+0.107	+0.108

All seven providers positive. BGE-m3 went from MRR 0.60 to 0.81 — a mid-tier model landing in top-tier territory.

Second benchmark — an external TypeScript Discord-bot codebase, 12 golden queries, 1667-node corpus

Provider	EN Δ	ZH Δ	Mean Δ
qwen/qwen3-embedding-8b	+0.208	+0.069	+0.139
openai/text-embedding-3-large	+0.069	+0.153	+0.111
google/gemini-embedding-2-preview	-0.014	+0.111	+0.049
local new-api Qwen-v4	-0.014	+0.056	+0.021
baai/bge-m3	-0.042	(err)	-0.042
local:all-MiniLM-L6-v2	-0.181	-0.014	-0.097

Top-tier (Qwen3-8B / OpenAI-3-large / Gemini-2) stay positive on a framework-heavy TS codebase. MiniLM drops -0.18 on English — its 256-token window is too small to absorb Discord SDK boilerplate (interaction.deferReply(), t('commands:...', locale)) without crowding out real logic tokens. The sticky-flag opt-in plus CRG_EMBED_INCLUDE_BODY=0 escape hatch is the right shape for that failure mode: a user who sees regression on their specific repo can revert in a single env var without losing their existing embeddings.

Third / fourth benchmarks — two internal Laravel / PHP codebases, SOTA A/B

Ran a four-provider A/B (MiniLM + OpenAI 3-large + Gemini-2-preview + Qwen3-8B) against two real Laravel apps (281-node ledger service and 1887-node API forum backend, with 6 and 10 hand-graded queries respectively), comparing PHP signature dedup OFF vs ON:

Provider	usdt-style ledger Δ-change	forum-API Δ-change
Gemini-2-preview	+0.083	+0.050
OpenAI 3-large	+0.111	0
Qwen3-8B	-0.056 (ceiling artefact)	0
MiniLM	-0.028	-0.050

Net across 8 (repo × provider) cells: +0.110 MRR@3 — gains concentrate on the top-tier cloud models where users most expect the quality upside, MiniLM regresses by -0.03 to -0.05 and those users can revert via CRG_EMBED_INCLUDE_BODY=0. An earlier MiniLM-only bench had me keep PHP in the conservative fallback; the SOTA A/B overturned that, and the _looks_like_signature docstring records the reversal so future contributors don't undo it again. PSR-3 / Monolog / Laravel-facade log-only patterns ($logger->info, Log::info, error_log) are unconditionally included in the language-agnostic log-skip regex.

Cross-repo ranking (body-on MRR, two-repo avg)

qwen/qwen3-embedding-8b — 0.759 (+0.108 Δ avg)
google/gemini-embedding-2-preview — 0.756 (+0.054 Δ avg, consistent positive)
openai/text-embedding-3-large — 0.735 (+0.138 Δ avg, biggest consistent PR value-add)
Qwen-v4 — 0.642
BGE-m3 — 0.594 (+0.086 but high variance)
MiniLM — 0.423 (+0.016 on average but -0.097 on the TS codebase)

Test plan

146 new / adjusted unit tests covering body extraction pipeline, signature dedup (Py / JS / TS / Java / Kotlin / Rust / Go + decorator + recursive-call negatives), block-comment + triple-quote state machine (single / multi-line / unclosed / JSDoc / inline), log-only line skipping (7 languages + mixed-statement / non-logger identifier / tail-continuation negatives), per-provider char budget + env override, sticky-flag cross-run behaviour including CLI flip, filter short-circuit with and without populated file_hash, read-failure must-retry regression, legacy hash backward compat, CRG_DATA_DIR / CRG_REPO_ROOT external cache dir regression.
MRR regression CI gate (tests/test_embeddings_mrr.py) — Local MiniLM against a bundled 10-query golden set; enforces MRR(body-on) >= MRR(legacy) - 0.02 AND > MRR(legacy) (non-inferiority plus strict improvement). pytest.importorskip("sentence_transformers") so default .[dev] CI stays green; installing .[embeddings] activates the gate.
Real-endpoint smoke tests manually verified on a local new-api gateway (text-embedding-v4) + OpenRouter across eight embedding models.
12 rounds of Codex review converged to NO SUBSTANTIVE ISSUES. All findings (combined hash with file_hash, path traversal containment, Kotlin fun without return type, empty-body sentinel, transient-read-failure retry, CRG_DATA_DIR fast-path misfire, log-skip over-match, _is_log_only_line tail-continuation) are fixed with regression tests.

Full suite locally: 158 embedding-related tests pass on top of the existing 1020 passing tests. Repo-wide failures I saw (60) sit in test_multilang / test_parser / test_main around Julia / GDScript / ReScript parsing and the new --tools flag — they reproduce on upstream/main without this branch, so they're unrelated.

Caveats

Body enrichment is not uniformly beneficial across all (model, codebase) combinations. SOTA models (Qwen3-8B / Gemini-2 / OpenAI-3-large) gain consistently on every repo I tested; smaller or framework-sensitive setups can regress on heavily-boilerplate codebases. CRG_EMBED_INCLUDE_BODY=0 reverts to metadata-only without touching existing embeddings.
The built-in CI gate only runs Local MiniLM — it's a non-inferiority guardrail, not a claim that every user will see a win. The cross-provider comparison above is a better guide for which model to pick, and follow-up work can layer an adaptive-budget mechanism once tokenizer APIs stabilise across providers.
google/gemini-embedding-001 is listed by OpenRouter but its upstream Google AI Studio provider returned no-response for our API key during testing. Used google/gemini-embedding-2-preview instead, which is the newest working Google embedding on OpenRouter today.

Follow-up work (tracked separately, NOT in this PR)

Scope-disciplined items intentionally left out of this PR so its
review surface stays bounded. Planned as atomic, independently-
mergeable PRs (ordered by dependency, not priority):

PR B — `install --auto-embed-hook` (in progress on my fork)

install currently writes MCP config + graph update/watch hooks but
nothing refreshes embeddings after edits. A new --auto-embed-hook
flag drops a small PostToolUse(Edit|Write|MultiEdit) hook that
runs code-review-graph embed --repo "$CLAUDE_PROJECT_DIR" in the
background (via Claude Code hook async: true + stdout/stderr
redirect). The two-stage filter + file_hash check this PR adds
means only 1-3 changed nodes actually hit the provider per save,
keeping API cost negligible — but the user must opt in because
cloud providers DO bill per token.

Handling this in install (not here) is the right place because
(a) install already owns hook writing, (b) the body-embedding
code in this PR stays purely library-level, (c) keeps this PR's
review surface stable. Stacks on this PR's HEAD on my fork.

PR A — `code-review-graph config` subcommand + `.env` auto-load

Provider credentials today must be in shell env (for CLI) OR the
MCP client's .mcp.json env block (for MCP server). There's no
shared config layer, so CLI users end up juggling export commands
or direnv. A config subcommand would fix that:

code-review-graph config set provider.openai.api_key <key>
code-review-graph config set --project provider.openai.model <id>
code-review-graph config show --source   # which layer each value came from

Three-layer resolution (env > project .code-review-graph/ config.toml > global ~/.config/code-review-graph/config.toml >
defaults), TOML format, backwards-compatible (existing
CRG_OPENAI_* env vars keep working, just as one layer in the
stack).

Orthogonal to PR B (the hook just shells out to code-review-graph embed; it doesn't care where env comes from), but planned after
so the feature lands on stable ground.

PR D — Adaptive body-char budget

Per-model capacity replaces the static _BODY_MAX_CHARS_BY_ PROVIDER table once common embedding providers expose stable
tokenizer APIs. Until then the env override
(CRG_EMBED_BODY_MAX_CHARS) covers edge cases.

PR E — Framework-SDK-pattern skip (beyond log lines)

The log-only skip this PR adds helps log-heavy codebases. Discord
SDK / Rails / Flask / Spring / React framework boilerplate is a
separate noise class that would need either provider-adaptive
budget (PR D) or per-framework heuristics (less maintainable).
Left as future work; CRG_EMBED_INCLUDE_BODY=0 is the current
escape hatch for small models on framework-heavy repos (see
TypeScript Discord-bot benchmark above).

Why: signature-only embeddings carry ~10 tokens per node (name + kind + params + return type + language). That ceiling wastes the long-context capacity of modern embedding models — Gemini 2, Qwen3-8B and similar rank SOTA on MTEB-code largely on their ability to understand the *body* of a function, but we were handing them only its name. So the quality advantage of large-context embeddings against local MiniLM was getting flattened by our own input, not by model capability. What: `_node_to_text()` now appends a truncated body snippet after the signature metadata, with the signature line and leading docstrings / header comments stripped so the budget goes to real logic instead of documentation. Four design choices worth flagging: 1. Per-provider char budget (ADR-B4): local=700 / google=3000 / minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS` overrides. This gives MiniLM just enough to not silently truncate and gives the long-context models room to actually use their capacity. 2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on `embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos auto-enable; existing DBs hold legacy behavior until the user runs `code-review-graph embed --include-body --confirm-reembed`. This avoids surprising cloud-provider users with a one-time re-embed API bill on upgrade, and the sticky flag keeps new nodes consistent with older ones across incremental runs (no mixed body-on / body-off state). 3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the metadata hash only (zero file IO) and skips when metadata + file_hash are unchanged. Stage 2 reads the file only when stage 1 can't prove freshness, recomputes the body hash, persists the combined form. Empty extracted bodies get sha256("") as a sentinel so they also short-circuit next time; read failures deliberately do NOT persist a sentinel so transient errors stay retryable. 4. Language-agnostic body extraction with block-comment + triple- quote state machine and conservative signature dedup via three- AND gate (node.name + declaration keyword + param first-token). Unknown languages fall back to "don't dedup" rather than risk dropping real body. Coverage today: Python / JS / TS / JSX / TSX / Java / Kotlin / Rust / Go; other languages keep the signature line and just eat a bit of their body budget. Backward compat: legacy rows (pre-existing sha256 hex in `embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file hash) are both parsed correctly by `_split_combined_hash` and trigger a single safe re-embed on first contact. Security: `_FileLineCache` resolves paths and enforces a `relative_to` containment check against the repo root, so a poisoned `nodes.file_path` can't read `/etc/passwd` or traverse out of the repo into cloud embedding calls. Verification: regression MRR@3 on a 10-query golden set (bundled mini repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent queries like "escape HTML entities including quotes and backticks" or "read GEMINI_API_KEY from the environment" move the correct node into top-3 for the first time because the body now contains the distinguishing tokens. Tested against real OpenAI-compatible endpoint too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded on first run, 0 on a repeat with sticky flag honored — filter IO short-circuit verified end-to-end. Tests: 59 new cases (file-line cache LRU + containment, signature dedup with decorator / recursive / params mismatch / unknown lang, block-comment and triple-quote state machine with JSDoc / Javadoc / Python single-line and multi-line docstrings / long comments, per-provider budget resolution + env override, sticky flag across runs including CLI flip, filter short-circuit with and without populated file_hash, read-failure must-retry regression, legacy hash backward compat). MRR regression test skipped gracefully when sentence-transformers isn't installed (CI matrix with `.[dev]` only stays green; CI jobs that install `.[embeddings]` activate the gate). Scope: all four providers (Local / Google / MiniMax / OpenAI- compatible) benefit transparently; no schema migration on graph.db, no change to `EmbeddingStore.__init__` signature, no change to parser.py / migrations.py / search.py / tools.py. Adds one new subcommand `embed` and a private `embeddings_meta` table on `embeddings.db`.

Why: Codex round 7 caught that the legacy inference ``db_path.parent -> repo root`` silently breaks for the supported ``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir, not inside ``<repo>/.code-review-graph/``). In that case ``_FileLineCache`` received the cache dir as repo_root and every body-snippet read failed the containment check, making body enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two- stage filter then wrote metadata-only hashes, so even opting back in via ``embed --include-body --confirm-reembed`` wouldn't help. What: ``_repo_root_guess`` now: 1. keeps the fast path when db_path parent is literally ``.code-review-graph`` (zero-config common case, no cost) 2. otherwise delegates to ``incremental.find_project_root`` which already honors ``CRG_REPO_ROOT`` and walks up for a ``.git`` marker — the same resolution order the rest of the CLI uses 3. returns None only when both fail, letting callers treat absolute file_path values as self-locating Regression test: ``TestRepoRootGuessExternalDataDir`` covers the legacy layout, the env-override fallback, and an end-to-end read through ``_FileLineCache`` from an external cache dir. Existing ``TestFilterStageShortCircuit`` fixtures now write their DB under ``<tmp_path>/.code-review-graph/`` so they hit the fast path rather than the cwd-relative fallback.

Why: Codex round 8 found the round-7 fast path still misfires when ``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g. ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or ``/shared/cache/<repo>/.code-review-graph``. In that case ``db_path.parent.name == ".code-review-graph"`` is True, the fast path returns ``parent.parent`` (``/var/tmp`` or the shared-cache parent), and ``_FileLineCache`` silently falls back to metadata for every node. My round-7 tests only covered an external dir NOT named ``.code-review-graph`` so the regression slipped. What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When that env var is set we always defer to ``incremental.find_project_root`` regardless of what ``db_path`` looks like; the zero-config fast path only runs when the user is on the default ``<repo>/.code-review-graph/graph.db`` layout and has NOT overridden the data dir. Regression test: ``test_crg_data_dir_named_dot_code_review_graph`` exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT`` pointed at a distant ``real-repo`` tree — and asserts the root resolves to ``real-repo``, not the cache's parent.

Why: cross-repo benchmark showed body enrichment regresses on framework-heavy codebases (humanitzbot — TypeScript Discord bot — MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found that signature dedup + block-comment state machine are all working, but TS/Discord function bodies are thick with framework boilerplate that dilutes the retrieval signal. Examples from one regressing function: await interaction.deferReply(); const locale = interaction.locale; logger.info('handling power request', { signal }); console.log('...'); On small models (MiniLM: 256-token window), those lines fill the budget with `await`, `deferReply`, `locale`, `logger`, `console` — high-frequency framework tokens that don't discriminate between similarly-named commands (``_power`` vs ``_status`` vs ``_console``). What: add a language-agnostic check to the existing body-extraction state machine that skips lines which ARE purely a log/print call with no other executable content. Covers: * Python: ``print(...)``, ``logger.info(...)``, ``logging.debug(...)`` * JS/TS: ``console.log(...)``, ``logger.warn(...)`` * Java: ``System.out.println(...)``, ``log.debug(...)`` * Go: ``fmt.Println(...)``, ``log.Printf(...)`` * Rust: ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)`` Lines where a log call is mixed with substantive code (``if err: logger.error(err); raise``) are preserved — they DO carry retrieval signal. Scope: one regex + one helper + one extra branch in the existing state machine loop. Stays within ADR-A1 "language-agnostic body extraction". Does NOT introduce framework-specific heuristics — log/print is a universal pattern, not a framework one. Also does NOT make the body hash non-deterministic (same input still produces same output), so the combined-hash filter short-circuit (AC-13) still holds. Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust log patterns, an integration test verifying scattered log lines get stripped from a multi-statement TS body while real logic survives, plus negative tests for mixed-statement lines and non-log calls (``await interaction.deferReply()`` is NOT a log call and stays kept). Follow-up: framework-specific SDK boilerplate (Discord ``interaction.X``, React hooks, Spring ``@Autowired``) is where the next round of improvement lives, but that can't be done generically without either a tokenizer-based adaptive budget or per-language SDK awareness — both are out of scope for this PR.

Round 10 flagged two real over-matches in 7daf93c: 1. The generic ``identifier.(error|warn|info|...)`` branch swallowed non-logger APIs. In THIS repo it dropped ``serve_cmd.error(...)`` (argparse's error method, which raises rather than logs). Same class of issue for ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer — that IS the function's output, not a log. 2. ``re.match`` is prefix-anchored, so ``logger.info('x') or do_work()`` passed the log-only check even though it has substantive trailing code. Violated the "mixed- statement lines stay kept" design promise. Fixes: * Receiver identifier is now an explicit whitelist — ``log``, ``logger``, ``logging``, ``logs``, ``console``, ``slog``, ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing), ``self.log``/``this.log``/``_log``/``_logger`` and the ``self.logger``/``this.logger`` variants. Custom loggers named differently fall through — they'll be kept as real code. Better to under-skip than mis-skip. * Dropped ``fmt.Fprint*`` from the Go branch — those take an arbitrary ``io.Writer`` (commonly HTTP responses). Only ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like). * ``_is_log_only_line`` now walks the matched string after the opening ``(``, tracking string literals and paren depth, to find the balanced close. Anything non-trivial after that ``)`` means the line has other executable content and is NOT log-only. Regression tests: tail-continuation cases (``logger.info('x') or do_work()``), non-logger identifier cases (``serve_cmd.error``, ``db.warn``, ``item.notice``), and ``fmt.Fprintf`` cases now stay kept. Pure-log cases including nested-paren args (``logger.info('val=' + str(compute(x, y)))``) still get dropped.

Why: body enrichment already shows small positive Δ on PHP (Laravel RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only ``.`` -style method access, so PHP-world logging idioms (``$logger->info``, Laravel ``Log::info`` facade, bare ``error_log``/``syslog``/``trigger_error``) silently stayed in the body text. Adding them is pure upside for log-heavy PHP / Laravel code paths, no test-matrix cost anywhere else. What this commit does: * PHP instance logger: ``$logger->info(...)`` / ``$this->logger->debug(...)`` — covers Monolog convention. * PHP static facade: ``Log::info(...)`` / ``\Log::error(...)`` — Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for fully-qualified imports). * PHP bare calls: ``error_log(...)`` / ``syslog(...)`` / ``trigger_error(...)`` — stdlib helpers. * PSR-3 log levels: debug / info / notice / warning / error / critical / alert / emergency, plus the common ``log``/``warn`` aliases already used by the rest of the regex. What this commit deliberately does NOT do: extend ``_looks_like_signature`` to PHP. I tested that locally — Laravel RESTful controllers have very short templated bodies (``$data = $request->validated(); Model::create($data);``) and the signature line carries type information (``StoreRequest``, ``JsonResponse``, etc.) that the body never repeats. Stripping the signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate wastes budget" heuristic breaks on convention-over-configuration PHP codebases, so PHP stays in the conservative fallback — signature line is kept as part of the body snippet. This is documented in the ``_looks_like_signature`` docstring and locked in by ``test_signature_dedup_php_conservative``. Tests: 10 new PHP log-pattern assertions (instance / static / bare / negative non-logger calls like ``$user->save()`` / ``User::create()`` / ``$response->setStatusCode()`` / ``$db->query()``) plus the conservative-fallback regression test noted above.

Why: my first-try rollout kept PHP in the conservative fallback based on a MiniLM-only bench that showed regressions on two Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large + Gemini-2-preview + Qwen3-8B across the same two repos tells a different story — PHP dedup is net +0.110 MRR@3 across the 8 (repo × provider) cells, with the wins concentrated on the top- tier cloud models where users most expect the quality gain: usdt-center OpenAI 3-large off Δ=-0.111 on Δ= 0.000 ↑ +0.111 usdt-center Gemini-2-preview off Δ=+0.056 on Δ=+0.139 ↑ +0.083 six-forum Gemini-2-preview off Δ=-0.083 on Δ=-0.033 ↑ +0.050 six-forum OpenAI 3-large off Δ=+0.117 on Δ=+0.117 = 0 six-forum Qwen3-8B off Δ=+0.117 on Δ=+0.117 = 0 usdt-center Qwen3-8B off Δ=-0.111 on Δ=-0.167 ↓ -0.056 * usdt-center MiniLM off Δ=+0.028 on Δ= 0.000 ↓ -0.028 six-forum MiniLM off Δ=+0.067 on Δ=+0.017 ↓ -0.050 * Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling), any body token that isn't a retrieval signal is noise; this is a ceiling-effect artefact, not a regression from dedup. What: adds PHP back to ``_looks_like_signature`` with the same three-AND gate every other language uses — name + declaration keyword + param first-token match. Covers class methods (all visibility / static / abstract / final prefixes precede ``function``), free functions, classes, interfaces, traits, and enums. Log-skip patterns from the previous commit stay unchanged. Mini-LM-on-Laravel regression is documented in the module docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0`` without touching any other part of the pipeline. The sticky-flag opt-in is exactly the shape this kind of per-model tradeoff needs, and why we built it. Regression test in test_signature_dedup_php now asserts dedup fires for standard PHP signatures (method / class / interface / trait) and still returns False for body-only lines and for overload-style param mismatches, so a future change that breaks one of those categories gets caught.

Codex round 13 nit: the ``test_signature_dedup_php`` comment said "traits, enums" but only asserted a trait example. Add explicit assertions for both shapes of the PHP 8.1 enum syntax: enum OrderStatus: string { ... # declaration → dedup case PENDING = 'pending'; # enum value → keep as body This locks in the behavior Codex spot-checked manually: ``case`` is deliberately NOT in the dedup keyword allowlist, so enum values stay in the body snippet where they carry real retrieval signal.

…anguages Why: the body-embedding feature ships a new ``code-review-graph embed`` subcommand (with ``--include-body`` / ``--confirm-reembed`` / ``--skip-cost-check`` flags) but the existing CLI reference tables in each README stopped at ``eval`` / ``serve``. A first-time user after this PR merges wouldn't discover the subcommand from docs alone. What: adds two rows to the CLI reference ``<details>`` block in every translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN): code-review-graph embed # fresh repo — body enrichment auto-on code-review-graph embed --include-body --confirm-reembed # existing DB — opt in to body Native-phrasing for each language explains the fresh-vs-existing-DB distinction so the sticky-flag + double-flag safety gate is obvious from the docs without cross-referencing the PR body. Notes for translators / follow-up contributors: the ``embed`` CLI also supports optional ``--provider`` / ``--model`` / ``--skip-cost-check`` flags that the full ``code-review-graph embed --help`` output covers; the CLI table sticks to the two shapes most users need.

…lpers Why: body-embedding (PR tirth8205#345) shipped a `code-review-graph embed` CLI but `install` doesn't wire up any hook to keep embeddings fresh after edits. Users who want auto-refresh have to hand-craft a PostToolUse entry in settings.json — easy to get wrong (wrong schema field, no stdio redirect, no cost-gate). This commit lands the building blocks so the next commit can expose a single `--auto-embed-hook` flag. What: - New `generate_auto_embed_hook_entry(repo_root)` in skills.py. Returns a single PostToolUse entry using the canonical Claude Code Jan-2026 schema (`async: true`, NOT the invalid `run_in_background`). Matcher covers `Edit|Write|MultiEdit` (Bash is intentionally left to the standalone update hook since Bash tools don't produce code edits). - Chains `update --skip-flows && embed` so graph.db is fresh when embed reads it; the second update call is a hash-dedup no-op. This eliminates a cross-process race that the separate update hook and this new embed hook would otherwise have on graph.db. - `CRG_EMBED_INCLUDE_BODY=0` is prefixed to each CLI call so the hook never auto-triggers PR tirth8205#345's body-mode cost bomb — users who want body-mode must run `embed --include-body --confirm-reembed` manually and accept the sticky-flag flip-flop (documented in README). - Full stdout+stderr redirection (`>/dev/null 2>&1` + `|| true`) keeps the MCP JSON-RPC stream uncorrupted even when embed prints progress or exits non-zero (e.g. missing provider). - `install_hooks(..., include_auto_embed=False)` adds the kwarg with a False default — existing callers see JSON-struct-identical output. - New `remove_auto_embed_hook_entry(...)` with defensive handling for malformed JSON / non-dict root / non-dict hooks / non-list PostToolUse (all return False without writing, no .bak on no-match — protects pristine baselines). Tests (36 new, all passing): - TestGenerateAutoEmbedHookEntry (13 cases): schema conformance, chaining order, stdio redirect count, env-var placement via regex, path-quoting, matcher content. - TestInstallHooksAutoEmbed (4 cases): entry dedup, JSON-struct deep equality vs pre-change behaviour, qoder platform parity. - TestRemoveAutoEmbedHook (9 cases): exact-dict removal, no-op-no-bak, user-custom preservation, 4 malformed-JSON defensive tests. Plan: .omc/plans/feat-install-auto-embed-hook.md (local, not committed). Stacked on PR tirth8205#345 (feat/body-embedding). Depends on `embed` CLI introduced there; if tirth8205#345 is rebased/squashed, this branch needs refresh.

Why: the hook-generation machinery from the previous commit is only useful if users can opt in from the CLI. This wires the flag through `install` and its `init` alias, surfaces pre-flight warnings for the two most common silent-failure modes (missing sentence-transformers, unsupported platform), and adds integration tests that exercise the full argparse → subprocess → settings.json path. What: - `install` and `init` both gain a mutually-exclusive flag group: `--auto-embed-hook` installs the entry; `--no-auto-embed-hook` removes an exact-match entry without touching anything else in settings.json. Default is OFF (opt-in) — passing neither flag produces the pre-change behaviour byte-for-equivalent on the wire. - `_handle_init` parses the flags into `want_auto_embed` / `want_remove_auto_embed`, emits a stderr warning when the flag is passed with a non-claude/qoder `--platform` (the hook block already skips those platforms, so the warning is the only user-facing signal), and a second stderr warning when the default local provider's `sentence_transformers` extra is missing — both warnings are non-blocking per the "never block the agent" principle. - Removal path explicitly does NOT call install_hooks to avoid re-merging the default update/SessionStart entries on top of any user customisation that happened between install and removal. Tests: - 9 new test_cli.py tests covering: flag acceptance on both subcommands, mutex-violation argparse exit 2, dry-run writes nothing, sentence-transformers find_spec monkeypatch warning, cursor-platform stderr warning, help-text contains both flags. - New tests/test_integration_install_auto_embed.py with two subprocess-driven round-trips — install-and-verify-schema plus install-then-remove — that exercise the full package entry point via `python -m code_review_graph`. Replaces the v1 plan's fragile jq-based smoke script with a CI-gated Python test. Asserts `async: true`, double stdout/stderr redirect, `|| true` tail, `CRG_EMBED_INCLUDE_BODY=0` presence, and static (not $CLAUDE_PROJECT_DIR) repo path. All 113 tests across the three touched files pass. Pre-existing 60 failures on test_parser / test_refactor / test_main (`--tools` flag) are documented in PR tirth8205#345 as unrelated upstream issues.

Why: Codex review flagged 1 BLOCKING + 2 MAJOR + 1 MINOR issue after inspecting the three atomic commits above. This commit lands all four fixes plus regression tests so the branch reaches NO SUBSTANTIVE ISSUES before pushing to upstream. BLOCKING — shell injection via repo path (skills.py:~588): - `generate_auto_embed_hook_entry` was using `json.dumps(...)` to quote the repo path, but json-escaping is NOT sufficient for bash double quotes: `/tmp/repo$(touch /tmp/pwned)` would still trigger command substitution when the hook fires. Switched to `shlex.quote(...)` which wraps any path with shell metacharacters in single quotes that bash does not expand. Added two regression tests asserting that paths containing `$()` and backticks land inside single-quoted tokens, not double-quoted ones. Kept the path-with-spaces test updated to expect the new shlex quoting shape. MAJOR — `--no-auto-embed-hook` had full-install side effects (cli.py _handle_init): - The flag was documented as "idempotent removal that preserves user config", but the original wiring still ran `install_platform_configs`, `.gitignore` update, `generate_skills`, `inject_claude_md`, `install_git_hook`, etc. Short-circuited `_handle_init` at the top: when `want_remove_auto_embed` is True we now do ONLY the removal and return, with a `--dry-run` preview path and clean no-op messages for `--no-hooks` and non-claude/qoder platforms. New test `test_no_auto_embed_hook_short_circuits_install_flow` asserts no .mcp.json / .gitignore / skills dir / CLAUDE.md / pre-commit hook is created when only `--no-auto-embed-hook` is passed. MAJOR — malformed / JSONC settings.json was silently overwritten (skills.py install_hooks): - Backup `shutil.copy2` lived INSIDE the `try: json.loads(...)` block, so a parse failure skipped the backup and the subsequent write destroyed the original file. Reordered so the backup always runs first, then parse; on `JSONDecodeError` we now raise `RuntimeError` with a message pointing the user at the fresh `.bak` and asking them to fix or remove the comment manually, instead of silently resetting to `{}` and writing a blank shape. New test `test_install_aborts_on_malformed_json_with_backup` covers the JSONC scenario Codex specifically called out. MINOR — warnings fired in dry-run / no-hooks modes (cli.py): - The `sentence-transformers is missing` and non-claude/qoder platform warnings were printed even when no hook would be installed (`--dry-run` or `--no-hooks`). Gated both warnings behind `want_auto_embed and not skip_hooks and not dry_run` so the CLI stays quiet in the modes where the warning isn't actionable. Two new tests (`test_extras_warning_gated_by_skip_hooks`, `test_extras_warning_gated_by_dry_run`) keep this regression-safe. Tests: 120 passing (was 113). No new lint / mypy / bandit findings on touched files beyond the pre-existing ones documented in PR tirth8205#345.

Minidoracat added 8 commits April 19, 2026 14:26

Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026

merge: body-embedding feature (PR tirth8205#345, local preview)

209bf3f

Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026

merge: README embed CLI docs (PR tirth8205#345 update)

ea5ddf0

Minidoracat mentioned this pull request Apr 19, 2026

feat(install): add opt-in --auto-embed-hook flag for auto-refresh embeddings #349

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embeddings): include function body in semantic search text#345

feat(embeddings): include function body in semantic search text#345
Minidoracat wants to merge 9 commits intotirth8205:mainfrom
Minidoracat:feat/body-embedding

Minidoracat commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Minidoracat commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design decisions (ADR summary)

Benchmark evidence

code_review_graph itself — 50 golden queries, 537-node corpus

Second benchmark — an external TypeScript Discord-bot codebase, 12 golden queries, 1667-node corpus

Third / fourth benchmarks — two internal Laravel / PHP codebases, SOTA A/B

Cross-repo ranking (body-on MRR, two-repo avg)

Test plan

Caveats

Follow-up work (tracked separately, NOT in this PR)

PR B — install --auto-embed-hook (in progress on my fork)

PR A — code-review-graph config subcommand + .env auto-load

PR D — Adaptive body-char budget

PR E — Framework-SDK-pattern skip (beyond log lines)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Minidoracat commented Apr 19, 2026 •

edited

Loading

`code_review_graph` itself — 50 golden queries, 537-node corpus

PR B — `install --auto-embed-hook` (in progress on my fork)

PR A — `code-review-graph config` subcommand + `.env` auto-load