feat(embeddings): include function body in semantic search text#345
Open
Minidoracat wants to merge 9 commits intotirth8205:mainfrom
Open
feat(embeddings): include function body in semantic search text#345Minidoracat wants to merge 9 commits intotirth8205:mainfrom
Minidoracat wants to merge 9 commits intotirth8205:mainfrom
Conversation
Why: signature-only embeddings carry ~10 tokens per node (name + kind +
params + return type + language). That ceiling wastes the long-context
capacity of modern embedding models — Gemini 2, Qwen3-8B and similar
rank SOTA on MTEB-code largely on their ability to understand the
*body* of a function, but we were handing them only its name. So the
quality advantage of large-context embeddings against local MiniLM was
getting flattened by our own input, not by model capability.
What: `_node_to_text()` now appends a truncated body snippet after the
signature metadata, with the signature line and leading docstrings /
header comments stripped so the budget goes to real logic instead of
documentation. Four design choices worth flagging:
1. Per-provider char budget (ADR-B4): local=700 / google=3000 /
minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS`
overrides. This gives MiniLM just enough to not silently truncate
and gives the long-context models room to actually use their
capacity.
2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on
`embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos
auto-enable; existing DBs hold legacy behavior until the user
runs `code-review-graph embed --include-body --confirm-reembed`.
This avoids surprising cloud-provider users with a one-time
re-embed API bill on upgrade, and the sticky flag keeps new nodes
consistent with older ones across incremental runs (no mixed
body-on / body-off state).
3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored
hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the
metadata hash only (zero file IO) and skips when metadata +
file_hash are unchanged. Stage 2 reads the file only when stage
1 can't prove freshness, recomputes the body hash, persists the
combined form. Empty extracted bodies get sha256("") as a
sentinel so they also short-circuit next time; read failures
deliberately do NOT persist a sentinel so transient errors stay
retryable.
4. Language-agnostic body extraction with block-comment + triple-
quote state machine and conservative signature dedup via three-
AND gate (node.name + declaration keyword + param first-token).
Unknown languages fall back to "don't dedup" rather than risk
dropping real body. Coverage today: Python / JS / TS / JSX / TSX /
Java / Kotlin / Rust / Go; other languages keep the signature
line and just eat a bit of their body budget.
Backward compat: legacy rows (pre-existing sha256 hex in
`embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file
hash) are both parsed correctly by `_split_combined_hash` and trigger
a single safe re-embed on first contact.
Security: `_FileLineCache` resolves paths and enforces a `relative_to`
containment check against the repo root, so a poisoned
`nodes.file_path` can't read `/etc/passwd` or traverse out of the
repo into cloud embedding calls.
Verification: regression MRR@3 on a 10-query golden set (bundled mini
repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent
queries like "escape HTML entities including quotes and backticks"
or "read GEMINI_API_KEY from the environment" move the correct node
into top-3 for the first time because the body now contains the
distinguishing tokens. Tested against real OpenAI-compatible endpoint
too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded
on first run, 0 on a repeat with sticky flag honored — filter IO
short-circuit verified end-to-end.
Tests: 59 new cases (file-line cache LRU + containment, signature
dedup with decorator / recursive / params mismatch / unknown lang,
block-comment and triple-quote state machine with JSDoc / Javadoc /
Python single-line and multi-line docstrings / long comments,
per-provider budget resolution + env override, sticky flag across
runs including CLI flip, filter short-circuit with and without
populated file_hash, read-failure must-retry regression, legacy hash
backward compat). MRR regression test skipped gracefully when
sentence-transformers isn't installed (CI matrix with `.[dev]` only
stays green; CI jobs that install `.[embeddings]` activate the gate).
Scope: all four providers (Local / Google / MiniMax / OpenAI-
compatible) benefit transparently; no schema migration on graph.db,
no change to `EmbeddingStore.__init__` signature, no change to
parser.py / migrations.py / search.py / tools.py. Adds one new
subcommand `embed` and a private `embeddings_meta` table on
`embeddings.db`.
Why: Codex round 7 caught that the legacy inference
``db_path.parent -> repo root`` silently breaks for the supported
``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir,
not inside ``<repo>/.code-review-graph/``). In that case
``_FileLineCache`` received the cache dir as repo_root and every
body-snippet read failed the containment check, making body
enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two-
stage filter then wrote metadata-only hashes, so even opting back
in via ``embed --include-body --confirm-reembed`` wouldn't help.
What: ``_repo_root_guess`` now:
1. keeps the fast path when db_path parent is literally
``.code-review-graph`` (zero-config common case, no cost)
2. otherwise delegates to ``incremental.find_project_root`` which
already honors ``CRG_REPO_ROOT`` and walks up for a ``.git``
marker — the same resolution order the rest of the CLI uses
3. returns None only when both fail, letting callers treat
absolute file_path values as self-locating
Regression test: ``TestRepoRootGuessExternalDataDir`` covers the
legacy layout, the env-override fallback, and an end-to-end read
through ``_FileLineCache`` from an external cache dir. Existing
``TestFilterStageShortCircuit`` fixtures now write their DB under
``<tmp_path>/.code-review-graph/`` so they hit the fast path
rather than the cwd-relative fallback.
Why: Codex round 8 found the round-7 fast path still misfires when ``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g. ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or ``/shared/cache/<repo>/.code-review-graph``. In that case ``db_path.parent.name == ".code-review-graph"`` is True, the fast path returns ``parent.parent`` (``/var/tmp`` or the shared-cache parent), and ``_FileLineCache`` silently falls back to metadata for every node. My round-7 tests only covered an external dir NOT named ``.code-review-graph`` so the regression slipped. What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When that env var is set we always defer to ``incremental.find_project_root`` regardless of what ``db_path`` looks like; the zero-config fast path only runs when the user is on the default ``<repo>/.code-review-graph/graph.db`` layout and has NOT overridden the data dir. Regression test: ``test_crg_data_dir_named_dot_code_review_graph`` exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT`` pointed at a distant ``real-repo`` tree — and asserts the root resolves to ``real-repo``, not the cache's parent.
Why: cross-repo benchmark showed body enrichment regresses on
framework-heavy codebases (humanitzbot — TypeScript Discord bot —
MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found
that signature dedup + block-comment state machine are all working,
but TS/Discord function bodies are thick with framework boilerplate
that dilutes the retrieval signal. Examples from one regressing
function:
await interaction.deferReply();
const locale = interaction.locale;
logger.info('handling power request', { signal });
console.log('...');
On small models (MiniLM: 256-token window), those lines fill the
budget with `await`, `deferReply`, `locale`, `logger`, `console`
— high-frequency framework tokens that don't discriminate between
similarly-named commands (``_power`` vs ``_status`` vs ``_console``).
What: add a language-agnostic check to the existing body-extraction
state machine that skips lines which ARE purely a log/print call
with no other executable content. Covers:
* Python: ``print(...)``, ``logger.info(...)``, ``logging.debug(...)``
* JS/TS: ``console.log(...)``, ``logger.warn(...)``
* Java: ``System.out.println(...)``, ``log.debug(...)``
* Go: ``fmt.Println(...)``, ``log.Printf(...)``
* Rust: ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)``
Lines where a log call is mixed with substantive code
(``if err: logger.error(err); raise``) are preserved — they DO
carry retrieval signal.
Scope: one regex + one helper + one extra branch in the existing
state machine loop. Stays within ADR-A1 "language-agnostic body
extraction". Does NOT introduce framework-specific heuristics —
log/print is a universal pattern, not a framework one. Also does
NOT make the body hash non-deterministic (same input still produces
same output), so the combined-hash filter short-circuit (AC-13)
still holds.
Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust
log patterns, an integration test verifying scattered log lines
get stripped from a multi-statement TS body while real logic
survives, plus negative tests for mixed-statement lines and
non-log calls (``await interaction.deferReply()`` is NOT a log
call and stays kept).
Follow-up: framework-specific SDK boilerplate (Discord
``interaction.X``, React hooks, Spring ``@Autowired``) is where
the next round of improvement lives, but that can't be done
generically without either a tokenizer-based adaptive budget or
per-language SDK awareness — both are out of scope for this PR.
Round 10 flagged two real over-matches in 7daf93c: 1. The generic ``identifier.(error|warn|info|...)`` branch swallowed non-logger APIs. In THIS repo it dropped ``serve_cmd.error(...)`` (argparse's error method, which raises rather than logs). Same class of issue for ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer — that IS the function's output, not a log. 2. ``re.match`` is prefix-anchored, so ``logger.info('x') or do_work()`` passed the log-only check even though it has substantive trailing code. Violated the "mixed- statement lines stay kept" design promise. Fixes: * Receiver identifier is now an explicit whitelist — ``log``, ``logger``, ``logging``, ``logs``, ``console``, ``slog``, ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing), ``self.log``/``this.log``/``_log``/``_logger`` and the ``self.logger``/``this.logger`` variants. Custom loggers named differently fall through — they'll be kept as real code. Better to under-skip than mis-skip. * Dropped ``fmt.Fprint*`` from the Go branch — those take an arbitrary ``io.Writer`` (commonly HTTP responses). Only ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like). * ``_is_log_only_line`` now walks the matched string after the opening ``(``, tracking string literals and paren depth, to find the balanced close. Anything non-trivial after that ``)`` means the line has other executable content and is NOT log-only. Regression tests: tail-continuation cases (``logger.info('x') or do_work()``), non-logger identifier cases (``serve_cmd.error``, ``db.warn``, ``item.notice``), and ``fmt.Fprintf`` cases now stay kept. Pure-log cases including nested-paren args (``logger.info('val=' + str(compute(x, y)))``) still get dropped.
Why: body enrichment already shows small positive Δ on PHP (Laravel
RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum
with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only
``.`` -style method access, so PHP-world logging idioms
(``$logger->info``, Laravel ``Log::info`` facade, bare
``error_log``/``syslog``/``trigger_error``) silently stayed in the
body text. Adding them is pure upside for log-heavy PHP / Laravel
code paths, no test-matrix cost anywhere else.
What this commit does:
* PHP instance logger: ``$logger->info(...)`` /
``$this->logger->debug(...)`` — covers Monolog convention.
* PHP static facade: ``Log::info(...)`` / ``\Log::error(...)`` —
Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for
fully-qualified imports).
* PHP bare calls: ``error_log(...)`` / ``syslog(...)`` /
``trigger_error(...)`` — stdlib helpers.
* PSR-3 log levels: debug / info / notice / warning / error /
critical / alert / emergency, plus the common ``log``/``warn``
aliases already used by the rest of the regex.
What this commit deliberately does NOT do: extend
``_looks_like_signature`` to PHP. I tested that locally — Laravel
RESTful controllers have very short templated bodies
(``$data = $request->validated(); Model::create($data);``) and the
signature line carries type information (``StoreRequest``,
``JsonResponse``, etc.) that the body never repeats. Stripping the
signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and
from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate
wastes budget" heuristic breaks on convention-over-configuration
PHP codebases, so PHP stays in the conservative fallback — signature
line is kept as part of the body snippet. This is documented in the
``_looks_like_signature`` docstring and locked in by
``test_signature_dedup_php_conservative``.
Tests: 10 new PHP log-pattern assertions (instance / static / bare /
negative non-logger calls like ``$user->save()`` / ``User::create()``
/ ``$response->setStatusCode()`` / ``$db->query()``) plus the
conservative-fallback regression test noted above.
Why: my first-try rollout kept PHP in the conservative fallback
based on a MiniLM-only bench that showed regressions on two
Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large +
Gemini-2-preview + Qwen3-8B across the same two repos tells a
different story — PHP dedup is net +0.110 MRR@3 across the 8
(repo × provider) cells, with the wins concentrated on the top-
tier cloud models where users most expect the quality gain:
usdt-center OpenAI 3-large off Δ=-0.111 on Δ= 0.000 ↑ +0.111
usdt-center Gemini-2-preview off Δ=+0.056 on Δ=+0.139 ↑ +0.083
six-forum Gemini-2-preview off Δ=-0.083 on Δ=-0.033 ↑ +0.050
six-forum OpenAI 3-large off Δ=+0.117 on Δ=+0.117 = 0
six-forum Qwen3-8B off Δ=+0.117 on Δ=+0.117 = 0
usdt-center Qwen3-8B off Δ=-0.111 on Δ=-0.167 ↓ -0.056 *
usdt-center MiniLM off Δ=+0.028 on Δ= 0.000 ↓ -0.028
six-forum MiniLM off Δ=+0.067 on Δ=+0.017 ↓ -0.050
* Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling),
any body token that isn't a retrieval signal is noise; this is
a ceiling-effect artefact, not a regression from dedup.
What: adds PHP back to ``_looks_like_signature`` with the same
three-AND gate every other language uses — name + declaration
keyword + param first-token match. Covers class methods (all
visibility / static / abstract / final prefixes precede
``function``), free functions, classes, interfaces, traits, and
enums. Log-skip patterns from the previous commit stay unchanged.
Mini-LM-on-Laravel regression is documented in the module
docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0``
without touching any other part of the pipeline. The sticky-flag
opt-in is exactly the shape this kind of per-model tradeoff
needs, and why we built it.
Regression test in test_signature_dedup_php now asserts dedup
fires for standard PHP signatures (method / class / interface /
trait) and still returns False for body-only lines and for
overload-style param mismatches, so a future change that breaks
one of those categories gets caught.
Codex round 13 nit: the ``test_signature_dedup_php`` comment said
"traits, enums" but only asserted a trait example. Add explicit
assertions for both shapes of the PHP 8.1 enum syntax:
enum OrderStatus: string { ... # declaration → dedup
case PENDING = 'pending'; # enum value → keep as body
This locks in the behavior Codex spot-checked manually: ``case``
is deliberately NOT in the dedup keyword allowlist, so enum
values stay in the body snippet where they carry real retrieval
signal.
Minidoracat
added a commit
to Minidoracat/code-review-graph
that referenced
this pull request
Apr 19, 2026
…anguages
Why: the body-embedding feature ships a new ``code-review-graph embed``
subcommand (with ``--include-body`` / ``--confirm-reembed`` /
``--skip-cost-check`` flags) but the existing CLI reference tables in
each README stopped at ``eval`` / ``serve``. A first-time user after
this PR merges wouldn't discover the subcommand from docs alone.
What: adds two rows to the CLI reference ``<details>`` block in every
translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN):
code-review-graph embed # fresh repo — body enrichment auto-on
code-review-graph embed --include-body --confirm-reembed
# existing DB — opt in to body
Native-phrasing for each language explains the fresh-vs-existing-DB
distinction so the sticky-flag + double-flag safety gate is obvious
from the docs without cross-referencing the PR body.
Notes for translators / follow-up contributors: the ``embed`` CLI also
supports optional ``--provider`` / ``--model`` / ``--skip-cost-check``
flags that the full ``code-review-graph embed --help`` output covers;
the CLI table sticks to the two shapes most users need.
Minidoracat
added a commit
to Minidoracat/code-review-graph
that referenced
this pull request
Apr 19, 2026
Minidoracat
added a commit
to Minidoracat/code-review-graph
that referenced
this pull request
Apr 19, 2026
…lpers Why: body-embedding (PR tirth8205#345) shipped a `code-review-graph embed` CLI but `install` doesn't wire up any hook to keep embeddings fresh after edits. Users who want auto-refresh have to hand-craft a PostToolUse entry in settings.json — easy to get wrong (wrong schema field, no stdio redirect, no cost-gate). This commit lands the building blocks so the next commit can expose a single `--auto-embed-hook` flag. What: - New `generate_auto_embed_hook_entry(repo_root)` in skills.py. Returns a single PostToolUse entry using the canonical Claude Code Jan-2026 schema (`async: true`, NOT the invalid `run_in_background`). Matcher covers `Edit|Write|MultiEdit` (Bash is intentionally left to the standalone update hook since Bash tools don't produce code edits). - Chains `update --skip-flows && embed` so graph.db is fresh when embed reads it; the second update call is a hash-dedup no-op. This eliminates a cross-process race that the separate update hook and this new embed hook would otherwise have on graph.db. - `CRG_EMBED_INCLUDE_BODY=0` is prefixed to each CLI call so the hook never auto-triggers PR tirth8205#345's body-mode cost bomb — users who want body-mode must run `embed --include-body --confirm-reembed` manually and accept the sticky-flag flip-flop (documented in README). - Full stdout+stderr redirection (`>/dev/null 2>&1` + `|| true`) keeps the MCP JSON-RPC stream uncorrupted even when embed prints progress or exits non-zero (e.g. missing provider). - `install_hooks(..., include_auto_embed=False)` adds the kwarg with a False default — existing callers see JSON-struct-identical output. - New `remove_auto_embed_hook_entry(...)` with defensive handling for malformed JSON / non-dict root / non-dict hooks / non-list PostToolUse (all return False without writing, no .bak on no-match — protects pristine baselines). Tests (36 new, all passing): - TestGenerateAutoEmbedHookEntry (13 cases): schema conformance, chaining order, stdio redirect count, env-var placement via regex, path-quoting, matcher content. - TestInstallHooksAutoEmbed (4 cases): entry dedup, JSON-struct deep equality vs pre-change behaviour, qoder platform parity. - TestRemoveAutoEmbedHook (9 cases): exact-dict removal, no-op-no-bak, user-custom preservation, 4 malformed-JSON defensive tests. Plan: .omc/plans/feat-install-auto-embed-hook.md (local, not committed). Stacked on PR tirth8205#345 (feat/body-embedding). Depends on `embed` CLI introduced there; if tirth8205#345 is rebased/squashed, this branch needs refresh.
Minidoracat
added a commit
to Minidoracat/code-review-graph
that referenced
this pull request
Apr 19, 2026
Why: the hook-generation machinery from the previous commit is only useful if users can opt in from the CLI. This wires the flag through `install` and its `init` alias, surfaces pre-flight warnings for the two most common silent-failure modes (missing sentence-transformers, unsupported platform), and adds integration tests that exercise the full argparse → subprocess → settings.json path. What: - `install` and `init` both gain a mutually-exclusive flag group: `--auto-embed-hook` installs the entry; `--no-auto-embed-hook` removes an exact-match entry without touching anything else in settings.json. Default is OFF (opt-in) — passing neither flag produces the pre-change behaviour byte-for-equivalent on the wire. - `_handle_init` parses the flags into `want_auto_embed` / `want_remove_auto_embed`, emits a stderr warning when the flag is passed with a non-claude/qoder `--platform` (the hook block already skips those platforms, so the warning is the only user-facing signal), and a second stderr warning when the default local provider's `sentence_transformers` extra is missing — both warnings are non-blocking per the "never block the agent" principle. - Removal path explicitly does NOT call install_hooks to avoid re-merging the default update/SessionStart entries on top of any user customisation that happened between install and removal. Tests: - 9 new test_cli.py tests covering: flag acceptance on both subcommands, mutex-violation argparse exit 2, dry-run writes nothing, sentence-transformers find_spec monkeypatch warning, cursor-platform stderr warning, help-text contains both flags. - New tests/test_integration_install_auto_embed.py with two subprocess-driven round-trips — install-and-verify-schema plus install-then-remove — that exercise the full package entry point via `python -m code_review_graph`. Replaces the v1 plan's fragile jq-based smoke script with a CI-gated Python test. Asserts `async: true`, double stdout/stderr redirect, `|| true` tail, `CRG_EMBED_INCLUDE_BODY=0` presence, and static (not $CLAUDE_PROJECT_DIR) repo path. All 113 tests across the three touched files pass. Pre-existing 60 failures on test_parser / test_refactor / test_main (`--tools` flag) are documented in PR tirth8205#345 as unrelated upstream issues.
Minidoracat
added a commit
to Minidoracat/code-review-graph
that referenced
this pull request
Apr 19, 2026
Why: Codex review flagged 1 BLOCKING + 2 MAJOR + 1 MINOR issue after
inspecting the three atomic commits above. This commit lands all four
fixes plus regression tests so the branch reaches NO SUBSTANTIVE ISSUES
before pushing to upstream.
BLOCKING — shell injection via repo path (skills.py:~588):
- `generate_auto_embed_hook_entry` was using `json.dumps(...)` to quote
the repo path, but json-escaping is NOT sufficient for bash double
quotes: `/tmp/repo$(touch /tmp/pwned)` would still trigger command
substitution when the hook fires. Switched to `shlex.quote(...)`
which wraps any path with shell metacharacters in single quotes
that bash does not expand. Added two regression tests asserting
that paths containing `$()` and backticks land inside single-quoted
tokens, not double-quoted ones. Kept the path-with-spaces test
updated to expect the new shlex quoting shape.
MAJOR — `--no-auto-embed-hook` had full-install side effects
(cli.py _handle_init):
- The flag was documented as "idempotent removal that preserves user
config", but the original wiring still ran `install_platform_configs`,
`.gitignore` update, `generate_skills`, `inject_claude_md`,
`install_git_hook`, etc. Short-circuited `_handle_init` at the top:
when `want_remove_auto_embed` is True we now do ONLY the removal
and return, with a `--dry-run` preview path and clean no-op messages
for `--no-hooks` and non-claude/qoder platforms. New test
`test_no_auto_embed_hook_short_circuits_install_flow` asserts no
.mcp.json / .gitignore / skills dir / CLAUDE.md / pre-commit hook
is created when only `--no-auto-embed-hook` is passed.
MAJOR — malformed / JSONC settings.json was silently overwritten
(skills.py install_hooks):
- Backup `shutil.copy2` lived INSIDE the `try: json.loads(...)` block,
so a parse failure skipped the backup and the subsequent write
destroyed the original file. Reordered so the backup always runs
first, then parse; on `JSONDecodeError` we now raise `RuntimeError`
with a message pointing the user at the fresh `.bak` and asking
them to fix or remove the comment manually, instead of silently
resetting to `{}` and writing a blank shape. New test
`test_install_aborts_on_malformed_json_with_backup` covers the
JSONC scenario Codex specifically called out.
MINOR — warnings fired in dry-run / no-hooks modes (cli.py):
- The `sentence-transformers is missing` and non-claude/qoder
platform warnings were printed even when no hook would be installed
(`--dry-run` or `--no-hooks`). Gated both warnings behind
`want_auto_embed and not skip_hooks and not dry_run` so the CLI
stays quiet in the modes where the warning isn't actionable. Two
new tests (`test_extras_warning_gated_by_skip_hooks`,
`test_extras_warning_gated_by_dry_run`) keep this regression-safe.
Tests: 120 passing (was 113). No new lint / mypy / bandit findings
on touched files beyond the pre-existing ones documented in PR tirth8205#345.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_node_to_text()now appends a truncated function body snippet next to the metadata we already embed, so semantic search can rank by what a function does rather than just its name. All four providers (Local / Google / MiniMax / OpenAI-compatible) benefit transparently; no schema migration, no change toEmbeddingStore.__init__signature, no change toparser.py/migrations.py/search.py/tools.py.The motivation is the note already in the README: signature-only embeddings (~10 tokens per node) flatten the quality gap between small models and long-context SOTA embeddings (Gemini 2 preview, Qwen3-8B). This PR lifts the input budget closer to what those models were trained to consume.
Design decisions (ADR summary)
"""...""",/** */,# ...,///,*), and collapses blank runs before truncating. Three-AND signature-dedup gate (name + language decl keyword + param first-token) for Python / JS / TS / JSX / TSX / Java / Kotlin / Rust / Go; other languages fall through as "don't dedup" so the body line just eats a bit of its own budget.logger.info(...)/console.log(...)/fmt.Println(...)/println!(...)lines that carry no retrieval signal are skipped everywhere in the body, language-agnostically. Tightened in a follow-up iteration so we don't misfire onserve_cmd.error(...)(argparse) orfmt.Fprintf(w, ...)(arbitrary writer) — receiver identifier is a fixed whitelist, and the helper walks the balanced close of the log call to ensure nothing substantive trails it.ADR-B4) —local=700 / google=3000 / minimax=3000 / openai=6000,CRG_EMBED_BODY_MAX_CHARSenv var overrides. MiniLM's 256-token window just fits; long-context models get enough room to actually use their capacity.ADR-C3) — newembeddings_metatable onembeddings.dbpersistsembed_body_enabledper-DB. Fresh repos auto-enable; existing DBs keep legacy signature-only behaviour until the user runscode-review-graph embed --include-body --confirm-reembed. Avoids one-time re-embed API bill surprise on upgrade; keeps mixed-state rows impossible across incremental runs.text_hash = meta_hash:body_hash:file_hash. Stage 1 computes metadata-hash only (zero IO) and skips when metadata +file_hash(fromGraphNode.file_hash) are unchanged. Stage 2 reads the file only when stage 1 can't prove freshness. Empty extracted bodies getsha256("")as a sentinel so they short-circuit too; read failures deliberately don't persist a sentinel so transient errors stay retryable.text_hash, no:) parsed correctly by_split_combined_hashand trigger a single safe re-embed on first contact.CRG_DATA_DIR/CRG_REPO_ROOTenv overrides fully respected (path containment check usesincremental.find_project_root)._FileLineCacheresolves paths and enforcesrelative_to(repo_root.resolve())so a poisonednodes.file_pathcan't read/etc/passwdor traverse out of the repo into cloud embedding calls.Benchmark evidence
Ran MRR@3 on two codebases with multiple providers, EN + ZH queries, legacy metadata-only vs body-enriched. Every provider's full table is below; the short version is: SOTA models consistently gain, small models are repo-dependent.
code_review_graphitself — 50 golden queries, 537-node corpustext-embedding-v4(Qwen-v4)All seven providers positive. BGE-m3 went from
MRR 0.60to0.81— a mid-tier model landing in top-tier territory.Second benchmark — an external TypeScript Discord-bot codebase, 12 golden queries, 1667-node corpus
Top-tier (Qwen3-8B / OpenAI-3-large / Gemini-2) stay positive on a framework-heavy TS codebase. MiniLM drops -0.18 on English — its 256-token window is too small to absorb Discord SDK boilerplate (
interaction.deferReply(),t('commands:...', locale)) without crowding out real logic tokens. The sticky-flag opt-in plusCRG_EMBED_INCLUDE_BODY=0escape hatch is the right shape for that failure mode: a user who sees regression on their specific repo can revert in a single env var without losing their existing embeddings.Third / fourth benchmarks — two internal Laravel / PHP codebases, SOTA A/B
Ran a four-provider A/B (MiniLM + OpenAI 3-large + Gemini-2-preview + Qwen3-8B) against two real Laravel apps (281-node ledger service and 1887-node API forum backend, with 6 and 10 hand-graded queries respectively), comparing PHP signature dedup OFF vs ON:
Net across 8 (repo × provider) cells: +0.110 MRR@3 — gains concentrate on the top-tier cloud models where users most expect the quality upside, MiniLM regresses by -0.03 to -0.05 and those users can revert via
CRG_EMBED_INCLUDE_BODY=0. An earlier MiniLM-only bench had me keep PHP in the conservative fallback; the SOTA A/B overturned that, and the_looks_like_signaturedocstring records the reversal so future contributors don't undo it again. PSR-3 / Monolog / Laravel-facade log-only patterns ($logger->info,Log::info,error_log) are unconditionally included in the language-agnostic log-skip regex.Cross-repo ranking (body-on MRR, two-repo avg)
Test plan
file_hash, read-failure must-retry regression, legacy hash backward compat,CRG_DATA_DIR/CRG_REPO_ROOTexternal cache dir regression.tests/test_embeddings_mrr.py) — Local MiniLM against a bundled 10-query golden set; enforcesMRR(body-on) >= MRR(legacy) - 0.02 AND > MRR(legacy)(non-inferiority plus strict improvement).pytest.importorskip("sentence_transformers")so default.[dev]CI stays green; installing.[embeddings]activates the gate.text-embedding-v4) + OpenRouter across eight embedding models.NO SUBSTANTIVE ISSUES. All findings (combined hash withfile_hash, path traversal containment, Kotlinfunwithout return type, empty-body sentinel, transient-read-failure retry,CRG_DATA_DIRfast-path misfire, log-skip over-match,_is_log_only_linetail-continuation) are fixed with regression tests.Full suite locally: 158 embedding-related tests pass on top of the existing 1020 passing tests. Repo-wide failures I saw (60) sit in
test_multilang/test_parser/test_mainaround Julia / GDScript / ReScript parsing and the new--toolsflag — they reproduce onupstream/mainwithout this branch, so they're unrelated.Caveats
CRG_EMBED_INCLUDE_BODY=0reverts to metadata-only without touching existing embeddings.google/gemini-embedding-001is listed by OpenRouter but its upstream Google AI Studio provider returned no-response for our API key during testing. Usedgoogle/gemini-embedding-2-previewinstead, which is the newest working Google embedding on OpenRouter today.Follow-up work (tracked separately, NOT in this PR)
Scope-disciplined items intentionally left out of this PR so its
review surface stays bounded. Planned as atomic, independently-
mergeable PRs (ordered by dependency, not priority):
PR B —
install --auto-embed-hook(in progress on my fork)installcurrently writes MCP config + graph update/watch hooks butnothing refreshes embeddings after edits. A new
--auto-embed-hookflag drops a small
PostToolUse(Edit|Write|MultiEdit)hook thatruns
code-review-graph embed --repo "$CLAUDE_PROJECT_DIR"in thebackground (via Claude Code hook
async: true+ stdout/stderrredirect). The two-stage filter +
file_hashcheck this PR addsmeans only 1-3 changed nodes actually hit the provider per save,
keeping API cost negligible — but the user must opt in because
cloud providers DO bill per token.
Handling this in
install(not here) is the right place because(a)
installalready owns hook writing, (b) the body-embeddingcode in this PR stays purely library-level, (c) keeps this PR's
review surface stable. Stacks on this PR's HEAD on my fork.
PR A —
code-review-graph configsubcommand +.envauto-loadProvider credentials today must be in shell env (for CLI) OR the
MCP client's
.mcp.jsonenv block (for MCP server). There's noshared config layer, so CLI users end up juggling
exportcommandsor
direnv. Aconfigsubcommand would fix that:Three-layer resolution (env > project
.code-review-graph/ config.toml> global~/.config/code-review-graph/config.toml>defaults), TOML format, backwards-compatible (existing
CRG_OPENAI_*env vars keep working, just as one layer in thestack).
Orthogonal to PR B (the hook just shells out to
code-review-graph embed; it doesn't care where env comes from), but planned afterso the feature lands on stable ground.
PR D — Adaptive body-char budget
Per-model capacity replaces the static
_BODY_MAX_CHARS_BY_ PROVIDERtable once common embedding providers expose stabletokenizer APIs. Until then the env override
(
CRG_EMBED_BODY_MAX_CHARS) covers edge cases.PR E — Framework-SDK-pattern skip (beyond log lines)
The log-only skip this PR adds helps log-heavy codebases. Discord
SDK / Rails / Flask / Spring / React framework boilerplate is a
separate noise class that would need either provider-adaptive
budget (PR D) or per-framework heuristics (less maintainable).
Left as future work;
CRG_EMBED_INCLUDE_BODY=0is the currentescape hatch for small models on framework-heavy repos (see
TypeScript Discord-bot benchmark above).