feat(install): add opt-in --auto-embed-hook flag for auto-refresh embeddings#349
Open
Minidoracat wants to merge 13 commits intotirth8205:mainfrom
Open
feat(install): add opt-in --auto-embed-hook flag for auto-refresh embeddings#349Minidoracat wants to merge 13 commits intotirth8205:mainfrom
Minidoracat wants to merge 13 commits intotirth8205:mainfrom
Conversation
Why: signature-only embeddings carry ~10 tokens per node (name + kind +
params + return type + language). That ceiling wastes the long-context
capacity of modern embedding models — Gemini 2, Qwen3-8B and similar
rank SOTA on MTEB-code largely on their ability to understand the
*body* of a function, but we were handing them only its name. So the
quality advantage of large-context embeddings against local MiniLM was
getting flattened by our own input, not by model capability.
What: `_node_to_text()` now appends a truncated body snippet after the
signature metadata, with the signature line and leading docstrings /
header comments stripped so the budget goes to real logic instead of
documentation. Four design choices worth flagging:
1. Per-provider char budget (ADR-B4): local=700 / google=3000 /
minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS`
overrides. This gives MiniLM just enough to not silently truncate
and gives the long-context models room to actually use their
capacity.
2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on
`embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos
auto-enable; existing DBs hold legacy behavior until the user
runs `code-review-graph embed --include-body --confirm-reembed`.
This avoids surprising cloud-provider users with a one-time
re-embed API bill on upgrade, and the sticky flag keeps new nodes
consistent with older ones across incremental runs (no mixed
body-on / body-off state).
3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored
hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the
metadata hash only (zero file IO) and skips when metadata +
file_hash are unchanged. Stage 2 reads the file only when stage
1 can't prove freshness, recomputes the body hash, persists the
combined form. Empty extracted bodies get sha256("") as a
sentinel so they also short-circuit next time; read failures
deliberately do NOT persist a sentinel so transient errors stay
retryable.
4. Language-agnostic body extraction with block-comment + triple-
quote state machine and conservative signature dedup via three-
AND gate (node.name + declaration keyword + param first-token).
Unknown languages fall back to "don't dedup" rather than risk
dropping real body. Coverage today: Python / JS / TS / JSX / TSX /
Java / Kotlin / Rust / Go; other languages keep the signature
line and just eat a bit of their body budget.
Backward compat: legacy rows (pre-existing sha256 hex in
`embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file
hash) are both parsed correctly by `_split_combined_hash` and trigger
a single safe re-embed on first contact.
Security: `_FileLineCache` resolves paths and enforces a `relative_to`
containment check against the repo root, so a poisoned
`nodes.file_path` can't read `/etc/passwd` or traverse out of the
repo into cloud embedding calls.
Verification: regression MRR@3 on a 10-query golden set (bundled mini
repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent
queries like "escape HTML entities including quotes and backticks"
or "read GEMINI_API_KEY from the environment" move the correct node
into top-3 for the first time because the body now contains the
distinguishing tokens. Tested against real OpenAI-compatible endpoint
too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded
on first run, 0 on a repeat with sticky flag honored — filter IO
short-circuit verified end-to-end.
Tests: 59 new cases (file-line cache LRU + containment, signature
dedup with decorator / recursive / params mismatch / unknown lang,
block-comment and triple-quote state machine with JSDoc / Javadoc /
Python single-line and multi-line docstrings / long comments,
per-provider budget resolution + env override, sticky flag across
runs including CLI flip, filter short-circuit with and without
populated file_hash, read-failure must-retry regression, legacy hash
backward compat). MRR regression test skipped gracefully when
sentence-transformers isn't installed (CI matrix with `.[dev]` only
stays green; CI jobs that install `.[embeddings]` activate the gate).
Scope: all four providers (Local / Google / MiniMax / OpenAI-
compatible) benefit transparently; no schema migration on graph.db,
no change to `EmbeddingStore.__init__` signature, no change to
parser.py / migrations.py / search.py / tools.py. Adds one new
subcommand `embed` and a private `embeddings_meta` table on
`embeddings.db`.
Why: Codex round 7 caught that the legacy inference
``db_path.parent -> repo root`` silently breaks for the supported
``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir,
not inside ``<repo>/.code-review-graph/``). In that case
``_FileLineCache`` received the cache dir as repo_root and every
body-snippet read failed the containment check, making body
enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two-
stage filter then wrote metadata-only hashes, so even opting back
in via ``embed --include-body --confirm-reembed`` wouldn't help.
What: ``_repo_root_guess`` now:
1. keeps the fast path when db_path parent is literally
``.code-review-graph`` (zero-config common case, no cost)
2. otherwise delegates to ``incremental.find_project_root`` which
already honors ``CRG_REPO_ROOT`` and walks up for a ``.git``
marker — the same resolution order the rest of the CLI uses
3. returns None only when both fail, letting callers treat
absolute file_path values as self-locating
Regression test: ``TestRepoRootGuessExternalDataDir`` covers the
legacy layout, the env-override fallback, and an end-to-end read
through ``_FileLineCache`` from an external cache dir. Existing
``TestFilterStageShortCircuit`` fixtures now write their DB under
``<tmp_path>/.code-review-graph/`` so they hit the fast path
rather than the cwd-relative fallback.
Why: Codex round 8 found the round-7 fast path still misfires when ``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g. ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or ``/shared/cache/<repo>/.code-review-graph``. In that case ``db_path.parent.name == ".code-review-graph"`` is True, the fast path returns ``parent.parent`` (``/var/tmp`` or the shared-cache parent), and ``_FileLineCache`` silently falls back to metadata for every node. My round-7 tests only covered an external dir NOT named ``.code-review-graph`` so the regression slipped. What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When that env var is set we always defer to ``incremental.find_project_root`` regardless of what ``db_path`` looks like; the zero-config fast path only runs when the user is on the default ``<repo>/.code-review-graph/graph.db`` layout and has NOT overridden the data dir. Regression test: ``test_crg_data_dir_named_dot_code_review_graph`` exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT`` pointed at a distant ``real-repo`` tree — and asserts the root resolves to ``real-repo``, not the cache's parent.
Why: cross-repo benchmark showed body enrichment regresses on
framework-heavy codebases (humanitzbot — TypeScript Discord bot —
MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found
that signature dedup + block-comment state machine are all working,
but TS/Discord function bodies are thick with framework boilerplate
that dilutes the retrieval signal. Examples from one regressing
function:
await interaction.deferReply();
const locale = interaction.locale;
logger.info('handling power request', { signal });
console.log('...');
On small models (MiniLM: 256-token window), those lines fill the
budget with `await`, `deferReply`, `locale`, `logger`, `console`
— high-frequency framework tokens that don't discriminate between
similarly-named commands (``_power`` vs ``_status`` vs ``_console``).
What: add a language-agnostic check to the existing body-extraction
state machine that skips lines which ARE purely a log/print call
with no other executable content. Covers:
* Python: ``print(...)``, ``logger.info(...)``, ``logging.debug(...)``
* JS/TS: ``console.log(...)``, ``logger.warn(...)``
* Java: ``System.out.println(...)``, ``log.debug(...)``
* Go: ``fmt.Println(...)``, ``log.Printf(...)``
* Rust: ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)``
Lines where a log call is mixed with substantive code
(``if err: logger.error(err); raise``) are preserved — they DO
carry retrieval signal.
Scope: one regex + one helper + one extra branch in the existing
state machine loop. Stays within ADR-A1 "language-agnostic body
extraction". Does NOT introduce framework-specific heuristics —
log/print is a universal pattern, not a framework one. Also does
NOT make the body hash non-deterministic (same input still produces
same output), so the combined-hash filter short-circuit (AC-13)
still holds.
Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust
log patterns, an integration test verifying scattered log lines
get stripped from a multi-statement TS body while real logic
survives, plus negative tests for mixed-statement lines and
non-log calls (``await interaction.deferReply()`` is NOT a log
call and stays kept).
Follow-up: framework-specific SDK boilerplate (Discord
``interaction.X``, React hooks, Spring ``@Autowired``) is where
the next round of improvement lives, but that can't be done
generically without either a tokenizer-based adaptive budget or
per-language SDK awareness — both are out of scope for this PR.
Round 10 flagged two real over-matches in 7daf93c: 1. The generic ``identifier.(error|warn|info|...)`` branch swallowed non-logger APIs. In THIS repo it dropped ``serve_cmd.error(...)`` (argparse's error method, which raises rather than logs). Same class of issue for ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer — that IS the function's output, not a log. 2. ``re.match`` is prefix-anchored, so ``logger.info('x') or do_work()`` passed the log-only check even though it has substantive trailing code. Violated the "mixed- statement lines stay kept" design promise. Fixes: * Receiver identifier is now an explicit whitelist — ``log``, ``logger``, ``logging``, ``logs``, ``console``, ``slog``, ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing), ``self.log``/``this.log``/``_log``/``_logger`` and the ``self.logger``/``this.logger`` variants. Custom loggers named differently fall through — they'll be kept as real code. Better to under-skip than mis-skip. * Dropped ``fmt.Fprint*`` from the Go branch — those take an arbitrary ``io.Writer`` (commonly HTTP responses). Only ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like). * ``_is_log_only_line`` now walks the matched string after the opening ``(``, tracking string literals and paren depth, to find the balanced close. Anything non-trivial after that ``)`` means the line has other executable content and is NOT log-only. Regression tests: tail-continuation cases (``logger.info('x') or do_work()``), non-logger identifier cases (``serve_cmd.error``, ``db.warn``, ``item.notice``), and ``fmt.Fprintf`` cases now stay kept. Pure-log cases including nested-paren args (``logger.info('val=' + str(compute(x, y)))``) still get dropped.
Why: body enrichment already shows small positive Δ on PHP (Laravel
RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum
with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only
``.`` -style method access, so PHP-world logging idioms
(``$logger->info``, Laravel ``Log::info`` facade, bare
``error_log``/``syslog``/``trigger_error``) silently stayed in the
body text. Adding them is pure upside for log-heavy PHP / Laravel
code paths, no test-matrix cost anywhere else.
What this commit does:
* PHP instance logger: ``$logger->info(...)`` /
``$this->logger->debug(...)`` — covers Monolog convention.
* PHP static facade: ``Log::info(...)`` / ``\Log::error(...)`` —
Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for
fully-qualified imports).
* PHP bare calls: ``error_log(...)`` / ``syslog(...)`` /
``trigger_error(...)`` — stdlib helpers.
* PSR-3 log levels: debug / info / notice / warning / error /
critical / alert / emergency, plus the common ``log``/``warn``
aliases already used by the rest of the regex.
What this commit deliberately does NOT do: extend
``_looks_like_signature`` to PHP. I tested that locally — Laravel
RESTful controllers have very short templated bodies
(``$data = $request->validated(); Model::create($data);``) and the
signature line carries type information (``StoreRequest``,
``JsonResponse``, etc.) that the body never repeats. Stripping the
signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and
from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate
wastes budget" heuristic breaks on convention-over-configuration
PHP codebases, so PHP stays in the conservative fallback — signature
line is kept as part of the body snippet. This is documented in the
``_looks_like_signature`` docstring and locked in by
``test_signature_dedup_php_conservative``.
Tests: 10 new PHP log-pattern assertions (instance / static / bare /
negative non-logger calls like ``$user->save()`` / ``User::create()``
/ ``$response->setStatusCode()`` / ``$db->query()``) plus the
conservative-fallback regression test noted above.
Why: my first-try rollout kept PHP in the conservative fallback
based on a MiniLM-only bench that showed regressions on two
Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large +
Gemini-2-preview + Qwen3-8B across the same two repos tells a
different story — PHP dedup is net +0.110 MRR@3 across the 8
(repo × provider) cells, with the wins concentrated on the top-
tier cloud models where users most expect the quality gain:
usdt-center OpenAI 3-large off Δ=-0.111 on Δ= 0.000 ↑ +0.111
usdt-center Gemini-2-preview off Δ=+0.056 on Δ=+0.139 ↑ +0.083
six-forum Gemini-2-preview off Δ=-0.083 on Δ=-0.033 ↑ +0.050
six-forum OpenAI 3-large off Δ=+0.117 on Δ=+0.117 = 0
six-forum Qwen3-8B off Δ=+0.117 on Δ=+0.117 = 0
usdt-center Qwen3-8B off Δ=-0.111 on Δ=-0.167 ↓ -0.056 *
usdt-center MiniLM off Δ=+0.028 on Δ= 0.000 ↓ -0.028
six-forum MiniLM off Δ=+0.067 on Δ=+0.017 ↓ -0.050
* Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling),
any body token that isn't a retrieval signal is noise; this is
a ceiling-effect artefact, not a regression from dedup.
What: adds PHP back to ``_looks_like_signature`` with the same
three-AND gate every other language uses — name + declaration
keyword + param first-token match. Covers class methods (all
visibility / static / abstract / final prefixes precede
``function``), free functions, classes, interfaces, traits, and
enums. Log-skip patterns from the previous commit stay unchanged.
Mini-LM-on-Laravel regression is documented in the module
docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0``
without touching any other part of the pipeline. The sticky-flag
opt-in is exactly the shape this kind of per-model tradeoff
needs, and why we built it.
Regression test in test_signature_dedup_php now asserts dedup
fires for standard PHP signatures (method / class / interface /
trait) and still returns False for body-only lines and for
overload-style param mismatches, so a future change that breaks
one of those categories gets caught.
Codex round 13 nit: the ``test_signature_dedup_php`` comment said
"traits, enums" but only asserted a trait example. Add explicit
assertions for both shapes of the PHP 8.1 enum syntax:
enum OrderStatus: string { ... # declaration → dedup
case PENDING = 'pending'; # enum value → keep as body
This locks in the behavior Codex spot-checked manually: ``case``
is deliberately NOT in the dedup keyword allowlist, so enum
values stay in the body snippet where they carry real retrieval
signal.
…anguages
Why: the body-embedding feature ships a new ``code-review-graph embed``
subcommand (with ``--include-body`` / ``--confirm-reembed`` /
``--skip-cost-check`` flags) but the existing CLI reference tables in
each README stopped at ``eval`` / ``serve``. A first-time user after
this PR merges wouldn't discover the subcommand from docs alone.
What: adds two rows to the CLI reference ``<details>`` block in every
translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN):
code-review-graph embed # fresh repo — body enrichment auto-on
code-review-graph embed --include-body --confirm-reembed
# existing DB — opt in to body
Native-phrasing for each language explains the fresh-vs-existing-DB
distinction so the sticky-flag + double-flag safety gate is obvious
from the docs without cross-referencing the PR body.
Notes for translators / follow-up contributors: the ``embed`` CLI also
supports optional ``--provider`` / ``--model`` / ``--skip-cost-check``
flags that the full ``code-review-graph embed --help`` output covers;
the CLI table sticks to the two shapes most users need.
…lpers Why: body-embedding (PR tirth8205#345) shipped a `code-review-graph embed` CLI but `install` doesn't wire up any hook to keep embeddings fresh after edits. Users who want auto-refresh have to hand-craft a PostToolUse entry in settings.json — easy to get wrong (wrong schema field, no stdio redirect, no cost-gate). This commit lands the building blocks so the next commit can expose a single `--auto-embed-hook` flag. What: - New `generate_auto_embed_hook_entry(repo_root)` in skills.py. Returns a single PostToolUse entry using the canonical Claude Code Jan-2026 schema (`async: true`, NOT the invalid `run_in_background`). Matcher covers `Edit|Write|MultiEdit` (Bash is intentionally left to the standalone update hook since Bash tools don't produce code edits). - Chains `update --skip-flows && embed` so graph.db is fresh when embed reads it; the second update call is a hash-dedup no-op. This eliminates a cross-process race that the separate update hook and this new embed hook would otherwise have on graph.db. - `CRG_EMBED_INCLUDE_BODY=0` is prefixed to each CLI call so the hook never auto-triggers PR tirth8205#345's body-mode cost bomb — users who want body-mode must run `embed --include-body --confirm-reembed` manually and accept the sticky-flag flip-flop (documented in README). - Full stdout+stderr redirection (`>/dev/null 2>&1` + `|| true`) keeps the MCP JSON-RPC stream uncorrupted even when embed prints progress or exits non-zero (e.g. missing provider). - `install_hooks(..., include_auto_embed=False)` adds the kwarg with a False default — existing callers see JSON-struct-identical output. - New `remove_auto_embed_hook_entry(...)` with defensive handling for malformed JSON / non-dict root / non-dict hooks / non-list PostToolUse (all return False without writing, no .bak on no-match — protects pristine baselines). Tests (36 new, all passing): - TestGenerateAutoEmbedHookEntry (13 cases): schema conformance, chaining order, stdio redirect count, env-var placement via regex, path-quoting, matcher content. - TestInstallHooksAutoEmbed (4 cases): entry dedup, JSON-struct deep equality vs pre-change behaviour, qoder platform parity. - TestRemoveAutoEmbedHook (9 cases): exact-dict removal, no-op-no-bak, user-custom preservation, 4 malformed-JSON defensive tests. Plan: .omc/plans/feat-install-auto-embed-hook.md (local, not committed). Stacked on PR tirth8205#345 (feat/body-embedding). Depends on `embed` CLI introduced there; if tirth8205#345 is rebased/squashed, this branch needs refresh.
Why: the hook-generation machinery from the previous commit is only useful if users can opt in from the CLI. This wires the flag through `install` and its `init` alias, surfaces pre-flight warnings for the two most common silent-failure modes (missing sentence-transformers, unsupported platform), and adds integration tests that exercise the full argparse → subprocess → settings.json path. What: - `install` and `init` both gain a mutually-exclusive flag group: `--auto-embed-hook` installs the entry; `--no-auto-embed-hook` removes an exact-match entry without touching anything else in settings.json. Default is OFF (opt-in) — passing neither flag produces the pre-change behaviour byte-for-equivalent on the wire. - `_handle_init` parses the flags into `want_auto_embed` / `want_remove_auto_embed`, emits a stderr warning when the flag is passed with a non-claude/qoder `--platform` (the hook block already skips those platforms, so the warning is the only user-facing signal), and a second stderr warning when the default local provider's `sentence_transformers` extra is missing — both warnings are non-blocking per the "never block the agent" principle. - Removal path explicitly does NOT call install_hooks to avoid re-merging the default update/SessionStart entries on top of any user customisation that happened between install and removal. Tests: - 9 new test_cli.py tests covering: flag acceptance on both subcommands, mutex-violation argparse exit 2, dry-run writes nothing, sentence-transformers find_spec monkeypatch warning, cursor-platform stderr warning, help-text contains both flags. - New tests/test_integration_install_auto_embed.py with two subprocess-driven round-trips — install-and-verify-schema plus install-then-remove — that exercise the full package entry point via `python -m code_review_graph`. Replaces the v1 plan's fragile jq-based smoke script with a CI-gated Python test. Asserts `async: true`, double stdout/stderr redirect, `|| true` tail, `CRG_EMBED_INCLUDE_BODY=0` presence, and static (not $CLAUDE_PROJECT_DIR) repo path. All 113 tests across the three touched files pass. Pre-existing 60 failures on test_parser / test_refactor / test_main (`--tools` flag) are documented in PR tirth8205#345 as unrelated upstream issues.
Why: a new CLI flag is invisible without discoverable docs. First-time users rarely find new flags from code alone — the install-command reference table and the existing body-embedding explanation are the places people actually look. Each of the 5 translated READMEs gets both touchpoints so no audience is stuck on English-only docs. What: - CLI reference table: two new lines right after the existing `embed --include-body --confirm-reembed` entry, covering install + uninstall of the auto-embed hook. Native-phrased per language. - Body-embedding explanation block: a new paragraph inside the same blockquote that already covers body-mode opt-in, documenting (a) the opt-in feature, (b) the Claude-Code-and-Qoder-only / POSIX-only scope limitation, and (c) the sticky-flag flip-flop behavior so anyone who ran `embed --include-body --confirm-reembed` before enabling the hook understands their preference will get reset on every fire. Parity: grep confirms each README mentions `--auto-embed-hook` twice (CLI row + explanation paragraph). Native phrasing preserved across EN / zh-CN / ja-JP / ko-KR / hi-IN — no English-only paragraphs sneaking into the translations. POSIX-only caveat and `install --no-auto-embed-hook` removal path are present in all five.
Why: Codex review flagged 1 BLOCKING + 2 MAJOR + 1 MINOR issue after
inspecting the three atomic commits above. This commit lands all four
fixes plus regression tests so the branch reaches NO SUBSTANTIVE ISSUES
before pushing to upstream.
BLOCKING — shell injection via repo path (skills.py:~588):
- `generate_auto_embed_hook_entry` was using `json.dumps(...)` to quote
the repo path, but json-escaping is NOT sufficient for bash double
quotes: `/tmp/repo$(touch /tmp/pwned)` would still trigger command
substitution when the hook fires. Switched to `shlex.quote(...)`
which wraps any path with shell metacharacters in single quotes
that bash does not expand. Added two regression tests asserting
that paths containing `$()` and backticks land inside single-quoted
tokens, not double-quoted ones. Kept the path-with-spaces test
updated to expect the new shlex quoting shape.
MAJOR — `--no-auto-embed-hook` had full-install side effects
(cli.py _handle_init):
- The flag was documented as "idempotent removal that preserves user
config", but the original wiring still ran `install_platform_configs`,
`.gitignore` update, `generate_skills`, `inject_claude_md`,
`install_git_hook`, etc. Short-circuited `_handle_init` at the top:
when `want_remove_auto_embed` is True we now do ONLY the removal
and return, with a `--dry-run` preview path and clean no-op messages
for `--no-hooks` and non-claude/qoder platforms. New test
`test_no_auto_embed_hook_short_circuits_install_flow` asserts no
.mcp.json / .gitignore / skills dir / CLAUDE.md / pre-commit hook
is created when only `--no-auto-embed-hook` is passed.
MAJOR — malformed / JSONC settings.json was silently overwritten
(skills.py install_hooks):
- Backup `shutil.copy2` lived INSIDE the `try: json.loads(...)` block,
so a parse failure skipped the backup and the subsequent write
destroyed the original file. Reordered so the backup always runs
first, then parse; on `JSONDecodeError` we now raise `RuntimeError`
with a message pointing the user at the fresh `.bak` and asking
them to fix or remove the comment manually, instead of silently
resetting to `{}` and writing a blank shape. New test
`test_install_aborts_on_malformed_json_with_backup` covers the
JSONC scenario Codex specifically called out.
MINOR — warnings fired in dry-run / no-hooks modes (cli.py):
- The `sentence-transformers is missing` and non-claude/qoder
platform warnings were printed even when no hook would be installed
(`--dry-run` or `--no-hooks`). Gated both warnings behind
`want_auto_embed and not skip_hooks and not dry_run` so the CLI
stays quiet in the modes where the warning isn't actionable. Two
new tests (`test_extras_warning_gated_by_skip_hooks`,
`test_extras_warning_gated_by_dry_run`) keep this regression-safe.
Tests: 120 passing (was 113). No new lint / mypy / bandit findings
on touched files beyond the pre-existing ones documented in PR tirth8205#345.
Minidoracat
added a commit
to Minidoracat/code-review-graph
that referenced
this pull request
Apr 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Opt-in
code-review-graph install --auto-embed-hookflag that writes aPostToolUsehook into.claude/settings.json(and.qoder/settings.json) so body-embedding data stays fresh between manualembedruns. Counterpart--no-auto-embed-hookperforms a pure removal without touching any other install state. Default is OFF — passing no flag keeps today's behaviour JSON-struct-identical.Stacked on #345 — do NOT merge until #345 lands
This PR depends on the
code-review-graph embedCLI subcommand introduced by PR #345 (feat/body-embedding). Branch base isorigin/feat/body-embeddingon my fork. Once #345 merges intomain, I'll rebase onto upstreammainand re-run CI.Why
#345 ships the
embedCLI butinstalldoesn't wire up any hook to refresh embeddings after edits. Users had to hand-craft a PostToolUse entry themselves — easy to get wrong (invalid schema fields, stdout leaks into MCP stdio, no cost gate against cloud providers). This PR bakes the correct shape intoinstall.Follow-up #1 in PR #345's description is exactly this feature.
ADR highlights (full decision table in the plan)
--no-skills/--no-hooks/--no-instructionsflag family.async: true(Claude Code Jan-2026 schema), NOTrun_in_background(which does not exist in the schema).shell: \"bash\"implied; PowerShell deferred. Docs: https://code.claude.com/docs/en/hooksEdit|Write|MultiEdit. Intentionally diverges from the existing update hook'sEdit|Write|Bash— Bash tool calls don't touch code, and the chained command re-runsupdateanyway.update --skip-flows && embedin sequence. Secondupdateis a hash-dedup no-op on unchanged files but eliminates a cross-process race ongraph.dbbetween the standalone update hook and the new embed hook.install_hooks). Other platforms trigger a clear stderr warning and skip.CRG_EMBED_INCLUDE_BODY=0is prefixed to each CLI invocation so the hook NEVER auto-triggers body-mode re-embeds. Body-mode users must opt in manually viaembed --include-body --confirm-reembed; the hook resets the sticky flag back to0on each fire (documented in all 5 READMEs as intentional friction).>/dev/null 2>&1on every CLI invocation + trailing|| true. Protects the JSON-RPC stream from corrupting stdout/stderr leakage.--no-auto-embed-hookshort-circuits to a pure removal path. Does NOT re-runinstall_platform_configs/.gitignore/ skill generation / instruction injection / git-hook install. Preserves any user customisation of other PostToolUse entries.Shell-quoting security
The hook command embeds the repo path. Initial draft used
json.dumps(path)for quoting, but bash still expands\$()and backticks inside double-quoted strings. After Codex round 1 flagged this as BLOCKING, switched toshlex.quote(...)which wraps paths containing shell metacharacters in single quotes that bash does not expand. Regression tests cover\$(...)and backtick paths.Files
code_review_graph/skills.py: newgenerate_auto_embed_hook_entry(repo_root), newremove_auto_embed_hook_entry(repo_root, platform), modifiedinstall_hooks(repo_root, platform, include_auto_embed=False). Malformed-JSON handling now backs up before parsing and raisesRuntimeErrorinstead of silently overwriting.code_review_graph/cli.py: mutex flag group oninstallandinit; new_handle_initpre-flight warnings (gated onwant_auto_embed and not skip_hooks and not dry_run); early return for the removal path.tests/test_skills.py: 3 new test classes (TestGenerateAutoEmbedHookEntry,TestInstallHooksAutoEmbed,TestRemoveAutoEmbedHook) covering schema conformance, chaining order, stdio redirects, env-var placement via regex, path-quoting (including\$()and backticks), exact-dict dedup, 4 malformed-JSON defensive paths, and JSON-struct deep equality with pre-change behaviour.tests/test_cli.py: 12 new CLI tests covering argparse, dry-run, warnings (extras missing + unsupported platform), short-circuit semantics for--no-auto-embed-hook, and warning-gating regressions from Codex round 1.tests/test_integration_install_auto_embed.py: new file, 2 subprocess integration tests for full install and install-then-remove round-trip.Known limitations
&&,|| true, and>/dev/null 2>&1which work in bash / zsh / sh and via WSL or git-bash on Windows. Native Windows PowerShell is deferred to a follow-up; the schema supports\"shell\": \"powershell\"but the command string would need translation.code-review-graphcommand, inheriting the PATH resolution precedent of the existing update hook. If Claude Code forks the hook without~/.local/bin(pipx) or the uv tool path, the hook will silently no-op — the same failure mode as the existing update hook. Can be addressed via_detect_serve_command()-style resolution in a follow-up.embed --include-body --confirm-reembedwill have that sticky flag reset to OFF on every hook fire. Documented in all 5 READMEs. Re-opt-in manually when body-mode is desired.graph.db; WAL + 30s SQLite timeout absorb it but may produce noticeable slowdown during bulk refactors. Debouncing deferred to v1.1.Test plan
uv run pytest tests/test_skills.py tests/test_cli.py tests/test_integration_install_auto_embed.py -v— 120 passinguv run ruff check code_review_graph/— no new errors on my diff (12 pre-existing lint issues on base branch, unchanged)uv run mypy code_review_graph/skills.py code_review_graph/cli.py --ignore-missing-imports --no-strict-optional— clean on my filesuv run bandit -r code_review_graph/skills.py code_review_graph/cli.py— 0 high-severity issues/tmp/sandbox-repo: install → build → fire hook → verify embeddings row ingraph.db→--no-auto-embed-hookround-trip → verify update entry survives and embed entry goneFollow-ups (explicitly NOT in this PR, tracked for later)
code-review-graph enable-body-embed --provider X --model Ythat writes env into.mcp.json, flips the sticky flag, runs the firstembed --include-body --confirm-reembed, and installs this hook — end-to-end in one command (from feat(embeddings): include function body in semantic search text #345's follow-up list)._detect_serve_commandwithinstall_platform_configs).🤖 Generated with Claude Code