Skip to content

feat(install): add opt-in --auto-embed-hook flag for auto-refresh embeddings#349

Open
Minidoracat wants to merge 13 commits intotirth8205:mainfrom
Minidoracat:feat/install-auto-embed-hook
Open

feat(install): add opt-in --auto-embed-hook flag for auto-refresh embeddings#349
Minidoracat wants to merge 13 commits intotirth8205:mainfrom
Minidoracat:feat/install-auto-embed-hook

Conversation

@Minidoracat
Copy link
Copy Markdown
Contributor

Summary

Opt-in code-review-graph install --auto-embed-hook flag that writes a PostToolUse hook into .claude/settings.json (and .qoder/settings.json) so body-embedding data stays fresh between manual embed runs. Counterpart --no-auto-embed-hook performs a pure removal without touching any other install state. Default is OFF — passing no flag keeps today's behaviour JSON-struct-identical.

Stacked on #345 — do NOT merge until #345 lands

This PR depends on the code-review-graph embed CLI subcommand introduced by PR #345 (feat/body-embedding). Branch base is origin/feat/body-embedding on my fork. Once #345 merges into main, I'll rebase onto upstream main and re-run CI.

Why

#345 ships the embed CLI but install doesn't wire up any hook to refresh embeddings after edits. Users had to hand-craft a PostToolUse entry themselves — easy to get wrong (invalid schema fields, stdout leaks into MCP stdio, no cost gate against cloud providers). This PR bakes the correct shape into install.

Follow-up #1 in PR #345's description is exactly this feature.

ADR highlights (full decision table in the plan)

  1. Opt-in via explicit flag, not env var / interactive prompt — matches the existing --no-skills / --no-hooks / --no-instructions flag family.
  2. Hook schema: async: true (Claude Code Jan-2026 schema), NOT run_in_background (which does not exist in the schema). shell: \"bash\" implied; PowerShell deferred. Docs: https://code.claude.com/docs/en/hooks
  3. Matcher: Edit|Write|MultiEdit. Intentionally diverges from the existing update hook's Edit|Write|Bash — Bash tool calls don't touch code, and the chained command re-runs update anyway.
  4. Chaining: one command runs update --skip-flows && embed in sequence. Second update is a hash-dedup no-op on unchanged files but eliminates a cross-process race on graph.db between the standalone update hook and the new embed hook.
  5. Platform scope: claude and qoder only (same as existing install_hooks). Other platforms trigger a clear stderr warning and skip.
  6. Cost safety: CRG_EMBED_INCLUDE_BODY=0 is prefixed to each CLI invocation so the hook NEVER auto-triggers body-mode re-embeds. Body-mode users must opt in manually via embed --include-body --confirm-reembed; the hook resets the sticky flag back to 0 on each fire (documented in all 5 READMEs as intentional friction).
  7. MCP stdio safety: >/dev/null 2>&1 on every CLI invocation + trailing || true. Protects the JSON-RPC stream from corrupting stdout/stderr leakage.
  8. Opt-out: --no-auto-embed-hook short-circuits to a pure removal path. Does NOT re-run install_platform_configs / .gitignore / skill generation / instruction injection / git-hook install. Preserves any user customisation of other PostToolUse entries.

Shell-quoting security

The hook command embeds the repo path. Initial draft used json.dumps(path) for quoting, but bash still expands \$() and backticks inside double-quoted strings. After Codex round 1 flagged this as BLOCKING, switched to shlex.quote(...) which wraps paths containing shell metacharacters in single quotes that bash does not expand. Regression tests cover \$(...) and backtick paths.

Files

  • code_review_graph/skills.py: new generate_auto_embed_hook_entry(repo_root), new remove_auto_embed_hook_entry(repo_root, platform), modified install_hooks(repo_root, platform, include_auto_embed=False). Malformed-JSON handling now backs up before parsing and raises RuntimeError instead of silently overwriting.
  • code_review_graph/cli.py: mutex flag group on install and init; new _handle_init pre-flight warnings (gated on want_auto_embed and not skip_hooks and not dry_run); early return for the removal path.
  • tests/test_skills.py: 3 new test classes (TestGenerateAutoEmbedHookEntry, TestInstallHooksAutoEmbed, TestRemoveAutoEmbedHook) covering schema conformance, chaining order, stdio redirects, env-var placement via regex, path-quoting (including \$() and backticks), exact-dict dedup, 4 malformed-JSON defensive paths, and JSON-struct deep equality with pre-change behaviour.
  • tests/test_cli.py: 12 new CLI tests covering argparse, dry-run, warnings (extras missing + unsupported platform), short-circuit semantics for --no-auto-embed-hook, and warning-gating regressions from Codex round 1.
  • tests/test_integration_install_auto_embed.py: new file, 2 subprocess integration tests for full install and install-then-remove round-trip.
  • 5 README files (EN / zh-CN / ja-JP / ko-KR / hi-IN): CLI reference rows + feature/limitations/sticky-flag-flip-flop paragraph with native phrasing.

Known limitations

  • POSIX shells only. The hook command uses &&, || true, and >/dev/null 2>&1 which work in bash / zsh / sh and via WSL or git-bash on Windows. Native Windows PowerShell is deferred to a follow-up; the schema supports \"shell\": \"powershell\" but the command string would need translation.
  • PATH inheritance. The hook uses the bare code-review-graph command, inheriting the PATH resolution precedent of the existing update hook. If Claude Code forks the hook without ~/.local/bin (pipx) or the uv tool path, the hook will silently no-op — the same failure mode as the existing update hook. Can be addressed via _detect_serve_command()-style resolution in a follow-up.
  • Sticky-flag flip-flop. Users who opted into body-mode embedding via embed --include-body --confirm-reembed will have that sticky flag reset to OFF on every hook fire. Documented in all 5 READMEs. Re-opt-in manually when body-mode is desired.
  • Burst-edit lock contention. 20+ rapid edits create 2× as many update spawns on graph.db; WAL + 30s SQLite timeout absorb it but may produce noticeable slowdown during bulk refactors. Debouncing deferred to v1.1.

Test plan

  • uv run pytest tests/test_skills.py tests/test_cli.py tests/test_integration_install_auto_embed.py -v — 120 passing
  • uv run ruff check code_review_graph/ — no new errors on my diff (12 pre-existing lint issues on base branch, unchanged)
  • uv run mypy code_review_graph/skills.py code_review_graph/cli.py --ignore-missing-imports --no-strict-optional — clean on my files
  • uv run bandit -r code_review_graph/skills.py code_review_graph/cli.py — 0 high-severity issues
  • Real e2e in /tmp/sandbox-repo: install → build → fire hook → verify embeddings row in graph.db--no-auto-embed-hook round-trip → verify update entry survives and embed entry gone
  • Codex adversarial review 2 rounds → NO SUBSTANTIVE ISSUES on round 2
  • CI matrix (3.10 / 3.11 / 3.12 / 3.13) — pending PR runner

Follow-ups (explicitly NOT in this PR, tracked for later)

  1. One-shot code-review-graph enable-body-embed --provider X --model Y that writes env into .mcp.json, flips the sticky flag, runs the first embed --include-body --confirm-reembed, and installs this hook — end-to-end in one command (from feat(embeddings): include function body in semantic search text #345's follow-up list).
  2. Adaptive body-char budget (from feat(embeddings): include function body in semantic search text #345's follow-up list).
  3. Windows native PowerShell branch of the hook command.
  4. Absolute-PATH resolution for the hook command (share _detect_serve_command with install_platform_configs).

🤖 Generated with Claude Code

Why: signature-only embeddings carry ~10 tokens per node (name + kind +
params + return type + language). That ceiling wastes the long-context
capacity of modern embedding models — Gemini 2, Qwen3-8B and similar
rank SOTA on MTEB-code largely on their ability to understand the
*body* of a function, but we were handing them only its name. So the
quality advantage of large-context embeddings against local MiniLM was
getting flattened by our own input, not by model capability.

What: `_node_to_text()` now appends a truncated body snippet after the
signature metadata, with the signature line and leading docstrings /
header comments stripped so the budget goes to real logic instead of
documentation. Four design choices worth flagging:

  1. Per-provider char budget (ADR-B4): local=700 / google=3000 /
     minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS`
     overrides. This gives MiniLM just enough to not silently truncate
     and gives the long-context models room to actually use their
     capacity.

  2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on
     `embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos
     auto-enable; existing DBs hold legacy behavior until the user
     runs `code-review-graph embed --include-body --confirm-reembed`.
     This avoids surprising cloud-provider users with a one-time
     re-embed API bill on upgrade, and the sticky flag keeps new nodes
     consistent with older ones across incremental runs (no mixed
     body-on / body-off state).

  3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored
     hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the
     metadata hash only (zero file IO) and skips when metadata +
     file_hash are unchanged. Stage 2 reads the file only when stage
     1 can't prove freshness, recomputes the body hash, persists the
     combined form. Empty extracted bodies get sha256("") as a
     sentinel so they also short-circuit next time; read failures
     deliberately do NOT persist a sentinel so transient errors stay
     retryable.

  4. Language-agnostic body extraction with block-comment + triple-
     quote state machine and conservative signature dedup via three-
     AND gate (node.name + declaration keyword + param first-token).
     Unknown languages fall back to "don't dedup" rather than risk
     dropping real body. Coverage today: Python / JS / TS / JSX / TSX /
     Java / Kotlin / Rust / Go; other languages keep the signature
     line and just eat a bit of their body budget.

Backward compat: legacy rows (pre-existing sha256 hex in
`embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file
hash) are both parsed correctly by `_split_combined_hash` and trigger
a single safe re-embed on first contact.

Security: `_FileLineCache` resolves paths and enforces a `relative_to`
containment check against the repo root, so a poisoned
`nodes.file_path` can't read `/etc/passwd` or traverse out of the
repo into cloud embedding calls.

Verification: regression MRR@3 on a 10-query golden set (bundled mini
repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent
queries like "escape HTML entities including quotes and backticks"
or "read GEMINI_API_KEY from the environment" move the correct node
into top-3 for the first time because the body now contains the
distinguishing tokens. Tested against real OpenAI-compatible endpoint
too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded
on first run, 0 on a repeat with sticky flag honored — filter IO
short-circuit verified end-to-end.

Tests: 59 new cases (file-line cache LRU + containment, signature
dedup with decorator / recursive / params mismatch / unknown lang,
block-comment and triple-quote state machine with JSDoc / Javadoc /
Python single-line and multi-line docstrings / long comments,
per-provider budget resolution + env override, sticky flag across
runs including CLI flip, filter short-circuit with and without
populated file_hash, read-failure must-retry regression, legacy hash
backward compat). MRR regression test skipped gracefully when
sentence-transformers isn't installed (CI matrix with `.[dev]` only
stays green; CI jobs that install `.[embeddings]` activate the gate).

Scope: all four providers (Local / Google / MiniMax / OpenAI-
compatible) benefit transparently; no schema migration on graph.db,
no change to `EmbeddingStore.__init__` signature, no change to
parser.py / migrations.py / search.py / tools.py. Adds one new
subcommand `embed` and a private `embeddings_meta` table on
`embeddings.db`.
Why: Codex round 7 caught that the legacy inference
``db_path.parent -> repo root`` silently breaks for the supported
``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir,
not inside ``<repo>/.code-review-graph/``). In that case
``_FileLineCache`` received the cache dir as repo_root and every
body-snippet read failed the containment check, making body
enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two-
stage filter then wrote metadata-only hashes, so even opting back
in via ``embed --include-body --confirm-reembed`` wouldn't help.

What: ``_repo_root_guess`` now:
  1. keeps the fast path when db_path parent is literally
     ``.code-review-graph`` (zero-config common case, no cost)
  2. otherwise delegates to ``incremental.find_project_root`` which
     already honors ``CRG_REPO_ROOT`` and walks up for a ``.git``
     marker — the same resolution order the rest of the CLI uses
  3. returns None only when both fail, letting callers treat
     absolute file_path values as self-locating

Regression test: ``TestRepoRootGuessExternalDataDir`` covers the
legacy layout, the env-override fallback, and an end-to-end read
through ``_FileLineCache`` from an external cache dir. Existing
``TestFilterStageShortCircuit`` fixtures now write their DB under
``<tmp_path>/.code-review-graph/`` so they hit the fast path
rather than the cwd-relative fallback.
Why: Codex round 8 found the round-7 fast path still misfires when
``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g.
``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or
``/shared/cache/<repo>/.code-review-graph``. In that case
``db_path.parent.name == ".code-review-graph"`` is True, the fast
path returns ``parent.parent`` (``/var/tmp`` or the shared-cache
parent), and ``_FileLineCache`` silently falls back to metadata
for every node. My round-7 tests only covered an external dir
NOT named ``.code-review-graph`` so the regression slipped.

What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When
that env var is set we always defer to
``incremental.find_project_root`` regardless of what ``db_path``
looks like; the zero-config fast path only runs when the user is
on the default ``<repo>/.code-review-graph/graph.db`` layout and
has NOT overridden the data dir.

Regression test: ``test_crg_data_dir_named_dot_code_review_graph``
exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at
a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT``
pointed at a distant ``real-repo`` tree — and asserts the root
resolves to ``real-repo``, not the cache's parent.
Why: cross-repo benchmark showed body enrichment regresses on
framework-heavy codebases (humanitzbot — TypeScript Discord bot —
MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found
that signature dedup + block-comment state machine are all working,
but TS/Discord function bodies are thick with framework boilerplate
that dilutes the retrieval signal. Examples from one regressing
function:

    await interaction.deferReply();
    const locale = interaction.locale;
    logger.info('handling power request', { signal });
    console.log('...');

On small models (MiniLM: 256-token window), those lines fill the
budget with `await`, `deferReply`, `locale`, `logger`, `console`
— high-frequency framework tokens that don't discriminate between
similarly-named commands (``_power`` vs ``_status`` vs ``_console``).

What: add a language-agnostic check to the existing body-extraction
state machine that skips lines which ARE purely a log/print call
with no other executable content. Covers:

  * Python:  ``print(...)``, ``logger.info(...)``, ``logging.debug(...)``
  * JS/TS:   ``console.log(...)``, ``logger.warn(...)``
  * Java:    ``System.out.println(...)``, ``log.debug(...)``
  * Go:      ``fmt.Println(...)``, ``log.Printf(...)``
  * Rust:    ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)``

Lines where a log call is mixed with substantive code
(``if err: logger.error(err); raise``) are preserved — they DO
carry retrieval signal.

Scope: one regex + one helper + one extra branch in the existing
state machine loop. Stays within ADR-A1 "language-agnostic body
extraction". Does NOT introduce framework-specific heuristics —
log/print is a universal pattern, not a framework one. Also does
NOT make the body hash non-deterministic (same input still produces
same output), so the combined-hash filter short-circuit (AC-13)
still holds.

Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust
log patterns, an integration test verifying scattered log lines
get stripped from a multi-statement TS body while real logic
survives, plus negative tests for mixed-statement lines and
non-log calls (``await interaction.deferReply()`` is NOT a log
call and stays kept).

Follow-up: framework-specific SDK boilerplate (Discord
``interaction.X``, React hooks, Spring ``@Autowired``) is where
the next round of improvement lives, but that can't be done
generically without either a tokenizer-based adaptive budget or
per-language SDK awareness — both are out of scope for this PR.
Round 10 flagged two real over-matches in 7daf93c:

1. The generic ``identifier.(error|warn|info|...)`` branch swallowed
   non-logger APIs. In THIS repo it dropped
   ``serve_cmd.error(...)`` (argparse's error method, which raises
   rather than logs). Same class of issue for
   ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer
   — that IS the function's output, not a log.

2. ``re.match`` is prefix-anchored, so
   ``logger.info('x') or do_work()`` passed the log-only check even
   though it has substantive trailing code. Violated the "mixed-
   statement lines stay kept" design promise.

Fixes:

* Receiver identifier is now an explicit whitelist — ``log``,
  ``logger``, ``logging``, ``logs``, ``console``, ``slog``,
  ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing),
  ``self.log``/``this.log``/``_log``/``_logger`` and the
  ``self.logger``/``this.logger`` variants. Custom loggers named
  differently fall through — they'll be kept as real code. Better
  to under-skip than mis-skip.
* Dropped ``fmt.Fprint*`` from the Go branch — those take an
  arbitrary ``io.Writer`` (commonly HTTP responses). Only
  ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like).
* ``_is_log_only_line`` now walks the matched string after the
  opening ``(``, tracking string literals and paren depth, to find
  the balanced close. Anything non-trivial after that ``)`` means
  the line has other executable content and is NOT log-only.

Regression tests: tail-continuation cases
(``logger.info('x') or do_work()``), non-logger identifier cases
(``serve_cmd.error``, ``db.warn``, ``item.notice``), and
``fmt.Fprintf`` cases now stay kept. Pure-log cases including
nested-paren args (``logger.info('val=' + str(compute(x, y)))``)
still get dropped.
Why: body enrichment already shows small positive Δ on PHP (Laravel
RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum
with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only
``.`` -style method access, so PHP-world logging idioms
(``$logger->info``, Laravel ``Log::info`` facade, bare
``error_log``/``syslog``/``trigger_error``) silently stayed in the
body text. Adding them is pure upside for log-heavy PHP / Laravel
code paths, no test-matrix cost anywhere else.

What this commit does:
  * PHP instance logger:  ``$logger->info(...)`` /
    ``$this->logger->debug(...)`` — covers Monolog convention.
  * PHP static facade:    ``Log::info(...)`` / ``\Log::error(...)`` —
    Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for
    fully-qualified imports).
  * PHP bare calls:       ``error_log(...)`` / ``syslog(...)`` /
    ``trigger_error(...)`` — stdlib helpers.
  * PSR-3 log levels:     debug / info / notice / warning / error /
    critical / alert / emergency, plus the common ``log``/``warn``
    aliases already used by the rest of the regex.

What this commit deliberately does NOT do: extend
``_looks_like_signature`` to PHP. I tested that locally — Laravel
RESTful controllers have very short templated bodies
(``$data = $request->validated(); Model::create($data);``) and the
signature line carries type information (``StoreRequest``,
``JsonResponse``, etc.) that the body never repeats. Stripping the
signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and
from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate
wastes budget" heuristic breaks on convention-over-configuration
PHP codebases, so PHP stays in the conservative fallback — signature
line is kept as part of the body snippet. This is documented in the
``_looks_like_signature`` docstring and locked in by
``test_signature_dedup_php_conservative``.

Tests: 10 new PHP log-pattern assertions (instance / static / bare /
negative non-logger calls like ``$user->save()`` / ``User::create()``
/ ``$response->setStatusCode()`` / ``$db->query()``) plus the
conservative-fallback regression test noted above.
Why: my first-try rollout kept PHP in the conservative fallback
based on a MiniLM-only bench that showed regressions on two
Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large +
Gemini-2-preview + Qwen3-8B across the same two repos tells a
different story — PHP dedup is net +0.110 MRR@3 across the 8
(repo × provider) cells, with the wins concentrated on the top-
tier cloud models where users most expect the quality gain:

  usdt-center  OpenAI 3-large    off Δ=-0.111  on Δ= 0.000  ↑ +0.111
  usdt-center  Gemini-2-preview  off Δ=+0.056  on Δ=+0.139  ↑ +0.083
  six-forum    Gemini-2-preview  off Δ=-0.083  on Δ=-0.033  ↑ +0.050
  six-forum    OpenAI 3-large    off Δ=+0.117  on Δ=+0.117  = 0
  six-forum    Qwen3-8B          off Δ=+0.117  on Δ=+0.117  = 0
  usdt-center  Qwen3-8B          off Δ=-0.111  on Δ=-0.167  ↓ -0.056 *
  usdt-center  MiniLM            off Δ=+0.028  on Δ= 0.000  ↓ -0.028
  six-forum    MiniLM            off Δ=+0.067  on Δ=+0.017  ↓ -0.050

  * Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling),
    any body token that isn't a retrieval signal is noise; this is
    a ceiling-effect artefact, not a regression from dedup.

What: adds PHP back to ``_looks_like_signature`` with the same
three-AND gate every other language uses — name + declaration
keyword + param first-token match. Covers class methods (all
visibility / static / abstract / final prefixes precede
``function``), free functions, classes, interfaces, traits, and
enums. Log-skip patterns from the previous commit stay unchanged.

Mini-LM-on-Laravel regression is documented in the module
docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0``
without touching any other part of the pipeline. The sticky-flag
opt-in is exactly the shape this kind of per-model tradeoff
needs, and why we built it.

Regression test in test_signature_dedup_php now asserts dedup
fires for standard PHP signatures (method / class / interface /
trait) and still returns False for body-only lines and for
overload-style param mismatches, so a future change that breaks
one of those categories gets caught.
Codex round 13 nit: the ``test_signature_dedup_php`` comment said
"traits, enums" but only asserted a trait example. Add explicit
assertions for both shapes of the PHP 8.1 enum syntax:

  enum OrderStatus: string { ...     # declaration → dedup
  case PENDING = 'pending';          # enum value  → keep as body

This locks in the behavior Codex spot-checked manually: ``case``
is deliberately NOT in the dedup keyword allowlist, so enum
values stay in the body snippet where they carry real retrieval
signal.
…anguages

Why: the body-embedding feature ships a new ``code-review-graph embed``
subcommand (with ``--include-body`` / ``--confirm-reembed`` /
``--skip-cost-check`` flags) but the existing CLI reference tables in
each README stopped at ``eval`` / ``serve``. A first-time user after
this PR merges wouldn't discover the subcommand from docs alone.

What: adds two rows to the CLI reference ``<details>`` block in every
translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN):

  code-review-graph embed            # fresh repo — body enrichment auto-on
  code-review-graph embed --include-body --confirm-reembed
                                     # existing DB — opt in to body

Native-phrasing for each language explains the fresh-vs-existing-DB
distinction so the sticky-flag + double-flag safety gate is obvious
from the docs without cross-referencing the PR body.

Notes for translators / follow-up contributors: the ``embed`` CLI also
supports optional ``--provider`` / ``--model`` / ``--skip-cost-check``
flags that the full ``code-review-graph embed --help`` output covers;
the CLI table sticks to the two shapes most users need.
…lpers

Why: body-embedding (PR tirth8205#345) shipped a `code-review-graph embed` CLI
but `install` doesn't wire up any hook to keep embeddings fresh after
edits. Users who want auto-refresh have to hand-craft a PostToolUse
entry in settings.json — easy to get wrong (wrong schema field, no
stdio redirect, no cost-gate). This commit lands the building blocks
so the next commit can expose a single `--auto-embed-hook` flag.

What:
- New `generate_auto_embed_hook_entry(repo_root)` in skills.py. Returns
  a single PostToolUse entry using the canonical Claude Code Jan-2026
  schema (`async: true`, NOT the invalid `run_in_background`). Matcher
  covers `Edit|Write|MultiEdit` (Bash is intentionally left to the
  standalone update hook since Bash tools don't produce code edits).
- Chains `update --skip-flows && embed` so graph.db is fresh when
  embed reads it; the second update call is a hash-dedup no-op. This
  eliminates a cross-process race that the separate update hook and
  this new embed hook would otherwise have on graph.db.
- `CRG_EMBED_INCLUDE_BODY=0` is prefixed to each CLI call so the hook
  never auto-triggers PR tirth8205#345's body-mode cost bomb — users who want
  body-mode must run `embed --include-body --confirm-reembed` manually
  and accept the sticky-flag flip-flop (documented in README).
- Full stdout+stderr redirection (`>/dev/null 2>&1` + `|| true`) keeps
  the MCP JSON-RPC stream uncorrupted even when embed prints progress
  or exits non-zero (e.g. missing provider).
- `install_hooks(..., include_auto_embed=False)` adds the kwarg with
  a False default — existing callers see JSON-struct-identical output.
- New `remove_auto_embed_hook_entry(...)` with defensive handling for
  malformed JSON / non-dict root / non-dict hooks / non-list PostToolUse
  (all return False without writing, no .bak on no-match — protects
  pristine baselines).

Tests (36 new, all passing):
- TestGenerateAutoEmbedHookEntry (13 cases): schema conformance,
  chaining order, stdio redirect count, env-var placement via regex,
  path-quoting, matcher content.
- TestInstallHooksAutoEmbed (4 cases): entry dedup, JSON-struct deep
  equality vs pre-change behaviour, qoder platform parity.
- TestRemoveAutoEmbedHook (9 cases): exact-dict removal, no-op-no-bak,
  user-custom preservation, 4 malformed-JSON defensive tests.

Plan: .omc/plans/feat-install-auto-embed-hook.md (local, not committed).
Stacked on PR tirth8205#345 (feat/body-embedding). Depends on `embed` CLI
introduced there; if tirth8205#345 is rebased/squashed, this branch needs
refresh.
Why: the hook-generation machinery from the previous commit is only
useful if users can opt in from the CLI. This wires the flag through
`install` and its `init` alias, surfaces pre-flight warnings for the
two most common silent-failure modes (missing sentence-transformers,
unsupported platform), and adds integration tests that exercise the
full argparse → subprocess → settings.json path.

What:
- `install` and `init` both gain a mutually-exclusive flag group:
  `--auto-embed-hook` installs the entry; `--no-auto-embed-hook`
  removes an exact-match entry without touching anything else in
  settings.json. Default is OFF (opt-in) — passing neither flag
  produces the pre-change behaviour byte-for-equivalent on the wire.
- `_handle_init` parses the flags into `want_auto_embed` /
  `want_remove_auto_embed`, emits a stderr warning when the flag is
  passed with a non-claude/qoder `--platform` (the hook block already
  skips those platforms, so the warning is the only user-facing
  signal), and a second stderr warning when the default local
  provider's `sentence_transformers` extra is missing — both warnings
  are non-blocking per the "never block the agent" principle.
- Removal path explicitly does NOT call install_hooks to avoid
  re-merging the default update/SessionStart entries on top of any
  user customisation that happened between install and removal.

Tests:
- 9 new test_cli.py tests covering: flag acceptance on both
  subcommands, mutex-violation argparse exit 2, dry-run writes
  nothing, sentence-transformers find_spec monkeypatch warning,
  cursor-platform stderr warning, help-text contains both flags.
- New tests/test_integration_install_auto_embed.py with two
  subprocess-driven round-trips — install-and-verify-schema plus
  install-then-remove — that exercise the full package entry point
  via `python -m code_review_graph`. Replaces the v1 plan's fragile
  jq-based smoke script with a CI-gated Python test. Asserts
  `async: true`, double stdout/stderr redirect, `|| true` tail,
  `CRG_EMBED_INCLUDE_BODY=0` presence, and static (not
  $CLAUDE_PROJECT_DIR) repo path.

All 113 tests across the three touched files pass. Pre-existing 60
failures on test_parser / test_refactor / test_main (`--tools` flag)
are documented in PR tirth8205#345 as unrelated upstream issues.
Why: a new CLI flag is invisible without discoverable docs. First-time
users rarely find new flags from code alone — the install-command
reference table and the existing body-embedding explanation are the
places people actually look. Each of the 5 translated READMEs gets
both touchpoints so no audience is stuck on English-only docs.

What:
- CLI reference table: two new lines right after the existing
  `embed --include-body --confirm-reembed` entry, covering install +
  uninstall of the auto-embed hook. Native-phrased per language.
- Body-embedding explanation block: a new paragraph inside the same
  blockquote that already covers body-mode opt-in, documenting (a)
  the opt-in feature, (b) the Claude-Code-and-Qoder-only / POSIX-only
  scope limitation, and (c) the sticky-flag flip-flop behavior so
  anyone who ran `embed --include-body --confirm-reembed` before
  enabling the hook understands their preference will get reset on
  every fire.

Parity: grep confirms each README mentions `--auto-embed-hook` twice
(CLI row + explanation paragraph). Native phrasing preserved across
EN / zh-CN / ja-JP / ko-KR / hi-IN — no English-only paragraphs
sneaking into the translations. POSIX-only caveat and
`install --no-auto-embed-hook` removal path are present in all five.
Why: Codex review flagged 1 BLOCKING + 2 MAJOR + 1 MINOR issue after
inspecting the three atomic commits above. This commit lands all four
fixes plus regression tests so the branch reaches NO SUBSTANTIVE ISSUES
before pushing to upstream.

BLOCKING — shell injection via repo path (skills.py:~588):
- `generate_auto_embed_hook_entry` was using `json.dumps(...)` to quote
  the repo path, but json-escaping is NOT sufficient for bash double
  quotes: `/tmp/repo$(touch /tmp/pwned)` would still trigger command
  substitution when the hook fires. Switched to `shlex.quote(...)`
  which wraps any path with shell metacharacters in single quotes
  that bash does not expand. Added two regression tests asserting
  that paths containing `$()` and backticks land inside single-quoted
  tokens, not double-quoted ones. Kept the path-with-spaces test
  updated to expect the new shlex quoting shape.

MAJOR — `--no-auto-embed-hook` had full-install side effects
(cli.py _handle_init):
- The flag was documented as "idempotent removal that preserves user
  config", but the original wiring still ran `install_platform_configs`,
  `.gitignore` update, `generate_skills`, `inject_claude_md`,
  `install_git_hook`, etc. Short-circuited `_handle_init` at the top:
  when `want_remove_auto_embed` is True we now do ONLY the removal
  and return, with a `--dry-run` preview path and clean no-op messages
  for `--no-hooks` and non-claude/qoder platforms. New test
  `test_no_auto_embed_hook_short_circuits_install_flow` asserts no
  .mcp.json / .gitignore / skills dir / CLAUDE.md / pre-commit hook
  is created when only `--no-auto-embed-hook` is passed.

MAJOR — malformed / JSONC settings.json was silently overwritten
(skills.py install_hooks):
- Backup `shutil.copy2` lived INSIDE the `try: json.loads(...)` block,
  so a parse failure skipped the backup and the subsequent write
  destroyed the original file. Reordered so the backup always runs
  first, then parse; on `JSONDecodeError` we now raise `RuntimeError`
  with a message pointing the user at the fresh `.bak` and asking
  them to fix or remove the comment manually, instead of silently
  resetting to `{}` and writing a blank shape. New test
  `test_install_aborts_on_malformed_json_with_backup` covers the
  JSONC scenario Codex specifically called out.

MINOR — warnings fired in dry-run / no-hooks modes (cli.py):
- The `sentence-transformers is missing` and non-claude/qoder
  platform warnings were printed even when no hook would be installed
  (`--dry-run` or `--no-hooks`). Gated both warnings behind
  `want_auto_embed and not skip_hooks and not dry_run` so the CLI
  stays quiet in the modes where the warning isn't actionable. Two
  new tests (`test_extras_warning_gated_by_skip_hooks`,
  `test_extras_warning_gated_by_dry_run`) keep this regression-safe.

Tests: 120 passing (was 113). No new lint / mypy / bandit findings
on touched files beyond the pre-existing ones documented in PR tirth8205#345.
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant