From 17898597cc9ed8923dd3d9dc6ef47a58ce349bea Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 06:59:41 +0800
Subject: [PATCH 1/9] feat(embeddings): include function body in semantic
 search text
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: signature-only embeddings carry ~10 tokens per node (name + kind +
params + return type + language). That ceiling wastes the long-context
capacity of modern embedding models — Gemini 2, Qwen3-8B and similar
rank SOTA on MTEB-code largely on their ability to understand the
*body* of a function, but we were handing them only its name. So the
quality advantage of large-context embeddings against local MiniLM was
getting flattened by our own input, not by model capability.

What: `_node_to_text()` now appends a truncated body snippet after the
signature metadata, with the signature line and leading docstrings /
header comments stripped so the budget goes to real logic instead of
documentation. Four design choices worth flagging:

  1. Per-provider char budget (ADR-B4): local=700 / google=3000 /
     minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS`
     overrides. This gives MiniLM just enough to not silently truncate
     and gives the long-context models room to actually use their
     capacity.

  2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on
     `embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos
     auto-enable; existing DBs hold legacy behavior until the user
     runs `code-review-graph embed --include-body --confirm-reembed`.
     This avoids surprising cloud-provider users with a one-time
     re-embed API bill on upgrade, and the sticky flag keeps new nodes
     consistent with older ones across incremental runs (no mixed
     body-on / body-off state).

  3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored
     hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the
     metadata hash only (zero file IO) and skips when metadata +
     file_hash are unchanged. Stage 2 reads the file only when stage
     1 can't prove freshness, recomputes the body hash, persists the
     combined form. Empty extracted bodies get sha256("") as a
     sentinel so they also short-circuit next time; read failures
     deliberately do NOT persist a sentinel so transient errors stay
     retryable.

  4. Language-agnostic body extraction with block-comment + triple-
     quote state machine and conservative signature dedup via three-
     AND gate (node.name + declaration keyword + param first-token).
     Unknown languages fall back to "don't dedup" rather than risk
     dropping real body. Coverage today: Python / JS / TS / JSX / TSX /
     Java / Kotlin / Rust / Go; other languages keep the signature
     line and just eat a bit of their body budget.

Backward compat: legacy rows (pre-existing sha256 hex in
`embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file
hash) are both parsed correctly by `_split_combined_hash` and trigger
a single safe re-embed on first contact.

Security: `_FileLineCache` resolves paths and enforces a `relative_to`
containment check against the repo root, so a poisoned
`nodes.file_path` can't read `/etc/passwd` or traverse out of the
repo into cloud embedding calls.

Verification: regression MRR@3 on a 10-query golden set (bundled mini
repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent
queries like "escape HTML entities including quotes and backticks"
or "read GEMINI_API_KEY from the environment" move the correct node
into top-3 for the first time because the body now contains the
distinguishing tokens. Tested against real OpenAI-compatible endpoint
too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded
on first run, 0 on a repeat with sticky flag honored — filter IO
short-circuit verified end-to-end.

Tests: 59 new cases (file-line cache LRU + containment, signature
dedup with decorator / recursive / params mismatch / unknown lang,
block-comment and triple-quote state machine with JSDoc / Javadoc /
Python single-line and multi-line docstrings / long comments,
per-provider budget resolution + env override, sticky flag across
runs including CLI flip, filter short-circuit with and without
populated file_hash, read-failure must-retry regression, legacy hash
backward compat). MRR regression test skipped gracefully when
sentence-transformers isn't installed (CI matrix with `.[dev]` only
stays green; CI jobs that install `.[embeddings]` activate the gate).

Scope: all four providers (Local / Google / MiniMax / OpenAI-
compatible) benefit transparently; no schema migration on graph.db,
no change to `EmbeddingStore.__init__` signature, no change to
parser.py / migrations.py / search.py / tools.py. Adds one new
subcommand `embed` and a private `embeddings_meta` table on
`embeddings.db`.
---
 README.hi-IN.md                               |   4 +-
 README.ja-JP.md                               |   4 +-
 README.ko-KR.md                               |   4 +-
 README.md                                     |  18 +-
 README.zh-CN.md                               |   4 +-
 code_review_graph/cli.py                      |  97 +++
 code_review_graph/embeddings.py               | 584 +++++++++++++-
 .../embeddings/eval/golden_queries.json       | 101 +++
 .../embeddings/eval/mini_repo/api_client.py   |  24 +
 .../embeddings/eval/mini_repo/embeddings.py   |  32 +
 .../embeddings/eval/mini_repo/graph.py        |  33 +
 .../embeddings/eval/mini_repo/incremental.py  |  16 +
 .../embeddings/eval/mini_repo/parser.py       |  33 +
 .../eval/mini_repo/visualization.py           |  18 +
 tests/test_embeddings.py                      | 754 ++++++++++++++++++
 tests/test_embeddings_mrr.py                  | 183 +++++
 16 files changed, 1882 insertions(+), 27 deletions(-)
 create mode 100644 tests/fixtures/embeddings/eval/golden_queries.json
 create mode 100644 tests/fixtures/embeddings/eval/mini_repo/api_client.py
 create mode 100644 tests/fixtures/embeddings/eval/mini_repo/embeddings.py
 create mode 100644 tests/fixtures/embeddings/eval/mini_repo/graph.py
 create mode 100644 tests/fixtures/embeddings/eval/mini_repo/incremental.py
 create mode 100644 tests/fixtures/embeddings/eval/mini_repo/parser.py
 create mode 100644 tests/fixtures/embeddings/eval/mini_repo/visualization.py
 create mode 100644 tests/test_embeddings_mrr.py

diff --git a/README.hi-IN.md b/README.hi-IN.md
index ab40ade0..45bcc0ce 100644
--- a/README.hi-IN.md
+++ b/README.hi-IN.md
@@ -133,7 +133,7 @@ Build the code review graph for this project
 | **23 भाषाएं + नोटबुक** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) |
 | **ब्लास्ट-रेडियस विश्लेषण** | दिखाता है कि किसी भी बदलाव से कौन से फ़ंक्शन, क्लासेज़, और फ़ाइलें प्रभावित होती हैं |
 | **ऑटो-अपडेट हुक्स** | बिना मैन्युअल हस्तक्षेप के हर फ़ाइल एडिट और git कमिट पर ग्राफ अपडेट होता है |
-| **सिमेंटिक सर्च** | sentence-transformers, Google Gemini, MiniMax, या किसी भी OpenAI-compatible एंडपॉइंट (असली OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) के ज़रिए वैकल्पिक वेक्टर एम्बेडिंग |
+| **सिमेंटिक सर्च** | sentence-transformers, Google Gemini, MiniMax, या किसी भी OpenAI-compatible एंडपॉइंट (असली OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) के ज़रिए वैकल्पिक वेक्टर एम्बेडिंग. फ़ंक्शन body का implementation हिस्सा एम्बेड होता है (signature और शुरुआती docstring/header comments हटा दिए जाते हैं ताकि snippet का token budget real logic पर खर्च हो, documentation पर नहीं). long-context मॉडल्स "कोड क्या करता है" के हिसाब से रैंक कर सकते हैं |
 | **इंटरैक्टिव विज़ुअलाइज़ेशन** | सर्च, कम्युनिटी लीजेंड टॉगल, और डिग्री-स्केल्ड नोड्स के साथ D3.js फ़ोर्स-डायरेक्टेड ग्राफ |
 | **हब और ब्रिज डिटेक्शन** | betweenness centrality के ज़रिए सबसे ज़्यादा कनेक्टेड नोड्स और आर्किटेक्चरल चोकपॉइंट्स खोजें |
 | **सरप्राइज़ स्कोरिंग** | अप्रत्याशित कपलिंग का पता लगाएं: क्रॉस-कम्युनिटी, क्रॉस-लैंग्वेज, पेरीफ़ेरल-टू-हब एज़ेज़ |
@@ -286,7 +286,7 @@ export CRG_OPENAI_BATCH_SIZE=100                        # टाइट बैच
 
 > **मॉडल चुनने की सलाह।** लंबे समय के उपयोग के लिए `-preview` / `-beta` / `-exp` वाले model ID (जैसे `google/gemini-embedding-2-preview`) से बचें — preview मॉडल्स के वज़न बदल सकते हैं (डाइमेंशन बदलने पर पूरा re-embed करना पड़ेगा) या बिना नोटिस deprecate हो सकते हैं। स्टेबल GA मॉडल्स की सलाह दी जाती है: `text-embedding-3-small` / `text-embedding-3-large` (OpenAI), `Qwen/Qwen3-Embedding-8B` (vLLM / LocalAI सेल्फ-होस्टेड के ज़रिए), या `gemini-embedding-001` (नेटिव Gemini provider के ज़रिए, `GOOGLE_API_KEY` चाहिए).
 >
-> साथ ही ध्यान दें: वर्तमान में `code-review-graph` केवल **फ़ंक्शन सिग्नेचर** एम्बेड करता है (प्रति नोड ~10 tokens, जैसे `"parse_file function (path: str) returns Tree"`). जिन मॉडल्स की क्वालिटी का मुख्य source लंबे context में function body को समझना है (जैसे Gemini 2 या Qwen3-8B के MTEB-code SOTA स्कोर्स), वे इस इनपुट लंबाई पर छोटे मॉडल्स से कम अंतर दिखाएंगे। Body / docstring एम्बेडिंग को फ़ॉलो-अप एन्हांसमेंट के रूप में ट्रैक किया जा रहा है।
+> `code-review-graph` signature के साथ-साथ **फ़ंक्शन body के implementation** हिस्से का truncated snippet भी एम्बेड करता है (signature और शुरुआती docstring/header comments हटा दिए जाते हैं ताकि snippet token budget real logic पर जाए; प्रति प्रोवाइडर char budget: local 700 / google 3000 / minimax 3000 / openai 6000). अब long-context मॉडल्स (Gemini 2, Qwen3-8B) फ़ंक्शन के नाम के बजाय "वह क्या करता है" के आधार पर रैंक कर सकते हैं। नए रेपो में यह अपने आप active हो जाता है; मौजूदा DB legacy signature-only व्यवहार रखते हैं जब तक आप `code-review-graph embed --include-body --confirm-reembed` से opt-in नहीं करते (एक बार का पूरा re-embed क्लाउड प्रोवाइडर्स पर API tokens खर्च कर सकता है — signature-only पर रहने के लिए `CRG_EMBED_INCLUDE_BODY=0` सेट करें).
 
 </details>
 
diff --git a/README.ja-JP.md b/README.ja-JP.md
index 220ea864..6708388f 100644
--- a/README.ja-JP.md
+++ b/README.ja-JP.md
@@ -135,7 +135,7 @@ gitコミットやファイル保存のたびにフックが起動します。
 | **23言語 + ノートブック** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) |
 | **影響範囲分析** | 変更によって影響を受ける関数、クラス、ファイルを正確に表示 |
 | **自動更新フック** | ファイル編集やgitコミットのたびに手動操作なしでグラフを更新 |
-| **セマンティック検索** | sentence-transformers、Google Gemini、MiniMax、またはOpenAI互換エンドポイント（本家OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI）によるオプションのベクトル埋め込み |
+| **セマンティック検索** | sentence-transformers、Google Gemini、MiniMax、またはOpenAI互換エンドポイント（本家OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI）によるオプションのベクトル埋め込み。関数 body の実装部分を埋め込みます（シグネチャと先頭の docstring / ヘッダーコメントは取り除き、snippet の token 予算を実ロジックに回します）。長 context モデルが「コードが何をしているか」でランキングできます |
 | **インタラクティブ可視化** | D3.js力学レイアウトグラフ。検索、コミュニティ凡例切替、次数スケーリングノード対応 |
 | **ハブ・ブリッジ検出** | 最も接続の多いノードと媒介中心性によるアーキテクチャのボトルネックを発見 |
 | **サプライズスコアリング** | 予期しない結合を検出：コミュニティ間、言語間、周辺からハブへのエッジ |
@@ -288,7 +288,7 @@ base URLがlocalhost（`127.0.0.1`、`localhost`、`0.0.0.0`、`::1`）を指し
 
 > **モデル選択のヒント。** `-preview` / `-beta` / `-exp` 付きのmodel ID（例：`google/gemini-embedding-2-preview`）は長期運用には避けてください。preview モデルは重みが変更される（次元が変わると全ノード re-embed 必須）か、予告なく deprecate される可能性があります。安定版 GA モデル推奨：`text-embedding-3-small` / `text-embedding-3-large`（OpenAI）、`Qwen/Qwen3-Embedding-8B`（vLLM / LocalAI 自前ホスト経由）、または `gemini-embedding-001`（ネイティブ Gemini provider 経由、`GOOGLE_API_KEY` が必要）。
 >
-> また注意：現状 `code-review-graph` は**関数シグネチャのみ**を埋め込みます（ノードあたり約10トークン、例：`"parse_file function (path: str) returns Tree"`）。長 context で関数 body を理解する能力で差をつけるモデル（Gemini 2 や Qwen3-8B の MTEB-code SOTA スコア）は、この入力長では小型モデルとの差がかなり縮まります。Body / docstring 埋め込みはフォローアップ拡張として追跡中です。
+> `code-review-graph` はシグネチャに加えて**関数 body の実装**部分の切り取り片も埋め込みます（シグネチャと先頭の docstring / ヘッダーコメントは取り除き、snippet の token 予算を実ロジックに割り当てます。プロバイダごとの文字数上限：local 700 / google 3000 / minimax 3000 / openai 6000）。長 context モデル（Gemini 2、Qwen3-8B）は「関数が何をしているか」でランク付けでき、名前だけに頼らなくなります。新規リポジトリでは自動で有効；既存 DB はレガシーの signature-only 動作を保持し、`code-review-graph embed --include-body --confirm-reembed` で明示的に opt-in するまで切り替わりません（一度限りの全件 re-embed はクラウドプロバイダで API トークンを消費します。signature-only のままにしたい場合は `CRG_EMBED_INCLUDE_BODY=0`）。
 
 </details>
 
diff --git a/README.ko-KR.md b/README.ko-KR.md
index f593f065..68a555a4 100644
--- a/README.ko-KR.md
+++ b/README.ko-KR.md
@@ -135,7 +135,7 @@ git 커밋이나 파일 저장마다 훅이 실행됩니다. 그래프는 변경
 | **23개 언어 + 노트북** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) |
 | **영향 범위 분석** | 변경에 의해 영향 받는 함수, 클래스, 파일을 정확히 보여줍니다 |
 | **자동 업데이트 훅** | 수동 개입 없이 파일 편집 및 git 커밋마다 그래프가 업데이트됩니다 |
-| **시맨틱 검색** | sentence-transformers, Google Gemini, MiniMax, 또는 OpenAI 호환 엔드포인트(실제 OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI)를 통한 선택적 벡터 임베딩 |
+| **시맨틱 검색** | sentence-transformers, Google Gemini, MiniMax, 또는 OpenAI 호환 엔드포인트(실제 OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI)를 통한 선택적 벡터 임베딩. 함수 body의 구현 부분을 임베딩합니다(시그니처와 선두 docstring/헤더 주석은 제거하여 snippet 토큰 예산을 실제 로직에 할당). 긴 context 모델이 "코드가 무엇을 하는지"로 랭킹할 수 있습니다 |
 | **인터랙티브 시각화** | 검색, 커뮤니티 범례 토글, 차수 기반 노드 크기 조정이 가능한 D3.js 포스 다이렉티드 그래프 |
 | **허브 및 브릿지 감지** | 가장 많이 연결된 노드와 매개 중심성을 통한 아키텍처 병목 지점 탐색 |
 | **서프라이즈 스코어링** | 예상치 못한 결합 감지: 커뮤니티 간, 언어 간, 주변부-허브 엣지 |
@@ -288,7 +288,7 @@ base URL이 localhost(`127.0.0.1`, `localhost`, `0.0.0.0`, `::1`)를 가리킬 
 
 > **모델 선택 팁.** `-preview` / `-beta` / `-exp` 접미사가 붙은 model ID(예: `google/gemini-embedding-2-preview`)는 장기 운영용으로 피하세요. preview 모델은 가중치가 바뀌거나(차원 변경 시 전체 re-embed 필수) 예고 없이 deprecate될 수 있습니다. 안정 GA 모델 권장: `text-embedding-3-small` / `text-embedding-3-large`(OpenAI), `Qwen/Qwen3-Embedding-8B`(vLLM / LocalAI 자체 호스팅 경유), 또는 `gemini-embedding-001`(네이티브 Gemini provider 경유, `GOOGLE_API_KEY` 필요).
 >
-> 참고로 현재 `code-review-graph`는 **함수 시그니처만** 임베딩합니다(노드당 약 10 토큰, 예: `"parse_file function (path: str) returns Tree"`). 긴 context로 함수 body를 이해하는 능력으로 차별화되는 모델(Gemini 2 또는 Qwen3-8B의 MTEB-code SOTA 점수)은 이 입력 길이에서 소형 모델과의 품질 차이가 훨씬 좁아집니다. Body / docstring 임베딩은 후속 개선 과제로 추적 중입니다.
+> `code-review-graph`는 시그니처에 더해 **함수 body 구현부** 스니펫도 함께 임베딩합니다(시그니처와 선두 docstring/헤더 주석은 제거하고 snippet 토큰 예산을 실제 로직에 할당. 프로바이더별 문자 예산: local 700 / google 3000 / minimax 3000 / openai 6000). 이제 긴 context 모델(Gemini 2, Qwen3-8B)이 함수 이름만이 아니라 "함수가 무엇을 하는지"로 랭킹할 수 있습니다. 새 레포지토리는 자동으로 활성화; 기존 DB는 레거시 signature-only 동작을 유지하며, `code-review-graph embed --include-body --confirm-reembed`로 명시적 opt-in해야 전환됩니다(일회성 전량 re-embed는 클라우드 프로바이더에서 API 토큰을 소비. signature-only로 유지하려면 `CRG_EMBED_INCLUDE_BODY=0`).
 
 </details>
 
diff --git a/README.md b/README.md
index 3dd714f4..ee5cd61c 100644
--- a/README.md
+++ b/README.md
@@ -195,7 +195,7 @@ The blast-radius analysis never misses an actually impacted file (perfect recall
 | **23 languages + notebooks** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) |
 | **Blast-radius analysis** | Shows exactly which functions, classes, and files are affected by any change |
 | **Auto-update hooks** | Graph updates on every file edit and git commit without manual intervention |
-| **Semantic search** | Optional vector embeddings via sentence-transformers, Google Gemini, MiniMax, or any OpenAI-compatible endpoint (real OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) |
+| **Semantic search** | Optional vector embeddings via sentence-transformers, Google Gemini, MiniMax, or any OpenAI-compatible endpoint (real OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI). Embeds the function body logic (signature + leading docstrings/header comments are stripped so the token budget goes to implementation, not documentation) — long-context models can rank by what code *does* |
 | **Interactive visualisation** | D3.js force-directed graph with search, community legend toggles, and degree-scaled nodes |
 | **Hub & bridge detection** | Find most-connected nodes and architectural chokepoints via betweenness centrality |
 | **Surprise scoring** | Detect unexpected coupling: cross-community, cross-language, peripheral-to-hub edges |
@@ -376,12 +376,16 @@ The cloud-egress warning is auto-skipped when the base URL points to localhost
 > `gemini-embedding-001` (via the native Gemini provider, which requires
 > `GOOGLE_API_KEY` instead of the OpenAI-compatible path).
 >
-> Also note: `code-review-graph` currently embeds **function signatures only**
-> (~10 tokens per node, e.g. `"parse_file function (path: str) returns Tree"`).
-> Models whose headline quality comes from long-context body understanding
-> (such as Gemini 2 or Qwen3-8B at their MTEB-code SOTA scores) will see a
-> much narrower quality gap against smaller models at this input length.
-> Body/docstring embedding is tracked as a follow-up enhancement.
+> `code-review-graph` also embeds a truncated **function body
+> implementation** snippet alongside the signature (signature + leading
+> docstrings / header comments are stripped so the snippet tokens go to
+> real logic; per-provider char budget: local 700 / google 3000 /
+> minimax 3000 / openai 6000). Long-context models (Gemini 2, Qwen3-8B)
+> can rank by what a function *does* and not just what it's named. New repositories pick this up automatically;
+> existing DBs keep the legacy signature-only behavior until you opt in
+> with `code-review-graph embed --include-body --confirm-reembed` (a
+> one-time re-embed that may spend API tokens on cloud providers — set
+> `CRG_EMBED_INCLUDE_BODY=0` to stay on signatures).
 
 #### Tool Filtering
 
diff --git a/README.zh-CN.md b/README.zh-CN.md
index 9d1c8fa0..2a8e387d 100644
--- a/README.zh-CN.md
+++ b/README.zh-CN.md
@@ -133,7 +133,7 @@ Build the code review graph for this project
 | **23 种语言 + 笔记本** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) |
 | **影响半径分析** | 精确展示任何变更所影响的函数、类和文件 |
 | **自动更新钩子** | 每次文件编辑和 git 提交时自动更新图，无需手动干预 |
-| **语义搜索** | 可选的向量嵌入，支持 sentence-transformers、Google Gemini、MiniMax，或任何 OpenAI 兼容端点（真实 OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI） |
+| **语义搜索** | 可选的向量嵌入，支持 sentence-transformers、Google Gemini、MiniMax，或任何 OpenAI 兼容端点（真实 OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI）。嵌入函数 body 的实作片段（剥除签名与开头的 docstring/header 注释，让 token 预算花在真正的逻辑而非文档），让长 context 模型按「代码做什么」排序 |
 | **交互式可视化** | D3.js 力导向图，支持搜索、社区图例切换和按度数缩放的节点 |
 | **Hub 与 Bridge 检测** | 查找连接最多的节点和通过介数中心性发现架构瓶颈 |
 | **异常评分** | 检测意外耦合：跨社区、跨语言、外围到核心的边 |
@@ -286,7 +286,7 @@ export CRG_OPENAI_BATCH_SIZE=100                        # 某些网关有更严
 
 > **模型选择提示。** 避免用 `-preview` / `-beta` / `-exp` 结尾的 model ID（例如 `google/gemini-embedding-2-preview`）做长期使用——preview 模型可能更换权重（维度一变就要全量 re-embed）或被无预警下架。建议改用正式 GA 模型：`text-embedding-3-small` / `text-embedding-3-large`（OpenAI）、`Qwen/Qwen3-Embedding-8B`（经 vLLM / LocalAI 自宿主）、或 `gemini-embedding-001`（经原生 Gemini provider，需要 `GOOGLE_API_KEY`）。
 >
-> 另外请注意：目前 `code-review-graph` 只嵌入**函数签名**（每节点约 10 tokens，例如 `"parse_file function (path: str) returns Tree"`）。那些靠长 context 理解函数 body 来拉开差距的模型（Gemini 2 或 Qwen3-8B 在 MTEB-code 的 SOTA 分数）在这个输入长度下跟小模型的品质差距会小很多。Body / docstring 嵌入已列为后续增强任务。
+> `code-review-graph` 也会在签名后面接上截断过的**函数 body 实作**片段（剥除签名与开头的 docstring/header 注释，让 snippet token 留给真正的逻辑而非文档；每家 provider 的字元上限：local 700 / google 3000 / minimax 3000 / openai 6000），长 context 模型（Gemini 2、Qwen3-8B）终于能按「代码做什么」排序而不是只看名字。新 repo 自动开启；既有 DB 保留原本的 signature-only 行为，直到你执行 `code-review-graph embed --include-body --confirm-reembed` 显式 opt-in（一次性全量 re-embed，对云 provider 会花 API token——想留在 signature-only 设 `CRG_EMBED_INCLUDE_BODY=0`）。
 
 </details>
 
diff --git a/code_review_graph/cli.py b/code_review_graph/cli.py
index 6e9349b6..d097b75e 100644
--- a/code_review_graph/cli.py
+++ b/code_review_graph/cli.py
@@ -480,6 +480,41 @@ def main() -> None:
     eval_cmd.add_argument("--report", action="store_true", help="Generate report from results")
     eval_cmd.add_argument("--output-dir", default=None, help="Output directory for results")
 
+    # embed (opt-in body enrichment gate)
+    embed_cmd = sub.add_parser(
+        "embed",
+        help="Compute embeddings for all graph nodes (opt-in body enrichment)",
+    )
+    embed_cmd.add_argument("--repo", default=None, help="Repository root (auto-detected)")
+    embed_cmd.add_argument(
+        "--provider",
+        default=None,
+        help="Embedding provider: local (default) / google / minimax / openai",
+    )
+    embed_cmd.add_argument(
+        "--model",
+        default=None,
+        help="Model name (local: sentence-transformers; openai: CRG_OPENAI_MODEL)",
+    )
+    embed_cmd.add_argument(
+        "--include-body",
+        action="store_true",
+        help="Embed function bodies alongside signatures (sets CRG_EMBED_INCLUDE_BODY=1)",
+    )
+    embed_cmd.add_argument(
+        "--confirm-reembed",
+        action="store_true",
+        help="Required alongside --include-body when embeddings.db already has rows — "
+        "without this flag the command refuses to trigger a full re-embed "
+        "(cost-safety gate for cloud providers).",
+    )
+    embed_cmd.add_argument(
+        "--skip-cost-check",
+        action="store_true",
+        help="Bypass the ballpark token/cost estimate warning shown before "
+        "re-embed on cloud providers (CI / automation).",
+    )
+
     # detect-changes
     detect_cmd = sub.add_parser("detect-changes", help="Analyze change impact")
     detect_cmd.add_argument("--base", default="HEAD~1", help="Git diff base (default: HEAD~1)")
@@ -722,6 +757,68 @@ def main() -> None:
             if result.get("files_updated", 0) > 0:
                 _cli_post_process(store)
 
+        elif args.command == "embed":
+            from .embeddings import (
+                CLOUD_PROVIDERS,
+                EmbeddingStore,
+                _warn_cloud_egress,
+                embed_all_nodes,
+            )
+
+            provider = getattr(args, "provider", None) or "local"
+            model = getattr(args, "model", None)
+
+            # --include-body safety gate: existing DB requires --confirm-reembed.
+            if getattr(args, "include_body", False):
+                emb_for_count = EmbeddingStore(store.db_path)
+                try:
+                    existing = emb_for_count.count()
+                finally:
+                    emb_for_count.close()
+                if existing > 0 and not getattr(args, "confirm_reembed", False):
+                    logging.error(
+                        "--include-body on a non-empty embeddings DB (%d rows) "
+                        "triggers a full re-embed. Re-run with --confirm-reembed "
+                        "to proceed, or unset --include-body to keep the current "
+                        "sticky gate. See CHANGELOG for cost notes.",
+                        existing,
+                    )
+                    sys.exit(2)
+                if (
+                    provider in CLOUD_PROVIDERS
+                    and existing >= 500
+                    and not getattr(args, "skip_cost_check", False)
+                ):
+                    print(
+                        f"Heads up: re-embedding {existing} nodes via '{provider}' "
+                        f"will spend API tokens. A rough ballpark is "
+                        f"{existing * 200} tokens; check your provider's pricing. "
+                        "Pass --skip-cost-check to silence this notice.",
+                        file=sys.stderr,
+                    )
+                os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+
+            emb_store = EmbeddingStore(store.db_path, provider=provider, model=model)
+            try:
+                if emb_store.provider is None:
+                    logging.error(
+                        "Embedding provider '%s' unavailable. For 'local', "
+                        "install the 'embeddings' extra: "
+                        "pip install code-review-graph[embeddings]",
+                        provider,
+                    )
+                    sys.exit(1)
+                if provider in CLOUD_PROVIDERS:
+                    _warn_cloud_egress(emb_store.provider.name)
+                count = embed_all_nodes(store, emb_store)
+                print(
+                    f"Embedded {count} new/changed nodes "
+                    f"(provider={emb_store.provider.name}, "
+                    f"total={emb_store.count()})"
+                )
+            finally:
+                emb_store.close()
+
         elif args.command == "status":
             stats = store.get_stats()
             print(f"Nodes: {stats.total_nodes}")
diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index be29045f..6badc0a6 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -18,6 +18,7 @@
 import sys
 import time
 from abc import ABC, abstractmethod
+from collections import OrderedDict
 from pathlib import Path
 from typing import Any
 from urllib.parse import urlparse
@@ -662,6 +663,11 @@ def _check_available() -> bool:
     text_hash TEXT NOT NULL,
     provider TEXT NOT NULL DEFAULT 'unknown'
 );
+
+CREATE TABLE IF NOT EXISTS embeddings_meta (
+    key TEXT PRIMARY KEY,
+    value TEXT NOT NULL
+);
 """
 
 
@@ -688,8 +694,403 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float:
     return dot / (norm_a * norm_b)
 
 
-def _node_to_text(node: GraphNode) -> str:
-    """Convert a node to a searchable text representation."""
+# ---------------------------------------------------------------------------
+# Body enrichment helpers (Iter 3 final: ADR-A1+dedup / B4 per-provider / C3 sticky)
+# ---------------------------------------------------------------------------
+
+# ADR-B4: per-provider body char budget. Keys are provider-name prefixes
+# (the part before ':'). Fallback applies to unknown providers.
+_BODY_MAX_CHARS_BY_PROVIDER: dict[str, int] = {
+    "local":   700,   # MiniLM ~256 token window; body + metadata must fit together
+    "google":  3000,  # gemini-embedding-001 ~2048 token window, conservative 50%
+    "minimax": 3000,  # embo-01 ~1024 token window, pay-per-use
+    "openai":  6000,  # text-embedding-3-* 8192 token ceiling, safety margin
+}
+_BODY_MAX_CHARS_FALLBACK = 700
+_BODY_MAX_LINES_DEFAULT = 15
+_FILE_CACHE_MAX_ENTRIES = 128
+_FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024  # skip files larger than 2 MB
+_FILE_CACHE_SNIFF_BYTES = 4096           # null-byte probe window for binary detect
+_EMBED_META_KEY_BODY_ENABLED = "embed_body_enabled"
+
+
+def _resolve_body_max_chars(provider_name: str) -> int:
+    """Return the per-provider char budget for body-enriched embeddings.
+
+    Precedence:
+      1. ``CRG_EMBED_BODY_MAX_CHARS`` env var (positive int) — global override
+      2. ``_BODY_MAX_CHARS_BY_PROVIDER[prefix]`` where prefix is
+         ``provider_name.split(':', 1)[0].lower()``
+      3. ``_BODY_MAX_CHARS_FALLBACK`` for unknown providers
+    """
+    override = os.environ.get("CRG_EMBED_BODY_MAX_CHARS")
+    if override:
+        try:
+            n = int(override)
+            if n > 0:
+                return n
+        except ValueError:
+            pass
+    prefix = provider_name.split(":", 1)[0].lower() if provider_name else ""
+    return _BODY_MAX_CHARS_BY_PROVIDER.get(prefix, _BODY_MAX_CHARS_FALLBACK)
+
+
+class _FileLineCache:
+    """Per-embed-batch file reader with bounded LRU (``OrderedDict``).
+
+    Returns a slice of file lines for a given 1-indexed inclusive range.
+    Returns ``[]`` (empty, never raises) for:
+      - missing file / directory / permission / generic ``OSError``
+      - binary file (null byte in first ``_FILE_CACHE_SNIFF_BYTES``)
+      - oversize file (> ``_FILE_CACHE_MAX_BYTES``)
+      - out-of-range line numbers (``line_start <= 0`` or beyond EOF)
+
+    Not thread-safe. ``embed_nodes`` is single-threaded; do not share
+    instances across batches.
+    """
+
+    def __init__(
+        self,
+        repo_root: Path | None = None,
+        max_entries: int = _FILE_CACHE_MAX_ENTRIES,
+    ) -> None:
+        self._repo_root = repo_root
+        self._max_entries = max_entries
+        # value ``None`` marks a known-failed read so we don't re-probe
+        self._cache: OrderedDict[str, list[str] | None] = OrderedDict()
+        self._read_count = 0
+        # Whether the most recent ``get_lines`` call saw a file that
+        # loaded at all. ``False`` only when the underlying file
+        # read/decode failed (missing / permission / binary / oversized
+        # / outside repo). Callers use this to decide whether an empty
+        # return reflects a legitimate empty slice or a transient read
+        # failure that should be retried on the next run, NOT persisted
+        # as an empty-body sentinel (Codex round 5 regression).
+        self._last_read_ok = True
+
+    def _resolve(self, file_path: str) -> Path | None:
+        """Resolve ``file_path`` to an absolute path within the repo root.
+
+        Returns ``None`` (treated as a read failure) if the resolved path
+        escapes the repo root. ``file_path`` comes straight out of the
+        graph DB, so an attacker who can write to ``nodes.file_path``
+        could otherwise point us at ``/etc/passwd`` or ``../../secret``
+        and get the contents embedded into a cloud provider. We only
+        allow the escape when ``repo_root`` is unset (e.g. unit tests that
+        construct the cache with absolute fixture paths).
+        """
+        p = Path(file_path)
+        if not p.is_absolute() and self._repo_root is not None:
+            p = self._repo_root / p
+        if self._repo_root is None:
+            return p
+        try:
+            resolved = p.resolve()
+            root = self._repo_root.resolve()
+        except (OSError, RuntimeError):
+            return None
+        try:
+            resolved.relative_to(root)
+        except ValueError:
+            return None
+        return resolved
+
+    def _load(self, file_path: str) -> list[str] | None:
+        """Read and split a file; return ``None`` on any failure."""
+        try:
+            p = self._resolve(file_path)
+            if p is None:
+                return None
+            st = p.stat()
+            if st.st_size > _FILE_CACHE_MAX_BYTES:
+                return None
+            with p.open("rb") as fh:
+                head = fh.read(_FILE_CACHE_SNIFF_BYTES)
+                if b"\x00" in head:
+                    return None
+                rest = fh.read()
+            raw = head + rest
+        except (OSError, ValueError):
+            return None
+        try:
+            text = raw.decode("utf-8", errors="replace")
+        except Exception:
+            return None
+        return text.splitlines()
+
+    def get_lines(self, file_path: str, line_start: int, line_end: int) -> list[str]:
+        if line_start <= 0 or line_end < line_start:
+            # Bogus range is a caller-side issue, not a read failure —
+            # retrying would hit the same bug, so mark as OK and let the
+            # caller persist whatever result they compute.
+            self._last_read_ok = True
+            return []
+        key = file_path
+        if key in self._cache:
+            self._cache.move_to_end(key)
+            lines = self._cache[key]
+        else:
+            lines = self._load(key)
+            self._read_count += 1
+            self._cache[key] = lines
+            if len(self._cache) > self._max_entries:
+                self._cache.popitem(last=False)
+        self._last_read_ok = lines is not None
+        if lines is None:
+            return []
+        if line_start > len(lines):
+            return []
+        return lines[line_start - 1:line_end]
+
+    @property
+    def last_read_ok(self) -> bool:
+        return self._last_read_ok
+
+    def clear(self) -> None:
+        self._cache.clear()
+        self._last_read_ok = True
+
+
+def _looks_like_signature(line: str, node: GraphNode) -> bool:
+    """Conservative substring check: is ``line`` the declaration of ``node``?
+
+    Returns True only when ALL of:
+      - ``node.name`` appears in the line
+      - the line contains a language-specific declaration keyword
+      - (if ``node.params`` is set) the first param identifier appears
+
+    Unknown languages return False (do not dedup — keep the line).
+    """
+    s = line.strip()
+    if not s:
+        return False
+    if node.name not in s:
+        return False
+
+    lang = (node.language or "").lower()
+
+    if lang == "python":
+        if not ("def " in s or "class " in s):
+            return False
+    elif lang in ("javascript", "typescript", "jsx", "tsx"):
+        if not (
+            "function " in s or "=>" in s or "class " in s or "async " in s
+        ):
+            return False
+    elif lang in ("java", "kotlin"):
+        has_type_name = bool(
+            node.return_type and node.return_type in s and node.name in s
+        )
+        # Kotlin methods commonly omit an explicit return type (e.g.
+        # ``override fun save(user: User) { ... }``); fall back to the
+        # ``fun `` keyword so dedup still fires for those.
+        has_kotlin_fun = lang == "kotlin" and "fun " in s
+        if not (
+            has_type_name or "class " in s or "interface " in s or has_kotlin_fun
+        ):
+            return False
+    elif lang == "rust":
+        if not any(
+            k in s for k in ("fn ", "impl ", "struct ", "enum ", "trait ")
+        ):
+            return False
+    elif lang == "go":
+        if "func " not in s:
+            return False
+    else:
+        return False
+
+    if node.params:
+        pstripped = node.params.strip("()").strip()
+        if pstripped:
+            first = pstripped.split(",", 1)[0].split(":", 1)[0].strip()
+            # tokens like `*args`, `**kwargs`, `&self`, `self` are safe
+            # to skip param gate on (always match): bail only when we have
+            # a real identifier we can check.
+            bare = first.lstrip("*&")
+            if bare and bare != "self" and bare not in s:
+                return False
+    return True
+
+
+def _extract_body_text(
+    node: GraphNode,
+    reader: _FileLineCache,
+    *,
+    max_chars: int,
+    max_lines: int = _BODY_MAX_LINES_DEFAULT,
+) -> str:
+    """Return a truncated body snippet for semantic embedding.
+
+    Pipeline:
+      1. Skip ``File`` nodes and bogus line ranges.
+      2. Read the node's ``[line_start, line_end]`` range via ``reader``.
+      3. Drop line 0 if it looks like the node's signature (dedup with metadata).
+      4. Run a block-comment + triple-quote state machine to skip leading
+         docstrings and header comments; leading single-line comments
+         (``#`` ``//`` ``--`` ``///`` ``*``) are dropped regardless of length.
+      5. Collapse consecutive blank lines.
+      6. Truncate by ``max_lines`` then by ``max_chars`` (prefer word boundary,
+         append ``' …'`` suffix when cut).
+    """
+    if node.kind == "File":
+        return ""
+    if node.line_start is None or node.line_end is None:
+        return ""
+    if node.line_start <= 0 or node.line_end < node.line_start:
+        return ""
+
+    lines = reader.get_lines(node.file_path, node.line_start, node.line_end)
+    if not lines:
+        return ""
+
+    # Step 3: signature dedup
+    if _looks_like_signature(lines[0], node):
+        lines = lines[1:]
+    if not lines:
+        return ""
+
+    # Step 4: block-comment + triple-quote state machine
+    kept: list[str] = []
+    in_block_comment = False
+    in_triple_quote: str | None = None
+    drops = 0
+    for line in lines:
+        s = line.strip()
+
+        # triple-quote state (Python docstring continuation)
+        if in_triple_quote is not None:
+            drops += 1
+            if in_triple_quote in s:
+                in_triple_quote = None
+            continue
+
+        # block-comment state (C/Java/JS /* ... */)
+        if in_block_comment:
+            drops += 1
+            if "*/" in s:
+                in_block_comment = False
+            continue
+
+        # leading-only triple-quote open
+        if not kept:
+            matched_triple = False
+            for q in ('"""', "'''"):
+                if s.startswith(q):
+                    # same-line open+close: """one-liner."""
+                    if len(s) >= len(q) * 2 and s.endswith(q):
+                        drops += 1
+                        matched_triple = True
+                        break
+                    # multi-line open: enter state (only if no closer on this line)
+                    if q not in s[len(q):]:
+                        in_triple_quote = q
+                        drops += 1
+                        matched_triple = True
+                        break
+            if matched_triple:
+                continue
+
+        # block-comment open: /* or /** on its own line
+        if s.startswith("/*") or s.startswith("/**"):
+            if "*/" not in s:
+                # multi-line opener
+                in_block_comment = True
+                drops += 1
+                continue
+            # single-line inline block comment: drop only if the line has no
+            # trailing code after the closing ``*/``.  Lines like
+            # ``/* x */ real_code = 1`` keep flowing into the real-content
+            # branch (handled by the leading-``*`` case below + kept).
+            after_close = s.split("*/", 1)[1].strip()
+            if not kept and not after_close:
+                drops += 1
+                continue
+
+        # leading single-line comments: drop regardless of length (D-iter3-4)
+        if not kept and s.startswith(("#", "//", "--", "///", "*")):
+            drops += 1
+            continue
+
+        # blank line before any real code
+        if not kept and s == "":
+            drops += 1
+            continue
+
+        kept.append(line)
+
+        # fallback safety cap
+        if drops >= 20 and not kept:
+            break
+
+    if not kept:
+        return ""
+
+    # Step 5: collapse blank-line runs
+    collapsed: list[str] = []
+    prev_blank = False
+    for ln in kept:
+        is_blank = ln.strip() == ""
+        if is_blank and prev_blank:
+            continue
+        collapsed.append(ln)
+        prev_blank = is_blank
+
+    # Step 6: truncate
+    collapsed = collapsed[:max_lines]
+    normalised = [ln.replace("\t", "    ").rstrip() for ln in collapsed]
+    text = " \n ".join(normalised)
+    if len(text) > max_chars:
+        cut = text[:max_chars].rsplit(" ", 1)
+        text = (cut[0] if len(cut) == 2 and cut[0] else text[:max_chars]) + " …"
+    return text
+
+
+def _split_combined_hash(stored: str) -> tuple[str, str, str]:
+    """Parse the combined hash stored in ``embeddings.text_hash``.
+
+    Format evolution (all three forms round-trip through this helper):
+
+      * **Iter 1 legacy**: single sha256 hex, no ``':'``
+        → ``(stored, '', '')``
+      * **Iter 2 (body flag only)**: ``'meta_hash:body_hash'``
+        → ``(meta, body, '')``
+      * **Iter 3.2 (body + file fingerprint)**: ``'meta:body:file_hash'``
+        → ``(meta, body, file_hash)``
+
+    The third slot lets ``embed_nodes`` detect a body change without
+    reading the file again: when metadata is unchanged but the node's
+    file-level hash has shifted, the body snippet may have changed
+    (Codex-review finding: metadata-only stage-1 filter would otherwise
+    leave edits like ``return x * 2`` → ``return x * 3`` permanently
+    stale). Rows older than Iter 3.2 report an empty file-hash slot,
+    which callers treat as "unknown — must re-verify".
+    """
+    parts = stored.split(":", 2)
+    if len(parts) == 1:
+        return parts[0], "", ""
+    if len(parts) == 2:
+        return parts[0], parts[1], ""
+    return parts[0], parts[1], parts[2]
+
+
+def _node_to_text(
+    node: GraphNode,
+    reader: _FileLineCache | None = None,
+    *,
+    max_chars: int | None = None,
+) -> str:
+    """Build searchable text = metadata + optional body snippet.
+
+    Backward-compatible: callers passing ``reader=None`` get the pre-Iter-2
+    metadata-only string exactly as before. Legacy tests that construct a
+    node and call ``_node_to_text(node)`` continue to pass unchanged.
+
+    When ``reader`` is provided, ``max_chars`` is required (raises
+    ``ValueError`` otherwise — ``assert`` would be stripped under
+    ``python -O``). Caller is responsible for gating via ``_body_enabled``
+    before passing a reader; this function does not re-check.
+    """
     parts = [node.name]
     if node.kind != "File":
         parts.append(node.kind.lower())
@@ -701,9 +1102,53 @@ def _node_to_text(node: GraphNode) -> str:
         parts.append(f"returns {node.return_type}")
     if node.language:
         parts.append(node.language)
+
+    if reader is not None:
+        if max_chars is None:
+            raise ValueError("max_chars required when reader provided")
+        body = _extract_body_text(node, reader, max_chars=max_chars)
+        if body:
+            parts.append("body:")
+            parts.append(body)
     return " ".join(parts)
 
 
+def _body_enabled(store: "EmbeddingStore") -> bool:
+    """ADR-C3 sticky-flag gate for body enrichment (Iter 3 D-iter3-1).
+
+    Decision priority (first match wins; ALL paths write back the flag to
+    ``embeddings_meta.embed_body_enabled`` so subsequent runs honour it):
+
+      1. ``CRG_EMBED_INCLUDE_BODY=0`` -> False (write ``"0"``)
+      2. ``CRG_EMBED_INCLUDE_BODY=1`` -> True  (write ``"1"``)
+      3. Existing flag -> honour stored value
+      4. No flag + ``store.count() == 0`` -> True  (auto-on for fresh DB)
+      5. No flag + ``store.count()  > 0`` -> False (legacy DB opts out by default)
+
+    The CLI ``--include-body --confirm-reembed`` combo is implemented by
+    setting ``CRG_EMBED_INCLUDE_BODY=1`` before invoking embed, so case (2)
+    fires and flips ``"0"`` -> ``"1"`` on legacy DBs.
+    """
+    env = os.environ.get("CRG_EMBED_INCLUDE_BODY")
+    if env == "0":
+        store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "0")
+        return False
+    if env == "1":
+        store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "1")
+        return True
+    stored = store._get_meta(_EMBED_META_KEY_BODY_ENABLED)
+    if stored == "1":
+        return True
+    if stored == "0":
+        return False
+    # Flag absent: decide based on DB emptiness and persist the choice.
+    if store.count() == 0:
+        store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "1")
+        return True
+    store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "0")
+    return False
+
+
 class EmbeddingStore:
     """Manages vector embeddings for graph nodes in SQLite."""
 
@@ -743,31 +1188,146 @@ def __exit__(self, exc_type, exc_val, exc_tb) -> None:  # type: ignore[no-untype
     def close(self) -> None:
         self._conn.close()
 
+    def _repo_root_guess(self) -> Path | None:
+        """Best-effort root inference for the ``_FileLineCache``.
+
+        Graph files live in ``.code-review-graph/graph.db``; walk upward from
+        ``db_path`` to find that directory and return its parent. Returns
+        ``None`` if we can't pin it down — callers interpret that as
+        "use absolute file paths only".
+        """
+        try:
+            parent = self.db_path.resolve().parent
+            if parent.name == ".code-review-graph":
+                return parent.parent
+            # db_path may live directly inside a repo root (tests/tmp_path)
+            return parent
+        except OSError:
+            return None
+
+    # --- embeddings_meta (Iter 3 D-iter3-1 sticky flag storage) ---
+
+    def _get_meta(self, key: str) -> str | None:
+        row = self._conn.execute(
+            "SELECT value FROM embeddings_meta WHERE key = ?", (key,)
+        ).fetchone()
+        return row["value"] if row else None
+
+    def _set_meta(self, key: str, value: str) -> None:
+        self._conn.execute(
+            "INSERT INTO embeddings_meta (key, value) VALUES (?, ?) "
+            "ON CONFLICT(key) DO UPDATE SET value = excluded.value",
+            (key, value),
+        )
+        self._conn.commit()
+
     def embed_nodes(self, nodes: list[GraphNode], batch_size: int = 64) -> int:
-        """Compute and store embeddings for a list of nodes."""
+        """Compute and store embeddings for a list of nodes.
+
+        Two-stage filter (Iter 3 Directive 5):
+
+          * **Stage 1 (zero IO)** — compute ``metadata_hash`` from the
+            metadata-only ``_node_to_text`` form and compare with the stored
+            hash's metadata half. If it matches AND body is disabled OR the
+            row already has a body part embedded, skip without reading any
+            file.
+          * **Stage 2 (with IO)** — only for nodes that pass stage 1's
+            escape-hatch branches (metadata changed, or body was just
+            enabled on a legacy row): read the file through the
+            ``_FileLineCache``, compute ``body_hash``, and store the
+            combined ``"{metadata_hash}:{body_hash}"`` in ``text_hash``.
+        """
         if not self.provider:
             return 0
 
-        # Filter to nodes that need embedding
-        to_embed: list[tuple[GraphNode, str, str]] = []
         provider_name = self.provider.name
+        body_on = _body_enabled(self)
+        max_chars = _resolve_body_max_chars(provider_name) if body_on else 0
+        reader = _FileLineCache(repo_root=self._repo_root_guess()) if body_on else None
+
+        to_embed: list[tuple[GraphNode, str, str]] = []
+
+        # Sort by file so the LRU cache gets consecutive hits within a file
+        sorted_nodes = sorted(
+            nodes,
+            key=lambda n: (n.file_path or "", n.line_start or 0),
+        )
 
-        for node in nodes:
+        for node in sorted_nodes:
             if node.kind == "File":
                 continue
-            text = _node_to_text(node)
-            text_hash = hashlib.sha256(text.encode()).hexdigest()
+
+            # --- Stage 1: metadata-only hash, zero IO ---
+            meta_text = _node_to_text(node, reader=None)
+            meta_hash = hashlib.sha256(meta_text.encode()).hexdigest()
 
             existing = self._conn.execute(
                 "SELECT text_hash, provider FROM embeddings WHERE qualified_name = ?",
                 (node.qualified_name,),
             ).fetchone()
 
-            # Re-embed if text changed OR provider changed
-            if (existing and existing["text_hash"] == text_hash
-                    and existing["provider"] == provider_name):
+            node_file_hash = node.file_hash or ""
+            if existing and existing["provider"] == provider_name:
+                stored_meta, stored_body, stored_file_hash = _split_combined_hash(
+                    existing["text_hash"]
+                )
+                if stored_meta == meta_hash:
+                    # Metadata identical. Skip when we can PROVE the body
+                    # snippet is still fresh too. We require *both* sides
+                    # to carry a non-empty file fingerprint that matches;
+                    # an empty fingerprint on either side is not proof of
+                    # freshness — it just means we don't know — so we
+                    # fall through to stage 2 and reread the file. This
+                    # is the conservative path Codex round 3 flagged:
+                    # metadata-only rows with absent ``node.file_hash``
+                    # would otherwise let body-only edits slip through.
+                    if not body_on:
+                        continue
+                    if (stored_body != ""
+                            and stored_file_hash != ""
+                            and node_file_hash != ""
+                            and stored_file_hash == node_file_hash):
+                        continue
+                    # else: fall through to stage 2 to refresh body_hash
+
+            # --- Stage 2: body-enriched hash (reads file only when needed) ---
+            if body_on and reader is not None:
+                full_text = _node_to_text(node, reader=reader, max_chars=max_chars)
+                body_part = full_text[len(meta_text):]
+                if reader.last_read_ok:
+                    # Always hash the body slot when body enrichment is
+                    # on and the read succeeded, even when ``body_part``
+                    # is empty (single-line function whose only line was
+                    # the signature, or a body entirely stripped as
+                    # docstring/header). Writing ``sha256("")`` here
+                    # lets stage 1 distinguish "body ran and produced
+                    # nothing" from "body never ran" (legacy / body-off
+                    # rows), so empty-body nodes short-circuit too.
+                    body_hash = hashlib.sha256(body_part.encode()).hexdigest()
+                    combined_hash = f"{meta_hash}:{body_hash}:{node_file_hash}"
+                    final_text = full_text
+                else:
+                    # File read failed (missing / permission / binary /
+                    # oversized / outside repo). Do NOT persist the
+                    # empty-body sentinel — that would freeze the node
+                    # on a transient failure (Codex round 5). Fall back
+                    # to the body-off stored form so the next run
+                    # retries. We still embed the metadata-only text
+                    # now, since we need *some* vector for this node.
+                    combined_hash = f"{meta_hash}::{node_file_hash}"
+                    final_text = meta_text
+            else:
+                combined_hash = f"{meta_hash}::{node_file_hash}"
+                final_text = meta_text
+
+            if (existing and existing["provider"] == provider_name
+                    and existing["text_hash"] == combined_hash):
                 continue
-            to_embed.append((node, text, text_hash))
+
+            to_embed.append((node, final_text, combined_hash))
+
+        if reader is not None:
+            reader.clear()
 
         if not to_embed:
             return 0
diff --git a/tests/fixtures/embeddings/eval/golden_queries.json b/tests/fixtures/embeddings/eval/golden_queries.json
new file mode 100644
index 00000000..cfec44ad
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/golden_queries.json
@@ -0,0 +1,101 @@
+[
+  {
+    "query": "parse source code tree into nodes",
+    "expected_top3": [
+      "parser.py::parse_tree",
+      "parser.py::walk_nodes",
+      "parser.py::extract_functions"
+    ],
+    "type": "signature-searchable",
+    "notes": "node names contain parse/tree/nodes"
+  },
+  {
+    "query": "split string into tokens",
+    "expected_top3": [
+      "parser.py::tokenize",
+      "parser.py::parse_tree",
+      "parser.py::build_root"
+    ],
+    "type": "signature-searchable",
+    "notes": "tokenize name is the obvious hit"
+  },
+  {
+    "query": "cosine similarity between embedding vectors",
+    "expected_top3": [
+      "embeddings.py::cosine_similarity",
+      "embeddings.py::EmbeddingIndex.search",
+      "embeddings.py::embed_text"
+    ],
+    "type": "signature-searchable",
+    "notes": "cosine_similarity name is explicit"
+  },
+  {
+    "query": "escape HTML entities including quotes and backticks",
+    "expected_top3": [
+      "visualization.py::esc_h",
+      "visualization.py::render_html"
+    ],
+    "type": "body-dependent",
+    "notes": "esc_h name is cryptic; only body shows replace('&', '&amp;') / replace('\"', '&quot;') / replace('`', '&#96;')"
+  },
+  {
+    "query": "breadth first search with visited set and queue",
+    "expected_top3": [
+      "graph.py::Graph.impact_radius",
+      "graph.py::Graph.neighbors",
+      "graph.py::Graph.add_edge"
+    ],
+    "type": "body-dependent",
+    "notes": "impact_radius name does not say BFS; body has visited=set(), deque(), while queue"
+  },
+  {
+    "query": "read OPENAI_API_KEY from the environment",
+    "expected_top3": [
+      "api_client.py::OpenAIClient.__init__",
+      "api_client.py::GeminiClient.__init__",
+      "api_client.py::MinimaxClient.__init__"
+    ],
+    "type": "body-dependent",
+    "notes": "all three __init__ share the same signature; only body contains the env var name"
+  },
+  {
+    "query": "read GEMINI_API_KEY from the environment",
+    "expected_top3": [
+      "api_client.py::GeminiClient.__init__",
+      "api_client.py::OpenAIClient.__init__",
+      "api_client.py::MinimaxClient.__init__"
+    ],
+    "type": "body-dependent",
+    "notes": "requires body to distinguish the three __init__ siblings"
+  },
+  {
+    "query": "render graph nodes as html list",
+    "expected_top3": [
+      "visualization.py::render_html",
+      "visualization.py::esc_h",
+      "graph.py::Graph.neighbors"
+    ],
+    "type": "mixed",
+    "notes": "render_html name has semantic signal; esc_h relies on body"
+  },
+  {
+    "query": "detect git file changes since a ref",
+    "expected_top3": [
+      "incremental.py::detect_changes",
+      "incremental.py::get_changed_files",
+      "graph.py::Graph.add_edge"
+    ],
+    "type": "mixed",
+    "notes": "detect_changes is signature-searchable; get_changed_files needs body-ish context"
+  },
+  {
+    "query": "store embedding vector by qualified name",
+    "expected_top3": [
+      "embeddings.py::EmbeddingIndex.add",
+      "embeddings.py::EmbeddingIndex.search",
+      "embeddings.py::EmbeddingIndex.__init__"
+    ],
+    "type": "signature-searchable",
+    "notes": "EmbeddingIndex.add name matches"
+  }
+]
diff --git a/tests/fixtures/embeddings/eval/mini_repo/api_client.py b/tests/fixtures/embeddings/eval/mini_repo/api_client.py
new file mode 100644
index 00000000..de194c95
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/mini_repo/api_client.py
@@ -0,0 +1,24 @@
+"""Mini cloud API clients for MRR eval fixture — identical signatures, body differs."""
+
+import os
+
+
+class OpenAIClient:
+    def __init__(self):
+        self.api_key = os.environ.get("OPENAI_API_KEY", "")
+        self.base_url = "https://api.openai.com/v1"
+        self.default_model = "text-embedding-3-small"
+
+
+class GeminiClient:
+    def __init__(self):
+        self.api_key = os.environ.get("GEMINI_API_KEY", "")
+        self.base_url = "https://generativelanguage.googleapis.com/v1beta"
+        self.default_model = "gemini-embedding-001"
+
+
+class MinimaxClient:
+    def __init__(self):
+        self.api_key = os.environ.get("MINIMAX_API_KEY", "")
+        self.base_url = "https://api.minimax.io/v1"
+        self.default_model = "embo-01"
diff --git a/tests/fixtures/embeddings/eval/mini_repo/embeddings.py b/tests/fixtures/embeddings/eval/mini_repo/embeddings.py
new file mode 100644
index 00000000..b5d30502
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/mini_repo/embeddings.py
@@ -0,0 +1,32 @@
+"""Mini embedding index for MRR eval fixture."""
+
+
+def cosine_similarity(a, b):
+    if len(a) != len(b):
+        return 0.0
+    dot = sum(x * y for x, y in zip(a, b))
+    norm_a = sum(x * x for x in a) ** 0.5
+    norm_b = sum(x * x for x in b) ** 0.5
+    if norm_a == 0 or norm_b == 0:
+        return 0.0
+    return dot / (norm_a * norm_b)
+
+
+def embed_text(text):
+    tokens = text.lower().split()
+    return [float(len(t)) for t in tokens]
+
+
+class EmbeddingIndex:
+    def __init__(self):
+        self.store = {}
+
+    def add(self, qname, vec):
+        self.store[qname] = vec
+
+    def search(self, query_vec, limit=10):
+        scored = []
+        for qn, vec in self.store.items():
+            scored.append((qn, cosine_similarity(query_vec, vec)))
+        scored.sort(key=lambda x: x[1], reverse=True)
+        return scored[:limit]
diff --git a/tests/fixtures/embeddings/eval/mini_repo/graph.py b/tests/fixtures/embeddings/eval/mini_repo/graph.py
new file mode 100644
index 00000000..8103ae11
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/mini_repo/graph.py
@@ -0,0 +1,33 @@
+"""Mini graph store for MRR eval fixture."""
+
+from collections import deque
+
+
+class Graph:
+    def __init__(self):
+        self.nodes = {}
+        self.edges = {}
+
+    def add_node(self, name):
+        self.nodes[name] = name
+        self.edges.setdefault(name, [])
+
+    def add_edge(self, src, dst):
+        self.edges.setdefault(src, []).append(dst)
+
+    def neighbors(self, name):
+        return list(self.edges.get(name, []))
+
+    def impact_radius(self, start, depth):
+        visited = set()
+        queue = deque([(start, 0)])
+        result = []
+        while queue:
+            node, d = queue.popleft()
+            if node in visited or d > depth:
+                continue
+            visited.add(node)
+            result.append(node)
+            for nbr in self.edges.get(node, []):
+                queue.append((nbr, d + 1))
+        return result
diff --git a/tests/fixtures/embeddings/eval/mini_repo/incremental.py b/tests/fixtures/embeddings/eval/mini_repo/incremental.py
new file mode 100644
index 00000000..e9c905b2
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/mini_repo/incremental.py
@@ -0,0 +1,16 @@
+"""Mini git change detection for MRR eval fixture."""
+
+import subprocess
+
+
+def detect_changes(repo_root, since_ref):
+    """Return list of files changed in repo since the given git ref."""
+    out = subprocess.run(
+        ["git", "-C", repo_root, "diff", "--name-only", since_ref, "HEAD"],
+        capture_output=True, text=True, check=False,
+    )
+    return [line for line in out.stdout.splitlines() if line.strip()]
+
+
+def get_changed_files(repo_root):
+    return detect_changes(repo_root, "HEAD~1")
diff --git a/tests/fixtures/embeddings/eval/mini_repo/parser.py b/tests/fixtures/embeddings/eval/mini_repo/parser.py
new file mode 100644
index 00000000..6874d7d0
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/mini_repo/parser.py
@@ -0,0 +1,33 @@
+"""Mini AST parser for MRR eval fixture."""
+
+
+def parse_tree(source):
+    """Parse source text and return an AST root."""
+    tokens = tokenize(source)
+    root = build_root(tokens)
+    return root
+
+
+def walk_nodes(tree, visitor):
+    stack = [tree]
+    while stack:
+        node = stack.pop()
+        visitor(node)
+        stack.extend(node.children)
+
+
+def extract_functions(tree):
+    found = []
+    walk_nodes(tree, lambda n: found.append(n) if n.kind == "function" else None)
+    return found
+
+
+def tokenize(source):
+    return source.split()
+
+
+def build_root(tokens):
+    class N:
+        kind = "root"
+        children = []
+    return N()
diff --git a/tests/fixtures/embeddings/eval/mini_repo/visualization.py b/tests/fixtures/embeddings/eval/mini_repo/visualization.py
new file mode 100644
index 00000000..5e131ccd
--- /dev/null
+++ b/tests/fixtures/embeddings/eval/mini_repo/visualization.py
@@ -0,0 +1,18 @@
+"""Mini HTML rendering for MRR eval fixture."""
+
+
+def esc_h(s):
+    """Replace dangerous characters so the string is safe inside HTML."""
+    out = s.replace("&", "&amp;")
+    out = out.replace("<", "&lt;").replace(">", "&gt;")
+    out = out.replace('"', "&quot;")
+    out = out.replace("`", "&#96;")
+    return out
+
+
+def render_html(graph):
+    buf = ["<html><body><ul>"]
+    for name in graph.nodes:
+        buf.append("<li>" + esc_h(name) + "</li>")
+    buf.append("</ul></body></html>")
+    return "".join(buf)
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 2f5cc4b0..a6283e2c 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -8,16 +8,25 @@
 import pytest
 
 from code_review_graph.embeddings import (
+    _BODY_MAX_CHARS_BY_PROVIDER,
+    _BODY_MAX_CHARS_FALLBACK,
+    _EMBED_META_KEY_BODY_ENABLED,
     LOCAL_DEFAULT_MODEL,
     EmbeddingStore,
     LocalEmbeddingProvider,
     MiniMaxEmbeddingProvider,
     OpenAIEmbeddingProvider,
+    _body_enabled,
     _cosine_similarity,
     _decode_vector,
     _encode_vector,
+    _extract_body_text,
+    _FileLineCache,
     _is_localhost_url,
+    _looks_like_signature,
     _node_to_text,
+    _resolve_body_max_chars,
+    _split_combined_hash,
     get_provider,
 )
 from code_review_graph.graph import GraphNode
@@ -1034,3 +1043,748 @@ def test_subdomain_spoof_triggers_warning(self, capsys):
             get_provider("openai")
         captured = capsys.readouterr()
         assert "cloud" in captured.err.lower()
+
+
+# ---------------------------------------------------------------------------
+# Body enrichment tests (Iter 3 final: ADR-A1+dedup / B4 per-provider / C3 sticky)
+# ---------------------------------------------------------------------------
+
+
+def _mk_node(**kwargs):
+    """Minimal GraphNode factory for body-enrichment tests."""
+    defaults = dict(
+        id=1, kind="Function", name="my_func",
+        qualified_name="sample.py::my_func", file_path="sample.py",
+        line_start=1, line_end=5, language="python",
+        parent_name=None, params=None, return_type=None,
+        is_test=False, file_hash=None, extra={},
+    )
+    defaults.update(kwargs)
+    return GraphNode(**defaults)
+
+
+class TestFileLineCache:
+    def test_reads_existing_file(self, tmp_path):
+        p = tmp_path / "a.py"
+        p.write_text("line1\nline2\nline3\n")
+        cache = _FileLineCache(repo_root=tmp_path)
+        assert cache.get_lines("a.py", 1, 3) == ["line1", "line2", "line3"]
+        assert cache.get_lines("a.py", 2, 2) == ["line2"]
+
+    def test_caches_across_calls(self, tmp_path):
+        p = tmp_path / "a.py"
+        p.write_text("x\ny\nz\n")
+        cache = _FileLineCache(repo_root=tmp_path)
+        cache.get_lines("a.py", 1, 2)
+        cache.get_lines("a.py", 2, 3)
+        cache.get_lines("a.py", 1, 1)
+        assert cache._read_count == 1
+
+    def test_missing_file_returns_empty(self, tmp_path):
+        cache = _FileLineCache(repo_root=tmp_path)
+        assert cache.get_lines("does_not_exist.py", 1, 5) == []
+
+    def test_binary_file_detected_returns_empty(self, tmp_path):
+        p = tmp_path / "b.bin"
+        p.write_bytes(b"abc\x00def\n")
+        cache = _FileLineCache(repo_root=tmp_path)
+        assert cache.get_lines("b.bin", 1, 3) == []
+
+    def test_out_of_range_lines_returns_empty(self, tmp_path):
+        p = tmp_path / "a.py"
+        p.write_text("only-one\n")
+        cache = _FileLineCache(repo_root=tmp_path)
+        assert cache.get_lines("a.py", 999, 1000) == []
+        assert cache.get_lines("a.py", 0, 5) == []
+        assert cache.get_lines("a.py", 3, 1) == []
+
+    def test_lru_eviction_at_max_entries(self, tmp_path):
+        cache = _FileLineCache(repo_root=tmp_path, max_entries=4)
+        for i in range(6):
+            p = tmp_path / f"f{i}.py"
+            p.write_text(f"content {i}\n")
+            cache.get_lines(f"f{i}.py", 1, 1)
+        # Earliest two entries should have been evicted
+        assert "f0.py" not in cache._cache
+        assert "f1.py" not in cache._cache
+        assert "f5.py" in cache._cache
+        assert len(cache._cache) == 4
+
+
+class TestSignatureDedup:
+    @pytest.mark.parametrize("lang,line,name,params", [
+        ("python",     "def my_func(x, y):", "my_func", "(x, y)"),
+        ("javascript", "function handle(req, res) {", "handle", "(req, res)"),
+        ("typescript", "const handle = (req: Req) => {", "handle", "(req: Req)"),
+        ("java",       "public int compute(int x) {", "compute", "(int x)"),
+        ("rust",       "pub fn build_tree(root: Node) -> Tree {",
+                       "build_tree", "(root: Node)"),
+    ])
+    def test_signature_line_deduplication(self, lang, line, name, params):
+        node = _mk_node(language=lang, name=name, params=params,
+                        return_type="int" if lang == "java" else None)
+        assert _looks_like_signature(line, node) is True
+
+    def test_signature_dedup_preserves_body_when_no_match(self):
+        # Decoy signature-like text but node.name is different -> must NOT dedup
+        node = _mk_node(language="python", name="compute_totals",
+                        params="(items)")
+        assert _looks_like_signature("def unrelated(x): pass", node) is False
+
+    def test_signature_dedup_respects_params_mismatch(self):
+        # Overload with different param first-token
+        node = _mk_node(language="python", name="handle", params="(request)")
+        assert _looks_like_signature(
+            "def handle(completely_different_arg):", node,
+        ) is False
+
+    def test_signature_dedup_unknown_language_conservative(self):
+        node = _mk_node(language="kotlin2", name="foo", params="()")
+        assert _looks_like_signature("fun foo() {}", node) is False
+
+    def test_signature_dedup_go_and_typescript(self):
+        go_node = _mk_node(language="go", name="Handle", params="(w, r)")
+        assert _looks_like_signature(
+            "func Handle(w http.ResponseWriter, r *http.Request) {", go_node,
+        ) is True
+        ts_node = _mk_node(language="typescript", name="doThing", params="(x)")
+        assert _looks_like_signature("async function doThing(x) {", ts_node) is True
+
+    def test_decorator_line_not_mistaken_for_signature(self):
+        # Iter 3 D-iter3-5: @decorator line has node.name false-positive risk
+        # because tree-sitter may include the decorator in the node range.
+        # Three-AND gate requires a declaration keyword; @my_decorator has none.
+        node = _mk_node(language="python", name="my_func", params="(x)")
+        assert _looks_like_signature("@my_decorator", node) is False
+        # Also the real def line DOES match.
+        assert _looks_like_signature("def my_func(x):", node) is True
+
+    def test_recursive_call_first_body_line_preserved(self):
+        # Iter 3 D-iter3-5: body first line `return fact(n-1)` contains
+        # node.name but lacks `def ` keyword, so three-AND gate returns False
+        # and dedup does NOT drop the line.
+        node = _mk_node(language="python", name="fact", params="(n)")
+        assert _looks_like_signature("return n * fact(n - 1)", node) is False
+
+
+class TestBlockCommentStateMachine:
+    def _extract(self, fixture_text: str, tmp_path, **node_kwargs) -> str:
+        p = tmp_path / "f.py"
+        p.write_text(fixture_text)
+        reader = _FileLineCache(repo_root=tmp_path)
+        node = _mk_node(
+            file_path="f.py",
+            line_start=node_kwargs.get("line_start", 1),
+            line_end=node_kwargs.get("line_end", fixture_text.count("\n") + 1),
+            language=node_kwargs.get("language", "python"),
+            name=node_kwargs.get("name", "never-matches-anything"),
+        )
+        return _extract_body_text(node, reader, max_chars=4000, max_lines=40)
+
+    def test_jsdoc_multiline_header_skipped(self, tmp_path):
+        src = (
+            "/**\n"
+            " * docstring line 1\n"
+            " * docstring line 2\n"
+            " */\n"
+            "return 42;\n"
+        )
+        out = self._extract(src, tmp_path, language="javascript")
+        assert "docstring" not in out
+        assert "return 42" in out
+
+    def test_javadoc_preserves_body(self, tmp_path):
+        src = (
+            "/** header doc */\n"
+            "int answer = 42;\n"
+            "return answer;\n"
+        )
+        out = self._extract(src, tmp_path, language="java")
+        assert "header doc" not in out
+        assert "answer = 42" in out or "return answer" in out
+
+    def test_block_comment_unclosed_falls_back(self, tmp_path):
+        # Unclosed /* -> state machine stays in block_comment, drops everything.
+        # Fallback cap prevents hang, function returns "".
+        src = "/*\n" + ("noise line\n" * 25)
+        out = self._extract(src, tmp_path, language="javascript")
+        assert out == ""
+
+    def test_inline_block_comment_not_swallowed(self, tmp_path):
+        src = "/* x */ real_code = 1\nmore = 2\n"
+        out = self._extract(src, tmp_path, language="javascript")
+        # inline /* x */ with trailing code: should NOT enter block state
+        assert "real_code = 1" in out or "more = 2" in out
+
+    def test_python_single_line_triple_quote_docstring_skipped(self, tmp_path):
+        src = '"""one-liner docstring."""\nreturn value\n'
+        out = self._extract(src, tmp_path, language="python")
+        assert "one-liner docstring" not in out
+        assert "return value" in out
+
+    def test_python_multiline_triple_quote_docstring_skipped(self, tmp_path):
+        src = (
+            '"""\n'
+            "   line 1\n"
+            "   line 2\n"
+            '"""\n'
+            "compute = 3\n"
+        )
+        out = self._extract(src, tmp_path, language="python")
+        assert "line 1" not in out and "line 2" not in out
+        assert "compute = 3" in out
+
+    def test_comment_prefix_dropped_regardless_of_length(self, tmp_path):
+        long_comment = "# " + ("x" * 200)  # >> 120 chars
+        src = f"{long_comment}\nreal = 1\n"
+        out = self._extract(src, tmp_path, language="python")
+        assert long_comment not in out
+        assert "real = 1" in out
+
+
+class TestBodyExtraction:
+    def _extract(self, src: str, tmp_path, **node_kwargs) -> str:
+        p = tmp_path / "f.py"
+        p.write_text(src)
+        reader = _FileLineCache(repo_root=tmp_path)
+        node = _mk_node(
+            file_path="f.py",
+            line_start=node_kwargs.get("line_start", 1),
+            line_end=node_kwargs.get("line_end", src.count("\n") + 1),
+            language=node_kwargs.get("language", "python"),
+            kind=node_kwargs.get("kind", "Function"),
+            name=node_kwargs.get("name", "whatever-no-sig-match"),
+        )
+        return _extract_body_text(
+            node, reader,
+            max_chars=node_kwargs.get("max_chars", 4000),
+            max_lines=node_kwargs.get("max_lines", 40),
+        )
+
+    def test_python_function_body(self, tmp_path):
+        src = (
+            "def parse_tree(data):\n"
+            "    root = build_root(data)\n"
+            "    walk_all_nodes(root)\n"
+            "    return root\n"
+        )
+        out = self._extract(
+            src, tmp_path, name="parse_tree", line_start=1, line_end=4,
+        )
+        # signature line dropped via dedup; body should contain real code
+        assert "build_root" in out
+        assert "walk_all_nodes" in out
+
+    def test_file_node_returns_empty(self, tmp_path):
+        p = tmp_path / "x.py"
+        p.write_text("print('hi')\n")
+        reader = _FileLineCache(repo_root=tmp_path)
+        node = _mk_node(kind="File", file_path="x.py", line_start=1, line_end=1)
+        assert _extract_body_text(node, reader, max_chars=1000) == ""
+
+    def test_leading_triple_quote_docstring_skipped(self, tmp_path):
+        src = '"""module docstring."""\nvalue = 1\n'
+        out = self._extract(src, tmp_path, name="never")
+        assert "module docstring" not in out
+        assert "value = 1" in out
+
+    def test_comment_only_region_skipped(self, tmp_path):
+        src = "# comment 1\n# comment 2\nactual_code = 42\n"
+        out = self._extract(src, tmp_path, name="never")
+        assert "comment 1" not in out
+        assert "actual_code" in out
+
+    def test_truncation_by_chars(self, tmp_path):
+        src = "value_a = {}\n".format("x" * 100)  # single long line
+        out = self._extract(
+            src, tmp_path, name="never",
+            max_chars=40, line_start=1, line_end=1,
+        )
+        assert out.endswith("…")
+        assert len(out) <= 45  # max_chars + suffix overhead
+
+    def test_truncation_by_lines(self, tmp_path):
+        lines = "\n".join(f"stmt_{i} = {i}" for i in range(30)) + "\n"
+        out = self._extract(
+            src=lines, tmp_path=tmp_path, name="never",
+            max_chars=100000, max_lines=3,
+        )
+        # only 3 stmts kept; stmt_3+ must not appear
+        assert "stmt_0" in out
+        assert "stmt_29" not in out
+
+    def test_single_line_node(self, tmp_path):
+        src = "def tiny(): return 1\n"
+        out = self._extract(
+            src, tmp_path, name="tiny", params="()",
+            line_start=1, line_end=1,
+        )
+        # signature dedup drops the only line; body should be empty
+        assert out == ""
+
+
+class TestBodyMaxCharsPerProvider:
+    @pytest.mark.parametrize("provider_name,expected", [
+        ("local:all-MiniLM-L6-v2",    _BODY_MAX_CHARS_BY_PROVIDER["local"]),
+        ("google:gemini-embedding-001", _BODY_MAX_CHARS_BY_PROVIDER["google"]),
+        ("minimax:embo-01",           _BODY_MAX_CHARS_BY_PROVIDER["minimax"]),
+        ("openai:text-embedding-3-small", _BODY_MAX_CHARS_BY_PROVIDER["openai"]),
+    ])
+    def test_body_max_chars_per_provider(self, provider_name, expected):
+        with patch.dict(os.environ, {}, clear=False):
+            os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None)
+            assert _resolve_body_max_chars(provider_name) == expected
+
+    def test_body_max_chars_env_override(self):
+        with patch.dict(os.environ, {"CRG_EMBED_BODY_MAX_CHARS": "1234"}):
+            assert _resolve_body_max_chars("local:anything") == 1234
+            assert _resolve_body_max_chars("openai:foo") == 1234
+
+    def test_body_max_chars_unknown_provider_fallback(self):
+        with patch.dict(os.environ, {}, clear=False):
+            os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None)
+            assert _resolve_body_max_chars("exotic:brand-new") == \
+                _BODY_MAX_CHARS_FALLBACK
+
+    def test_body_max_chars_provider_prefix_parsing(self):
+        with patch.dict(os.environ, {}, clear=False):
+            os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None)
+            assert _resolve_body_max_chars("openai:text-embedding-v4") == \
+                _BODY_MAX_CHARS_BY_PROVIDER["openai"]
+            assert _resolve_body_max_chars("OPENAI:foo") == \
+                _BODY_MAX_CHARS_BY_PROVIDER["openai"]
+
+
+class TestBodyEnabledStoreGate:
+    """ADR-C3 sticky flag (Iter 3 D-iter3-1 / D-iter3-2)."""
+
+    def _fresh_store(self, tmp_path):
+        with patch("code_review_graph.embeddings.get_provider", return_value=None):
+            return EmbeddingStore(tmp_path / "e.db")
+
+    def _scrub_env(self):
+        for k in ("CRG_EMBED_INCLUDE_BODY",):
+            os.environ.pop(k, None)
+
+    def test_body_enabled_empty_db_auto_enables(self, tmp_path):
+        self._scrub_env()
+        store = self._fresh_store(tmp_path)
+        try:
+            assert _body_enabled(store) is True
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1"
+        finally:
+            store.close()
+
+    def test_body_enabled_nonempty_db_requires_opt_in(self, tmp_path):
+        self._scrub_env()
+        store = self._fresh_store(tmp_path)
+        try:
+            # Simulate an existing embedding row without flag being set yet.
+            store._conn.execute(
+                "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) "
+                "VALUES (?, ?, ?, ?)",
+                ("x::y", _encode_vector([0.1, 0.2]), "abc", "unknown"),
+            )
+            store._conn.commit()
+            assert _body_enabled(store) is False
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0"
+        finally:
+            store.close()
+
+    def test_body_enabled_env_on_overrides(self, tmp_path):
+        with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "1"}):
+            store = self._fresh_store(tmp_path)
+            try:
+                store._conn.execute(
+                    "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) "
+                    "VALUES (?, ?, ?, ?)",
+                    ("x::y", _encode_vector([0.1]), "abc", "unknown"),
+                )
+                store._conn.commit()
+                assert _body_enabled(store) is True
+                assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1"
+            finally:
+                store.close()
+
+    def test_body_enabled_env_off_overrides(self, tmp_path):
+        with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "0"}):
+            store = self._fresh_store(tmp_path)
+            try:
+                assert _body_enabled(store) is False
+                assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0"
+            finally:
+                store.close()
+
+    def test_sticky_flag_empty_db_writes_on(self, tmp_path):
+        self._scrub_env()
+        store = self._fresh_store(tmp_path)
+        try:
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) is None
+            _body_enabled(store)
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1"
+        finally:
+            store.close()
+
+    def test_sticky_flag_existing_db_writes_off(self, tmp_path):
+        self._scrub_env()
+        store = self._fresh_store(tmp_path)
+        try:
+            store._conn.execute(
+                "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) "
+                "VALUES (?, ?, ?, ?)",
+                ("x::y", _encode_vector([0.1]), "abc", "unknown"),
+            )
+            store._conn.commit()
+            _body_enabled(store)
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0"
+        finally:
+            store.close()
+
+    def test_sticky_flag_persists_across_runs(self, tmp_path):
+        """Iter 2 mid-run bug regression: once on, stays on after count() > 0."""
+        self._scrub_env()
+        store = self._fresh_store(tmp_path)
+        try:
+            # Run 1: empty DB -> auto-on, flag = "1"
+            assert _body_enabled(store) is True
+            # Simulate embeddings landing (count > 0).
+            store._conn.execute(
+                "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) "
+                "VALUES (?, ?, ?, ?)",
+                ("x::y", _encode_vector([0.1]), "h", "unknown"),
+            )
+            store._conn.commit()
+            assert store.count() > 0
+            # Run 2: no env — MUST honour the "1" flag, NOT fall back to
+            # the count()>0 -> False branch.
+            assert _body_enabled(store) is True
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1"
+        finally:
+            store.close()
+
+    def test_sticky_flag_cli_flip_from_off_to_on(self, tmp_path):
+        self._scrub_env()
+        store = self._fresh_store(tmp_path)
+        try:
+            # Setup: legacy DB with flag="0" (decided at first embed without env)
+            store._conn.execute(
+                "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) "
+                "VALUES (?, ?, ?, ?)",
+                ("x::y", _encode_vector([0.1]), "h", "unknown"),
+            )
+            store._conn.commit()
+            assert _body_enabled(store) is False
+            assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0"
+            # CLI user runs `embed --include-body --confirm-reembed`
+            # (sets env var before calling _body_enabled).
+            with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "1"}):
+                assert _body_enabled(store) is True
+                assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1"
+            # Next run without env: flag is now "1" -> stays on.
+            self._scrub_env()
+            assert _body_enabled(store) is True
+        finally:
+            store.close()
+
+
+class TestFilterStageShortCircuit:
+    """Directive 5: metadata_hash short-circuit skips file IO for unchanged nodes."""
+
+    def _fake_provider(self, dim=4):
+        mock = MagicMock()
+        mock.name = "local:fake"
+        mock.embed = MagicMock(return_value=[[0.1] * dim, [0.2] * dim, [0.3] * dim])
+        mock.embed_query = MagicMock(return_value=[0.1] * dim)
+        return mock
+
+    def _store_with_fake(self, tmp_path, provider):
+        with patch(
+            "code_review_graph.embeddings.get_provider", return_value=provider,
+        ):
+            return EmbeddingStore(tmp_path / "e.db")
+
+    def test_incremental_skips_unchanged_node_without_file_read(self, tmp_path):
+        # Create real source file for node
+        src = tmp_path / "sample.py"
+        src.write_text(
+            "def compute(x):\n"
+            "    a = x * 2\n"
+            "    return a\n",
+        )
+        provider = self._fake_provider()
+        # Force sticky flag on so stage-2 would normally read file
+        os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+        try:
+            store = self._store_with_fake(tmp_path, provider)
+            try:
+                # Unchanged node carries a populated file_hash — only then
+                # can stage 1 safely short-circuit (empty fingerprints now
+                # force a re-read; see Codex round 3 regression fix).
+                node = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    file_path=str(src), line_start=1, line_end=3,
+                    qualified_name="sample.py::compute",
+                    file_hash="file-fingerprint-v1",
+                )
+                # First embed — reader WILL be used; returns 1 embedded.
+                assert store.embed_nodes([node]) == 1
+                provider.embed.reset_mock()
+
+                # Now patch reader.get_lines so we can assert it's not hit
+                with patch.object(
+                    _FileLineCache, "get_lines",
+                    autospec=True, return_value=[],
+                ) as mock_get_lines:
+                    # Re-embed same node — metadata + file_hash both match
+                    # the stored combined hash → stage 1 short-circuits
+                    # with no file IO.
+                    assert store.embed_nodes([node]) == 0
+                    assert mock_get_lines.call_count == 0
+                    assert provider.embed.call_count == 0
+            finally:
+                store.close()
+        finally:
+            os.environ.pop("CRG_EMBED_INCLUDE_BODY", None)
+
+    def test_read_failure_does_not_freeze_node(self, tmp_path):
+        """Codex round 5: transient file-read failure (missing file, binary,
+        permission error, out-of-repo) must NOT persist an empty-body
+        sentinel that makes stage 1 skip forever. Once the file becomes
+        readable again the body must be recomputed."""
+        src = tmp_path / "maybe.py"
+        src.write_text("def maybe():\n    a = 1\n    return a\n")
+        provider = self._fake_provider()
+        os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+        try:
+            store = self._store_with_fake(tmp_path, provider)
+            try:
+                node = _mk_node(
+                    language="python", name="maybe", params="()",
+                    file_path=str(src), line_start=1, line_end=3,
+                    qualified_name="maybe.py::maybe",
+                    file_hash="maybe-fingerprint",
+                )
+                # First embed with a patched reader that simulates a
+                # transient read failure (last_read_ok=False).
+                with patch.object(
+                    _FileLineCache, "get_lines",
+                    autospec=True, return_value=[],
+                ) as mock_get_lines:
+                    def fake_fail(self, *a, **kw):
+                        self._last_read_ok = False
+                        return []
+                    mock_get_lines.side_effect = fake_fail
+                    assert store.embed_nodes([node]) == 1
+
+                provider.embed.reset_mock()
+                # Second embed with a real reader — the failed body
+                # snippet should NOT short-circuit stage 1 (stored_body
+                # is ""), so stage 2 reads the file again and persists a
+                # proper body hash.
+                assert store.embed_nodes([node]) == 1
+                assert provider.embed.call_count == 1
+            finally:
+                store.close()
+        finally:
+            os.environ.pop("CRG_EMBED_INCLUDE_BODY", None)
+
+    def test_empty_body_does_not_reread_on_unchanged_run(self, tmp_path):
+        """Codex round 4: single-line functions whose body extracts to ""
+        used to fall through stage 1 on every incremental run. Stage 1
+        must now skip them too (empty body hash is written as sha256('')
+        to distinguish 'body ran, produced nothing' from legacy rows)."""
+        src = tmp_path / "tiny.py"
+        src.write_text("def tiny(): return 1\n")
+        provider = self._fake_provider()
+        os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+        try:
+            store = self._store_with_fake(tmp_path, provider)
+            try:
+                node = _mk_node(
+                    language="python", name="tiny", params="()",
+                    file_path=str(src), line_start=1, line_end=1,
+                    qualified_name="tiny.py::tiny",
+                    file_hash="tiny-fingerprint",
+                )
+                assert store.embed_nodes([node]) == 1
+                provider.embed.reset_mock()
+                with patch.object(
+                    _FileLineCache, "get_lines",
+                    autospec=True, return_value=[],
+                ) as mock_get_lines:
+                    assert store.embed_nodes([node]) == 0
+                    assert mock_get_lines.call_count == 0
+                    assert provider.embed.call_count == 0
+            finally:
+                store.close()
+        finally:
+            os.environ.pop("CRG_EMBED_INCLUDE_BODY", None)
+
+    def test_body_edit_with_empty_file_hash_still_reembeds(self, tmp_path):
+        """Codex round 3: when both stored and current file_hash are empty,
+        stage 1 must NOT short-circuit — otherwise body-only edits slip
+        through on callers that don't populate node.file_hash."""
+        src = tmp_path / "sample.py"
+        src.write_text("def compute(x):\n    return x * 2\n")
+        provider = self._fake_provider()
+        os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+        try:
+            store = self._store_with_fake(tmp_path, provider)
+            try:
+                # file_hash deliberately omitted (None → empty string slot)
+                node_v1 = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    file_path=str(src), line_start=1, line_end=2,
+                    qualified_name="sample.py::compute",
+                )
+                assert store.embed_nodes([node_v1]) == 1
+                provider.embed.reset_mock()
+                src.write_text("def compute(x):\n    return x * 3\n")
+                node_v2 = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    file_path=str(src), line_start=1, line_end=2,
+                    qualified_name="sample.py::compute",
+                )
+                # Without a file_hash signal we must re-read and recompute
+                # body_hash; the body text has changed so this is 1 embed.
+                assert store.embed_nodes([node_v2]) == 1
+                assert provider.embed.call_count == 1
+            finally:
+                store.close()
+        finally:
+            os.environ.pop("CRG_EMBED_INCLUDE_BODY", None)
+
+    def test_metadata_change_triggers_stage2_read(self, tmp_path):
+        src = tmp_path / "sample.py"
+        src.write_text("def compute(x):\n    return x * 2\n")
+        provider = self._fake_provider()
+        os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+        try:
+            store = self._store_with_fake(tmp_path, provider)
+            try:
+                node = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    file_path=str(src), line_start=1, line_end=2,
+                    qualified_name="sample.py::compute",
+                )
+                store.embed_nodes([node])
+                provider.embed.reset_mock()
+                # Flip return_type => metadata_hash changes => stage 2 must run.
+                node_changed = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    return_type="int",
+                    file_path=str(src), line_start=1, line_end=2,
+                    qualified_name="sample.py::compute",
+                )
+                with patch.object(
+                    _FileLineCache, "get_lines", autospec=True,
+                    return_value=["def compute(x):", "    return x * 2"],
+                ) as mock_get_lines:
+                    assert store.embed_nodes([node_changed]) == 1
+                    assert mock_get_lines.call_count >= 1
+            finally:
+                store.close()
+        finally:
+            os.environ.pop("CRG_EMBED_INCLUDE_BODY", None)
+
+    def test_split_combined_hash_legacy_row(self):
+        # Iter 1: pure sha256 hex, no separator
+        legacy = "a" * 64
+        assert _split_combined_hash(legacy) == (legacy, "", "")
+        # Iter 2: metadata + body only (no file fingerprint yet)
+        iter2 = "abc:def"
+        assert _split_combined_hash(iter2) == ("abc", "def", "")
+        # Iter 2 body-disabled stored form
+        iter2_body_off = "abc:"
+        assert _split_combined_hash(iter2_body_off) == ("abc", "", "")
+        # Iter 3.2: metadata + body + file fingerprint
+        iter3 = "meta:body:fhash"
+        assert _split_combined_hash(iter3) == ("meta", "body", "fhash")
+        # Iter 3.2 body-disabled but file_hash captured
+        iter3_body_off = "meta::fhash"
+        assert _split_combined_hash(iter3_body_off) == ("meta", "", "fhash")
+
+    def test_body_only_change_triggers_reembed(self, tmp_path):
+        """Codex-review regression: metadata-unchanged body edit used to be
+        skipped forever. Now file_hash shift forces stage-2 re-read."""
+        src = tmp_path / "sample.py"
+        src.write_text("def compute(x):\n    return x * 2\n")
+        provider = self._fake_provider()
+        os.environ["CRG_EMBED_INCLUDE_BODY"] = "1"
+        try:
+            store = self._store_with_fake(tmp_path, provider)
+            try:
+                node_v1 = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    file_path=str(src), line_start=1, line_end=2,
+                    qualified_name="sample.py::compute", file_hash="h-v1",
+                )
+                assert store.embed_nodes([node_v1]) == 1
+                provider.embed.reset_mock()
+                # Body edits but metadata identical; file_hash changes.
+                src.write_text("def compute(x):\n    return x * 3\n")
+                node_v2 = _mk_node(
+                    language="python", name="compute", params="(x)",
+                    file_path=str(src), line_start=1, line_end=2,
+                    qualified_name="sample.py::compute", file_hash="h-v2",
+                )
+                assert store.embed_nodes([node_v2]) == 1
+                assert provider.embed.call_count == 1
+            finally:
+                store.close()
+        finally:
+            os.environ.pop("CRG_EMBED_INCLUDE_BODY", None)
+
+
+class TestFileLineCacheContainment:
+    """Codex-review regression: file_path must stay inside repo_root."""
+
+    def test_absolute_path_outside_root_rejected(self, tmp_path):
+        outside = tmp_path.parent / ("outside-" + tmp_path.name + ".secret")
+        outside.write_text("top secret\n")
+        try:
+            cache = _FileLineCache(repo_root=tmp_path)
+            assert cache.get_lines(str(outside), 1, 1) == []
+        finally:
+            outside.unlink(missing_ok=True)
+
+    def test_relative_traversal_rejected(self, tmp_path):
+        outside = tmp_path.parent / "escape-target.secret"
+        outside.write_text("top secret\n")
+        try:
+            cache = _FileLineCache(repo_root=tmp_path)
+            # "../escape-target.secret" tries to pop out of the repo
+            assert cache.get_lines(
+                f"../{outside.name}", 1, 1,
+            ) == []
+        finally:
+            outside.unlink(missing_ok=True)
+
+    def test_inside_root_still_reads(self, tmp_path):
+        inside = tmp_path / "hello.py"
+        inside.write_text("x = 1\n")
+        cache = _FileLineCache(repo_root=tmp_path)
+        assert cache.get_lines("hello.py", 1, 1) == ["x = 1"]
+
+    def test_no_root_preserves_absolute_behavior(self, tmp_path):
+        inside = tmp_path / "hello.py"
+        inside.write_text("y = 2\n")
+        # Without repo_root we intentionally skip the containment check so
+        # unit tests that construct nodes with absolute fixture paths still
+        # work (see TestFileLineCache.test_reads_existing_file style, but
+        # called with no repo_root).
+        cache = _FileLineCache(repo_root=None)
+        assert cache.get_lines(str(inside), 1, 1) == ["y = 2"]
+
+
+class TestKotlinFunSignatureDedup:
+    """Codex-review regression: Kotlin ``fun`` without return type."""
+
+    def test_kotlin_fun_without_return_type(self):
+        node = _mk_node(
+            language="kotlin", name="save", params="(user: User)",
+            return_type=None,
+        )
+        assert _looks_like_signature(
+            "override fun save(user: User) {", node,
+        ) is True
diff --git a/tests/test_embeddings_mrr.py b/tests/test_embeddings_mrr.py
new file mode 100644
index 00000000..0746ff38
--- /dev/null
+++ b/tests/test_embeddings_mrr.py
@@ -0,0 +1,183 @@
+"""MRR regression test for body-enriched embeddings (AC-6 Iter 3).
+
+Runs the Local sentence-transformers provider against a fixed mini_repo +
+10 golden queries and asserts that enabling body enrichment does NOT
+regress Mean Reciprocal Rank at top-3 beyond the -0.02 tolerance AND
+produces a strictly positive delta (non-inferiority + soft signal).
+
+Skipped when ``sentence-transformers`` is not installed, so the project's
+default ``.[dev]`` test matrix keeps passing; CI jobs that care about the
+gate should install ``.[embeddings]`` to activate this module.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from code_review_graph.graph import GraphNode
+
+pytest.importorskip("sentence_transformers")
+
+from code_review_graph.embeddings import (  # noqa: E402  — after importorskip
+    LocalEmbeddingProvider,
+    _cosine_similarity,
+    _FileLineCache,
+    _node_to_text,
+    _resolve_body_max_chars,
+)
+
+FIXTURE_DIR = Path(__file__).parent / "fixtures" / "embeddings" / "eval"
+MINI_REPO = FIXTURE_DIR / "mini_repo"
+
+
+def _mk(qualified_name, file_rel, line_start, line_end, *,
+        kind="Function", name=None, parent_name=None, params=None,
+        return_type=None):
+    """Build a GraphNode for mini_repo fixture nodes."""
+    return GraphNode(
+        id=0, kind=kind,
+        name=name or qualified_name.rsplit("::", 1)[1].split(".")[-1],
+        qualified_name=qualified_name,
+        file_path=str(MINI_REPO / file_rel),
+        line_start=line_start, line_end=line_end,
+        language="python",
+        parent_name=parent_name, params=params, return_type=return_type,
+        is_test=False, file_hash=None, extra={},
+    )
+
+
+# Accurate line ranges from mini_repo files (see tests/fixtures/.../mini_repo/).
+NODES: list[GraphNode] = [
+    # parser.py
+    _mk("parser.py::parse_tree",        "parser.py",         4,  9,  params="(source)"),
+    _mk("parser.py::walk_nodes",        "parser.py",        11, 16,  params="(tree, visitor)"),
+    _mk("parser.py::extract_functions", "parser.py",        19, 22,  params="(tree)"),
+    _mk("parser.py::tokenize",          "parser.py",        25, 26,  params="(source)"),
+    _mk("parser.py::build_root",        "parser.py",        29, 33,  params="(tokens)"),
+    # graph.py
+    _mk("graph.py::Graph.__init__",     "graph.py",          7,  9,  params="(self)",
+        parent_name="Graph"),
+    _mk("graph.py::Graph.add_node",     "graph.py",         11, 13,  params="(self, name)",
+        parent_name="Graph"),
+    _mk("graph.py::Graph.add_edge",     "graph.py",         15, 16,  params="(self, src, dst)",
+        parent_name="Graph"),
+    _mk("graph.py::Graph.neighbors",    "graph.py",         18, 19,  params="(self, name)",
+        parent_name="Graph"),
+    _mk("graph.py::Graph.impact_radius","graph.py",         21, 32,
+        params="(self, start, depth)", parent_name="Graph"),
+    # embeddings.py
+    _mk("embeddings.py::cosine_similarity",        "embeddings.py",  4, 12,
+        params="(a, b)"),
+    _mk("embeddings.py::embed_text",               "embeddings.py", 15, 17,
+        params="(text)"),
+    _mk("embeddings.py::EmbeddingIndex.__init__",  "embeddings.py", 21, 22,
+        params="(self)", parent_name="EmbeddingIndex"),
+    _mk("embeddings.py::EmbeddingIndex.add",       "embeddings.py", 24, 25,
+        params="(self, qname, vec)", parent_name="EmbeddingIndex"),
+    _mk("embeddings.py::EmbeddingIndex.search",    "embeddings.py", 27, 32,
+        params="(self, query_vec, limit=10)", parent_name="EmbeddingIndex"),
+    # visualization.py
+    _mk("visualization.py::esc_h",        "visualization.py",  4, 10, params="(s)"),
+    _mk("visualization.py::render_html",  "visualization.py", 13, 18, params="(graph)"),
+    # api_client.py
+    _mk("api_client.py::OpenAIClient.__init__",   "api_client.py",  7, 10,
+        params="(self)", parent_name="OpenAIClient"),
+    _mk("api_client.py::GeminiClient.__init__",   "api_client.py", 14, 17,
+        params="(self)", parent_name="GeminiClient"),
+    _mk("api_client.py::MinimaxClient.__init__",  "api_client.py", 21, 24,
+        params="(self)", parent_name="MinimaxClient"),
+    # incremental.py
+    _mk("incremental.py::detect_changes",     "incremental.py",  6, 12,
+        params="(repo_root, since_ref)"),
+    _mk("incremental.py::get_changed_files",  "incremental.py", 15, 16,
+        params="(repo_root)"),
+]
+
+
+def _rank_top(query_vec, doc_vecs, qnames, limit=3):
+    scored = [
+        (qn, _cosine_similarity(query_vec, v))
+        for qn, v in zip(qnames, doc_vecs)
+    ]
+    scored.sort(key=lambda x: x[1], reverse=True)
+    return [qn for qn, _ in scored[:limit]]
+
+
+def _reciprocal_rank(ranked_top: list[str], expected: set[str]) -> float:
+    for i, qn in enumerate(ranked_top, start=1):
+        if qn in expected:
+            return 1.0 / i
+    return 0.0
+
+
+def _mrr(rankings: list[list[str]], expectations: list[set[str]]) -> float:
+    if not rankings:
+        return 0.0
+    return sum(
+        _reciprocal_rank(r, e) for r, e in zip(rankings, expectations)
+    ) / len(rankings)
+
+
+def test_mrr_body_not_worse_than_legacy():
+    """AC-6 (Iter 3 D-iter3-9): non-inferiority MRR gate.
+
+    Two runs with the Local provider:
+      - legacy: ``_node_to_text(node, reader=None)`` — metadata only
+      - body-on: ``_node_to_text(node, reader, max_chars=...)`` — body enriched
+
+    The body-on run must not regress MRR@3 beyond the tolerance
+    (``MRR_on >= MRR_legacy - 0.02``) AND must strictly improve it
+    (``MRR_on > MRR_legacy``).
+    """
+    provider = LocalEmbeddingProvider()
+    max_chars = _resolve_body_max_chars(provider.name)
+    reader = _FileLineCache()
+
+    legacy_texts = [_node_to_text(n, reader=None) for n in NODES]
+    body_texts = [
+        _node_to_text(n, reader=reader, max_chars=max_chars) for n in NODES
+    ]
+
+    legacy_vecs = provider.embed(legacy_texts)
+    body_vecs = provider.embed(body_texts)
+
+    qnames = [n.qualified_name for n in NODES]
+
+    queries = json.loads((FIXTURE_DIR / "golden_queries.json").read_text())
+    # Sanity: all golden expected nodes exist in our fixture.
+    known = set(qnames)
+    for q in queries:
+        for exp in q["expected_top3"]:
+            assert exp in known, f"Golden query references unknown node: {exp}"
+
+    query_texts = [q["query"] for q in queries]
+    query_vecs = [provider.embed_query(t) for t in query_texts]
+    expectations = [set(q["expected_top3"]) for q in queries]
+
+    legacy_rankings = [_rank_top(qv, legacy_vecs, qnames) for qv in query_vecs]
+    body_rankings = [_rank_top(qv, body_vecs, qnames) for qv in query_vecs]
+
+    mrr_legacy = _mrr(legacy_rankings, expectations)
+    mrr_body = _mrr(body_rankings, expectations)
+
+    # Diagnostic logging (captured on failure)
+    print(f"MRR@3 legacy={mrr_legacy:.4f} body={mrr_body:.4f}")
+    for q, lr, br in zip(queries, legacy_rankings, body_rankings):
+        print(f"  [{q['type']:19s}] {q['query']}")
+        print(f"      legacy top-3: {lr}")
+        print(f"      body top-3  : {br}")
+
+    # non-inferiority hard gate (must not regress beyond tolerance)
+    assert mrr_body >= mrr_legacy - 0.02, (
+        f"body enrichment regressed MRR beyond tolerance "
+        f"(body={mrr_body:.4f}, legacy={mrr_legacy:.4f}, "
+        f"delta={mrr_body - mrr_legacy:+.4f})"
+    )
+    # soft signal: must strictly improve (not just match)
+    assert mrr_body > mrr_legacy, (
+        f"body enrichment failed to improve MRR "
+        f"(body={mrr_body:.4f}, legacy={mrr_legacy:.4f})"
+    )

From c74bc3dc7514425a1dc0520bf64c10b277f75eff Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 14:38:37 +0800
Subject: [PATCH 2/9] fix(embeddings): respect CRG_DATA_DIR in _repo_root_guess
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: Codex round 7 caught that the legacy inference
``db_path.parent -> repo root`` silently breaks for the supported
``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir,
not inside ``<repo>/.code-review-graph/``). In that case
``_FileLineCache`` received the cache dir as repo_root and every
body-snippet read failed the containment check, making body
enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two-
stage filter then wrote metadata-only hashes, so even opting back
in via ``embed --include-body --confirm-reembed`` wouldn't help.

What: ``_repo_root_guess`` now:
  1. keeps the fast path when db_path parent is literally
     ``.code-review-graph`` (zero-config common case, no cost)
  2. otherwise delegates to ``incremental.find_project_root`` which
     already honors ``CRG_REPO_ROOT`` and walks up for a ``.git``
     marker — the same resolution order the rest of the CLI uses
  3. returns None only when both fail, letting callers treat
     absolute file_path values as self-locating

Regression test: ``TestRepoRootGuessExternalDataDir`` covers the
legacy layout, the env-override fallback, and an end-to-end read
through ``_FileLineCache`` from an external cache dir. Existing
``TestFilterStageShortCircuit`` fixtures now write their DB under
``<tmp_path>/.code-review-graph/`` so they hit the fast path
rather than the cwd-relative fallback.
---
 code_review_graph/embeddings.py | 32 +++++++++----
 tests/test_embeddings.py        | 85 ++++++++++++++++++++++++++++++++-
 2 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index 6badc0a6..05da85b2 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -1189,20 +1189,36 @@ def close(self) -> None:
         self._conn.close()
 
     def _repo_root_guess(self) -> Path | None:
-        """Best-effort root inference for the ``_FileLineCache``.
-
-        Graph files live in ``.code-review-graph/graph.db``; walk upward from
-        ``db_path`` to find that directory and return its parent. Returns
-        ``None`` if we can't pin it down — callers interpret that as
-        "use absolute file paths only".
+        """Best-effort repo-root inference for the ``_FileLineCache``.
+
+        Priority (Codex round 7 fix — ``CRG_DATA_DIR`` lets users keep
+        ``graph.db`` outside the repo, so ``db_path.parent`` is NOT a
+        reliable proxy for the repo root in that case):
+
+          1. Legacy layout ``<repo>/.code-review-graph/graph.db`` — if
+             the parent directory is named ``.code-review-graph`` the
+             repo root is its parent. This keeps the common case
+             zero-config and cheap.
+          2. Delegate to :func:`incremental.find_project_root`, which
+             honors ``CRG_REPO_ROOT`` first and otherwise walks up from
+             cwd for a ``.git`` (or SVN) marker. This covers
+             ``CRG_DATA_DIR`` external-cache setups and MCP servers
+             launched with the repo as cwd.
+          3. ``None`` if neither yields a path we can resolve — callers
+             must treat that as "use absolute file paths only" (which
+             graph-emitted ``file_path`` already is).
         """
         try:
             parent = self.db_path.resolve().parent
             if parent.name == ".code-review-graph":
                 return parent.parent
-            # db_path may live directly inside a repo root (tests/tmp_path)
-            return parent
         except OSError:
+            pass
+        try:
+            from .incremental import find_project_root
+            root = find_project_root()
+            return root.resolve() if root else None
+        except Exception:
             return None
 
     # --- embeddings_meta (Iter 3 D-iter3-1 sticky flag storage) ---
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index a6283e2c..5da4961e 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1498,10 +1498,18 @@ def _fake_provider(self, dim=4):
         return mock
 
     def _store_with_fake(self, tmp_path, provider):
+        # Put the DB under ``<tmp_path>/.code-review-graph/`` so
+        # ``_repo_root_guess`` takes the legacy fast-path and returns
+        # ``tmp_path`` — that's where these tests stage their fixture
+        # source files. Without this layout the fallback walks up from
+        # cwd and returns the real project root, which sits outside
+        # ``tmp_path`` and makes every body read fail containment.
+        crg_dir = tmp_path / ".code-review-graph"
+        crg_dir.mkdir(exist_ok=True)
         with patch(
             "code_review_graph.embeddings.get_provider", return_value=provider,
         ):
-            return EmbeddingStore(tmp_path / "e.db")
+            return EmbeddingStore(crg_dir / "e.db")
 
     def test_incremental_skips_unchanged_node_without_file_read(self, tmp_path):
         # Create real source file for node
@@ -1777,6 +1785,81 @@ def test_no_root_preserves_absolute_behavior(self, tmp_path):
         assert cache.get_lines(str(inside), 1, 1) == ["y = 2"]
 
 
+class TestRepoRootGuessExternalDataDir:
+    """Codex round 7 regression: ``CRG_DATA_DIR`` / ``CRG_REPO_ROOT``.
+
+    When ``graph.db`` lives outside the repo (``CRG_DATA_DIR`` is set),
+    ``_repo_root_guess`` must NOT return the cache directory — otherwise
+    ``_FileLineCache`` resolves repo-relative paths against the cache
+    and every body read fails the containment check, silently degrading
+    to metadata-only embeddings. We verify the guess falls back to
+    ``find_project_root`` (which honors ``CRG_REPO_ROOT`` or walks up to
+    a ``.git`` marker) in that case, while preserving the fast legacy
+    ``<repo>/.code-review-graph/graph.db`` inference.
+    """
+
+    def _store(self, db_path):
+        with patch(
+            "code_review_graph.embeddings.get_provider", return_value=None,
+        ):
+            return EmbeddingStore(db_path)
+
+    def test_legacy_layout_infers_repo_from_db_parent(self, tmp_path):
+        repo = tmp_path / "repo"
+        (repo / ".code-review-graph").mkdir(parents=True)
+        (repo / ".git").mkdir()
+        db = repo / ".code-review-graph" / "graph.db"
+        db.touch()
+        store = self._store(db)
+        try:
+            assert store._repo_root_guess() == repo.resolve()
+        finally:
+            store.close()
+
+    def test_external_data_dir_falls_back_to_crg_repo_root(self, tmp_path):
+        repo = tmp_path / "my-repo"
+        (repo / ".git").mkdir(parents=True)
+        (repo / "sample.py").write_text("def f():\n    return 42\n")
+        cache = tmp_path / "external-cache"
+        cache.mkdir()
+        db = cache / "graph.db"
+        db.touch()
+        with patch.dict(
+            os.environ, {"CRG_REPO_ROOT": str(repo)},
+        ):
+            store = self._store(db)
+            try:
+                # Without the fix this would point at ``cache`` and make
+                # every downstream read fail containment.
+                assert store._repo_root_guess() == repo.resolve()
+            finally:
+                store.close()
+
+    def test_external_data_dir_resolves_file_read(self, tmp_path):
+        repo = tmp_path / "my-repo"
+        (repo / ".git").mkdir(parents=True)
+        (repo / "sample.py").write_text("def f():\n    return 42\n")
+        cache = tmp_path / "external-cache"
+        cache.mkdir()
+        db = cache / "graph.db"
+        db.touch()
+        with patch.dict(
+            os.environ, {"CRG_REPO_ROOT": str(repo)},
+        ):
+            store = self._store(db)
+            try:
+                root = store._repo_root_guess()
+                reader = _FileLineCache(repo_root=root)
+                # Repo-relative file_path resolves correctly against the
+                # real repo, not the external cache dir.
+                assert reader.get_lines("sample.py", 1, 2) == [
+                    "def f():", "    return 42",
+                ]
+                assert reader.last_read_ok is True
+            finally:
+                store.close()
+
+
 class TestKotlinFunSignatureDedup:
     """Codex-review regression: Kotlin ``fun`` without return type."""
 

From 970c46b06d1f4523aaa048d1ae65a07bae497355 Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 14:41:49 +0800
Subject: [PATCH 3/9] fix(embeddings): skip fast path when CRG_DATA_DIR is set
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: Codex round 8 found the round-7 fast path still misfires when
``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g.
``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or
``/shared/cache/<repo>/.code-review-graph``. In that case
``db_path.parent.name == ".code-review-graph"`` is True, the fast
path returns ``parent.parent`` (``/var/tmp`` or the shared-cache
parent), and ``_FileLineCache`` silently falls back to metadata
for every node. My round-7 tests only covered an external dir
NOT named ``.code-review-graph`` so the regression slipped.

What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When
that env var is set we always defer to
``incremental.find_project_root`` regardless of what ``db_path``
looks like; the zero-config fast path only runs when the user is
on the default ``<repo>/.code-review-graph/graph.db`` layout and
has NOT overridden the data dir.

Regression test: ``test_crg_data_dir_named_dot_code_review_graph``
exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at
a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT``
pointed at a distant ``real-repo`` tree — and asserts the root
resolves to ``real-repo``, not the cache's parent.
---
 code_review_graph/embeddings.py | 50 ++++++++++++++++++---------------
 tests/test_embeddings.py        | 33 ++++++++++++++++++++++
 2 files changed, 61 insertions(+), 22 deletions(-)

diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index 05da85b2..6e9078f9 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -1191,29 +1191,35 @@ def close(self) -> None:
     def _repo_root_guess(self) -> Path | None:
         """Best-effort repo-root inference for the ``_FileLineCache``.
 
-        Priority (Codex round 7 fix — ``CRG_DATA_DIR`` lets users keep
-        ``graph.db`` outside the repo, so ``db_path.parent`` is NOT a
-        reliable proxy for the repo root in that case):
-
-          1. Legacy layout ``<repo>/.code-review-graph/graph.db`` — if
-             the parent directory is named ``.code-review-graph`` the
-             repo root is its parent. This keeps the common case
-             zero-config and cheap.
-          2. Delegate to :func:`incremental.find_project_root`, which
-             honors ``CRG_REPO_ROOT`` first and otherwise walks up from
-             cwd for a ``.git`` (or SVN) marker. This covers
-             ``CRG_DATA_DIR`` external-cache setups and MCP servers
-             launched with the repo as cwd.
-          3. ``None`` if neither yields a path we can resolve — callers
-             must treat that as "use absolute file paths only" (which
-             graph-emitted ``file_path`` already is).
+        Priority (Codex rounds 7 + 8 — ``CRG_DATA_DIR`` lets users keep
+        ``graph.db`` outside the repo, so neither ``db_path.parent`` nor
+        even the ``.code-review-graph`` directory name is a reliable
+        proxy for the repo root in that case; an override like
+        ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` would otherwise
+        hand us ``/var/tmp`` as the repo root):
+
+          1. When ``CRG_DATA_DIR`` is NOT set, trust the zero-config
+             default layout: if ``db_path.parent`` is literally named
+             ``.code-review-graph`` then the repo root is its parent.
+             This keeps the common case cheap (no env reads, no dir
+             walk).
+          2. Otherwise — ``CRG_DATA_DIR`` is set, or the DB lives
+             somewhere non-standard — delegate to
+             :func:`incremental.find_project_root`, which honors
+             ``CRG_REPO_ROOT`` first and otherwise walks up from cwd
+             for a ``.git`` (or SVN) marker.
+          3. Returns ``None`` only when both paths fail to resolve;
+             callers then treat absolute ``file_path`` values on
+             ``GraphNode`` as self-locating.
         """
-        try:
-            parent = self.db_path.resolve().parent
-            if parent.name == ".code-review-graph":
-                return parent.parent
-        except OSError:
-            pass
+        data_dir_override = os.environ.get("CRG_DATA_DIR", "").strip()
+        if not data_dir_override:
+            try:
+                parent = self.db_path.resolve().parent
+                if parent.name == ".code-review-graph":
+                    return parent.parent
+            except OSError:
+                pass
         try:
             from .incremental import find_project_root
             root = find_project_root()
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 5da4961e..5606de64 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1835,6 +1835,39 @@ def test_external_data_dir_falls_back_to_crg_repo_root(self, tmp_path):
             finally:
                 store.close()
 
+    def test_crg_data_dir_named_dot_code_review_graph(self, tmp_path):
+        """Codex round 8: a ``CRG_DATA_DIR`` whose last segment happens
+        to be ``.code-review-graph`` (e.g. ``/var/tmp/.code-review-graph``
+        or a shared cache dir set up that way) must NOT trigger the
+        fast path — ``db_path.parent.parent`` would be the cache's
+        parent directory, not the repo root. When the env override is
+        set we always defer to ``find_project_root`` so that the
+        repo-root resolution uses the real git walk / ``CRG_REPO_ROOT``
+        override instead of a coincidental directory name."""
+        repo = tmp_path / "real-repo"
+        (repo / ".git").mkdir(parents=True)
+        (repo / "sample.py").write_text("x = 1\n")
+        # Cache dir whose last segment IS '.code-review-graph' but
+        # sits far away from the real repo — classic misfire case.
+        cache = tmp_path / "var-tmp" / ".code-review-graph"
+        cache.mkdir(parents=True)
+        db = cache / "graph.db"
+        db.touch()
+        with patch.dict(
+            os.environ,
+            {"CRG_DATA_DIR": str(cache), "CRG_REPO_ROOT": str(repo)},
+        ):
+            store = self._store(db)
+            try:
+                root = store._repo_root_guess()
+                # WRONG (fast-path trap): tmp_path / "var-tmp"
+                # RIGHT: repo (via find_project_root + CRG_REPO_ROOT)
+                assert root == repo.resolve()
+                reader = _FileLineCache(repo_root=root)
+                assert reader.get_lines("sample.py", 1, 1) == ["x = 1"]
+            finally:
+                store.close()
+
     def test_external_data_dir_resolves_file_read(self, tmp_path):
         repo = tmp_path / "my-repo"
         (repo / ".git").mkdir(parents=True)

From 7daf93ccb8b5572273029a815fdae175d6562d32 Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 20:57:03 +0800
Subject: [PATCH 4/9] feat(embeddings): drop pure log/print lines from body
 snippets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: cross-repo benchmark showed body enrichment regresses on
framework-heavy codebases (humanitzbot — TypeScript Discord bot —
MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found
that signature dedup + block-comment state machine are all working,
but TS/Discord function bodies are thick with framework boilerplate
that dilutes the retrieval signal. Examples from one regressing
function:

    await interaction.deferReply();
    const locale = interaction.locale;
    logger.info('handling power request', { signal });
    console.log('...');

On small models (MiniLM: 256-token window), those lines fill the
budget with `await`, `deferReply`, `locale`, `logger`, `console`
— high-frequency framework tokens that don't discriminate between
similarly-named commands (``_power`` vs ``_status`` vs ``_console``).

What: add a language-agnostic check to the existing body-extraction
state machine that skips lines which ARE purely a log/print call
with no other executable content. Covers:

  * Python:  ``print(...)``, ``logger.info(...)``, ``logging.debug(...)``
  * JS/TS:   ``console.log(...)``, ``logger.warn(...)``
  * Java:    ``System.out.println(...)``, ``log.debug(...)``
  * Go:      ``fmt.Println(...)``, ``log.Printf(...)``
  * Rust:    ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)``

Lines where a log call is mixed with substantive code
(``if err: logger.error(err); raise``) are preserved — they DO
carry retrieval signal.

Scope: one regex + one helper + one extra branch in the existing
state machine loop. Stays within ADR-A1 "language-agnostic body
extraction". Does NOT introduce framework-specific heuristics —
log/print is a universal pattern, not a framework one. Also does
NOT make the body hash non-deterministic (same input still produces
same output), so the combined-hash filter short-circuit (AC-13)
still holds.

Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust
log patterns, an integration test verifying scattered log lines
get stripped from a multi-statement TS body while real logic
survives, plus negative tests for mixed-statement lines and
non-log calls (``await interaction.deferReply()`` is NOT a log
call and stays kept).

Follow-up: framework-specific SDK boilerplate (Discord
``interaction.X``, React hooks, Spring ``@Autowired``) is where
the next round of improvement lives, but that can't be done
generically without either a tokenizer-based adaptive budget or
per-language SDK awareness — both are out of scope for this PR.
---
 code_review_graph/embeddings.py | 48 ++++++++++++++++++
 tests/test_embeddings.py        | 87 +++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)

diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index 6e9078f9..6641e482 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -13,6 +13,7 @@
 import hashlib
 import logging
 import os
+import re
 import sqlite3
 import struct
 import sys
@@ -708,6 +709,44 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float:
 }
 _BODY_MAX_CHARS_FALLBACK = 700
 _BODY_MAX_LINES_DEFAULT = 15
+
+# Pure log/print lines carry almost no semantic signal for code retrieval
+# — a function's purpose is almost never "the fact that it logs" — but
+# framework-heavy codebases (Discord bots, Rails controllers, Flask
+# views, Spring services) can easily burn the body budget on them.
+# Skip lines that ARE a log/print call and nothing else. Language-
+# agnostic: Python logger/logging/print, JS/TS console.X / logger.X,
+# Java System.out.println + log.X, Go fmt.Print* + log.Print*, Rust
+# println!/eprintln!/dbg!. Keep lines where the log is mixed with
+# other statements.
+_LOG_ONLY_LINE_RE = re.compile(
+    r'^\s*'
+    r'(?:'
+    # identifier.method(...)  — catches logger.info / log.debug /
+    # self.logger.warning / console.log / trace.error / etc.
+    r'(?:[A-Za-z_][\w.]*)\s*\.\s*'
+    r'(?:log|debug|info|warn|warning|error|trace|fatal|notice|fine|finer|finest'
+    r'|Print|Println|Printf)\s*\('
+    r'|'
+    # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...)
+    r'(?:print|println|eprintln|dbg)\s*!?\s*\('
+    r'|'
+    # java: System.out.println(...) / System.err.print(...)
+    r'System\s*\.\s*(?:out|err)\s*\.\s*(?:print|println|printf)\s*\('
+    r'|'
+    # go: fmt.Println(...) / fmt.Printf(...) / fmt.Fprintln(...)
+    r'fmt\s*\.\s*(?:Print|Println|Printf|Fprintln|Fprintf|Fprint)\s*\('
+    r')'
+)
+
+
+def _is_log_only_line(s: str) -> bool:
+    """Return True when the line is purely a log/print call with no
+    other executable content."""
+    stripped = s.strip().rstrip(";").strip()
+    if not stripped.endswith(")"):
+        return False
+    return bool(_LOG_ONLY_LINE_RE.match(stripped))
 _FILE_CACHE_MAX_ENTRIES = 128
 _FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024  # skip files larger than 2 MB
 _FILE_CACHE_SNIFF_BYTES = 4096           # null-byte probe window for binary detect
@@ -1017,6 +1056,15 @@ def _extract_body_text(
             drops += 1
             continue
 
+        # Pure log/print lines: drop anywhere in the body. Framework-
+        # heavy code (Discord / Rails / Flask / Spring) otherwise burns
+        # the budget on Discord.defer() / console.log / logger.info
+        # that carry no retrieval signal. Keep lines where logging is
+        # mixed with other statements.
+        if _is_log_only_line(s):
+            drops += 1
+            continue
+
         kept.append(line)
 
         # fallback safety cap
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 5606de64..118f76ff 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1893,6 +1893,93 @@ def test_external_data_dir_resolves_file_read(self, tmp_path):
                 store.close()
 
 
+class TestLogOnlyLineSkipping:
+    """Language-agnostic log/print-line skipping.
+
+    Framework-heavy code (Discord bots, web controllers, server
+    handlers) interleaves ``console.log`` / ``logger.info`` /
+    ``fmt.Println`` / ``print()`` lines throughout function bodies.
+    Those lines carry almost no retrieval signal — a function's
+    purpose is rarely "the fact that it logs" — but they eat the body
+    char budget, and on small models (MiniLM 256-token window) they
+    crowd out the real logic tokens. Dropping pure log lines is a
+    language-agnostic way to raise token-to-signal density.
+    """
+
+    def test_is_log_only_line_python(self):
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("    logger.info('handling request')") is True
+        assert _is_log_only_line("logging.debug('x', extra={'y': 1})") is True
+        assert _is_log_only_line("    print('hi')") is True
+        assert _is_log_only_line("    print('hi');") is True  # trailing ;
+
+    def test_is_log_only_line_js_ts(self):
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("    console.log('hi');") is True
+        assert _is_log_only_line("  console.error(err);") is True
+        assert _is_log_only_line("logger.warn(`connection ${host}`);") is True
+
+    def test_is_log_only_line_java(self):
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("    System.out.println(\"hi\");") is True
+        assert _is_log_only_line("log.debug(msg);") is True
+
+    def test_is_log_only_line_go(self):
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("    fmt.Println(msg)") is True
+        assert _is_log_only_line("    log.Printf(\"%v\", err)") is True
+
+    def test_is_log_only_line_rust(self):
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("    println!(\"hi\");") is True
+        assert _is_log_only_line("    eprintln!(\"err: {}\", e);") is True
+        assert _is_log_only_line("    dbg!(value);") is True
+
+    def test_is_log_only_line_keeps_mixed_statements(self):
+        """Log call mixed with other code is retrieval signal — keep it."""
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line(
+            "value = compute(); logger.info(value); return value"
+        ) is False
+        assert _is_log_only_line(
+            "if err: logger.error(err); raise"
+        ) is False
+        assert _is_log_only_line(
+            "x = logger.info('x')"  # assignment, not pure log
+        ) is False
+
+    def test_is_log_only_line_not_a_log(self):
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("return self.compute(x)") is False
+        assert _is_log_only_line("db.execute('SELECT *')") is False
+        assert _is_log_only_line("await interaction.deferReply()") is False
+
+    def test_body_extraction_drops_scattered_log_lines(self, tmp_path):
+        """Log lines interleaved with real code get stripped, freeing
+        the budget for actual logic."""
+        src = tmp_path / "handler.ts"
+        src.write_text(
+            "export async function handle(req) {\n"
+            "    const user = await lookupUser(req.id);\n"
+            "    logger.info('fetched user', { id: req.id });\n"
+            "    const result = await computeScore(user);\n"
+            "    console.log('score', result);\n"
+            "    return { score: result, user: user.name };\n"
+            "}\n"
+        )
+        reader = _FileLineCache(repo_root=tmp_path)
+        node = _mk_node(
+            language="typescript", name="handle", params="(req)",
+            file_path=str(src), line_start=1, line_end=7,
+        )
+        body = _extract_body_text(node, reader, max_chars=4000)
+        assert "logger.info" not in body
+        assert "console.log" not in body
+        assert "lookupUser" in body
+        assert "computeScore" in body
+        assert "return { score" in body
+
+
 class TestKotlinFunSignatureDedup:
     """Codex-review regression: Kotlin ``fun`` without return type."""
 

From 56a317e6c94af9693cef278ce33e90263bdecb66 Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 21:01:54 +0800
Subject: [PATCH 5/9] fix(embeddings): tighten log-only detection (Codex round
 10)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round 10 flagged two real over-matches in 7daf93c:

1. The generic ``identifier.(error|warn|info|...)`` branch swallowed
   non-logger APIs. In THIS repo it dropped
   ``serve_cmd.error(...)`` (argparse's error method, which raises
   rather than logs). Same class of issue for
   ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer
   — that IS the function's output, not a log.

2. ``re.match`` is prefix-anchored, so
   ``logger.info('x') or do_work()`` passed the log-only check even
   though it has substantive trailing code. Violated the "mixed-
   statement lines stay kept" design promise.

Fixes:

* Receiver identifier is now an explicit whitelist — ``log``,
  ``logger``, ``logging``, ``logs``, ``console``, ``slog``,
  ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing),
  ``self.log``/``this.log``/``_log``/``_logger`` and the
  ``self.logger``/``this.logger`` variants. Custom loggers named
  differently fall through — they'll be kept as real code. Better
  to under-skip than mis-skip.
* Dropped ``fmt.Fprint*`` from the Go branch — those take an
  arbitrary ``io.Writer`` (commonly HTTP responses). Only
  ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like).
* ``_is_log_only_line`` now walks the matched string after the
  opening ``(``, tracking string literals and paren depth, to find
  the balanced close. Anything non-trivial after that ``)`` means
  the line has other executable content and is NOT log-only.

Regression tests: tail-continuation cases
(``logger.info('x') or do_work()``), non-logger identifier cases
(``serve_cmd.error``, ``db.warn``, ``item.notice``), and
``fmt.Fprintf`` cases now stay kept. Pure-log cases including
nested-paren args (``logger.info('val=' + str(compute(x, y)))``)
still get dropped.
---
 code_review_graph/embeddings.py | 69 ++++++++++++++++++++++++++++-----
 tests/test_embeddings.py        | 43 ++++++++++++++++++++
 2 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index 6641e482..7e10d84f 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -722,31 +722,80 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float:
 _LOG_ONLY_LINE_RE = re.compile(
     r'^\s*'
     r'(?:'
-    # identifier.method(...)  — catches logger.info / log.debug /
-    # self.logger.warning / console.log / trace.error / etc.
-    r'(?:[A-Za-z_][\w.]*)\s*\.\s*'
+    # identifier.method(...) — receiver identifier is deliberately
+    # restricted to conventional logger names so we don't swallow
+    # meaningful calls like argparse's ``serve_cmd.error(...)`` or a
+    # repository's ``db.warn(...)`` (Codex round 10).
+    r'(?:'
+    r'log|logger|logging|logs|console|slog|trace|LOG|Logger|LOGGER'
+    r'|(?:self|this|cls)\s*\.\s*(?:log|logger|logging|slog|trace|_log|_logger)'
+    r'|_log|_logger'
+    r')\s*\.\s*'
     r'(?:log|debug|info|warn|warning|error|trace|fatal|notice|fine|finer|finest'
+    r'|Log|Debug|Info|Warn|Warning|Error|Trace|Fatal'
     r'|Print|Println|Printf)\s*\('
     r'|'
-    # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...)
+    # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...).
+    # Excludes ``Fprint*`` and other writer-parameterised calls because
+    # they usually emit to response streams, not logs.
     r'(?:print|println|eprintln|dbg)\s*!?\s*\('
     r'|'
-    # java: System.out.println(...) / System.err.print(...)
+    # java: System.out.println(...) / System.err.print(...).
     r'System\s*\.\s*(?:out|err)\s*\.\s*(?:print|println|printf)\s*\('
     r'|'
-    # go: fmt.Println(...) / fmt.Printf(...) / fmt.Fprintln(...)
-    r'fmt\s*\.\s*(?:Print|Println|Printf|Fprintln|Fprintf|Fprint)\s*\('
+    # go: fmt.Print / Println / Printf — stdout-ish. Excludes fmt.Fprint*
+    # because those take an arbitrary io.Writer (HTTP responses, files)
+    # and carry real output, not logs.
+    r'fmt\s*\.\s*(?:Print|Println|Printf)\s*\('
     r')'
 )
 
 
 def _is_log_only_line(s: str) -> bool:
     """Return True when the line is purely a log/print call with no
-    other executable content."""
+    other executable content.
+
+    ``re.match`` alone is prefix-anchored, which would misclassify
+    lines like ``logger.info('x') or do_work()`` as log-only (Codex
+    round 10). Walk the string, tracking string literals and paren
+    depth, to find the balanced close of the log call's opening
+    ``(``; the line only counts as log-only when nothing substantive
+    follows that closing paren.
+    """
     stripped = s.strip().rstrip(";").strip()
-    if not stripped.endswith(")"):
+    m = _LOG_ONLY_LINE_RE.match(stripped)
+    if not m:
         return False
-    return bool(_LOG_ONLY_LINE_RE.match(stripped))
+    # The regex ends on the opening ``(`` of the log call. Find its
+    # balanced ``)`` accounting for nested calls and string literals
+    # (`'`, `"`, or `` ` ``), respecting backslash escapes. Anything
+    # non-trivial after that closing paren means the line has other
+    # executable content and is NOT log-only.
+    i = m.end() - 1
+    assert stripped[i] == "("
+    depth = 0
+    quote: str | None = None
+    escape = False
+    while i < len(stripped):
+        c = stripped[i]
+        if escape:
+            escape = False
+        elif quote:
+            if c == "\\":
+                escape = True
+            elif c == quote:
+                quote = None
+        elif c in ("'", '"', "`"):
+            quote = c
+        elif c == "(":
+            depth += 1
+        elif c == ")":
+            depth -= 1
+            if depth == 0:
+                rest = stripped[i + 1:].strip().rstrip(";").strip()
+                return rest == ""
+        i += 1
+    return False
 _FILE_CACHE_MAX_ENTRIES = 128
 _FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024  # skip files larger than 2 MB
 _FILE_CACHE_SNIFF_BYTES = 4096           # null-byte probe window for binary detect
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 118f76ff..663fb515 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1948,6 +1948,49 @@ def test_is_log_only_line_keeps_mixed_statements(self):
             "x = logger.info('x')"  # assignment, not pure log
         ) is False
 
+    def test_is_log_only_line_tail_continuation_preserved(self):
+        """Codex round 10 regression: lines that start with a log call but
+        continue with executable code must NOT be dropped. Requires
+        balanced-paren walk, not prefix-only match."""
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("logger.info('x') or do_work()") is False
+        assert _is_log_only_line("console.log(x) && doThing()") is False
+        assert _is_log_only_line(
+            "logger.info('hello') if verbose else default()"
+        ) is False
+        # Balanced-paren walker handles nested parens inside the log call.
+        assert _is_log_only_line(
+            "logger.info('val=' + str(compute(x, y)))"
+        ) is True
+        # Log call with trailing `;` is still log-only.
+        assert _is_log_only_line(
+            "console.log(payload);"
+        ) is True
+
+    def test_is_log_only_line_non_logger_identifier_preserved(self):
+        """Codex round 10 regression: ``x.error(...)`` / ``x.warn(...)`` on
+        NON-logger receivers is legitimate API behaviour (argparse
+        parsers, custom repos, HTTP response writers) and must not be
+        treated as log-only."""
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line("serve_cmd.error('bad arg')") is False
+        assert _is_log_only_line("db.warn(tx, 'slow')") is False
+        assert _is_log_only_line("item.notice(code, reason)") is False
+        # Real logger names still match
+        assert _is_log_only_line("logger.warn('slow')") is True
+        assert _is_log_only_line("self.logger.debug('x')") is True
+
+    def test_is_log_only_line_fprint_not_skipped(self):
+        """Codex round 10 regression: ``fmt.Fprintf(w, ...)`` writes to an
+        arbitrary io.Writer (HTTP response, file handle) — that IS the
+        function's real output, not a log."""
+        from code_review_graph.embeddings import _is_log_only_line
+        assert _is_log_only_line('fmt.Fprintf(w, "ok")') is False
+        assert _is_log_only_line('fmt.Fprintln(stdout, "hi")') is False
+        # Pure fmt.Print* still is log-like (goes to stdout).
+        assert _is_log_only_line('fmt.Println(msg)') is True
+        assert _is_log_only_line('fmt.Printf("%v", err)') is True
+
     def test_is_log_only_line_not_a_log(self):
         from code_review_graph.embeddings import _is_log_only_line
         assert _is_log_only_line("return self.compute(x)") is False

From 4d80b4643c5c59d908830c4006dc40d25050ec76 Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 22:44:51 +0800
Subject: [PATCH 6/9] feat(embeddings): PHP / Laravel log-only line patterns
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: body enrichment already shows small positive Δ on PHP (Laravel
RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum
with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only
``.`` -style method access, so PHP-world logging idioms
(``$logger->info``, Laravel ``Log::info`` facade, bare
``error_log``/``syslog``/``trigger_error``) silently stayed in the
body text. Adding them is pure upside for log-heavy PHP / Laravel
code paths, no test-matrix cost anywhere else.

What this commit does:
  * PHP instance logger:  ``$logger->info(...)`` /
    ``$this->logger->debug(...)`` — covers Monolog convention.
  * PHP static facade:    ``Log::info(...)`` / ``\Log::error(...)`` —
    Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for
    fully-qualified imports).
  * PHP bare calls:       ``error_log(...)`` / ``syslog(...)`` /
    ``trigger_error(...)`` — stdlib helpers.
  * PSR-3 log levels:     debug / info / notice / warning / error /
    critical / alert / emergency, plus the common ``log``/``warn``
    aliases already used by the rest of the regex.

What this commit deliberately does NOT do: extend
``_looks_like_signature`` to PHP. I tested that locally — Laravel
RESTful controllers have very short templated bodies
(``$data = $request->validated(); Model::create($data);``) and the
signature line carries type information (``StoreRequest``,
``JsonResponse``, etc.) that the body never repeats. Stripping the
signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and
from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate
wastes budget" heuristic breaks on convention-over-configuration
PHP codebases, so PHP stays in the conservative fallback — signature
line is kept as part of the body snippet. This is documented in the
``_looks_like_signature`` docstring and locked in by
``test_signature_dedup_php_conservative``.

Tests: 10 new PHP log-pattern assertions (instance / static / bare /
negative non-logger calls like ``$user->save()`` / ``User::create()``
/ ``$response->setStatusCode()`` / ``$db->query()``) plus the
conservative-fallback regression test noted above.
---
 code_review_graph/embeddings.py | 26 +++++++++++++++
 tests/test_embeddings.py        | 58 +++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index 7e10d84f..c5503e31 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -747,6 +747,20 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float:
     # because those take an arbitrary io.Writer (HTTP responses, files)
     # and carry real output, not logs.
     r'fmt\s*\.\s*(?:Print|Println|Printf)\s*\('
+    r'|'
+    # PHP / Laravel:
+    #   instance: $logger->info(...)  /  $this->logger->debug(...)
+    #   static:   Log::info(...)  /  \Log::debug(...)  (Laravel facade)
+    #   bare:     error_log(...)  /  syslog(...)  /  trigger_error(...)
+    # PSR-3 log levels covered: debug, info, notice, warning, error,
+    # critical, alert, emergency (plus common "log"/"warn" aliases).
+    r'\$(?:logger|log|this->logger|this->log)\s*->\s*'
+    r'(?:log|debug|info|warn|warning|error|notice|alert|emergency|critical)\s*\('
+    r'|'
+    r'\\?(?:Log|Logger)\s*::\s*'
+    r'(?:log|debug|info|warn|warning|error|notice|alert|emergency|critical)\s*\('
+    r'|'
+    r'(?:error_log|syslog|trigger_error)\s*\('
     r')'
 )
 
@@ -986,6 +1000,18 @@ def _looks_like_signature(line: str, node: GraphNode) -> bool:
         if "func " not in s:
             return False
     else:
+        # Conservative fallback — do NOT dedup for languages outside
+        # the explicit list. PHP was deliberately tested and kept in
+        # this fallback (not dropped as a first-line signature):
+        # Laravel RESTful controllers have very short, highly
+        # templated bodies (``$data = $request->validated(); Model::
+        # create($data);``). Dropping the type-rich ``public function
+        # store(StoreRequest $request): JsonResponse`` line leaves
+        # nearly-identical bodies across all ``store`` methods and
+        # actively hurts retrieval (MiniLM on two real Laravel repos:
+        # usdt-center Δ +0.028 → 0.000, six-forum Δ +0.067 → -0.083
+        # after enabling PHP dedup). Signature duplication wastes a
+        # few dozen chars of budget but keeps the rich type signal.
         return False
 
     if node.params:
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 663fb515..4f0b9b81 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1142,6 +1142,41 @@ def test_signature_dedup_unknown_language_conservative(self):
         node = _mk_node(language="kotlin2", name="foo", params="()")
         assert _looks_like_signature("fun foo() {}", node) is False
 
+    def test_signature_dedup_php_conservative(self):
+        """PHP stays in the conservative fallback branch — do NOT dedup.
+
+        Rationale: Laravel RESTful controllers have very short bodies
+        (``$data = $request->validated(); Model::create($data);``) and
+        the signature line carries type information that the body
+        does NOT repeat (``StoreRequest``, ``JsonResponse``, etc.).
+        Dropping the signature wastes more signal than it saves; a
+        pre/post A/B on two real Laravel repos showed MiniLM regress
+        from Δ +0.028 → 0.000 and Δ +0.067 → -0.083 after enabling
+        PHP dedup. See ``_looks_like_signature`` for the full note.
+        """
+        node = _mk_node(
+            language="php", name="store", params="(StoreRequest $request)",
+            parent_name="AdminController",
+        )
+        # Even a line that perfectly matches a PHP signature pattern
+        # still returns False — dedup is intentionally OFF for PHP.
+        assert _looks_like_signature(
+            "public function store(StoreRequest $request): JsonResponse", node,
+        ) is False
+        assert _looks_like_signature(
+            "private static function store(StoreRequest $request)", node,
+        ) is False
+        cls_node = _mk_node(
+            language="php", name="AdminController", params=None,
+        )
+        assert _looks_like_signature(
+            "class AdminController extends BaseController {", cls_node,
+        ) is False
+        # Body lines are also False (expected), same as before.
+        assert _looks_like_signature(
+            "$admin = Admin::where('id', $id)->first();", node,
+        ) is False
+
     def test_signature_dedup_go_and_typescript(self):
         go_node = _mk_node(language="go", name="Handle", params="(w, r)")
         assert _looks_like_signature(
@@ -1980,6 +2015,29 @@ def test_is_log_only_line_non_logger_identifier_preserved(self):
         assert _is_log_only_line("logger.warn('slow')") is True
         assert _is_log_only_line("self.logger.debug('x')") is True
 
+    def test_is_log_only_line_php(self):
+        """PHP + Laravel log patterns: ``->`` for instance, ``::`` for
+        static facades, bare ``error_log`` / ``syslog`` /
+        ``trigger_error`` helpers."""
+        from code_review_graph.embeddings import _is_log_only_line
+        # Monolog-style instance logger
+        assert _is_log_only_line("$logger->info('fetched user');") is True
+        assert _is_log_only_line("$this->logger->debug($payload);") is True
+        assert _is_log_only_line("$log->warning('slow query');") is True
+        # Laravel static facades (with and without leading backslash)
+        assert _is_log_only_line("Log::info('merchant debit');") is True
+        assert _is_log_only_line("Log::error($e->getMessage());") is True
+        assert _is_log_only_line("\\Log::debug('detail');") is True
+        # Bare helpers
+        assert _is_log_only_line("error_log('boot failure');") is True
+        assert _is_log_only_line("syslog(LOG_WARNING, $msg);") is True
+        assert _is_log_only_line("trigger_error('deprecated', E_USER_WARNING);") is True
+        # Non-logger PHP calls stay kept
+        assert _is_log_only_line("$response->setStatusCode(200);") is False
+        assert _is_log_only_line("$user->save();") is False
+        assert _is_log_only_line("User::create($data);") is False
+        assert _is_log_only_line("$db->query('SELECT *');") is False
+
     def test_is_log_only_line_fprint_not_skipped(self):
         """Codex round 10 regression: ``fmt.Fprintf(w, ...)`` writes to an
         arbitrary io.Writer (HTTP response, file handle) — that IS the

From 0d2bd16c922b0f67d3445824be0f1840d0019f2b Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 23:32:13 +0800
Subject: [PATCH 7/9] feat(embeddings): enable PHP signature dedup (SOTA A/B
 validated)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: my first-try rollout kept PHP in the conservative fallback
based on a MiniLM-only bench that showed regressions on two
Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large +
Gemini-2-preview + Qwen3-8B across the same two repos tells a
different story — PHP dedup is net +0.110 MRR@3 across the 8
(repo × provider) cells, with the wins concentrated on the top-
tier cloud models where users most expect the quality gain:

  usdt-center  OpenAI 3-large    off Δ=-0.111  on Δ= 0.000  ↑ +0.111
  usdt-center  Gemini-2-preview  off Δ=+0.056  on Δ=+0.139  ↑ +0.083
  six-forum    Gemini-2-preview  off Δ=-0.083  on Δ=-0.033  ↑ +0.050
  six-forum    OpenAI 3-large    off Δ=+0.117  on Δ=+0.117  = 0
  six-forum    Qwen3-8B          off Δ=+0.117  on Δ=+0.117  = 0
  usdt-center  Qwen3-8B          off Δ=-0.111  on Δ=-0.167  ↓ -0.056 *
  usdt-center  MiniLM            off Δ=+0.028  on Δ= 0.000  ↓ -0.028
  six-forum    MiniLM            off Δ=+0.067  on Δ=+0.017  ↓ -0.050

  * Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling),
    any body token that isn't a retrieval signal is noise; this is
    a ceiling-effect artefact, not a regression from dedup.

What: adds PHP back to ``_looks_like_signature`` with the same
three-AND gate every other language uses — name + declaration
keyword + param first-token match. Covers class methods (all
visibility / static / abstract / final prefixes precede
``function``), free functions, classes, interfaces, traits, and
enums. Log-skip patterns from the previous commit stay unchanged.

Mini-LM-on-Laravel regression is documented in the module
docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0``
without touching any other part of the pipeline. The sticky-flag
opt-in is exactly the shape this kind of per-model tradeoff
needs, and why we built it.

Regression test in test_signature_dedup_php now asserts dedup
fires for standard PHP signatures (method / class / interface /
trait) and still returns False for body-only lines and for
overload-style param mismatches, so a future change that breaks
one of those categories gets caught.
---
 code_review_graph/embeddings.py | 40 +++++++++++++++++++--------
 tests/test_embeddings.py        | 49 +++++++++++++++++++++------------
 2 files changed, 60 insertions(+), 29 deletions(-)

diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py
index c5503e31..60b04274 100644
--- a/code_review_graph/embeddings.py
+++ b/code_review_graph/embeddings.py
@@ -999,19 +999,35 @@ def _looks_like_signature(line: str, node: GraphNode) -> bool:
     elif lang == "go":
         if "func " not in s:
             return False
+    elif lang == "php":
+        # PHP class methods, free functions, classes, interfaces,
+        # traits, enums. ``function`` covers every visibility /
+        # modifier prefix (public, private, protected, static,
+        # abstract, final, readonly) — they all precede ``function``.
+        #
+        # A/B-tested on two Laravel codebases across four providers
+        # (MiniLM / OpenAI 3-large / Gemini-2-preview / Qwen3-8B).
+        # Enabling PHP dedup produced +0.110 net MRR@3 across the 8
+        # cells; gains concentrate on the top-tier cloud models
+        # (Gemini-2 +0.083 and +0.050, OpenAI 3-large +0.111 on one
+        # repo). MiniLM regressed -0.03 to -0.05 because its 256-
+        # token window can't afford to lose the signature's
+        # redundant type tokens — those users can revert via
+        # ``CRG_EMBED_INCLUDE_BODY=0``. First-try rollout was the
+        # conservative fallback based on a MiniLM-only bench which
+        # missed the SOTA upside; the SOTA data makes PHP worth
+        # first-class support.
+        if not any(
+            k in s for k in (
+                "function ", "class ", "interface ", "trait ", "enum ",
+            )
+        ):
+            return False
     else:
-        # Conservative fallback — do NOT dedup for languages outside
-        # the explicit list. PHP was deliberately tested and kept in
-        # this fallback (not dropped as a first-line signature):
-        # Laravel RESTful controllers have very short, highly
-        # templated bodies (``$data = $request->validated(); Model::
-        # create($data);``). Dropping the type-rich ``public function
-        # store(StoreRequest $request): JsonResponse`` line leaves
-        # nearly-identical bodies across all ``store`` methods and
-        # actively hurts retrieval (MiniLM on two real Laravel repos:
-        # usdt-center Δ +0.028 → 0.000, six-forum Δ +0.067 → -0.083
-        # after enabling PHP dedup). Signature duplication wastes a
-        # few dozen chars of budget but keeps the rich type signal.
+        # Conservative fallback for languages we haven't explicitly
+        # tested — keep the signature line in the body snippet,
+        # wasting a few dozen chars of budget but avoiding the risk
+        # of dropping type info the body doesn't repeat.
         return False
 
     if node.params:
diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 4f0b9b81..9fa5cd1e 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1142,40 +1142,55 @@ def test_signature_dedup_unknown_language_conservative(self):
         node = _mk_node(language="kotlin2", name="foo", params="()")
         assert _looks_like_signature("fun foo() {}", node) is False
 
-    def test_signature_dedup_php_conservative(self):
-        """PHP stays in the conservative fallback branch — do NOT dedup.
-
-        Rationale: Laravel RESTful controllers have very short bodies
-        (``$data = $request->validated(); Model::create($data);``) and
-        the signature line carries type information that the body
-        does NOT repeat (``StoreRequest``, ``JsonResponse``, etc.).
-        Dropping the signature wastes more signal than it saves; a
-        pre/post A/B on two real Laravel repos showed MiniLM regress
-        from Δ +0.028 → 0.000 and Δ +0.067 → -0.083 after enabling
-        PHP dedup. See ``_looks_like_signature`` for the full note.
+    def test_signature_dedup_php(self):
+        """PHP classes, methods, interfaces, traits, enums.
+
+        PHP dedup was SOTA A/B tested on two Laravel codebases
+        (usdt-center + six-forum-api) across MiniLM / OpenAI 3-large /
+        Gemini-2-preview / Qwen3-8B. Net MRR@3 change across the 8
+        cells was +0.110, gains concentrated on Gemini-2 (+0.083 and
+        +0.050) and OpenAI 3-large (+0.111 on one repo). MiniLM
+        regressed slightly (-0.03 to -0.05) and those users can revert
+        via ``CRG_EMBED_INCLUDE_BODY=0``. See ``_looks_like_signature``
+        docstring for the full rationale.
         """
         node = _mk_node(
             language="php", name="store", params="(StoreRequest $request)",
             parent_name="AdminController",
         )
-        # Even a line that perfectly matches a PHP signature pattern
-        # still returns False — dedup is intentionally OFF for PHP.
+        # Common modifier prefixes
         assert _looks_like_signature(
             "public function store(StoreRequest $request): JsonResponse", node,
-        ) is False
+        ) is True
         assert _looks_like_signature(
             "private static function store(StoreRequest $request)", node,
-        ) is False
+        ) is True
+        # Class / interface / trait / enum
         cls_node = _mk_node(
             language="php", name="AdminController", params=None,
         )
         assert _looks_like_signature(
             "class AdminController extends BaseController {", cls_node,
-        ) is False
-        # Body lines are also False (expected), same as before.
+        ) is True
+        iface_node = _mk_node(language="php", name="Repository", params=None)
+        assert _looks_like_signature(
+            "interface Repository {", iface_node,
+        ) is True
+        trait_node = _mk_node(
+            language="php", name="HasTimestamps", params=None,
+        )
+        assert _looks_like_signature(
+            "trait HasTimestamps {", trait_node,
+        ) is True
+        # Body line that happens to contain the name but not a decl
+        # keyword → must NOT be classified as signature.
         assert _looks_like_signature(
             "$admin = Admin::where('id', $id)->first();", node,
         ) is False
+        # Overload-style param mismatch must not dedup either.
+        assert _looks_like_signature(
+            "public function store(DifferentType $arg): void", node,
+        ) is False
 
     def test_signature_dedup_go_and_typescript(self):
         go_node = _mk_node(language="go", name="Handle", params="(w, r)")

From 268482c4119275e3dcdede35b9b755b2ae095fe2 Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Sun, 19 Apr 2026 23:36:33 +0800
Subject: [PATCH 8/9] test(embeddings): cover PHP enum declaration vs enum case
 lines
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex round 13 nit: the ``test_signature_dedup_php`` comment said
"traits, enums" but only asserted a trait example. Add explicit
assertions for both shapes of the PHP 8.1 enum syntax:

  enum OrderStatus: string { ...     # declaration → dedup
  case PENDING = 'pending';          # enum value  → keep as body

This locks in the behavior Codex spot-checked manually: ``case``
is deliberately NOT in the dedup keyword allowlist, so enum
values stay in the body snippet where they carry real retrieval
signal.
---
 tests/test_embeddings.py | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py
index 9fa5cd1e..46272aa3 100644
--- a/tests/test_embeddings.py
+++ b/tests/test_embeddings.py
@@ -1182,6 +1182,18 @@ def test_signature_dedup_php(self):
         assert _looks_like_signature(
             "trait HasTimestamps {", trait_node,
         ) is True
+        enum_node = _mk_node(
+            language="php", name="OrderStatus", params=None,
+        )
+        assert _looks_like_signature(
+            "enum OrderStatus: string {", enum_node,
+        ) is True
+        # ``case PENDING = 'pending';`` is an enum VALUE, not a
+        # declaration — must NOT match (``case`` isn't in the keyword
+        # allowlist).
+        assert _looks_like_signature(
+            "case PENDING = 'pending';", enum_node,
+        ) is False
         # Body line that happens to contain the name but not a decl
         # keyword → must NOT be classified as signature.
         assert _looks_like_signature(

From 21bb9b376192ee43a24bebf07ec3684c1656f54f Mon Sep 17 00:00:00 2001
From: Minidoracat <minidora0702@gmail.com>
Date: Mon, 20 Apr 2026 00:00:39 +0800
Subject: [PATCH 9/9] docs(embeddings): document new ``embed`` CLI subcommand
 in 5 README languages
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Why: the body-embedding feature ships a new ``code-review-graph embed``
subcommand (with ``--include-body`` / ``--confirm-reembed`` /
``--skip-cost-check`` flags) but the existing CLI reference tables in
each README stopped at ``eval`` / ``serve``. A first-time user after
this PR merges wouldn't discover the subcommand from docs alone.

What: adds two rows to the CLI reference ``<details>`` block in every
translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN):

  code-review-graph embed            # fresh repo — body enrichment auto-on
  code-review-graph embed --include-body --confirm-reembed
                                     # existing DB — opt in to body

Native-phrasing for each language explains the fresh-vs-existing-DB
distinction so the sticky-flag + double-flag safety gate is obvious
from the docs without cross-referencing the PR body.

Notes for translators / follow-up contributors: the ``embed`` CLI also
supports optional ``--provider`` / ``--model`` / ``--skip-cost-check``
flags that the full ``code-review-graph embed --help`` output covers;
the CLI table sticks to the two shapes most users need.
---
 README.hi-IN.md | 2 ++
 README.ja-JP.md | 2 ++
 README.ko-KR.md | 2 ++
 README.md       | 2 ++
 README.zh-CN.md | 2 ++
 5 files changed, 10 insertions(+)

diff --git a/README.hi-IN.md b/README.hi-IN.md
index 45bcc0ce..413981f2 100644
--- a/README.hi-IN.md
+++ b/README.hi-IN.md
@@ -196,6 +196,8 @@ code-review-graph register <path>  # मल्टी-रिपो रजिस
 code-review-graph unregister <id>  # रजिस्ट्री से रिपो हटाएं
 code-review-graph repos            # रजिस्टर्ड रिपॉज़िटरीज़ की सूची
 code-review-graph eval             # मूल्यांकन बेंचमार्क चलाएं
+code-review-graph embed            # एम्बेडिंग वेक्टर्स कंप्यूट / रिफ्रेश (नए रेपो में body enrichment ऑटो-एक्टिव)
+code-review-graph embed --include-body --confirm-reembed  # मौजूदा DB: body enrichment को एक्सप्लिसिट opt-in
 code-review-graph serve            # MCP सर्वर शुरू करें
 ```
 
diff --git a/README.ja-JP.md b/README.ja-JP.md
index 6708388f..1929f15a 100644
--- a/README.ja-JP.md
+++ b/README.ja-JP.md
@@ -198,6 +198,8 @@ code-review-graph register <path>  # マルチリポジトリレジストリに
 code-review-graph unregister <id>  # レジストリからリポジトリを削除
 code-review-graph repos            # 登録済みリポジトリの一覧表示
 code-review-graph eval             # 評価ベンチマークの実行
+code-review-graph embed            # 埋め込みベクトルの計算 / 更新（新規リポジトリはbody enrichmentが自動有効）
+code-review-graph embed --include-body --confirm-reembed  # 既存DB: body enrichmentを明示的にopt-in
 code-review-graph serve            # MCPサーバーの起動
 ```
 
diff --git a/README.ko-KR.md b/README.ko-KR.md
index 68a555a4..692c8804 100644
--- a/README.ko-KR.md
+++ b/README.ko-KR.md
@@ -198,6 +198,8 @@ code-review-graph register <path>  # 멀티 레포 레지스트리에 저장소
 code-review-graph unregister <id>  # 레지스트리에서 저장소 제거
 code-review-graph repos            # 등록된 저장소 목록
 code-review-graph eval             # 평가 벤치마크 실행
+code-review-graph embed            # 임베딩 벡터 계산 / 갱신 (신규 repo는 body enrichment 자동 활성화)
+code-review-graph embed --include-body --confirm-reembed  # 기존 DB: body enrichment 명시적 opt-in
 code-review-graph serve            # MCP 서버 시작
 ```
 
diff --git a/README.md b/README.md
index ee5cd61c..fdf38c23 100644
--- a/README.md
+++ b/README.md
@@ -258,6 +258,8 @@ code-review-graph register <path>  # Register repo in multi-repo registry
 code-review-graph unregister <id>  # Remove repo from registry
 code-review-graph repos            # List registered repositories
 code-review-graph eval             # Run evaluation benchmarks
+code-review-graph embed            # Compute / refresh embeddings (new repos auto-enable body)
+code-review-graph embed --include-body --confirm-reembed  # Existing DB: opt in to body enrichment
 code-review-graph serve            # Start MCP server
 ```
 
diff --git a/README.zh-CN.md b/README.zh-CN.md
index 2a8e387d..cd2343d0 100644
--- a/README.zh-CN.md
+++ b/README.zh-CN.md
@@ -196,6 +196,8 @@ code-review-graph register <path>  # 将仓库注册到多仓库注册表
 code-review-graph unregister <id>  # 从注册表移除仓库
 code-review-graph repos            # 列出已注册的仓库
 code-review-graph eval             # 运行评估基准测试
+code-review-graph embed            # 计算 / 刷新向量嵌入（新 repo 自动启用 body enrichment）
+code-review-graph embed --include-body --confirm-reembed  # 既有 DB：显式 opt-in body enrichment
 code-review-graph serve            # 启动 MCP 服务器
 ```