From 17898597cc9ed8923dd3d9dc6ef47a58ce349bea Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 06:59:41 +0800 Subject: [PATCH 1/9] feat(embeddings): include function body in semantic search text MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: signature-only embeddings carry ~10 tokens per node (name + kind + params + return type + language). That ceiling wastes the long-context capacity of modern embedding models — Gemini 2, Qwen3-8B and similar rank SOTA on MTEB-code largely on their ability to understand the *body* of a function, but we were handing them only its name. So the quality advantage of large-context embeddings against local MiniLM was getting flattened by our own input, not by model capability. What: `_node_to_text()` now appends a truncated body snippet after the signature metadata, with the signature line and leading docstrings / header comments stripped so the budget goes to real logic instead of documentation. Four design choices worth flagging: 1. Per-provider char budget (ADR-B4): local=700 / google=3000 / minimax=3000 / openai=6000, env `CRG_EMBED_BODY_MAX_CHARS` overrides. This gives MiniLM just enough to not silently truncate and gives the long-context models room to actually use their capacity. 2. Sticky-flag opt-in (ADR-C3): new `embeddings_meta` table on `embeddings.db` persists `embed_body_enabled` per-DB. Fresh repos auto-enable; existing DBs hold legacy behavior until the user runs `code-review-graph embed --include-body --confirm-reembed`. This avoids surprising cloud-provider users with a one-time re-embed API bill on upgrade, and the sticky flag keeps new nodes consistent with older ones across incremental runs (no mixed body-on / body-off state). 3. Two-stage IO-bounded filter (ADR-filter-short-circuit): stored hash is `meta_hash:body_hash:file_hash`. Stage 1 computes the metadata hash only (zero file IO) and skips when metadata + file_hash are unchanged. Stage 2 reads the file only when stage 1 can't prove freshness, recomputes the body hash, persists the combined form. Empty extracted bodies get sha256("") as a sentinel so they also short-circuit next time; read failures deliberately do NOT persist a sentinel so transient errors stay retryable. 4. Language-agnostic body extraction with block-comment + triple- quote state machine and conservative signature dedup via three- AND gate (node.name + declaration keyword + param first-token). Unknown languages fall back to "don't dedup" rather than risk dropping real body. Coverage today: Python / JS / TS / JSX / TSX / Java / Kotlin / Rust / Go; other languages keep the signature line and just eat a bit of their body budget. Backward compat: legacy rows (pre-existing sha256 hex in `embeddings.text_hash`, no `:`) and Iter-2 rows (`meta:body`, no file hash) are both parsed correctly by `_split_combined_hash` and trigger a single safe re-embed on first contact. Security: `_FileLineCache` resolves paths and enforces a `relative_to` containment check against the repo root, so a poisoned `nodes.file_path` can't read `/etc/passwd` or traverse out of the repo into cloud embedding calls. Verification: regression MRR@3 on a 10-query golden set (bundled mini repo, local MiniLM-L6-v2) goes from 0.9500 to 1.0000; body-dependent queries like "escape HTML entities including quotes and backticks" or "read GEMINI_API_KEY from the environment" move the correct node into top-3 for the first time because the body now contains the distinguishing tokens. Tested against real OpenAI-compatible endpoint too (http://127.0.0.1:3000/v1 + text-embedding-v4): 28 nodes embedded on first run, 0 on a repeat with sticky flag honored — filter IO short-circuit verified end-to-end. Tests: 59 new cases (file-line cache LRU + containment, signature dedup with decorator / recursive / params mismatch / unknown lang, block-comment and triple-quote state machine with JSDoc / Javadoc / Python single-line and multi-line docstrings / long comments, per-provider budget resolution + env override, sticky flag across runs including CLI flip, filter short-circuit with and without populated file_hash, read-failure must-retry regression, legacy hash backward compat). MRR regression test skipped gracefully when sentence-transformers isn't installed (CI matrix with `.[dev]` only stays green; CI jobs that install `.[embeddings]` activate the gate). Scope: all four providers (Local / Google / MiniMax / OpenAI- compatible) benefit transparently; no schema migration on graph.db, no change to `EmbeddingStore.__init__` signature, no change to parser.py / migrations.py / search.py / tools.py. Adds one new subcommand `embed` and a private `embeddings_meta` table on `embeddings.db`. --- README.hi-IN.md | 4 +- README.ja-JP.md | 4 +- README.ko-KR.md | 4 +- README.md | 18 +- README.zh-CN.md | 4 +- code_review_graph/cli.py | 97 +++ code_review_graph/embeddings.py | 584 +++++++++++++- .../embeddings/eval/golden_queries.json | 101 +++ .../embeddings/eval/mini_repo/api_client.py | 24 + .../embeddings/eval/mini_repo/embeddings.py | 32 + .../embeddings/eval/mini_repo/graph.py | 33 + .../embeddings/eval/mini_repo/incremental.py | 16 + .../embeddings/eval/mini_repo/parser.py | 33 + .../eval/mini_repo/visualization.py | 18 + tests/test_embeddings.py | 754 ++++++++++++++++++ tests/test_embeddings_mrr.py | 183 +++++ 16 files changed, 1882 insertions(+), 27 deletions(-) create mode 100644 tests/fixtures/embeddings/eval/golden_queries.json create mode 100644 tests/fixtures/embeddings/eval/mini_repo/api_client.py create mode 100644 tests/fixtures/embeddings/eval/mini_repo/embeddings.py create mode 100644 tests/fixtures/embeddings/eval/mini_repo/graph.py create mode 100644 tests/fixtures/embeddings/eval/mini_repo/incremental.py create mode 100644 tests/fixtures/embeddings/eval/mini_repo/parser.py create mode 100644 tests/fixtures/embeddings/eval/mini_repo/visualization.py create mode 100644 tests/test_embeddings_mrr.py diff --git a/README.hi-IN.md b/README.hi-IN.md index ab40ade0..45bcc0ce 100644 --- a/README.hi-IN.md +++ b/README.hi-IN.md @@ -133,7 +133,7 @@ Build the code review graph for this project | **23 भाषाएं + नोटबुक** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **ब्लास्ट-रेडियस विश्लेषण** | दिखाता है कि किसी भी बदलाव से कौन से फ़ंक्शन, क्लासेज़, और फ़ाइलें प्रभावित होती हैं | | **ऑटो-अपडेट हुक्स** | बिना मैन्युअल हस्तक्षेप के हर फ़ाइल एडिट और git कमिट पर ग्राफ अपडेट होता है | -| **सिमेंटिक सर्च** | sentence-transformers, Google Gemini, MiniMax, या किसी भी OpenAI-compatible एंडपॉइंट (असली OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) के ज़रिए वैकल्पिक वेक्टर एम्बेडिंग | +| **सिमेंटिक सर्च** | sentence-transformers, Google Gemini, MiniMax, या किसी भी OpenAI-compatible एंडपॉइंट (असली OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) के ज़रिए वैकल्पिक वेक्टर एम्बेडिंग. फ़ंक्शन body का implementation हिस्सा एम्बेड होता है (signature और शुरुआती docstring/header comments हटा दिए जाते हैं ताकि snippet का token budget real logic पर खर्च हो, documentation पर नहीं). long-context मॉडल्स "कोड क्या करता है" के हिसाब से रैंक कर सकते हैं | | **इंटरैक्टिव विज़ुअलाइज़ेशन** | सर्च, कम्युनिटी लीजेंड टॉगल, और डिग्री-स्केल्ड नोड्स के साथ D3.js फ़ोर्स-डायरेक्टेड ग्राफ | | **हब और ब्रिज डिटेक्शन** | betweenness centrality के ज़रिए सबसे ज़्यादा कनेक्टेड नोड्स और आर्किटेक्चरल चोकपॉइंट्स खोजें | | **सरप्राइज़ स्कोरिंग** | अप्रत्याशित कपलिंग का पता लगाएं: क्रॉस-कम्युनिटी, क्रॉस-लैंग्वेज, पेरीफ़ेरल-टू-हब एज़ेज़ | @@ -286,7 +286,7 @@ export CRG_OPENAI_BATCH_SIZE=100 # टाइट बैच > **मॉडल चुनने की सलाह।** लंबे समय के उपयोग के लिए `-preview` / `-beta` / `-exp` वाले model ID (जैसे `google/gemini-embedding-2-preview`) से बचें — preview मॉडल्स के वज़न बदल सकते हैं (डाइमेंशन बदलने पर पूरा re-embed करना पड़ेगा) या बिना नोटिस deprecate हो सकते हैं। स्टेबल GA मॉडल्स की सलाह दी जाती है: `text-embedding-3-small` / `text-embedding-3-large` (OpenAI), `Qwen/Qwen3-Embedding-8B` (vLLM / LocalAI सेल्फ-होस्टेड के ज़रिए), या `gemini-embedding-001` (नेटिव Gemini provider के ज़रिए, `GOOGLE_API_KEY` चाहिए). > -> साथ ही ध्यान दें: वर्तमान में `code-review-graph` केवल **फ़ंक्शन सिग्नेचर** एम्बेड करता है (प्रति नोड ~10 tokens, जैसे `"parse_file function (path: str) returns Tree"`). जिन मॉडल्स की क्वालिटी का मुख्य source लंबे context में function body को समझना है (जैसे Gemini 2 या Qwen3-8B के MTEB-code SOTA स्कोर्स), वे इस इनपुट लंबाई पर छोटे मॉडल्स से कम अंतर दिखाएंगे। Body / docstring एम्बेडिंग को फ़ॉलो-अप एन्हांसमेंट के रूप में ट्रैक किया जा रहा है। +> `code-review-graph` signature के साथ-साथ **फ़ंक्शन body के implementation** हिस्से का truncated snippet भी एम्बेड करता है (signature और शुरुआती docstring/header comments हटा दिए जाते हैं ताकि snippet token budget real logic पर जाए; प्रति प्रोवाइडर char budget: local 700 / google 3000 / minimax 3000 / openai 6000). अब long-context मॉडल्स (Gemini 2, Qwen3-8B) फ़ंक्शन के नाम के बजाय "वह क्या करता है" के आधार पर रैंक कर सकते हैं। नए रेपो में यह अपने आप active हो जाता है; मौजूदा DB legacy signature-only व्यवहार रखते हैं जब तक आप `code-review-graph embed --include-body --confirm-reembed` से opt-in नहीं करते (एक बार का पूरा re-embed क्लाउड प्रोवाइडर्स पर API tokens खर्च कर सकता है — signature-only पर रहने के लिए `CRG_EMBED_INCLUDE_BODY=0` सेट करें). diff --git a/README.ja-JP.md b/README.ja-JP.md index 220ea864..6708388f 100644 --- a/README.ja-JP.md +++ b/README.ja-JP.md @@ -135,7 +135,7 @@ gitコミットやファイル保存のたびにフックが起動します。 | **23言語 + ノートブック** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **影響範囲分析** | 変更によって影響を受ける関数、クラス、ファイルを正確に表示 | | **自動更新フック** | ファイル編集やgitコミットのたびに手動操作なしでグラフを更新 | -| **セマンティック検索** | sentence-transformers、Google Gemini、MiniMax、またはOpenAI互換エンドポイント(本家OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI)によるオプションのベクトル埋め込み | +| **セマンティック検索** | sentence-transformers、Google Gemini、MiniMax、またはOpenAI互換エンドポイント(本家OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI)によるオプションのベクトル埋め込み。関数 body の実装部分を埋め込みます(シグネチャと先頭の docstring / ヘッダーコメントは取り除き、snippet の token 予算を実ロジックに回します)。長 context モデルが「コードが何をしているか」でランキングできます | | **インタラクティブ可視化** | D3.js力学レイアウトグラフ。検索、コミュニティ凡例切替、次数スケーリングノード対応 | | **ハブ・ブリッジ検出** | 最も接続の多いノードと媒介中心性によるアーキテクチャのボトルネックを発見 | | **サプライズスコアリング** | 予期しない結合を検出:コミュニティ間、言語間、周辺からハブへのエッジ | @@ -288,7 +288,7 @@ base URLがlocalhost(`127.0.0.1`、`localhost`、`0.0.0.0`、`::1`)を指し > **モデル選択のヒント。** `-preview` / `-beta` / `-exp` 付きのmodel ID(例:`google/gemini-embedding-2-preview`)は長期運用には避けてください。preview モデルは重みが変更される(次元が変わると全ノード re-embed 必須)か、予告なく deprecate される可能性があります。安定版 GA モデル推奨:`text-embedding-3-small` / `text-embedding-3-large`(OpenAI)、`Qwen/Qwen3-Embedding-8B`(vLLM / LocalAI 自前ホスト経由)、または `gemini-embedding-001`(ネイティブ Gemini provider 経由、`GOOGLE_API_KEY` が必要)。 > -> また注意:現状 `code-review-graph` は**関数シグネチャのみ**を埋め込みます(ノードあたり約10トークン、例:`"parse_file function (path: str) returns Tree"`)。長 context で関数 body を理解する能力で差をつけるモデル(Gemini 2 や Qwen3-8B の MTEB-code SOTA スコア)は、この入力長では小型モデルとの差がかなり縮まります。Body / docstring 埋め込みはフォローアップ拡張として追跡中です。 +> `code-review-graph` はシグネチャに加えて**関数 body の実装**部分の切り取り片も埋め込みます(シグネチャと先頭の docstring / ヘッダーコメントは取り除き、snippet の token 予算を実ロジックに割り当てます。プロバイダごとの文字数上限:local 700 / google 3000 / minimax 3000 / openai 6000)。長 context モデル(Gemini 2、Qwen3-8B)は「関数が何をしているか」でランク付けでき、名前だけに頼らなくなります。新規リポジトリでは自動で有効;既存 DB はレガシーの signature-only 動作を保持し、`code-review-graph embed --include-body --confirm-reembed` で明示的に opt-in するまで切り替わりません(一度限りの全件 re-embed はクラウドプロバイダで API トークンを消費します。signature-only のままにしたい場合は `CRG_EMBED_INCLUDE_BODY=0`)。 diff --git a/README.ko-KR.md b/README.ko-KR.md index f593f065..68a555a4 100644 --- a/README.ko-KR.md +++ b/README.ko-KR.md @@ -135,7 +135,7 @@ git 커밋이나 파일 저장마다 훅이 실행됩니다. 그래프는 변경 | **23개 언어 + 노트북** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **영향 범위 분석** | 변경에 의해 영향 받는 함수, 클래스, 파일을 정확히 보여줍니다 | | **자동 업데이트 훅** | 수동 개입 없이 파일 편집 및 git 커밋마다 그래프가 업데이트됩니다 | -| **시맨틱 검색** | sentence-transformers, Google Gemini, MiniMax, 또는 OpenAI 호환 엔드포인트(실제 OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI)를 통한 선택적 벡터 임베딩 | +| **시맨틱 검색** | sentence-transformers, Google Gemini, MiniMax, 또는 OpenAI 호환 엔드포인트(실제 OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI)를 통한 선택적 벡터 임베딩. 함수 body의 구현 부분을 임베딩합니다(시그니처와 선두 docstring/헤더 주석은 제거하여 snippet 토큰 예산을 실제 로직에 할당). 긴 context 모델이 "코드가 무엇을 하는지"로 랭킹할 수 있습니다 | | **인터랙티브 시각화** | 검색, 커뮤니티 범례 토글, 차수 기반 노드 크기 조정이 가능한 D3.js 포스 다이렉티드 그래프 | | **허브 및 브릿지 감지** | 가장 많이 연결된 노드와 매개 중심성을 통한 아키텍처 병목 지점 탐색 | | **서프라이즈 스코어링** | 예상치 못한 결합 감지: 커뮤니티 간, 언어 간, 주변부-허브 엣지 | @@ -288,7 +288,7 @@ base URL이 localhost(`127.0.0.1`, `localhost`, `0.0.0.0`, `::1`)를 가리킬 > **모델 선택 팁.** `-preview` / `-beta` / `-exp` 접미사가 붙은 model ID(예: `google/gemini-embedding-2-preview`)는 장기 운영용으로 피하세요. preview 모델은 가중치가 바뀌거나(차원 변경 시 전체 re-embed 필수) 예고 없이 deprecate될 수 있습니다. 안정 GA 모델 권장: `text-embedding-3-small` / `text-embedding-3-large`(OpenAI), `Qwen/Qwen3-Embedding-8B`(vLLM / LocalAI 자체 호스팅 경유), 또는 `gemini-embedding-001`(네이티브 Gemini provider 경유, `GOOGLE_API_KEY` 필요). > -> 참고로 현재 `code-review-graph`는 **함수 시그니처만** 임베딩합니다(노드당 약 10 토큰, 예: `"parse_file function (path: str) returns Tree"`). 긴 context로 함수 body를 이해하는 능력으로 차별화되는 모델(Gemini 2 또는 Qwen3-8B의 MTEB-code SOTA 점수)은 이 입력 길이에서 소형 모델과의 품질 차이가 훨씬 좁아집니다. Body / docstring 임베딩은 후속 개선 과제로 추적 중입니다. +> `code-review-graph`는 시그니처에 더해 **함수 body 구현부** 스니펫도 함께 임베딩합니다(시그니처와 선두 docstring/헤더 주석은 제거하고 snippet 토큰 예산을 실제 로직에 할당. 프로바이더별 문자 예산: local 700 / google 3000 / minimax 3000 / openai 6000). 이제 긴 context 모델(Gemini 2, Qwen3-8B)이 함수 이름만이 아니라 "함수가 무엇을 하는지"로 랭킹할 수 있습니다. 새 레포지토리는 자동으로 활성화; 기존 DB는 레거시 signature-only 동작을 유지하며, `code-review-graph embed --include-body --confirm-reembed`로 명시적 opt-in해야 전환됩니다(일회성 전량 re-embed는 클라우드 프로바이더에서 API 토큰을 소비. signature-only로 유지하려면 `CRG_EMBED_INCLUDE_BODY=0`). diff --git a/README.md b/README.md index 3dd714f4..ee5cd61c 100644 --- a/README.md +++ b/README.md @@ -195,7 +195,7 @@ The blast-radius analysis never misses an actually impacted file (perfect recall | **23 languages + notebooks** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **Blast-radius analysis** | Shows exactly which functions, classes, and files are affected by any change | | **Auto-update hooks** | Graph updates on every file edit and git commit without manual intervention | -| **Semantic search** | Optional vector embeddings via sentence-transformers, Google Gemini, MiniMax, or any OpenAI-compatible endpoint (real OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) | +| **Semantic search** | Optional vector embeddings via sentence-transformers, Google Gemini, MiniMax, or any OpenAI-compatible endpoint (real OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI). Embeds the function body logic (signature + leading docstrings/header comments are stripped so the token budget goes to implementation, not documentation) — long-context models can rank by what code *does* | | **Interactive visualisation** | D3.js force-directed graph with search, community legend toggles, and degree-scaled nodes | | **Hub & bridge detection** | Find most-connected nodes and architectural chokepoints via betweenness centrality | | **Surprise scoring** | Detect unexpected coupling: cross-community, cross-language, peripheral-to-hub edges | @@ -376,12 +376,16 @@ The cloud-egress warning is auto-skipped when the base URL points to localhost > `gemini-embedding-001` (via the native Gemini provider, which requires > `GOOGLE_API_KEY` instead of the OpenAI-compatible path). > -> Also note: `code-review-graph` currently embeds **function signatures only** -> (~10 tokens per node, e.g. `"parse_file function (path: str) returns Tree"`). -> Models whose headline quality comes from long-context body understanding -> (such as Gemini 2 or Qwen3-8B at their MTEB-code SOTA scores) will see a -> much narrower quality gap against smaller models at this input length. -> Body/docstring embedding is tracked as a follow-up enhancement. +> `code-review-graph` also embeds a truncated **function body +> implementation** snippet alongside the signature (signature + leading +> docstrings / header comments are stripped so the snippet tokens go to +> real logic; per-provider char budget: local 700 / google 3000 / +> minimax 3000 / openai 6000). Long-context models (Gemini 2, Qwen3-8B) +> can rank by what a function *does* and not just what it's named. New repositories pick this up automatically; +> existing DBs keep the legacy signature-only behavior until you opt in +> with `code-review-graph embed --include-body --confirm-reembed` (a +> one-time re-embed that may spend API tokens on cloud providers — set +> `CRG_EMBED_INCLUDE_BODY=0` to stay on signatures). #### Tool Filtering diff --git a/README.zh-CN.md b/README.zh-CN.md index 9d1c8fa0..2a8e387d 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -133,7 +133,7 @@ Build the code review graph for this project | **23 种语言 + 笔记本** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **影响半径分析** | 精确展示任何变更所影响的函数、类和文件 | | **自动更新钩子** | 每次文件编辑和 git 提交时自动更新图,无需手动干预 | -| **语义搜索** | 可选的向量嵌入,支持 sentence-transformers、Google Gemini、MiniMax,或任何 OpenAI 兼容端点(真实 OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI) | +| **语义搜索** | 可选的向量嵌入,支持 sentence-transformers、Google Gemini、MiniMax,或任何 OpenAI 兼容端点(真实 OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI)。嵌入函数 body 的实作片段(剥除签名与开头的 docstring/header 注释,让 token 预算花在真正的逻辑而非文档),让长 context 模型按「代码做什么」排序 | | **交互式可视化** | D3.js 力导向图,支持搜索、社区图例切换和按度数缩放的节点 | | **Hub 与 Bridge 检测** | 查找连接最多的节点和通过介数中心性发现架构瓶颈 | | **异常评分** | 检测意外耦合:跨社区、跨语言、外围到核心的边 | @@ -286,7 +286,7 @@ export CRG_OPENAI_BATCH_SIZE=100 # 某些网关有更严 > **模型选择提示。** 避免用 `-preview` / `-beta` / `-exp` 结尾的 model ID(例如 `google/gemini-embedding-2-preview`)做长期使用——preview 模型可能更换权重(维度一变就要全量 re-embed)或被无预警下架。建议改用正式 GA 模型:`text-embedding-3-small` / `text-embedding-3-large`(OpenAI)、`Qwen/Qwen3-Embedding-8B`(经 vLLM / LocalAI 自宿主)、或 `gemini-embedding-001`(经原生 Gemini provider,需要 `GOOGLE_API_KEY`)。 > -> 另外请注意:目前 `code-review-graph` 只嵌入**函数签名**(每节点约 10 tokens,例如 `"parse_file function (path: str) returns Tree"`)。那些靠长 context 理解函数 body 来拉开差距的模型(Gemini 2 或 Qwen3-8B 在 MTEB-code 的 SOTA 分数)在这个输入长度下跟小模型的品质差距会小很多。Body / docstring 嵌入已列为后续增强任务。 +> `code-review-graph` 也会在签名后面接上截断过的**函数 body 实作**片段(剥除签名与开头的 docstring/header 注释,让 snippet token 留给真正的逻辑而非文档;每家 provider 的字元上限:local 700 / google 3000 / minimax 3000 / openai 6000),长 context 模型(Gemini 2、Qwen3-8B)终于能按「代码做什么」排序而不是只看名字。新 repo 自动开启;既有 DB 保留原本的 signature-only 行为,直到你执行 `code-review-graph embed --include-body --confirm-reembed` 显式 opt-in(一次性全量 re-embed,对云 provider 会花 API token——想留在 signature-only 设 `CRG_EMBED_INCLUDE_BODY=0`)。 diff --git a/code_review_graph/cli.py b/code_review_graph/cli.py index 6e9349b6..d097b75e 100644 --- a/code_review_graph/cli.py +++ b/code_review_graph/cli.py @@ -480,6 +480,41 @@ def main() -> None: eval_cmd.add_argument("--report", action="store_true", help="Generate report from results") eval_cmd.add_argument("--output-dir", default=None, help="Output directory for results") + # embed (opt-in body enrichment gate) + embed_cmd = sub.add_parser( + "embed", + help="Compute embeddings for all graph nodes (opt-in body enrichment)", + ) + embed_cmd.add_argument("--repo", default=None, help="Repository root (auto-detected)") + embed_cmd.add_argument( + "--provider", + default=None, + help="Embedding provider: local (default) / google / minimax / openai", + ) + embed_cmd.add_argument( + "--model", + default=None, + help="Model name (local: sentence-transformers; openai: CRG_OPENAI_MODEL)", + ) + embed_cmd.add_argument( + "--include-body", + action="store_true", + help="Embed function bodies alongside signatures (sets CRG_EMBED_INCLUDE_BODY=1)", + ) + embed_cmd.add_argument( + "--confirm-reembed", + action="store_true", + help="Required alongside --include-body when embeddings.db already has rows — " + "without this flag the command refuses to trigger a full re-embed " + "(cost-safety gate for cloud providers).", + ) + embed_cmd.add_argument( + "--skip-cost-check", + action="store_true", + help="Bypass the ballpark token/cost estimate warning shown before " + "re-embed on cloud providers (CI / automation).", + ) + # detect-changes detect_cmd = sub.add_parser("detect-changes", help="Analyze change impact") detect_cmd.add_argument("--base", default="HEAD~1", help="Git diff base (default: HEAD~1)") @@ -722,6 +757,68 @@ def main() -> None: if result.get("files_updated", 0) > 0: _cli_post_process(store) + elif args.command == "embed": + from .embeddings import ( + CLOUD_PROVIDERS, + EmbeddingStore, + _warn_cloud_egress, + embed_all_nodes, + ) + + provider = getattr(args, "provider", None) or "local" + model = getattr(args, "model", None) + + # --include-body safety gate: existing DB requires --confirm-reembed. + if getattr(args, "include_body", False): + emb_for_count = EmbeddingStore(store.db_path) + try: + existing = emb_for_count.count() + finally: + emb_for_count.close() + if existing > 0 and not getattr(args, "confirm_reembed", False): + logging.error( + "--include-body on a non-empty embeddings DB (%d rows) " + "triggers a full re-embed. Re-run with --confirm-reembed " + "to proceed, or unset --include-body to keep the current " + "sticky gate. See CHANGELOG for cost notes.", + existing, + ) + sys.exit(2) + if ( + provider in CLOUD_PROVIDERS + and existing >= 500 + and not getattr(args, "skip_cost_check", False) + ): + print( + f"Heads up: re-embedding {existing} nodes via '{provider}' " + f"will spend API tokens. A rough ballpark is " + f"{existing * 200} tokens; check your provider's pricing. " + "Pass --skip-cost-check to silence this notice.", + file=sys.stderr, + ) + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + + emb_store = EmbeddingStore(store.db_path, provider=provider, model=model) + try: + if emb_store.provider is None: + logging.error( + "Embedding provider '%s' unavailable. For 'local', " + "install the 'embeddings' extra: " + "pip install code-review-graph[embeddings]", + provider, + ) + sys.exit(1) + if provider in CLOUD_PROVIDERS: + _warn_cloud_egress(emb_store.provider.name) + count = embed_all_nodes(store, emb_store) + print( + f"Embedded {count} new/changed nodes " + f"(provider={emb_store.provider.name}, " + f"total={emb_store.count()})" + ) + finally: + emb_store.close() + elif args.command == "status": stats = store.get_stats() print(f"Nodes: {stats.total_nodes}") diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index be29045f..6badc0a6 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -18,6 +18,7 @@ import sys import time from abc import ABC, abstractmethod +from collections import OrderedDict from pathlib import Path from typing import Any from urllib.parse import urlparse @@ -662,6 +663,11 @@ def _check_available() -> bool: text_hash TEXT NOT NULL, provider TEXT NOT NULL DEFAULT 'unknown' ); + +CREATE TABLE IF NOT EXISTS embeddings_meta ( + key TEXT PRIMARY KEY, + value TEXT NOT NULL +); """ @@ -688,8 +694,403 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float: return dot / (norm_a * norm_b) -def _node_to_text(node: GraphNode) -> str: - """Convert a node to a searchable text representation.""" +# --------------------------------------------------------------------------- +# Body enrichment helpers (Iter 3 final: ADR-A1+dedup / B4 per-provider / C3 sticky) +# --------------------------------------------------------------------------- + +# ADR-B4: per-provider body char budget. Keys are provider-name prefixes +# (the part before ':'). Fallback applies to unknown providers. +_BODY_MAX_CHARS_BY_PROVIDER: dict[str, int] = { + "local": 700, # MiniLM ~256 token window; body + metadata must fit together + "google": 3000, # gemini-embedding-001 ~2048 token window, conservative 50% + "minimax": 3000, # embo-01 ~1024 token window, pay-per-use + "openai": 6000, # text-embedding-3-* 8192 token ceiling, safety margin +} +_BODY_MAX_CHARS_FALLBACK = 700 +_BODY_MAX_LINES_DEFAULT = 15 +_FILE_CACHE_MAX_ENTRIES = 128 +_FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024 # skip files larger than 2 MB +_FILE_CACHE_SNIFF_BYTES = 4096 # null-byte probe window for binary detect +_EMBED_META_KEY_BODY_ENABLED = "embed_body_enabled" + + +def _resolve_body_max_chars(provider_name: str) -> int: + """Return the per-provider char budget for body-enriched embeddings. + + Precedence: + 1. ``CRG_EMBED_BODY_MAX_CHARS`` env var (positive int) — global override + 2. ``_BODY_MAX_CHARS_BY_PROVIDER[prefix]`` where prefix is + ``provider_name.split(':', 1)[0].lower()`` + 3. ``_BODY_MAX_CHARS_FALLBACK`` for unknown providers + """ + override = os.environ.get("CRG_EMBED_BODY_MAX_CHARS") + if override: + try: + n = int(override) + if n > 0: + return n + except ValueError: + pass + prefix = provider_name.split(":", 1)[0].lower() if provider_name else "" + return _BODY_MAX_CHARS_BY_PROVIDER.get(prefix, _BODY_MAX_CHARS_FALLBACK) + + +class _FileLineCache: + """Per-embed-batch file reader with bounded LRU (``OrderedDict``). + + Returns a slice of file lines for a given 1-indexed inclusive range. + Returns ``[]`` (empty, never raises) for: + - missing file / directory / permission / generic ``OSError`` + - binary file (null byte in first ``_FILE_CACHE_SNIFF_BYTES``) + - oversize file (> ``_FILE_CACHE_MAX_BYTES``) + - out-of-range line numbers (``line_start <= 0`` or beyond EOF) + + Not thread-safe. ``embed_nodes`` is single-threaded; do not share + instances across batches. + """ + + def __init__( + self, + repo_root: Path | None = None, + max_entries: int = _FILE_CACHE_MAX_ENTRIES, + ) -> None: + self._repo_root = repo_root + self._max_entries = max_entries + # value ``None`` marks a known-failed read so we don't re-probe + self._cache: OrderedDict[str, list[str] | None] = OrderedDict() + self._read_count = 0 + # Whether the most recent ``get_lines`` call saw a file that + # loaded at all. ``False`` only when the underlying file + # read/decode failed (missing / permission / binary / oversized + # / outside repo). Callers use this to decide whether an empty + # return reflects a legitimate empty slice or a transient read + # failure that should be retried on the next run, NOT persisted + # as an empty-body sentinel (Codex round 5 regression). + self._last_read_ok = True + + def _resolve(self, file_path: str) -> Path | None: + """Resolve ``file_path`` to an absolute path within the repo root. + + Returns ``None`` (treated as a read failure) if the resolved path + escapes the repo root. ``file_path`` comes straight out of the + graph DB, so an attacker who can write to ``nodes.file_path`` + could otherwise point us at ``/etc/passwd`` or ``../../secret`` + and get the contents embedded into a cloud provider. We only + allow the escape when ``repo_root`` is unset (e.g. unit tests that + construct the cache with absolute fixture paths). + """ + p = Path(file_path) + if not p.is_absolute() and self._repo_root is not None: + p = self._repo_root / p + if self._repo_root is None: + return p + try: + resolved = p.resolve() + root = self._repo_root.resolve() + except (OSError, RuntimeError): + return None + try: + resolved.relative_to(root) + except ValueError: + return None + return resolved + + def _load(self, file_path: str) -> list[str] | None: + """Read and split a file; return ``None`` on any failure.""" + try: + p = self._resolve(file_path) + if p is None: + return None + st = p.stat() + if st.st_size > _FILE_CACHE_MAX_BYTES: + return None + with p.open("rb") as fh: + head = fh.read(_FILE_CACHE_SNIFF_BYTES) + if b"\x00" in head: + return None + rest = fh.read() + raw = head + rest + except (OSError, ValueError): + return None + try: + text = raw.decode("utf-8", errors="replace") + except Exception: + return None + return text.splitlines() + + def get_lines(self, file_path: str, line_start: int, line_end: int) -> list[str]: + if line_start <= 0 or line_end < line_start: + # Bogus range is a caller-side issue, not a read failure — + # retrying would hit the same bug, so mark as OK and let the + # caller persist whatever result they compute. + self._last_read_ok = True + return [] + key = file_path + if key in self._cache: + self._cache.move_to_end(key) + lines = self._cache[key] + else: + lines = self._load(key) + self._read_count += 1 + self._cache[key] = lines + if len(self._cache) > self._max_entries: + self._cache.popitem(last=False) + self._last_read_ok = lines is not None + if lines is None: + return [] + if line_start > len(lines): + return [] + return lines[line_start - 1:line_end] + + @property + def last_read_ok(self) -> bool: + return self._last_read_ok + + def clear(self) -> None: + self._cache.clear() + self._last_read_ok = True + + +def _looks_like_signature(line: str, node: GraphNode) -> bool: + """Conservative substring check: is ``line`` the declaration of ``node``? + + Returns True only when ALL of: + - ``node.name`` appears in the line + - the line contains a language-specific declaration keyword + - (if ``node.params`` is set) the first param identifier appears + + Unknown languages return False (do not dedup — keep the line). + """ + s = line.strip() + if not s: + return False + if node.name not in s: + return False + + lang = (node.language or "").lower() + + if lang == "python": + if not ("def " in s or "class " in s): + return False + elif lang in ("javascript", "typescript", "jsx", "tsx"): + if not ( + "function " in s or "=>" in s or "class " in s or "async " in s + ): + return False + elif lang in ("java", "kotlin"): + has_type_name = bool( + node.return_type and node.return_type in s and node.name in s + ) + # Kotlin methods commonly omit an explicit return type (e.g. + # ``override fun save(user: User) { ... }``); fall back to the + # ``fun `` keyword so dedup still fires for those. + has_kotlin_fun = lang == "kotlin" and "fun " in s + if not ( + has_type_name or "class " in s or "interface " in s or has_kotlin_fun + ): + return False + elif lang == "rust": + if not any( + k in s for k in ("fn ", "impl ", "struct ", "enum ", "trait ") + ): + return False + elif lang == "go": + if "func " not in s: + return False + else: + return False + + if node.params: + pstripped = node.params.strip("()").strip() + if pstripped: + first = pstripped.split(",", 1)[0].split(":", 1)[0].strip() + # tokens like `*args`, `**kwargs`, `&self`, `self` are safe + # to skip param gate on (always match): bail only when we have + # a real identifier we can check. + bare = first.lstrip("*&") + if bare and bare != "self" and bare not in s: + return False + return True + + +def _extract_body_text( + node: GraphNode, + reader: _FileLineCache, + *, + max_chars: int, + max_lines: int = _BODY_MAX_LINES_DEFAULT, +) -> str: + """Return a truncated body snippet for semantic embedding. + + Pipeline: + 1. Skip ``File`` nodes and bogus line ranges. + 2. Read the node's ``[line_start, line_end]`` range via ``reader``. + 3. Drop line 0 if it looks like the node's signature (dedup with metadata). + 4. Run a block-comment + triple-quote state machine to skip leading + docstrings and header comments; leading single-line comments + (``#`` ``//`` ``--`` ``///`` ``*``) are dropped regardless of length. + 5. Collapse consecutive blank lines. + 6. Truncate by ``max_lines`` then by ``max_chars`` (prefer word boundary, + append ``' …'`` suffix when cut). + """ + if node.kind == "File": + return "" + if node.line_start is None or node.line_end is None: + return "" + if node.line_start <= 0 or node.line_end < node.line_start: + return "" + + lines = reader.get_lines(node.file_path, node.line_start, node.line_end) + if not lines: + return "" + + # Step 3: signature dedup + if _looks_like_signature(lines[0], node): + lines = lines[1:] + if not lines: + return "" + + # Step 4: block-comment + triple-quote state machine + kept: list[str] = [] + in_block_comment = False + in_triple_quote: str | None = None + drops = 0 + for line in lines: + s = line.strip() + + # triple-quote state (Python docstring continuation) + if in_triple_quote is not None: + drops += 1 + if in_triple_quote in s: + in_triple_quote = None + continue + + # block-comment state (C/Java/JS /* ... */) + if in_block_comment: + drops += 1 + if "*/" in s: + in_block_comment = False + continue + + # leading-only triple-quote open + if not kept: + matched_triple = False + for q in ('"""', "'''"): + if s.startswith(q): + # same-line open+close: """one-liner.""" + if len(s) >= len(q) * 2 and s.endswith(q): + drops += 1 + matched_triple = True + break + # multi-line open: enter state (only if no closer on this line) + if q not in s[len(q):]: + in_triple_quote = q + drops += 1 + matched_triple = True + break + if matched_triple: + continue + + # block-comment open: /* or /** on its own line + if s.startswith("/*") or s.startswith("/**"): + if "*/" not in s: + # multi-line opener + in_block_comment = True + drops += 1 + continue + # single-line inline block comment: drop only if the line has no + # trailing code after the closing ``*/``. Lines like + # ``/* x */ real_code = 1`` keep flowing into the real-content + # branch (handled by the leading-``*`` case below + kept). + after_close = s.split("*/", 1)[1].strip() + if not kept and not after_close: + drops += 1 + continue + + # leading single-line comments: drop regardless of length (D-iter3-4) + if not kept and s.startswith(("#", "//", "--", "///", "*")): + drops += 1 + continue + + # blank line before any real code + if not kept and s == "": + drops += 1 + continue + + kept.append(line) + + # fallback safety cap + if drops >= 20 and not kept: + break + + if not kept: + return "" + + # Step 5: collapse blank-line runs + collapsed: list[str] = [] + prev_blank = False + for ln in kept: + is_blank = ln.strip() == "" + if is_blank and prev_blank: + continue + collapsed.append(ln) + prev_blank = is_blank + + # Step 6: truncate + collapsed = collapsed[:max_lines] + normalised = [ln.replace("\t", " ").rstrip() for ln in collapsed] + text = " \n ".join(normalised) + if len(text) > max_chars: + cut = text[:max_chars].rsplit(" ", 1) + text = (cut[0] if len(cut) == 2 and cut[0] else text[:max_chars]) + " …" + return text + + +def _split_combined_hash(stored: str) -> tuple[str, str, str]: + """Parse the combined hash stored in ``embeddings.text_hash``. + + Format evolution (all three forms round-trip through this helper): + + * **Iter 1 legacy**: single sha256 hex, no ``':'`` + → ``(stored, '', '')`` + * **Iter 2 (body flag only)**: ``'meta_hash:body_hash'`` + → ``(meta, body, '')`` + * **Iter 3.2 (body + file fingerprint)**: ``'meta:body:file_hash'`` + → ``(meta, body, file_hash)`` + + The third slot lets ``embed_nodes`` detect a body change without + reading the file again: when metadata is unchanged but the node's + file-level hash has shifted, the body snippet may have changed + (Codex-review finding: metadata-only stage-1 filter would otherwise + leave edits like ``return x * 2`` → ``return x * 3`` permanently + stale). Rows older than Iter 3.2 report an empty file-hash slot, + which callers treat as "unknown — must re-verify". + """ + parts = stored.split(":", 2) + if len(parts) == 1: + return parts[0], "", "" + if len(parts) == 2: + return parts[0], parts[1], "" + return parts[0], parts[1], parts[2] + + +def _node_to_text( + node: GraphNode, + reader: _FileLineCache | None = None, + *, + max_chars: int | None = None, +) -> str: + """Build searchable text = metadata + optional body snippet. + + Backward-compatible: callers passing ``reader=None`` get the pre-Iter-2 + metadata-only string exactly as before. Legacy tests that construct a + node and call ``_node_to_text(node)`` continue to pass unchanged. + + When ``reader`` is provided, ``max_chars`` is required (raises + ``ValueError`` otherwise — ``assert`` would be stripped under + ``python -O``). Caller is responsible for gating via ``_body_enabled`` + before passing a reader; this function does not re-check. + """ parts = [node.name] if node.kind != "File": parts.append(node.kind.lower()) @@ -701,9 +1102,53 @@ def _node_to_text(node: GraphNode) -> str: parts.append(f"returns {node.return_type}") if node.language: parts.append(node.language) + + if reader is not None: + if max_chars is None: + raise ValueError("max_chars required when reader provided") + body = _extract_body_text(node, reader, max_chars=max_chars) + if body: + parts.append("body:") + parts.append(body) return " ".join(parts) +def _body_enabled(store: "EmbeddingStore") -> bool: + """ADR-C3 sticky-flag gate for body enrichment (Iter 3 D-iter3-1). + + Decision priority (first match wins; ALL paths write back the flag to + ``embeddings_meta.embed_body_enabled`` so subsequent runs honour it): + + 1. ``CRG_EMBED_INCLUDE_BODY=0`` -> False (write ``"0"``) + 2. ``CRG_EMBED_INCLUDE_BODY=1`` -> True (write ``"1"``) + 3. Existing flag -> honour stored value + 4. No flag + ``store.count() == 0`` -> True (auto-on for fresh DB) + 5. No flag + ``store.count() > 0`` -> False (legacy DB opts out by default) + + The CLI ``--include-body --confirm-reembed`` combo is implemented by + setting ``CRG_EMBED_INCLUDE_BODY=1`` before invoking embed, so case (2) + fires and flips ``"0"`` -> ``"1"`` on legacy DBs. + """ + env = os.environ.get("CRG_EMBED_INCLUDE_BODY") + if env == "0": + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "0") + return False + if env == "1": + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "1") + return True + stored = store._get_meta(_EMBED_META_KEY_BODY_ENABLED) + if stored == "1": + return True + if stored == "0": + return False + # Flag absent: decide based on DB emptiness and persist the choice. + if store.count() == 0: + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "1") + return True + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "0") + return False + + class EmbeddingStore: """Manages vector embeddings for graph nodes in SQLite.""" @@ -743,31 +1188,146 @@ def __exit__(self, exc_type, exc_val, exc_tb) -> None: # type: ignore[no-untype def close(self) -> None: self._conn.close() + def _repo_root_guess(self) -> Path | None: + """Best-effort root inference for the ``_FileLineCache``. + + Graph files live in ``.code-review-graph/graph.db``; walk upward from + ``db_path`` to find that directory and return its parent. Returns + ``None`` if we can't pin it down — callers interpret that as + "use absolute file paths only". + """ + try: + parent = self.db_path.resolve().parent + if parent.name == ".code-review-graph": + return parent.parent + # db_path may live directly inside a repo root (tests/tmp_path) + return parent + except OSError: + return None + + # --- embeddings_meta (Iter 3 D-iter3-1 sticky flag storage) --- + + def _get_meta(self, key: str) -> str | None: + row = self._conn.execute( + "SELECT value FROM embeddings_meta WHERE key = ?", (key,) + ).fetchone() + return row["value"] if row else None + + def _set_meta(self, key: str, value: str) -> None: + self._conn.execute( + "INSERT INTO embeddings_meta (key, value) VALUES (?, ?) " + "ON CONFLICT(key) DO UPDATE SET value = excluded.value", + (key, value), + ) + self._conn.commit() + def embed_nodes(self, nodes: list[GraphNode], batch_size: int = 64) -> int: - """Compute and store embeddings for a list of nodes.""" + """Compute and store embeddings for a list of nodes. + + Two-stage filter (Iter 3 Directive 5): + + * **Stage 1 (zero IO)** — compute ``metadata_hash`` from the + metadata-only ``_node_to_text`` form and compare with the stored + hash's metadata half. If it matches AND body is disabled OR the + row already has a body part embedded, skip without reading any + file. + * **Stage 2 (with IO)** — only for nodes that pass stage 1's + escape-hatch branches (metadata changed, or body was just + enabled on a legacy row): read the file through the + ``_FileLineCache``, compute ``body_hash``, and store the + combined ``"{metadata_hash}:{body_hash}"`` in ``text_hash``. + """ if not self.provider: return 0 - # Filter to nodes that need embedding - to_embed: list[tuple[GraphNode, str, str]] = [] provider_name = self.provider.name + body_on = _body_enabled(self) + max_chars = _resolve_body_max_chars(provider_name) if body_on else 0 + reader = _FileLineCache(repo_root=self._repo_root_guess()) if body_on else None + + to_embed: list[tuple[GraphNode, str, str]] = [] + + # Sort by file so the LRU cache gets consecutive hits within a file + sorted_nodes = sorted( + nodes, + key=lambda n: (n.file_path or "", n.line_start or 0), + ) - for node in nodes: + for node in sorted_nodes: if node.kind == "File": continue - text = _node_to_text(node) - text_hash = hashlib.sha256(text.encode()).hexdigest() + + # --- Stage 1: metadata-only hash, zero IO --- + meta_text = _node_to_text(node, reader=None) + meta_hash = hashlib.sha256(meta_text.encode()).hexdigest() existing = self._conn.execute( "SELECT text_hash, provider FROM embeddings WHERE qualified_name = ?", (node.qualified_name,), ).fetchone() - # Re-embed if text changed OR provider changed - if (existing and existing["text_hash"] == text_hash - and existing["provider"] == provider_name): + node_file_hash = node.file_hash or "" + if existing and existing["provider"] == provider_name: + stored_meta, stored_body, stored_file_hash = _split_combined_hash( + existing["text_hash"] + ) + if stored_meta == meta_hash: + # Metadata identical. Skip when we can PROVE the body + # snippet is still fresh too. We require *both* sides + # to carry a non-empty file fingerprint that matches; + # an empty fingerprint on either side is not proof of + # freshness — it just means we don't know — so we + # fall through to stage 2 and reread the file. This + # is the conservative path Codex round 3 flagged: + # metadata-only rows with absent ``node.file_hash`` + # would otherwise let body-only edits slip through. + if not body_on: + continue + if (stored_body != "" + and stored_file_hash != "" + and node_file_hash != "" + and stored_file_hash == node_file_hash): + continue + # else: fall through to stage 2 to refresh body_hash + + # --- Stage 2: body-enriched hash (reads file only when needed) --- + if body_on and reader is not None: + full_text = _node_to_text(node, reader=reader, max_chars=max_chars) + body_part = full_text[len(meta_text):] + if reader.last_read_ok: + # Always hash the body slot when body enrichment is + # on and the read succeeded, even when ``body_part`` + # is empty (single-line function whose only line was + # the signature, or a body entirely stripped as + # docstring/header). Writing ``sha256("")`` here + # lets stage 1 distinguish "body ran and produced + # nothing" from "body never ran" (legacy / body-off + # rows), so empty-body nodes short-circuit too. + body_hash = hashlib.sha256(body_part.encode()).hexdigest() + combined_hash = f"{meta_hash}:{body_hash}:{node_file_hash}" + final_text = full_text + else: + # File read failed (missing / permission / binary / + # oversized / outside repo). Do NOT persist the + # empty-body sentinel — that would freeze the node + # on a transient failure (Codex round 5). Fall back + # to the body-off stored form so the next run + # retries. We still embed the metadata-only text + # now, since we need *some* vector for this node. + combined_hash = f"{meta_hash}::{node_file_hash}" + final_text = meta_text + else: + combined_hash = f"{meta_hash}::{node_file_hash}" + final_text = meta_text + + if (existing and existing["provider"] == provider_name + and existing["text_hash"] == combined_hash): continue - to_embed.append((node, text, text_hash)) + + to_embed.append((node, final_text, combined_hash)) + + if reader is not None: + reader.clear() if not to_embed: return 0 diff --git a/tests/fixtures/embeddings/eval/golden_queries.json b/tests/fixtures/embeddings/eval/golden_queries.json new file mode 100644 index 00000000..cfec44ad --- /dev/null +++ b/tests/fixtures/embeddings/eval/golden_queries.json @@ -0,0 +1,101 @@ +[ + { + "query": "parse source code tree into nodes", + "expected_top3": [ + "parser.py::parse_tree", + "parser.py::walk_nodes", + "parser.py::extract_functions" + ], + "type": "signature-searchable", + "notes": "node names contain parse/tree/nodes" + }, + { + "query": "split string into tokens", + "expected_top3": [ + "parser.py::tokenize", + "parser.py::parse_tree", + "parser.py::build_root" + ], + "type": "signature-searchable", + "notes": "tokenize name is the obvious hit" + }, + { + "query": "cosine similarity between embedding vectors", + "expected_top3": [ + "embeddings.py::cosine_similarity", + "embeddings.py::EmbeddingIndex.search", + "embeddings.py::embed_text" + ], + "type": "signature-searchable", + "notes": "cosine_similarity name is explicit" + }, + { + "query": "escape HTML entities including quotes and backticks", + "expected_top3": [ + "visualization.py::esc_h", + "visualization.py::render_html" + ], + "type": "body-dependent", + "notes": "esc_h name is cryptic; only body shows replace('&', '&') / replace('\"', '"') / replace('`', '`')" + }, + { + "query": "breadth first search with visited set and queue", + "expected_top3": [ + "graph.py::Graph.impact_radius", + "graph.py::Graph.neighbors", + "graph.py::Graph.add_edge" + ], + "type": "body-dependent", + "notes": "impact_radius name does not say BFS; body has visited=set(), deque(), while queue" + }, + { + "query": "read OPENAI_API_KEY from the environment", + "expected_top3": [ + "api_client.py::OpenAIClient.__init__", + "api_client.py::GeminiClient.__init__", + "api_client.py::MinimaxClient.__init__" + ], + "type": "body-dependent", + "notes": "all three __init__ share the same signature; only body contains the env var name" + }, + { + "query": "read GEMINI_API_KEY from the environment", + "expected_top3": [ + "api_client.py::GeminiClient.__init__", + "api_client.py::OpenAIClient.__init__", + "api_client.py::MinimaxClient.__init__" + ], + "type": "body-dependent", + "notes": "requires body to distinguish the three __init__ siblings" + }, + { + "query": "render graph nodes as html list", + "expected_top3": [ + "visualization.py::render_html", + "visualization.py::esc_h", + "graph.py::Graph.neighbors" + ], + "type": "mixed", + "notes": "render_html name has semantic signal; esc_h relies on body" + }, + { + "query": "detect git file changes since a ref", + "expected_top3": [ + "incremental.py::detect_changes", + "incremental.py::get_changed_files", + "graph.py::Graph.add_edge" + ], + "type": "mixed", + "notes": "detect_changes is signature-searchable; get_changed_files needs body-ish context" + }, + { + "query": "store embedding vector by qualified name", + "expected_top3": [ + "embeddings.py::EmbeddingIndex.add", + "embeddings.py::EmbeddingIndex.search", + "embeddings.py::EmbeddingIndex.__init__" + ], + "type": "signature-searchable", + "notes": "EmbeddingIndex.add name matches" + } +] diff --git a/tests/fixtures/embeddings/eval/mini_repo/api_client.py b/tests/fixtures/embeddings/eval/mini_repo/api_client.py new file mode 100644 index 00000000..de194c95 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/api_client.py @@ -0,0 +1,24 @@ +"""Mini cloud API clients for MRR eval fixture — identical signatures, body differs.""" + +import os + + +class OpenAIClient: + def __init__(self): + self.api_key = os.environ.get("OPENAI_API_KEY", "") + self.base_url = "https://api.openai.com/v1" + self.default_model = "text-embedding-3-small" + + +class GeminiClient: + def __init__(self): + self.api_key = os.environ.get("GEMINI_API_KEY", "") + self.base_url = "https://generativelanguage.googleapis.com/v1beta" + self.default_model = "gemini-embedding-001" + + +class MinimaxClient: + def __init__(self): + self.api_key = os.environ.get("MINIMAX_API_KEY", "") + self.base_url = "https://api.minimax.io/v1" + self.default_model = "embo-01" diff --git a/tests/fixtures/embeddings/eval/mini_repo/embeddings.py b/tests/fixtures/embeddings/eval/mini_repo/embeddings.py new file mode 100644 index 00000000..b5d30502 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/embeddings.py @@ -0,0 +1,32 @@ +"""Mini embedding index for MRR eval fixture.""" + + +def cosine_similarity(a, b): + if len(a) != len(b): + return 0.0 + dot = sum(x * y for x, y in zip(a, b)) + norm_a = sum(x * x for x in a) ** 0.5 + norm_b = sum(x * x for x in b) ** 0.5 + if norm_a == 0 or norm_b == 0: + return 0.0 + return dot / (norm_a * norm_b) + + +def embed_text(text): + tokens = text.lower().split() + return [float(len(t)) for t in tokens] + + +class EmbeddingIndex: + def __init__(self): + self.store = {} + + def add(self, qname, vec): + self.store[qname] = vec + + def search(self, query_vec, limit=10): + scored = [] + for qn, vec in self.store.items(): + scored.append((qn, cosine_similarity(query_vec, vec))) + scored.sort(key=lambda x: x[1], reverse=True) + return scored[:limit] diff --git a/tests/fixtures/embeddings/eval/mini_repo/graph.py b/tests/fixtures/embeddings/eval/mini_repo/graph.py new file mode 100644 index 00000000..8103ae11 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/graph.py @@ -0,0 +1,33 @@ +"""Mini graph store for MRR eval fixture.""" + +from collections import deque + + +class Graph: + def __init__(self): + self.nodes = {} + self.edges = {} + + def add_node(self, name): + self.nodes[name] = name + self.edges.setdefault(name, []) + + def add_edge(self, src, dst): + self.edges.setdefault(src, []).append(dst) + + def neighbors(self, name): + return list(self.edges.get(name, [])) + + def impact_radius(self, start, depth): + visited = set() + queue = deque([(start, 0)]) + result = [] + while queue: + node, d = queue.popleft() + if node in visited or d > depth: + continue + visited.add(node) + result.append(node) + for nbr in self.edges.get(node, []): + queue.append((nbr, d + 1)) + return result diff --git a/tests/fixtures/embeddings/eval/mini_repo/incremental.py b/tests/fixtures/embeddings/eval/mini_repo/incremental.py new file mode 100644 index 00000000..e9c905b2 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/incremental.py @@ -0,0 +1,16 @@ +"""Mini git change detection for MRR eval fixture.""" + +import subprocess + + +def detect_changes(repo_root, since_ref): + """Return list of files changed in repo since the given git ref.""" + out = subprocess.run( + ["git", "-C", repo_root, "diff", "--name-only", since_ref, "HEAD"], + capture_output=True, text=True, check=False, + ) + return [line for line in out.stdout.splitlines() if line.strip()] + + +def get_changed_files(repo_root): + return detect_changes(repo_root, "HEAD~1") diff --git a/tests/fixtures/embeddings/eval/mini_repo/parser.py b/tests/fixtures/embeddings/eval/mini_repo/parser.py new file mode 100644 index 00000000..6874d7d0 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/parser.py @@ -0,0 +1,33 @@ +"""Mini AST parser for MRR eval fixture.""" + + +def parse_tree(source): + """Parse source text and return an AST root.""" + tokens = tokenize(source) + root = build_root(tokens) + return root + + +def walk_nodes(tree, visitor): + stack = [tree] + while stack: + node = stack.pop() + visitor(node) + stack.extend(node.children) + + +def extract_functions(tree): + found = [] + walk_nodes(tree, lambda n: found.append(n) if n.kind == "function" else None) + return found + + +def tokenize(source): + return source.split() + + +def build_root(tokens): + class N: + kind = "root" + children = [] + return N() diff --git a/tests/fixtures/embeddings/eval/mini_repo/visualization.py b/tests/fixtures/embeddings/eval/mini_repo/visualization.py new file mode 100644 index 00000000..5e131ccd --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/visualization.py @@ -0,0 +1,18 @@ +"""Mini HTML rendering for MRR eval fixture.""" + + +def esc_h(s): + """Replace dangerous characters so the string is safe inside HTML.""" + out = s.replace("&", "&") + out = out.replace("<", "<").replace(">", ">") + out = out.replace('"', """) + out = out.replace("`", "`") + return out + + +def render_html(graph): + buf = ["") + return "".join(buf) diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 2f5cc4b0..a6283e2c 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -8,16 +8,25 @@ import pytest from code_review_graph.embeddings import ( + _BODY_MAX_CHARS_BY_PROVIDER, + _BODY_MAX_CHARS_FALLBACK, + _EMBED_META_KEY_BODY_ENABLED, LOCAL_DEFAULT_MODEL, EmbeddingStore, LocalEmbeddingProvider, MiniMaxEmbeddingProvider, OpenAIEmbeddingProvider, + _body_enabled, _cosine_similarity, _decode_vector, _encode_vector, + _extract_body_text, + _FileLineCache, _is_localhost_url, + _looks_like_signature, _node_to_text, + _resolve_body_max_chars, + _split_combined_hash, get_provider, ) from code_review_graph.graph import GraphNode @@ -1034,3 +1043,748 @@ def test_subdomain_spoof_triggers_warning(self, capsys): get_provider("openai") captured = capsys.readouterr() assert "cloud" in captured.err.lower() + + +# --------------------------------------------------------------------------- +# Body enrichment tests (Iter 3 final: ADR-A1+dedup / B4 per-provider / C3 sticky) +# --------------------------------------------------------------------------- + + +def _mk_node(**kwargs): + """Minimal GraphNode factory for body-enrichment tests.""" + defaults = dict( + id=1, kind="Function", name="my_func", + qualified_name="sample.py::my_func", file_path="sample.py", + line_start=1, line_end=5, language="python", + parent_name=None, params=None, return_type=None, + is_test=False, file_hash=None, extra={}, + ) + defaults.update(kwargs) + return GraphNode(**defaults) + + +class TestFileLineCache: + def test_reads_existing_file(self, tmp_path): + p = tmp_path / "a.py" + p.write_text("line1\nline2\nline3\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("a.py", 1, 3) == ["line1", "line2", "line3"] + assert cache.get_lines("a.py", 2, 2) == ["line2"] + + def test_caches_across_calls(self, tmp_path): + p = tmp_path / "a.py" + p.write_text("x\ny\nz\n") + cache = _FileLineCache(repo_root=tmp_path) + cache.get_lines("a.py", 1, 2) + cache.get_lines("a.py", 2, 3) + cache.get_lines("a.py", 1, 1) + assert cache._read_count == 1 + + def test_missing_file_returns_empty(self, tmp_path): + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("does_not_exist.py", 1, 5) == [] + + def test_binary_file_detected_returns_empty(self, tmp_path): + p = tmp_path / "b.bin" + p.write_bytes(b"abc\x00def\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("b.bin", 1, 3) == [] + + def test_out_of_range_lines_returns_empty(self, tmp_path): + p = tmp_path / "a.py" + p.write_text("only-one\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("a.py", 999, 1000) == [] + assert cache.get_lines("a.py", 0, 5) == [] + assert cache.get_lines("a.py", 3, 1) == [] + + def test_lru_eviction_at_max_entries(self, tmp_path): + cache = _FileLineCache(repo_root=tmp_path, max_entries=4) + for i in range(6): + p = tmp_path / f"f{i}.py" + p.write_text(f"content {i}\n") + cache.get_lines(f"f{i}.py", 1, 1) + # Earliest two entries should have been evicted + assert "f0.py" not in cache._cache + assert "f1.py" not in cache._cache + assert "f5.py" in cache._cache + assert len(cache._cache) == 4 + + +class TestSignatureDedup: + @pytest.mark.parametrize("lang,line,name,params", [ + ("python", "def my_func(x, y):", "my_func", "(x, y)"), + ("javascript", "function handle(req, res) {", "handle", "(req, res)"), + ("typescript", "const handle = (req: Req) => {", "handle", "(req: Req)"), + ("java", "public int compute(int x) {", "compute", "(int x)"), + ("rust", "pub fn build_tree(root: Node) -> Tree {", + "build_tree", "(root: Node)"), + ]) + def test_signature_line_deduplication(self, lang, line, name, params): + node = _mk_node(language=lang, name=name, params=params, + return_type="int" if lang == "java" else None) + assert _looks_like_signature(line, node) is True + + def test_signature_dedup_preserves_body_when_no_match(self): + # Decoy signature-like text but node.name is different -> must NOT dedup + node = _mk_node(language="python", name="compute_totals", + params="(items)") + assert _looks_like_signature("def unrelated(x): pass", node) is False + + def test_signature_dedup_respects_params_mismatch(self): + # Overload with different param first-token + node = _mk_node(language="python", name="handle", params="(request)") + assert _looks_like_signature( + "def handle(completely_different_arg):", node, + ) is False + + def test_signature_dedup_unknown_language_conservative(self): + node = _mk_node(language="kotlin2", name="foo", params="()") + assert _looks_like_signature("fun foo() {}", node) is False + + def test_signature_dedup_go_and_typescript(self): + go_node = _mk_node(language="go", name="Handle", params="(w, r)") + assert _looks_like_signature( + "func Handle(w http.ResponseWriter, r *http.Request) {", go_node, + ) is True + ts_node = _mk_node(language="typescript", name="doThing", params="(x)") + assert _looks_like_signature("async function doThing(x) {", ts_node) is True + + def test_decorator_line_not_mistaken_for_signature(self): + # Iter 3 D-iter3-5: @decorator line has node.name false-positive risk + # because tree-sitter may include the decorator in the node range. + # Three-AND gate requires a declaration keyword; @my_decorator has none. + node = _mk_node(language="python", name="my_func", params="(x)") + assert _looks_like_signature("@my_decorator", node) is False + # Also the real def line DOES match. + assert _looks_like_signature("def my_func(x):", node) is True + + def test_recursive_call_first_body_line_preserved(self): + # Iter 3 D-iter3-5: body first line `return fact(n-1)` contains + # node.name but lacks `def ` keyword, so three-AND gate returns False + # and dedup does NOT drop the line. + node = _mk_node(language="python", name="fact", params="(n)") + assert _looks_like_signature("return n * fact(n - 1)", node) is False + + +class TestBlockCommentStateMachine: + def _extract(self, fixture_text: str, tmp_path, **node_kwargs) -> str: + p = tmp_path / "f.py" + p.write_text(fixture_text) + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node( + file_path="f.py", + line_start=node_kwargs.get("line_start", 1), + line_end=node_kwargs.get("line_end", fixture_text.count("\n") + 1), + language=node_kwargs.get("language", "python"), + name=node_kwargs.get("name", "never-matches-anything"), + ) + return _extract_body_text(node, reader, max_chars=4000, max_lines=40) + + def test_jsdoc_multiline_header_skipped(self, tmp_path): + src = ( + "/**\n" + " * docstring line 1\n" + " * docstring line 2\n" + " */\n" + "return 42;\n" + ) + out = self._extract(src, tmp_path, language="javascript") + assert "docstring" not in out + assert "return 42" in out + + def test_javadoc_preserves_body(self, tmp_path): + src = ( + "/** header doc */\n" + "int answer = 42;\n" + "return answer;\n" + ) + out = self._extract(src, tmp_path, language="java") + assert "header doc" not in out + assert "answer = 42" in out or "return answer" in out + + def test_block_comment_unclosed_falls_back(self, tmp_path): + # Unclosed /* -> state machine stays in block_comment, drops everything. + # Fallback cap prevents hang, function returns "". + src = "/*\n" + ("noise line\n" * 25) + out = self._extract(src, tmp_path, language="javascript") + assert out == "" + + def test_inline_block_comment_not_swallowed(self, tmp_path): + src = "/* x */ real_code = 1\nmore = 2\n" + out = self._extract(src, tmp_path, language="javascript") + # inline /* x */ with trailing code: should NOT enter block state + assert "real_code = 1" in out or "more = 2" in out + + def test_python_single_line_triple_quote_docstring_skipped(self, tmp_path): + src = '"""one-liner docstring."""\nreturn value\n' + out = self._extract(src, tmp_path, language="python") + assert "one-liner docstring" not in out + assert "return value" in out + + def test_python_multiline_triple_quote_docstring_skipped(self, tmp_path): + src = ( + '"""\n' + " line 1\n" + " line 2\n" + '"""\n' + "compute = 3\n" + ) + out = self._extract(src, tmp_path, language="python") + assert "line 1" not in out and "line 2" not in out + assert "compute = 3" in out + + def test_comment_prefix_dropped_regardless_of_length(self, tmp_path): + long_comment = "# " + ("x" * 200) # >> 120 chars + src = f"{long_comment}\nreal = 1\n" + out = self._extract(src, tmp_path, language="python") + assert long_comment not in out + assert "real = 1" in out + + +class TestBodyExtraction: + def _extract(self, src: str, tmp_path, **node_kwargs) -> str: + p = tmp_path / "f.py" + p.write_text(src) + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node( + file_path="f.py", + line_start=node_kwargs.get("line_start", 1), + line_end=node_kwargs.get("line_end", src.count("\n") + 1), + language=node_kwargs.get("language", "python"), + kind=node_kwargs.get("kind", "Function"), + name=node_kwargs.get("name", "whatever-no-sig-match"), + ) + return _extract_body_text( + node, reader, + max_chars=node_kwargs.get("max_chars", 4000), + max_lines=node_kwargs.get("max_lines", 40), + ) + + def test_python_function_body(self, tmp_path): + src = ( + "def parse_tree(data):\n" + " root = build_root(data)\n" + " walk_all_nodes(root)\n" + " return root\n" + ) + out = self._extract( + src, tmp_path, name="parse_tree", line_start=1, line_end=4, + ) + # signature line dropped via dedup; body should contain real code + assert "build_root" in out + assert "walk_all_nodes" in out + + def test_file_node_returns_empty(self, tmp_path): + p = tmp_path / "x.py" + p.write_text("print('hi')\n") + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node(kind="File", file_path="x.py", line_start=1, line_end=1) + assert _extract_body_text(node, reader, max_chars=1000) == "" + + def test_leading_triple_quote_docstring_skipped(self, tmp_path): + src = '"""module docstring."""\nvalue = 1\n' + out = self._extract(src, tmp_path, name="never") + assert "module docstring" not in out + assert "value = 1" in out + + def test_comment_only_region_skipped(self, tmp_path): + src = "# comment 1\n# comment 2\nactual_code = 42\n" + out = self._extract(src, tmp_path, name="never") + assert "comment 1" not in out + assert "actual_code" in out + + def test_truncation_by_chars(self, tmp_path): + src = "value_a = {}\n".format("x" * 100) # single long line + out = self._extract( + src, tmp_path, name="never", + max_chars=40, line_start=1, line_end=1, + ) + assert out.endswith("…") + assert len(out) <= 45 # max_chars + suffix overhead + + def test_truncation_by_lines(self, tmp_path): + lines = "\n".join(f"stmt_{i} = {i}" for i in range(30)) + "\n" + out = self._extract( + src=lines, tmp_path=tmp_path, name="never", + max_chars=100000, max_lines=3, + ) + # only 3 stmts kept; stmt_3+ must not appear + assert "stmt_0" in out + assert "stmt_29" not in out + + def test_single_line_node(self, tmp_path): + src = "def tiny(): return 1\n" + out = self._extract( + src, tmp_path, name="tiny", params="()", + line_start=1, line_end=1, + ) + # signature dedup drops the only line; body should be empty + assert out == "" + + +class TestBodyMaxCharsPerProvider: + @pytest.mark.parametrize("provider_name,expected", [ + ("local:all-MiniLM-L6-v2", _BODY_MAX_CHARS_BY_PROVIDER["local"]), + ("google:gemini-embedding-001", _BODY_MAX_CHARS_BY_PROVIDER["google"]), + ("minimax:embo-01", _BODY_MAX_CHARS_BY_PROVIDER["minimax"]), + ("openai:text-embedding-3-small", _BODY_MAX_CHARS_BY_PROVIDER["openai"]), + ]) + def test_body_max_chars_per_provider(self, provider_name, expected): + with patch.dict(os.environ, {}, clear=False): + os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None) + assert _resolve_body_max_chars(provider_name) == expected + + def test_body_max_chars_env_override(self): + with patch.dict(os.environ, {"CRG_EMBED_BODY_MAX_CHARS": "1234"}): + assert _resolve_body_max_chars("local:anything") == 1234 + assert _resolve_body_max_chars("openai:foo") == 1234 + + def test_body_max_chars_unknown_provider_fallback(self): + with patch.dict(os.environ, {}, clear=False): + os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None) + assert _resolve_body_max_chars("exotic:brand-new") == \ + _BODY_MAX_CHARS_FALLBACK + + def test_body_max_chars_provider_prefix_parsing(self): + with patch.dict(os.environ, {}, clear=False): + os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None) + assert _resolve_body_max_chars("openai:text-embedding-v4") == \ + _BODY_MAX_CHARS_BY_PROVIDER["openai"] + assert _resolve_body_max_chars("OPENAI:foo") == \ + _BODY_MAX_CHARS_BY_PROVIDER["openai"] + + +class TestBodyEnabledStoreGate: + """ADR-C3 sticky flag (Iter 3 D-iter3-1 / D-iter3-2).""" + + def _fresh_store(self, tmp_path): + with patch("code_review_graph.embeddings.get_provider", return_value=None): + return EmbeddingStore(tmp_path / "e.db") + + def _scrub_env(self): + for k in ("CRG_EMBED_INCLUDE_BODY",): + os.environ.pop(k, None) + + def test_body_enabled_empty_db_auto_enables(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_body_enabled_nonempty_db_requires_opt_in(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + # Simulate an existing embedding row without flag being set yet. + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1, 0.2]), "abc", "unknown"), + ) + store._conn.commit() + assert _body_enabled(store) is False + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + finally: + store.close() + + def test_body_enabled_env_on_overrides(self, tmp_path): + with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "1"}): + store = self._fresh_store(tmp_path) + try: + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "abc", "unknown"), + ) + store._conn.commit() + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_body_enabled_env_off_overrides(self, tmp_path): + with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "0"}): + store = self._fresh_store(tmp_path) + try: + assert _body_enabled(store) is False + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + finally: + store.close() + + def test_sticky_flag_empty_db_writes_on(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) is None + _body_enabled(store) + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_sticky_flag_existing_db_writes_off(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "abc", "unknown"), + ) + store._conn.commit() + _body_enabled(store) + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + finally: + store.close() + + def test_sticky_flag_persists_across_runs(self, tmp_path): + """Iter 2 mid-run bug regression: once on, stays on after count() > 0.""" + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + # Run 1: empty DB -> auto-on, flag = "1" + assert _body_enabled(store) is True + # Simulate embeddings landing (count > 0). + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "h", "unknown"), + ) + store._conn.commit() + assert store.count() > 0 + # Run 2: no env — MUST honour the "1" flag, NOT fall back to + # the count()>0 -> False branch. + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_sticky_flag_cli_flip_from_off_to_on(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + # Setup: legacy DB with flag="0" (decided at first embed without env) + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "h", "unknown"), + ) + store._conn.commit() + assert _body_enabled(store) is False + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + # CLI user runs `embed --include-body --confirm-reembed` + # (sets env var before calling _body_enabled). + with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "1"}): + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + # Next run without env: flag is now "1" -> stays on. + self._scrub_env() + assert _body_enabled(store) is True + finally: + store.close() + + +class TestFilterStageShortCircuit: + """Directive 5: metadata_hash short-circuit skips file IO for unchanged nodes.""" + + def _fake_provider(self, dim=4): + mock = MagicMock() + mock.name = "local:fake" + mock.embed = MagicMock(return_value=[[0.1] * dim, [0.2] * dim, [0.3] * dim]) + mock.embed_query = MagicMock(return_value=[0.1] * dim) + return mock + + def _store_with_fake(self, tmp_path, provider): + with patch( + "code_review_graph.embeddings.get_provider", return_value=provider, + ): + return EmbeddingStore(tmp_path / "e.db") + + def test_incremental_skips_unchanged_node_without_file_read(self, tmp_path): + # Create real source file for node + src = tmp_path / "sample.py" + src.write_text( + "def compute(x):\n" + " a = x * 2\n" + " return a\n", + ) + provider = self._fake_provider() + # Force sticky flag on so stage-2 would normally read file + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + # Unchanged node carries a populated file_hash — only then + # can stage 1 safely short-circuit (empty fingerprints now + # force a re-read; see Codex round 3 regression fix). + node = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=3, + qualified_name="sample.py::compute", + file_hash="file-fingerprint-v1", + ) + # First embed — reader WILL be used; returns 1 embedded. + assert store.embed_nodes([node]) == 1 + provider.embed.reset_mock() + + # Now patch reader.get_lines so we can assert it's not hit + with patch.object( + _FileLineCache, "get_lines", + autospec=True, return_value=[], + ) as mock_get_lines: + # Re-embed same node — metadata + file_hash both match + # the stored combined hash → stage 1 short-circuits + # with no file IO. + assert store.embed_nodes([node]) == 0 + assert mock_get_lines.call_count == 0 + assert provider.embed.call_count == 0 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_read_failure_does_not_freeze_node(self, tmp_path): + """Codex round 5: transient file-read failure (missing file, binary, + permission error, out-of-repo) must NOT persist an empty-body + sentinel that makes stage 1 skip forever. Once the file becomes + readable again the body must be recomputed.""" + src = tmp_path / "maybe.py" + src.write_text("def maybe():\n a = 1\n return a\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node = _mk_node( + language="python", name="maybe", params="()", + file_path=str(src), line_start=1, line_end=3, + qualified_name="maybe.py::maybe", + file_hash="maybe-fingerprint", + ) + # First embed with a patched reader that simulates a + # transient read failure (last_read_ok=False). + with patch.object( + _FileLineCache, "get_lines", + autospec=True, return_value=[], + ) as mock_get_lines: + def fake_fail(self, *a, **kw): + self._last_read_ok = False + return [] + mock_get_lines.side_effect = fake_fail + assert store.embed_nodes([node]) == 1 + + provider.embed.reset_mock() + # Second embed with a real reader — the failed body + # snippet should NOT short-circuit stage 1 (stored_body + # is ""), so stage 2 reads the file again and persists a + # proper body hash. + assert store.embed_nodes([node]) == 1 + assert provider.embed.call_count == 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_empty_body_does_not_reread_on_unchanged_run(self, tmp_path): + """Codex round 4: single-line functions whose body extracts to "" + used to fall through stage 1 on every incremental run. Stage 1 + must now skip them too (empty body hash is written as sha256('') + to distinguish 'body ran, produced nothing' from legacy rows).""" + src = tmp_path / "tiny.py" + src.write_text("def tiny(): return 1\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node = _mk_node( + language="python", name="tiny", params="()", + file_path=str(src), line_start=1, line_end=1, + qualified_name="tiny.py::tiny", + file_hash="tiny-fingerprint", + ) + assert store.embed_nodes([node]) == 1 + provider.embed.reset_mock() + with patch.object( + _FileLineCache, "get_lines", + autospec=True, return_value=[], + ) as mock_get_lines: + assert store.embed_nodes([node]) == 0 + assert mock_get_lines.call_count == 0 + assert provider.embed.call_count == 0 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_body_edit_with_empty_file_hash_still_reembeds(self, tmp_path): + """Codex round 3: when both stored and current file_hash are empty, + stage 1 must NOT short-circuit — otherwise body-only edits slip + through on callers that don't populate node.file_hash.""" + src = tmp_path / "sample.py" + src.write_text("def compute(x):\n return x * 2\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + # file_hash deliberately omitted (None → empty string slot) + node_v1 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + assert store.embed_nodes([node_v1]) == 1 + provider.embed.reset_mock() + src.write_text("def compute(x):\n return x * 3\n") + node_v2 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + # Without a file_hash signal we must re-read and recompute + # body_hash; the body text has changed so this is 1 embed. + assert store.embed_nodes([node_v2]) == 1 + assert provider.embed.call_count == 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_metadata_change_triggers_stage2_read(self, tmp_path): + src = tmp_path / "sample.py" + src.write_text("def compute(x):\n return x * 2\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + store.embed_nodes([node]) + provider.embed.reset_mock() + # Flip return_type => metadata_hash changes => stage 2 must run. + node_changed = _mk_node( + language="python", name="compute", params="(x)", + return_type="int", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + with patch.object( + _FileLineCache, "get_lines", autospec=True, + return_value=["def compute(x):", " return x * 2"], + ) as mock_get_lines: + assert store.embed_nodes([node_changed]) == 1 + assert mock_get_lines.call_count >= 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_split_combined_hash_legacy_row(self): + # Iter 1: pure sha256 hex, no separator + legacy = "a" * 64 + assert _split_combined_hash(legacy) == (legacy, "", "") + # Iter 2: metadata + body only (no file fingerprint yet) + iter2 = "abc:def" + assert _split_combined_hash(iter2) == ("abc", "def", "") + # Iter 2 body-disabled stored form + iter2_body_off = "abc:" + assert _split_combined_hash(iter2_body_off) == ("abc", "", "") + # Iter 3.2: metadata + body + file fingerprint + iter3 = "meta:body:fhash" + assert _split_combined_hash(iter3) == ("meta", "body", "fhash") + # Iter 3.2 body-disabled but file_hash captured + iter3_body_off = "meta::fhash" + assert _split_combined_hash(iter3_body_off) == ("meta", "", "fhash") + + def test_body_only_change_triggers_reembed(self, tmp_path): + """Codex-review regression: metadata-unchanged body edit used to be + skipped forever. Now file_hash shift forces stage-2 re-read.""" + src = tmp_path / "sample.py" + src.write_text("def compute(x):\n return x * 2\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node_v1 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", file_hash="h-v1", + ) + assert store.embed_nodes([node_v1]) == 1 + provider.embed.reset_mock() + # Body edits but metadata identical; file_hash changes. + src.write_text("def compute(x):\n return x * 3\n") + node_v2 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", file_hash="h-v2", + ) + assert store.embed_nodes([node_v2]) == 1 + assert provider.embed.call_count == 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + +class TestFileLineCacheContainment: + """Codex-review regression: file_path must stay inside repo_root.""" + + def test_absolute_path_outside_root_rejected(self, tmp_path): + outside = tmp_path.parent / ("outside-" + tmp_path.name + ".secret") + outside.write_text("top secret\n") + try: + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines(str(outside), 1, 1) == [] + finally: + outside.unlink(missing_ok=True) + + def test_relative_traversal_rejected(self, tmp_path): + outside = tmp_path.parent / "escape-target.secret" + outside.write_text("top secret\n") + try: + cache = _FileLineCache(repo_root=tmp_path) + # "../escape-target.secret" tries to pop out of the repo + assert cache.get_lines( + f"../{outside.name}", 1, 1, + ) == [] + finally: + outside.unlink(missing_ok=True) + + def test_inside_root_still_reads(self, tmp_path): + inside = tmp_path / "hello.py" + inside.write_text("x = 1\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("hello.py", 1, 1) == ["x = 1"] + + def test_no_root_preserves_absolute_behavior(self, tmp_path): + inside = tmp_path / "hello.py" + inside.write_text("y = 2\n") + # Without repo_root we intentionally skip the containment check so + # unit tests that construct nodes with absolute fixture paths still + # work (see TestFileLineCache.test_reads_existing_file style, but + # called with no repo_root). + cache = _FileLineCache(repo_root=None) + assert cache.get_lines(str(inside), 1, 1) == ["y = 2"] + + +class TestKotlinFunSignatureDedup: + """Codex-review regression: Kotlin ``fun`` without return type.""" + + def test_kotlin_fun_without_return_type(self): + node = _mk_node( + language="kotlin", name="save", params="(user: User)", + return_type=None, + ) + assert _looks_like_signature( + "override fun save(user: User) {", node, + ) is True diff --git a/tests/test_embeddings_mrr.py b/tests/test_embeddings_mrr.py new file mode 100644 index 00000000..0746ff38 --- /dev/null +++ b/tests/test_embeddings_mrr.py @@ -0,0 +1,183 @@ +"""MRR regression test for body-enriched embeddings (AC-6 Iter 3). + +Runs the Local sentence-transformers provider against a fixed mini_repo + +10 golden queries and asserts that enabling body enrichment does NOT +regress Mean Reciprocal Rank at top-3 beyond the -0.02 tolerance AND +produces a strictly positive delta (non-inferiority + soft signal). + +Skipped when ``sentence-transformers`` is not installed, so the project's +default ``.[dev]`` test matrix keeps passing; CI jobs that care about the +gate should install ``.[embeddings]`` to activate this module. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest + +from code_review_graph.graph import GraphNode + +pytest.importorskip("sentence_transformers") + +from code_review_graph.embeddings import ( # noqa: E402 — after importorskip + LocalEmbeddingProvider, + _cosine_similarity, + _FileLineCache, + _node_to_text, + _resolve_body_max_chars, +) + +FIXTURE_DIR = Path(__file__).parent / "fixtures" / "embeddings" / "eval" +MINI_REPO = FIXTURE_DIR / "mini_repo" + + +def _mk(qualified_name, file_rel, line_start, line_end, *, + kind="Function", name=None, parent_name=None, params=None, + return_type=None): + """Build a GraphNode for mini_repo fixture nodes.""" + return GraphNode( + id=0, kind=kind, + name=name or qualified_name.rsplit("::", 1)[1].split(".")[-1], + qualified_name=qualified_name, + file_path=str(MINI_REPO / file_rel), + line_start=line_start, line_end=line_end, + language="python", + parent_name=parent_name, params=params, return_type=return_type, + is_test=False, file_hash=None, extra={}, + ) + + +# Accurate line ranges from mini_repo files (see tests/fixtures/.../mini_repo/). +NODES: list[GraphNode] = [ + # parser.py + _mk("parser.py::parse_tree", "parser.py", 4, 9, params="(source)"), + _mk("parser.py::walk_nodes", "parser.py", 11, 16, params="(tree, visitor)"), + _mk("parser.py::extract_functions", "parser.py", 19, 22, params="(tree)"), + _mk("parser.py::tokenize", "parser.py", 25, 26, params="(source)"), + _mk("parser.py::build_root", "parser.py", 29, 33, params="(tokens)"), + # graph.py + _mk("graph.py::Graph.__init__", "graph.py", 7, 9, params="(self)", + parent_name="Graph"), + _mk("graph.py::Graph.add_node", "graph.py", 11, 13, params="(self, name)", + parent_name="Graph"), + _mk("graph.py::Graph.add_edge", "graph.py", 15, 16, params="(self, src, dst)", + parent_name="Graph"), + _mk("graph.py::Graph.neighbors", "graph.py", 18, 19, params="(self, name)", + parent_name="Graph"), + _mk("graph.py::Graph.impact_radius","graph.py", 21, 32, + params="(self, start, depth)", parent_name="Graph"), + # embeddings.py + _mk("embeddings.py::cosine_similarity", "embeddings.py", 4, 12, + params="(a, b)"), + _mk("embeddings.py::embed_text", "embeddings.py", 15, 17, + params="(text)"), + _mk("embeddings.py::EmbeddingIndex.__init__", "embeddings.py", 21, 22, + params="(self)", parent_name="EmbeddingIndex"), + _mk("embeddings.py::EmbeddingIndex.add", "embeddings.py", 24, 25, + params="(self, qname, vec)", parent_name="EmbeddingIndex"), + _mk("embeddings.py::EmbeddingIndex.search", "embeddings.py", 27, 32, + params="(self, query_vec, limit=10)", parent_name="EmbeddingIndex"), + # visualization.py + _mk("visualization.py::esc_h", "visualization.py", 4, 10, params="(s)"), + _mk("visualization.py::render_html", "visualization.py", 13, 18, params="(graph)"), + # api_client.py + _mk("api_client.py::OpenAIClient.__init__", "api_client.py", 7, 10, + params="(self)", parent_name="OpenAIClient"), + _mk("api_client.py::GeminiClient.__init__", "api_client.py", 14, 17, + params="(self)", parent_name="GeminiClient"), + _mk("api_client.py::MinimaxClient.__init__", "api_client.py", 21, 24, + params="(self)", parent_name="MinimaxClient"), + # incremental.py + _mk("incremental.py::detect_changes", "incremental.py", 6, 12, + params="(repo_root, since_ref)"), + _mk("incremental.py::get_changed_files", "incremental.py", 15, 16, + params="(repo_root)"), +] + + +def _rank_top(query_vec, doc_vecs, qnames, limit=3): + scored = [ + (qn, _cosine_similarity(query_vec, v)) + for qn, v in zip(qnames, doc_vecs) + ] + scored.sort(key=lambda x: x[1], reverse=True) + return [qn for qn, _ in scored[:limit]] + + +def _reciprocal_rank(ranked_top: list[str], expected: set[str]) -> float: + for i, qn in enumerate(ranked_top, start=1): + if qn in expected: + return 1.0 / i + return 0.0 + + +def _mrr(rankings: list[list[str]], expectations: list[set[str]]) -> float: + if not rankings: + return 0.0 + return sum( + _reciprocal_rank(r, e) for r, e in zip(rankings, expectations) + ) / len(rankings) + + +def test_mrr_body_not_worse_than_legacy(): + """AC-6 (Iter 3 D-iter3-9): non-inferiority MRR gate. + + Two runs with the Local provider: + - legacy: ``_node_to_text(node, reader=None)`` — metadata only + - body-on: ``_node_to_text(node, reader, max_chars=...)`` — body enriched + + The body-on run must not regress MRR@3 beyond the tolerance + (``MRR_on >= MRR_legacy - 0.02``) AND must strictly improve it + (``MRR_on > MRR_legacy``). + """ + provider = LocalEmbeddingProvider() + max_chars = _resolve_body_max_chars(provider.name) + reader = _FileLineCache() + + legacy_texts = [_node_to_text(n, reader=None) for n in NODES] + body_texts = [ + _node_to_text(n, reader=reader, max_chars=max_chars) for n in NODES + ] + + legacy_vecs = provider.embed(legacy_texts) + body_vecs = provider.embed(body_texts) + + qnames = [n.qualified_name for n in NODES] + + queries = json.loads((FIXTURE_DIR / "golden_queries.json").read_text()) + # Sanity: all golden expected nodes exist in our fixture. + known = set(qnames) + for q in queries: + for exp in q["expected_top3"]: + assert exp in known, f"Golden query references unknown node: {exp}" + + query_texts = [q["query"] for q in queries] + query_vecs = [provider.embed_query(t) for t in query_texts] + expectations = [set(q["expected_top3"]) for q in queries] + + legacy_rankings = [_rank_top(qv, legacy_vecs, qnames) for qv in query_vecs] + body_rankings = [_rank_top(qv, body_vecs, qnames) for qv in query_vecs] + + mrr_legacy = _mrr(legacy_rankings, expectations) + mrr_body = _mrr(body_rankings, expectations) + + # Diagnostic logging (captured on failure) + print(f"MRR@3 legacy={mrr_legacy:.4f} body={mrr_body:.4f}") + for q, lr, br in zip(queries, legacy_rankings, body_rankings): + print(f" [{q['type']:19s}] {q['query']}") + print(f" legacy top-3: {lr}") + print(f" body top-3 : {br}") + + # non-inferiority hard gate (must not regress beyond tolerance) + assert mrr_body >= mrr_legacy - 0.02, ( + f"body enrichment regressed MRR beyond tolerance " + f"(body={mrr_body:.4f}, legacy={mrr_legacy:.4f}, " + f"delta={mrr_body - mrr_legacy:+.4f})" + ) + # soft signal: must strictly improve (not just match) + assert mrr_body > mrr_legacy, ( + f"body enrichment failed to improve MRR " + f"(body={mrr_body:.4f}, legacy={mrr_legacy:.4f})" + ) From c74bc3dc7514425a1dc0520bf64c10b277f75eff Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 14:38:37 +0800 Subject: [PATCH 2/9] fix(embeddings): respect CRG_DATA_DIR in _repo_root_guess MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: Codex round 7 caught that the legacy inference ``db_path.parent -> repo root`` silently breaks for the supported ``CRG_DATA_DIR`` layout (graph.db lives in an external cache dir, not inside ``/.code-review-graph/``). In that case ``_FileLineCache`` received the cache dir as repo_root and every body-snippet read failed the containment check, making body enrichment a no-op for anyone who uses ``CRG_DATA_DIR``. The two- stage filter then wrote metadata-only hashes, so even opting back in via ``embed --include-body --confirm-reembed`` wouldn't help. What: ``_repo_root_guess`` now: 1. keeps the fast path when db_path parent is literally ``.code-review-graph`` (zero-config common case, no cost) 2. otherwise delegates to ``incremental.find_project_root`` which already honors ``CRG_REPO_ROOT`` and walks up for a ``.git`` marker — the same resolution order the rest of the CLI uses 3. returns None only when both fail, letting callers treat absolute file_path values as self-locating Regression test: ``TestRepoRootGuessExternalDataDir`` covers the legacy layout, the env-override fallback, and an end-to-end read through ``_FileLineCache`` from an external cache dir. Existing ``TestFilterStageShortCircuit`` fixtures now write their DB under ``/.code-review-graph/`` so they hit the fast path rather than the cwd-relative fallback. --- code_review_graph/embeddings.py | 32 +++++++++---- tests/test_embeddings.py | 85 ++++++++++++++++++++++++++++++++- 2 files changed, 108 insertions(+), 9 deletions(-) diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index 6badc0a6..05da85b2 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -1189,20 +1189,36 @@ def close(self) -> None: self._conn.close() def _repo_root_guess(self) -> Path | None: - """Best-effort root inference for the ``_FileLineCache``. - - Graph files live in ``.code-review-graph/graph.db``; walk upward from - ``db_path`` to find that directory and return its parent. Returns - ``None`` if we can't pin it down — callers interpret that as - "use absolute file paths only". + """Best-effort repo-root inference for the ``_FileLineCache``. + + Priority (Codex round 7 fix — ``CRG_DATA_DIR`` lets users keep + ``graph.db`` outside the repo, so ``db_path.parent`` is NOT a + reliable proxy for the repo root in that case): + + 1. Legacy layout ``/.code-review-graph/graph.db`` — if + the parent directory is named ``.code-review-graph`` the + repo root is its parent. This keeps the common case + zero-config and cheap. + 2. Delegate to :func:`incremental.find_project_root`, which + honors ``CRG_REPO_ROOT`` first and otherwise walks up from + cwd for a ``.git`` (or SVN) marker. This covers + ``CRG_DATA_DIR`` external-cache setups and MCP servers + launched with the repo as cwd. + 3. ``None`` if neither yields a path we can resolve — callers + must treat that as "use absolute file paths only" (which + graph-emitted ``file_path`` already is). """ try: parent = self.db_path.resolve().parent if parent.name == ".code-review-graph": return parent.parent - # db_path may live directly inside a repo root (tests/tmp_path) - return parent except OSError: + pass + try: + from .incremental import find_project_root + root = find_project_root() + return root.resolve() if root else None + except Exception: return None # --- embeddings_meta (Iter 3 D-iter3-1 sticky flag storage) --- diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index a6283e2c..5da4961e 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1498,10 +1498,18 @@ def _fake_provider(self, dim=4): return mock def _store_with_fake(self, tmp_path, provider): + # Put the DB under ``/.code-review-graph/`` so + # ``_repo_root_guess`` takes the legacy fast-path and returns + # ``tmp_path`` — that's where these tests stage their fixture + # source files. Without this layout the fallback walks up from + # cwd and returns the real project root, which sits outside + # ``tmp_path`` and makes every body read fail containment. + crg_dir = tmp_path / ".code-review-graph" + crg_dir.mkdir(exist_ok=True) with patch( "code_review_graph.embeddings.get_provider", return_value=provider, ): - return EmbeddingStore(tmp_path / "e.db") + return EmbeddingStore(crg_dir / "e.db") def test_incremental_skips_unchanged_node_without_file_read(self, tmp_path): # Create real source file for node @@ -1777,6 +1785,81 @@ def test_no_root_preserves_absolute_behavior(self, tmp_path): assert cache.get_lines(str(inside), 1, 1) == ["y = 2"] +class TestRepoRootGuessExternalDataDir: + """Codex round 7 regression: ``CRG_DATA_DIR`` / ``CRG_REPO_ROOT``. + + When ``graph.db`` lives outside the repo (``CRG_DATA_DIR`` is set), + ``_repo_root_guess`` must NOT return the cache directory — otherwise + ``_FileLineCache`` resolves repo-relative paths against the cache + and every body read fails the containment check, silently degrading + to metadata-only embeddings. We verify the guess falls back to + ``find_project_root`` (which honors ``CRG_REPO_ROOT`` or walks up to + a ``.git`` marker) in that case, while preserving the fast legacy + ``/.code-review-graph/graph.db`` inference. + """ + + def _store(self, db_path): + with patch( + "code_review_graph.embeddings.get_provider", return_value=None, + ): + return EmbeddingStore(db_path) + + def test_legacy_layout_infers_repo_from_db_parent(self, tmp_path): + repo = tmp_path / "repo" + (repo / ".code-review-graph").mkdir(parents=True) + (repo / ".git").mkdir() + db = repo / ".code-review-graph" / "graph.db" + db.touch() + store = self._store(db) + try: + assert store._repo_root_guess() == repo.resolve() + finally: + store.close() + + def test_external_data_dir_falls_back_to_crg_repo_root(self, tmp_path): + repo = tmp_path / "my-repo" + (repo / ".git").mkdir(parents=True) + (repo / "sample.py").write_text("def f():\n return 42\n") + cache = tmp_path / "external-cache" + cache.mkdir() + db = cache / "graph.db" + db.touch() + with patch.dict( + os.environ, {"CRG_REPO_ROOT": str(repo)}, + ): + store = self._store(db) + try: + # Without the fix this would point at ``cache`` and make + # every downstream read fail containment. + assert store._repo_root_guess() == repo.resolve() + finally: + store.close() + + def test_external_data_dir_resolves_file_read(self, tmp_path): + repo = tmp_path / "my-repo" + (repo / ".git").mkdir(parents=True) + (repo / "sample.py").write_text("def f():\n return 42\n") + cache = tmp_path / "external-cache" + cache.mkdir() + db = cache / "graph.db" + db.touch() + with patch.dict( + os.environ, {"CRG_REPO_ROOT": str(repo)}, + ): + store = self._store(db) + try: + root = store._repo_root_guess() + reader = _FileLineCache(repo_root=root) + # Repo-relative file_path resolves correctly against the + # real repo, not the external cache dir. + assert reader.get_lines("sample.py", 1, 2) == [ + "def f():", " return 42", + ] + assert reader.last_read_ok is True + finally: + store.close() + + class TestKotlinFunSignatureDedup: """Codex-review regression: Kotlin ``fun`` without return type.""" From 970c46b06d1f4523aaa048d1ae65a07bae497355 Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 14:41:49 +0800 Subject: [PATCH 3/9] fix(embeddings): skip fast path when CRG_DATA_DIR is set MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: Codex round 8 found the round-7 fast path still misfires when ``CRG_DATA_DIR`` itself ends with ``.code-review-graph`` — e.g. ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` or ``/shared/cache//.code-review-graph``. In that case ``db_path.parent.name == ".code-review-graph"`` is True, the fast path returns ``parent.parent`` (``/var/tmp`` or the shared-cache parent), and ``_FileLineCache`` silently falls back to metadata for every node. My round-7 tests only covered an external dir NOT named ``.code-review-graph`` so the regression slipped. What: ``_repo_root_guess`` now consults ``CRG_DATA_DIR`` first. When that env var is set we always defer to ``incremental.find_project_root`` regardless of what ``db_path`` looks like; the zero-config fast path only runs when the user is on the default ``/.code-review-graph/graph.db`` layout and has NOT overridden the data dir. Regression test: ``test_crg_data_dir_named_dot_code_review_graph`` exercises the misfire case directly — ``CRG_DATA_DIR`` pointed at a ``/var-tmp/.code-review-graph`` cache with ``CRG_REPO_ROOT`` pointed at a distant ``real-repo`` tree — and asserts the root resolves to ``real-repo``, not the cache's parent. --- code_review_graph/embeddings.py | 50 ++++++++++++++++++--------------- tests/test_embeddings.py | 33 ++++++++++++++++++++++ 2 files changed, 61 insertions(+), 22 deletions(-) diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index 05da85b2..6e9078f9 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -1191,29 +1191,35 @@ def close(self) -> None: def _repo_root_guess(self) -> Path | None: """Best-effort repo-root inference for the ``_FileLineCache``. - Priority (Codex round 7 fix — ``CRG_DATA_DIR`` lets users keep - ``graph.db`` outside the repo, so ``db_path.parent`` is NOT a - reliable proxy for the repo root in that case): - - 1. Legacy layout ``/.code-review-graph/graph.db`` — if - the parent directory is named ``.code-review-graph`` the - repo root is its parent. This keeps the common case - zero-config and cheap. - 2. Delegate to :func:`incremental.find_project_root`, which - honors ``CRG_REPO_ROOT`` first and otherwise walks up from - cwd for a ``.git`` (or SVN) marker. This covers - ``CRG_DATA_DIR`` external-cache setups and MCP servers - launched with the repo as cwd. - 3. ``None`` if neither yields a path we can resolve — callers - must treat that as "use absolute file paths only" (which - graph-emitted ``file_path`` already is). + Priority (Codex rounds 7 + 8 — ``CRG_DATA_DIR`` lets users keep + ``graph.db`` outside the repo, so neither ``db_path.parent`` nor + even the ``.code-review-graph`` directory name is a reliable + proxy for the repo root in that case; an override like + ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` would otherwise + hand us ``/var/tmp`` as the repo root): + + 1. When ``CRG_DATA_DIR`` is NOT set, trust the zero-config + default layout: if ``db_path.parent`` is literally named + ``.code-review-graph`` then the repo root is its parent. + This keeps the common case cheap (no env reads, no dir + walk). + 2. Otherwise — ``CRG_DATA_DIR`` is set, or the DB lives + somewhere non-standard — delegate to + :func:`incremental.find_project_root`, which honors + ``CRG_REPO_ROOT`` first and otherwise walks up from cwd + for a ``.git`` (or SVN) marker. + 3. Returns ``None`` only when both paths fail to resolve; + callers then treat absolute ``file_path`` values on + ``GraphNode`` as self-locating. """ - try: - parent = self.db_path.resolve().parent - if parent.name == ".code-review-graph": - return parent.parent - except OSError: - pass + data_dir_override = os.environ.get("CRG_DATA_DIR", "").strip() + if not data_dir_override: + try: + parent = self.db_path.resolve().parent + if parent.name == ".code-review-graph": + return parent.parent + except OSError: + pass try: from .incremental import find_project_root root = find_project_root() diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 5da4961e..5606de64 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1835,6 +1835,39 @@ def test_external_data_dir_falls_back_to_crg_repo_root(self, tmp_path): finally: store.close() + def test_crg_data_dir_named_dot_code_review_graph(self, tmp_path): + """Codex round 8: a ``CRG_DATA_DIR`` whose last segment happens + to be ``.code-review-graph`` (e.g. ``/var/tmp/.code-review-graph`` + or a shared cache dir set up that way) must NOT trigger the + fast path — ``db_path.parent.parent`` would be the cache's + parent directory, not the repo root. When the env override is + set we always defer to ``find_project_root`` so that the + repo-root resolution uses the real git walk / ``CRG_REPO_ROOT`` + override instead of a coincidental directory name.""" + repo = tmp_path / "real-repo" + (repo / ".git").mkdir(parents=True) + (repo / "sample.py").write_text("x = 1\n") + # Cache dir whose last segment IS '.code-review-graph' but + # sits far away from the real repo — classic misfire case. + cache = tmp_path / "var-tmp" / ".code-review-graph" + cache.mkdir(parents=True) + db = cache / "graph.db" + db.touch() + with patch.dict( + os.environ, + {"CRG_DATA_DIR": str(cache), "CRG_REPO_ROOT": str(repo)}, + ): + store = self._store(db) + try: + root = store._repo_root_guess() + # WRONG (fast-path trap): tmp_path / "var-tmp" + # RIGHT: repo (via find_project_root + CRG_REPO_ROOT) + assert root == repo.resolve() + reader = _FileLineCache(repo_root=root) + assert reader.get_lines("sample.py", 1, 1) == ["x = 1"] + finally: + store.close() + def test_external_data_dir_resolves_file_read(self, tmp_path): repo = tmp_path / "my-repo" (repo / ".git").mkdir(parents=True) From 7daf93ccb8b5572273029a815fdae175d6562d32 Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 20:57:03 +0800 Subject: [PATCH 4/9] feat(embeddings): drop pure log/print lines from body snippets MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: cross-repo benchmark showed body enrichment regresses on framework-heavy codebases (humanitzbot — TypeScript Discord bot — MiniLM EN went from 0.375 → 0.194 MRR@3, -0.18). Diagnosis found that signature dedup + block-comment state machine are all working, but TS/Discord function bodies are thick with framework boilerplate that dilutes the retrieval signal. Examples from one regressing function: await interaction.deferReply(); const locale = interaction.locale; logger.info('handling power request', { signal }); console.log('...'); On small models (MiniLM: 256-token window), those lines fill the budget with `await`, `deferReply`, `locale`, `logger`, `console` — high-frequency framework tokens that don't discriminate between similarly-named commands (``_power`` vs ``_status`` vs ``_console``). What: add a language-agnostic check to the existing body-extraction state machine that skips lines which ARE purely a log/print call with no other executable content. Covers: * Python: ``print(...)``, ``logger.info(...)``, ``logging.debug(...)`` * JS/TS: ``console.log(...)``, ``logger.warn(...)`` * Java: ``System.out.println(...)``, ``log.debug(...)`` * Go: ``fmt.Println(...)``, ``log.Printf(...)`` * Rust: ``println!(...)``, ``eprintln!(...)``, ``dbg!(...)`` Lines where a log call is mixed with substantive code (``if err: logger.error(err); raise``) are preserved — they DO carry retrieval signal. Scope: one regex + one helper + one extra branch in the existing state machine loop. Stays within ADR-A1 "language-agnostic body extraction". Does NOT introduce framework-specific heuristics — log/print is a universal pattern, not a framework one. Also does NOT make the body hash non-deterministic (same input still produces same output), so the combined-hash filter short-circuit (AC-13) still holds. Tests: 9 new unit tests covering Python / JS / TS / Java / Go / Rust log patterns, an integration test verifying scattered log lines get stripped from a multi-statement TS body while real logic survives, plus negative tests for mixed-statement lines and non-log calls (``await interaction.deferReply()`` is NOT a log call and stays kept). Follow-up: framework-specific SDK boilerplate (Discord ``interaction.X``, React hooks, Spring ``@Autowired``) is where the next round of improvement lives, but that can't be done generically without either a tokenizer-based adaptive budget or per-language SDK awareness — both are out of scope for this PR. --- code_review_graph/embeddings.py | 48 ++++++++++++++++++ tests/test_embeddings.py | 87 +++++++++++++++++++++++++++++++++ 2 files changed, 135 insertions(+) diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index 6e9078f9..6641e482 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -13,6 +13,7 @@ import hashlib import logging import os +import re import sqlite3 import struct import sys @@ -708,6 +709,44 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float: } _BODY_MAX_CHARS_FALLBACK = 700 _BODY_MAX_LINES_DEFAULT = 15 + +# Pure log/print lines carry almost no semantic signal for code retrieval +# — a function's purpose is almost never "the fact that it logs" — but +# framework-heavy codebases (Discord bots, Rails controllers, Flask +# views, Spring services) can easily burn the body budget on them. +# Skip lines that ARE a log/print call and nothing else. Language- +# agnostic: Python logger/logging/print, JS/TS console.X / logger.X, +# Java System.out.println + log.X, Go fmt.Print* + log.Print*, Rust +# println!/eprintln!/dbg!. Keep lines where the log is mixed with +# other statements. +_LOG_ONLY_LINE_RE = re.compile( + r'^\s*' + r'(?:' + # identifier.method(...) — catches logger.info / log.debug / + # self.logger.warning / console.log / trace.error / etc. + r'(?:[A-Za-z_][\w.]*)\s*\.\s*' + r'(?:log|debug|info|warn|warning|error|trace|fatal|notice|fine|finer|finest' + r'|Print|Println|Printf)\s*\(' + r'|' + # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...) + r'(?:print|println|eprintln|dbg)\s*!?\s*\(' + r'|' + # java: System.out.println(...) / System.err.print(...) + r'System\s*\.\s*(?:out|err)\s*\.\s*(?:print|println|printf)\s*\(' + r'|' + # go: fmt.Println(...) / fmt.Printf(...) / fmt.Fprintln(...) + r'fmt\s*\.\s*(?:Print|Println|Printf|Fprintln|Fprintf|Fprint)\s*\(' + r')' +) + + +def _is_log_only_line(s: str) -> bool: + """Return True when the line is purely a log/print call with no + other executable content.""" + stripped = s.strip().rstrip(";").strip() + if not stripped.endswith(")"): + return False + return bool(_LOG_ONLY_LINE_RE.match(stripped)) _FILE_CACHE_MAX_ENTRIES = 128 _FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024 # skip files larger than 2 MB _FILE_CACHE_SNIFF_BYTES = 4096 # null-byte probe window for binary detect @@ -1017,6 +1056,15 @@ def _extract_body_text( drops += 1 continue + # Pure log/print lines: drop anywhere in the body. Framework- + # heavy code (Discord / Rails / Flask / Spring) otherwise burns + # the budget on Discord.defer() / console.log / logger.info + # that carry no retrieval signal. Keep lines where logging is + # mixed with other statements. + if _is_log_only_line(s): + drops += 1 + continue + kept.append(line) # fallback safety cap diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 5606de64..118f76ff 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1893,6 +1893,93 @@ def test_external_data_dir_resolves_file_read(self, tmp_path): store.close() +class TestLogOnlyLineSkipping: + """Language-agnostic log/print-line skipping. + + Framework-heavy code (Discord bots, web controllers, server + handlers) interleaves ``console.log`` / ``logger.info`` / + ``fmt.Println`` / ``print()`` lines throughout function bodies. + Those lines carry almost no retrieval signal — a function's + purpose is rarely "the fact that it logs" — but they eat the body + char budget, and on small models (MiniLM 256-token window) they + crowd out the real logic tokens. Dropping pure log lines is a + language-agnostic way to raise token-to-signal density. + """ + + def test_is_log_only_line_python(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" logger.info('handling request')") is True + assert _is_log_only_line("logging.debug('x', extra={'y': 1})") is True + assert _is_log_only_line(" print('hi')") is True + assert _is_log_only_line(" print('hi');") is True # trailing ; + + def test_is_log_only_line_js_ts(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" console.log('hi');") is True + assert _is_log_only_line(" console.error(err);") is True + assert _is_log_only_line("logger.warn(`connection ${host}`);") is True + + def test_is_log_only_line_java(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" System.out.println(\"hi\");") is True + assert _is_log_only_line("log.debug(msg);") is True + + def test_is_log_only_line_go(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" fmt.Println(msg)") is True + assert _is_log_only_line(" log.Printf(\"%v\", err)") is True + + def test_is_log_only_line_rust(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" println!(\"hi\");") is True + assert _is_log_only_line(" eprintln!(\"err: {}\", e);") is True + assert _is_log_only_line(" dbg!(value);") is True + + def test_is_log_only_line_keeps_mixed_statements(self): + """Log call mixed with other code is retrieval signal — keep it.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line( + "value = compute(); logger.info(value); return value" + ) is False + assert _is_log_only_line( + "if err: logger.error(err); raise" + ) is False + assert _is_log_only_line( + "x = logger.info('x')" # assignment, not pure log + ) is False + + def test_is_log_only_line_not_a_log(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line("return self.compute(x)") is False + assert _is_log_only_line("db.execute('SELECT *')") is False + assert _is_log_only_line("await interaction.deferReply()") is False + + def test_body_extraction_drops_scattered_log_lines(self, tmp_path): + """Log lines interleaved with real code get stripped, freeing + the budget for actual logic.""" + src = tmp_path / "handler.ts" + src.write_text( + "export async function handle(req) {\n" + " const user = await lookupUser(req.id);\n" + " logger.info('fetched user', { id: req.id });\n" + " const result = await computeScore(user);\n" + " console.log('score', result);\n" + " return { score: result, user: user.name };\n" + "}\n" + ) + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node( + language="typescript", name="handle", params="(req)", + file_path=str(src), line_start=1, line_end=7, + ) + body = _extract_body_text(node, reader, max_chars=4000) + assert "logger.info" not in body + assert "console.log" not in body + assert "lookupUser" in body + assert "computeScore" in body + assert "return { score" in body + + class TestKotlinFunSignatureDedup: """Codex-review regression: Kotlin ``fun`` without return type.""" From 56a317e6c94af9693cef278ce33e90263bdecb66 Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 21:01:54 +0800 Subject: [PATCH 5/9] fix(embeddings): tighten log-only detection (Codex round 10) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 10 flagged two real over-matches in 7daf93c: 1. The generic ``identifier.(error|warn|info|...)`` branch swallowed non-logger APIs. In THIS repo it dropped ``serve_cmd.error(...)`` (argparse's error method, which raises rather than logs). Same class of issue for ``fmt.Fprintf(w, "ok")`` where ``w`` is an HTTP response writer — that IS the function's output, not a log. 2. ``re.match`` is prefix-anchored, so ``logger.info('x') or do_work()`` passed the log-only check even though it has substantive trailing code. Violated the "mixed- statement lines stay kept" design promise. Fixes: * Receiver identifier is now an explicit whitelist — ``log``, ``logger``, ``logging``, ``logs``, ``console``, ``slog``, ``trace``, ``LOG``/``Logger``/``LOGGER`` (Java casing), ``self.log``/``this.log``/``_log``/``_logger`` and the ``self.logger``/``this.logger`` variants. Custom loggers named differently fall through — they'll be kept as real code. Better to under-skip than mis-skip. * Dropped ``fmt.Fprint*`` from the Go branch — those take an arbitrary ``io.Writer`` (commonly HTTP responses). Only ``fmt.Print/Println/Printf`` remain (stdout-ish = print-like). * ``_is_log_only_line`` now walks the matched string after the opening ``(``, tracking string literals and paren depth, to find the balanced close. Anything non-trivial after that ``)`` means the line has other executable content and is NOT log-only. Regression tests: tail-continuation cases (``logger.info('x') or do_work()``), non-logger identifier cases (``serve_cmd.error``, ``db.warn``, ``item.notice``), and ``fmt.Fprintf`` cases now stay kept. Pure-log cases including nested-paren args (``logger.info('val=' + str(compute(x, y)))``) still get dropped. --- code_review_graph/embeddings.py | 69 ++++++++++++++++++++++++++++----- tests/test_embeddings.py | 43 ++++++++++++++++++++ 2 files changed, 102 insertions(+), 10 deletions(-) diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index 6641e482..7e10d84f 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -722,31 +722,80 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float: _LOG_ONLY_LINE_RE = re.compile( r'^\s*' r'(?:' - # identifier.method(...) — catches logger.info / log.debug / - # self.logger.warning / console.log / trace.error / etc. - r'(?:[A-Za-z_][\w.]*)\s*\.\s*' + # identifier.method(...) — receiver identifier is deliberately + # restricted to conventional logger names so we don't swallow + # meaningful calls like argparse's ``serve_cmd.error(...)`` or a + # repository's ``db.warn(...)`` (Codex round 10). + r'(?:' + r'log|logger|logging|logs|console|slog|trace|LOG|Logger|LOGGER' + r'|(?:self|this|cls)\s*\.\s*(?:log|logger|logging|slog|trace|_log|_logger)' + r'|_log|_logger' + r')\s*\.\s*' r'(?:log|debug|info|warn|warning|error|trace|fatal|notice|fine|finer|finest' + r'|Log|Debug|Info|Warn|Warning|Error|Trace|Fatal' r'|Print|Println|Printf)\s*\(' r'|' - # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...) + # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...). + # Excludes ``Fprint*`` and other writer-parameterised calls because + # they usually emit to response streams, not logs. r'(?:print|println|eprintln|dbg)\s*!?\s*\(' r'|' - # java: System.out.println(...) / System.err.print(...) + # java: System.out.println(...) / System.err.print(...). r'System\s*\.\s*(?:out|err)\s*\.\s*(?:print|println|printf)\s*\(' r'|' - # go: fmt.Println(...) / fmt.Printf(...) / fmt.Fprintln(...) - r'fmt\s*\.\s*(?:Print|Println|Printf|Fprintln|Fprintf|Fprint)\s*\(' + # go: fmt.Print / Println / Printf — stdout-ish. Excludes fmt.Fprint* + # because those take an arbitrary io.Writer (HTTP responses, files) + # and carry real output, not logs. + r'fmt\s*\.\s*(?:Print|Println|Printf)\s*\(' r')' ) def _is_log_only_line(s: str) -> bool: """Return True when the line is purely a log/print call with no - other executable content.""" + other executable content. + + ``re.match`` alone is prefix-anchored, which would misclassify + lines like ``logger.info('x') or do_work()`` as log-only (Codex + round 10). Walk the string, tracking string literals and paren + depth, to find the balanced close of the log call's opening + ``(``; the line only counts as log-only when nothing substantive + follows that closing paren. + """ stripped = s.strip().rstrip(";").strip() - if not stripped.endswith(")"): + m = _LOG_ONLY_LINE_RE.match(stripped) + if not m: return False - return bool(_LOG_ONLY_LINE_RE.match(stripped)) + # The regex ends on the opening ``(`` of the log call. Find its + # balanced ``)`` accounting for nested calls and string literals + # (`'`, `"`, or `` ` ``), respecting backslash escapes. Anything + # non-trivial after that closing paren means the line has other + # executable content and is NOT log-only. + i = m.end() - 1 + assert stripped[i] == "(" + depth = 0 + quote: str | None = None + escape = False + while i < len(stripped): + c = stripped[i] + if escape: + escape = False + elif quote: + if c == "\\": + escape = True + elif c == quote: + quote = None + elif c in ("'", '"', "`"): + quote = c + elif c == "(": + depth += 1 + elif c == ")": + depth -= 1 + if depth == 0: + rest = stripped[i + 1:].strip().rstrip(";").strip() + return rest == "" + i += 1 + return False _FILE_CACHE_MAX_ENTRIES = 128 _FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024 # skip files larger than 2 MB _FILE_CACHE_SNIFF_BYTES = 4096 # null-byte probe window for binary detect diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 118f76ff..663fb515 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1948,6 +1948,49 @@ def test_is_log_only_line_keeps_mixed_statements(self): "x = logger.info('x')" # assignment, not pure log ) is False + def test_is_log_only_line_tail_continuation_preserved(self): + """Codex round 10 regression: lines that start with a log call but + continue with executable code must NOT be dropped. Requires + balanced-paren walk, not prefix-only match.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line("logger.info('x') or do_work()") is False + assert _is_log_only_line("console.log(x) && doThing()") is False + assert _is_log_only_line( + "logger.info('hello') if verbose else default()" + ) is False + # Balanced-paren walker handles nested parens inside the log call. + assert _is_log_only_line( + "logger.info('val=' + str(compute(x, y)))" + ) is True + # Log call with trailing `;` is still log-only. + assert _is_log_only_line( + "console.log(payload);" + ) is True + + def test_is_log_only_line_non_logger_identifier_preserved(self): + """Codex round 10 regression: ``x.error(...)`` / ``x.warn(...)`` on + NON-logger receivers is legitimate API behaviour (argparse + parsers, custom repos, HTTP response writers) and must not be + treated as log-only.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line("serve_cmd.error('bad arg')") is False + assert _is_log_only_line("db.warn(tx, 'slow')") is False + assert _is_log_only_line("item.notice(code, reason)") is False + # Real logger names still match + assert _is_log_only_line("logger.warn('slow')") is True + assert _is_log_only_line("self.logger.debug('x')") is True + + def test_is_log_only_line_fprint_not_skipped(self): + """Codex round 10 regression: ``fmt.Fprintf(w, ...)`` writes to an + arbitrary io.Writer (HTTP response, file handle) — that IS the + function's real output, not a log.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line('fmt.Fprintf(w, "ok")') is False + assert _is_log_only_line('fmt.Fprintln(stdout, "hi")') is False + # Pure fmt.Print* still is log-like (goes to stdout). + assert _is_log_only_line('fmt.Println(msg)') is True + assert _is_log_only_line('fmt.Printf("%v", err)') is True + def test_is_log_only_line_not_a_log(self): from code_review_graph.embeddings import _is_log_only_line assert _is_log_only_line("return self.compute(x)") is False From 4d80b4643c5c59d908830c4006dc40d25050ec76 Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 22:44:51 +0800 Subject: [PATCH 6/9] feat(embeddings): PHP / Laravel log-only line patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: body enrichment already shows small positive Δ on PHP (Laravel RESTful apps saw MRR@3 +0.028 on usdt-center and +0.067 on six-forum with local MiniLM) but my ``_LOG_ONLY_LINE_RE`` covered only ``.`` -style method access, so PHP-world logging idioms (``$logger->info``, Laravel ``Log::info`` facade, bare ``error_log``/``syslog``/``trigger_error``) silently stayed in the body text. Adding them is pure upside for log-heavy PHP / Laravel code paths, no test-matrix cost anywhere else. What this commit does: * PHP instance logger: ``$logger->info(...)`` / ``$this->logger->debug(...)`` — covers Monolog convention. * PHP static facade: ``Log::info(...)`` / ``\Log::error(...)`` — Laravel's ``Illuminate\Support\Facades\Log`` (leading ``\\`` for fully-qualified imports). * PHP bare calls: ``error_log(...)`` / ``syslog(...)`` / ``trigger_error(...)`` — stdlib helpers. * PSR-3 log levels: debug / info / notice / warning / error / critical / alert / emergency, plus the common ``log``/``warn`` aliases already used by the rest of the regex. What this commit deliberately does NOT do: extend ``_looks_like_signature`` to PHP. I tested that locally — Laravel RESTful controllers have very short templated bodies (``$data = $request->validated(); Model::create($data);``) and the signature line carries type information (``StoreRequest``, ``JsonResponse``, etc.) that the body never repeats. Stripping the signature flips MiniLM from Δ +0.028 to 0.000 on usdt-center and from Δ +0.067 to **-0.083** on six-forum. The "signature duplicate wastes budget" heuristic breaks on convention-over-configuration PHP codebases, so PHP stays in the conservative fallback — signature line is kept as part of the body snippet. This is documented in the ``_looks_like_signature`` docstring and locked in by ``test_signature_dedup_php_conservative``. Tests: 10 new PHP log-pattern assertions (instance / static / bare / negative non-logger calls like ``$user->save()`` / ``User::create()`` / ``$response->setStatusCode()`` / ``$db->query()``) plus the conservative-fallback regression test noted above. --- code_review_graph/embeddings.py | 26 +++++++++++++++ tests/test_embeddings.py | 58 +++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+) diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index 7e10d84f..c5503e31 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -747,6 +747,20 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float: # because those take an arbitrary io.Writer (HTTP responses, files) # and carry real output, not logs. r'fmt\s*\.\s*(?:Print|Println|Printf)\s*\(' + r'|' + # PHP / Laravel: + # instance: $logger->info(...) / $this->logger->debug(...) + # static: Log::info(...) / \Log::debug(...) (Laravel facade) + # bare: error_log(...) / syslog(...) / trigger_error(...) + # PSR-3 log levels covered: debug, info, notice, warning, error, + # critical, alert, emergency (plus common "log"/"warn" aliases). + r'\$(?:logger|log|this->logger|this->log)\s*->\s*' + r'(?:log|debug|info|warn|warning|error|notice|alert|emergency|critical)\s*\(' + r'|' + r'\\?(?:Log|Logger)\s*::\s*' + r'(?:log|debug|info|warn|warning|error|notice|alert|emergency|critical)\s*\(' + r'|' + r'(?:error_log|syslog|trigger_error)\s*\(' r')' ) @@ -986,6 +1000,18 @@ def _looks_like_signature(line: str, node: GraphNode) -> bool: if "func " not in s: return False else: + # Conservative fallback — do NOT dedup for languages outside + # the explicit list. PHP was deliberately tested and kept in + # this fallback (not dropped as a first-line signature): + # Laravel RESTful controllers have very short, highly + # templated bodies (``$data = $request->validated(); Model:: + # create($data);``). Dropping the type-rich ``public function + # store(StoreRequest $request): JsonResponse`` line leaves + # nearly-identical bodies across all ``store`` methods and + # actively hurts retrieval (MiniLM on two real Laravel repos: + # usdt-center Δ +0.028 → 0.000, six-forum Δ +0.067 → -0.083 + # after enabling PHP dedup). Signature duplication wastes a + # few dozen chars of budget but keeps the rich type signal. return False if node.params: diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 663fb515..4f0b9b81 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1142,6 +1142,41 @@ def test_signature_dedup_unknown_language_conservative(self): node = _mk_node(language="kotlin2", name="foo", params="()") assert _looks_like_signature("fun foo() {}", node) is False + def test_signature_dedup_php_conservative(self): + """PHP stays in the conservative fallback branch — do NOT dedup. + + Rationale: Laravel RESTful controllers have very short bodies + (``$data = $request->validated(); Model::create($data);``) and + the signature line carries type information that the body + does NOT repeat (``StoreRequest``, ``JsonResponse``, etc.). + Dropping the signature wastes more signal than it saves; a + pre/post A/B on two real Laravel repos showed MiniLM regress + from Δ +0.028 → 0.000 and Δ +0.067 → -0.083 after enabling + PHP dedup. See ``_looks_like_signature`` for the full note. + """ + node = _mk_node( + language="php", name="store", params="(StoreRequest $request)", + parent_name="AdminController", + ) + # Even a line that perfectly matches a PHP signature pattern + # still returns False — dedup is intentionally OFF for PHP. + assert _looks_like_signature( + "public function store(StoreRequest $request): JsonResponse", node, + ) is False + assert _looks_like_signature( + "private static function store(StoreRequest $request)", node, + ) is False + cls_node = _mk_node( + language="php", name="AdminController", params=None, + ) + assert _looks_like_signature( + "class AdminController extends BaseController {", cls_node, + ) is False + # Body lines are also False (expected), same as before. + assert _looks_like_signature( + "$admin = Admin::where('id', $id)->first();", node, + ) is False + def test_signature_dedup_go_and_typescript(self): go_node = _mk_node(language="go", name="Handle", params="(w, r)") assert _looks_like_signature( @@ -1980,6 +2015,29 @@ def test_is_log_only_line_non_logger_identifier_preserved(self): assert _is_log_only_line("logger.warn('slow')") is True assert _is_log_only_line("self.logger.debug('x')") is True + def test_is_log_only_line_php(self): + """PHP + Laravel log patterns: ``->`` for instance, ``::`` for + static facades, bare ``error_log`` / ``syslog`` / + ``trigger_error`` helpers.""" + from code_review_graph.embeddings import _is_log_only_line + # Monolog-style instance logger + assert _is_log_only_line("$logger->info('fetched user');") is True + assert _is_log_only_line("$this->logger->debug($payload);") is True + assert _is_log_only_line("$log->warning('slow query');") is True + # Laravel static facades (with and without leading backslash) + assert _is_log_only_line("Log::info('merchant debit');") is True + assert _is_log_only_line("Log::error($e->getMessage());") is True + assert _is_log_only_line("\\Log::debug('detail');") is True + # Bare helpers + assert _is_log_only_line("error_log('boot failure');") is True + assert _is_log_only_line("syslog(LOG_WARNING, $msg);") is True + assert _is_log_only_line("trigger_error('deprecated', E_USER_WARNING);") is True + # Non-logger PHP calls stay kept + assert _is_log_only_line("$response->setStatusCode(200);") is False + assert _is_log_only_line("$user->save();") is False + assert _is_log_only_line("User::create($data);") is False + assert _is_log_only_line("$db->query('SELECT *');") is False + def test_is_log_only_line_fprint_not_skipped(self): """Codex round 10 regression: ``fmt.Fprintf(w, ...)`` writes to an arbitrary io.Writer (HTTP response, file handle) — that IS the From 0d2bd16c922b0f67d3445824be0f1840d0019f2b Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 23:32:13 +0800 Subject: [PATCH 7/9] feat(embeddings): enable PHP signature dedup (SOTA A/B validated) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: my first-try rollout kept PHP in the conservative fallback based on a MiniLM-only bench that showed regressions on two Laravel repos. Fresh A/B covering MiniLM + OpenAI 3-large + Gemini-2-preview + Qwen3-8B across the same two repos tells a different story — PHP dedup is net +0.110 MRR@3 across the 8 (repo × provider) cells, with the wins concentrated on the top- tier cloud models where users most expect the quality gain: usdt-center OpenAI 3-large off Δ=-0.111 on Δ= 0.000 ↑ +0.111 usdt-center Gemini-2-preview off Δ=+0.056 on Δ=+0.139 ↑ +0.083 six-forum Gemini-2-preview off Δ=-0.083 on Δ=-0.033 ↑ +0.050 six-forum OpenAI 3-large off Δ=+0.117 on Δ=+0.117 = 0 six-forum Qwen3-8B off Δ=+0.117 on Δ=+0.117 = 0 usdt-center Qwen3-8B off Δ=-0.111 on Δ=-0.167 ↓ -0.056 * usdt-center MiniLM off Δ=+0.028 on Δ= 0.000 ↓ -0.028 six-forum MiniLM off Δ=+0.067 on Δ=+0.017 ↓ -0.050 * Qwen3-8B usdt-center: legacy already at MRR 1.000 (ceiling), any body token that isn't a retrieval signal is noise; this is a ceiling-effect artefact, not a regression from dedup. What: adds PHP back to ``_looks_like_signature`` with the same three-AND gate every other language uses — name + declaration keyword + param first-token match. Covers class methods (all visibility / static / abstract / final prefixes precede ``function``), free functions, classes, interfaces, traits, and enums. Log-skip patterns from the previous commit stay unchanged. Mini-LM-on-Laravel regression is documented in the module docstring; those users can revert via ``CRG_EMBED_INCLUDE_BODY=0`` without touching any other part of the pipeline. The sticky-flag opt-in is exactly the shape this kind of per-model tradeoff needs, and why we built it. Regression test in test_signature_dedup_php now asserts dedup fires for standard PHP signatures (method / class / interface / trait) and still returns False for body-only lines and for overload-style param mismatches, so a future change that breaks one of those categories gets caught. --- code_review_graph/embeddings.py | 40 +++++++++++++++++++-------- tests/test_embeddings.py | 49 +++++++++++++++++++++------------ 2 files changed, 60 insertions(+), 29 deletions(-) diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index c5503e31..60b04274 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -999,19 +999,35 @@ def _looks_like_signature(line: str, node: GraphNode) -> bool: elif lang == "go": if "func " not in s: return False + elif lang == "php": + # PHP class methods, free functions, classes, interfaces, + # traits, enums. ``function`` covers every visibility / + # modifier prefix (public, private, protected, static, + # abstract, final, readonly) — they all precede ``function``. + # + # A/B-tested on two Laravel codebases across four providers + # (MiniLM / OpenAI 3-large / Gemini-2-preview / Qwen3-8B). + # Enabling PHP dedup produced +0.110 net MRR@3 across the 8 + # cells; gains concentrate on the top-tier cloud models + # (Gemini-2 +0.083 and +0.050, OpenAI 3-large +0.111 on one + # repo). MiniLM regressed -0.03 to -0.05 because its 256- + # token window can't afford to lose the signature's + # redundant type tokens — those users can revert via + # ``CRG_EMBED_INCLUDE_BODY=0``. First-try rollout was the + # conservative fallback based on a MiniLM-only bench which + # missed the SOTA upside; the SOTA data makes PHP worth + # first-class support. + if not any( + k in s for k in ( + "function ", "class ", "interface ", "trait ", "enum ", + ) + ): + return False else: - # Conservative fallback — do NOT dedup for languages outside - # the explicit list. PHP was deliberately tested and kept in - # this fallback (not dropped as a first-line signature): - # Laravel RESTful controllers have very short, highly - # templated bodies (``$data = $request->validated(); Model:: - # create($data);``). Dropping the type-rich ``public function - # store(StoreRequest $request): JsonResponse`` line leaves - # nearly-identical bodies across all ``store`` methods and - # actively hurts retrieval (MiniLM on two real Laravel repos: - # usdt-center Δ +0.028 → 0.000, six-forum Δ +0.067 → -0.083 - # after enabling PHP dedup). Signature duplication wastes a - # few dozen chars of budget but keeps the rich type signal. + # Conservative fallback for languages we haven't explicitly + # tested — keep the signature line in the body snippet, + # wasting a few dozen chars of budget but avoiding the risk + # of dropping type info the body doesn't repeat. return False if node.params: diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 4f0b9b81..9fa5cd1e 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1142,40 +1142,55 @@ def test_signature_dedup_unknown_language_conservative(self): node = _mk_node(language="kotlin2", name="foo", params="()") assert _looks_like_signature("fun foo() {}", node) is False - def test_signature_dedup_php_conservative(self): - """PHP stays in the conservative fallback branch — do NOT dedup. - - Rationale: Laravel RESTful controllers have very short bodies - (``$data = $request->validated(); Model::create($data);``) and - the signature line carries type information that the body - does NOT repeat (``StoreRequest``, ``JsonResponse``, etc.). - Dropping the signature wastes more signal than it saves; a - pre/post A/B on two real Laravel repos showed MiniLM regress - from Δ +0.028 → 0.000 and Δ +0.067 → -0.083 after enabling - PHP dedup. See ``_looks_like_signature`` for the full note. + def test_signature_dedup_php(self): + """PHP classes, methods, interfaces, traits, enums. + + PHP dedup was SOTA A/B tested on two Laravel codebases + (usdt-center + six-forum-api) across MiniLM / OpenAI 3-large / + Gemini-2-preview / Qwen3-8B. Net MRR@3 change across the 8 + cells was +0.110, gains concentrated on Gemini-2 (+0.083 and + +0.050) and OpenAI 3-large (+0.111 on one repo). MiniLM + regressed slightly (-0.03 to -0.05) and those users can revert + via ``CRG_EMBED_INCLUDE_BODY=0``. See ``_looks_like_signature`` + docstring for the full rationale. """ node = _mk_node( language="php", name="store", params="(StoreRequest $request)", parent_name="AdminController", ) - # Even a line that perfectly matches a PHP signature pattern - # still returns False — dedup is intentionally OFF for PHP. + # Common modifier prefixes assert _looks_like_signature( "public function store(StoreRequest $request): JsonResponse", node, - ) is False + ) is True assert _looks_like_signature( "private static function store(StoreRequest $request)", node, - ) is False + ) is True + # Class / interface / trait / enum cls_node = _mk_node( language="php", name="AdminController", params=None, ) assert _looks_like_signature( "class AdminController extends BaseController {", cls_node, - ) is False - # Body lines are also False (expected), same as before. + ) is True + iface_node = _mk_node(language="php", name="Repository", params=None) + assert _looks_like_signature( + "interface Repository {", iface_node, + ) is True + trait_node = _mk_node( + language="php", name="HasTimestamps", params=None, + ) + assert _looks_like_signature( + "trait HasTimestamps {", trait_node, + ) is True + # Body line that happens to contain the name but not a decl + # keyword → must NOT be classified as signature. assert _looks_like_signature( "$admin = Admin::where('id', $id)->first();", node, ) is False + # Overload-style param mismatch must not dedup either. + assert _looks_like_signature( + "public function store(DifferentType $arg): void", node, + ) is False def test_signature_dedup_go_and_typescript(self): go_node = _mk_node(language="go", name="Handle", params="(w, r)") From 268482c4119275e3dcdede35b9b755b2ae095fe2 Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Sun, 19 Apr 2026 23:36:33 +0800 Subject: [PATCH 8/9] test(embeddings): cover PHP enum declaration vs enum case lines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex round 13 nit: the ``test_signature_dedup_php`` comment said "traits, enums" but only asserted a trait example. Add explicit assertions for both shapes of the PHP 8.1 enum syntax: enum OrderStatus: string { ... # declaration → dedup case PENDING = 'pending'; # enum value → keep as body This locks in the behavior Codex spot-checked manually: ``case`` is deliberately NOT in the dedup keyword allowlist, so enum values stay in the body snippet where they carry real retrieval signal. --- tests/test_embeddings.py | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 9fa5cd1e..46272aa3 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -1182,6 +1182,18 @@ def test_signature_dedup_php(self): assert _looks_like_signature( "trait HasTimestamps {", trait_node, ) is True + enum_node = _mk_node( + language="php", name="OrderStatus", params=None, + ) + assert _looks_like_signature( + "enum OrderStatus: string {", enum_node, + ) is True + # ``case PENDING = 'pending';`` is an enum VALUE, not a + # declaration — must NOT match (``case`` isn't in the keyword + # allowlist). + assert _looks_like_signature( + "case PENDING = 'pending';", enum_node, + ) is False # Body line that happens to contain the name but not a decl # keyword → must NOT be classified as signature. assert _looks_like_signature( From 21bb9b376192ee43a24bebf07ec3684c1656f54f Mon Sep 17 00:00:00 2001 From: Minidoracat Date: Mon, 20 Apr 2026 00:00:39 +0800 Subject: [PATCH 9/9] docs(embeddings): document new ``embed`` CLI subcommand in 5 README languages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why: the body-embedding feature ships a new ``code-review-graph embed`` subcommand (with ``--include-body`` / ``--confirm-reembed`` / ``--skip-cost-check`` flags) but the existing CLI reference tables in each README stopped at ``eval`` / ``serve``. A first-time user after this PR merges wouldn't discover the subcommand from docs alone. What: adds two rows to the CLI reference ``
`` block in every translated README (EN / zh-CN / ja-JP / ko-KR / hi-IN): code-review-graph embed # fresh repo — body enrichment auto-on code-review-graph embed --include-body --confirm-reembed # existing DB — opt in to body Native-phrasing for each language explains the fresh-vs-existing-DB distinction so the sticky-flag + double-flag safety gate is obvious from the docs without cross-referencing the PR body. Notes for translators / follow-up contributors: the ``embed`` CLI also supports optional ``--provider`` / ``--model`` / ``--skip-cost-check`` flags that the full ``code-review-graph embed --help`` output covers; the CLI table sticks to the two shapes most users need. --- README.hi-IN.md | 2 ++ README.ja-JP.md | 2 ++ README.ko-KR.md | 2 ++ README.md | 2 ++ README.zh-CN.md | 2 ++ 5 files changed, 10 insertions(+) diff --git a/README.hi-IN.md b/README.hi-IN.md index 45bcc0ce..413981f2 100644 --- a/README.hi-IN.md +++ b/README.hi-IN.md @@ -196,6 +196,8 @@ code-review-graph register # मल्टी-रिपो रजिस code-review-graph unregister # रजिस्ट्री से रिपो हटाएं code-review-graph repos # रजिस्टर्ड रिपॉज़िटरीज़ की सूची code-review-graph eval # मूल्यांकन बेंचमार्क चलाएं +code-review-graph embed # एम्बेडिंग वेक्टर्स कंप्यूट / रिफ्रेश (नए रेपो में body enrichment ऑटो-एक्टिव) +code-review-graph embed --include-body --confirm-reembed # मौजूदा DB: body enrichment को एक्सप्लिसिट opt-in code-review-graph serve # MCP सर्वर शुरू करें ``` diff --git a/README.ja-JP.md b/README.ja-JP.md index 6708388f..1929f15a 100644 --- a/README.ja-JP.md +++ b/README.ja-JP.md @@ -198,6 +198,8 @@ code-review-graph register # マルチリポジトリレジストリに code-review-graph unregister # レジストリからリポジトリを削除 code-review-graph repos # 登録済みリポジトリの一覧表示 code-review-graph eval # 評価ベンチマークの実行 +code-review-graph embed # 埋め込みベクトルの計算 / 更新(新規リポジトリはbody enrichmentが自動有効) +code-review-graph embed --include-body --confirm-reembed # 既存DB: body enrichmentを明示的にopt-in code-review-graph serve # MCPサーバーの起動 ``` diff --git a/README.ko-KR.md b/README.ko-KR.md index 68a555a4..692c8804 100644 --- a/README.ko-KR.md +++ b/README.ko-KR.md @@ -198,6 +198,8 @@ code-review-graph register # 멀티 레포 레지스트리에 저장소 code-review-graph unregister # 레지스트리에서 저장소 제거 code-review-graph repos # 등록된 저장소 목록 code-review-graph eval # 평가 벤치마크 실행 +code-review-graph embed # 임베딩 벡터 계산 / 갱신 (신규 repo는 body enrichment 자동 활성화) +code-review-graph embed --include-body --confirm-reembed # 기존 DB: body enrichment 명시적 opt-in code-review-graph serve # MCP 서버 시작 ``` diff --git a/README.md b/README.md index ee5cd61c..fdf38c23 100644 --- a/README.md +++ b/README.md @@ -258,6 +258,8 @@ code-review-graph register # Register repo in multi-repo registry code-review-graph unregister # Remove repo from registry code-review-graph repos # List registered repositories code-review-graph eval # Run evaluation benchmarks +code-review-graph embed # Compute / refresh embeddings (new repos auto-enable body) +code-review-graph embed --include-body --confirm-reembed # Existing DB: opt in to body enrichment code-review-graph serve # Start MCP server ``` diff --git a/README.zh-CN.md b/README.zh-CN.md index 2a8e387d..cd2343d0 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -196,6 +196,8 @@ code-review-graph register # 将仓库注册到多仓库注册表 code-review-graph unregister # 从注册表移除仓库 code-review-graph repos # 列出已注册的仓库 code-review-graph eval # 运行评估基准测试 +code-review-graph embed # 计算 / 刷新向量嵌入(新 repo 自动启用 body enrichment) +code-review-graph embed --include-body --confirm-reembed # 既有 DB:显式 opt-in body enrichment code-review-graph serve # 启动 MCP 服务器 ```