diff --git a/README.hi-IN.md b/README.hi-IN.md index ab40ade0..413981f2 100644 --- a/README.hi-IN.md +++ b/README.hi-IN.md @@ -133,7 +133,7 @@ Build the code review graph for this project | **23 भाषाएं + नोटबुक** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **ब्लास्ट-रेडियस विश्लेषण** | दिखाता है कि किसी भी बदलाव से कौन से फ़ंक्शन, क्लासेज़, और फ़ाइलें प्रभावित होती हैं | | **ऑटो-अपडेट हुक्स** | बिना मैन्युअल हस्तक्षेप के हर फ़ाइल एडिट और git कमिट पर ग्राफ अपडेट होता है | -| **सिमेंटिक सर्च** | sentence-transformers, Google Gemini, MiniMax, या किसी भी OpenAI-compatible एंडपॉइंट (असली OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) के ज़रिए वैकल्पिक वेक्टर एम्बेडिंग | +| **सिमेंटिक सर्च** | sentence-transformers, Google Gemini, MiniMax, या किसी भी OpenAI-compatible एंडपॉइंट (असली OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) के ज़रिए वैकल्पिक वेक्टर एम्बेडिंग. फ़ंक्शन body का implementation हिस्सा एम्बेड होता है (signature और शुरुआती docstring/header comments हटा दिए जाते हैं ताकि snippet का token budget real logic पर खर्च हो, documentation पर नहीं). long-context मॉडल्स "कोड क्या करता है" के हिसाब से रैंक कर सकते हैं | | **इंटरैक्टिव विज़ुअलाइज़ेशन** | सर्च, कम्युनिटी लीजेंड टॉगल, और डिग्री-स्केल्ड नोड्स के साथ D3.js फ़ोर्स-डायरेक्टेड ग्राफ | | **हब और ब्रिज डिटेक्शन** | betweenness centrality के ज़रिए सबसे ज़्यादा कनेक्टेड नोड्स और आर्किटेक्चरल चोकपॉइंट्स खोजें | | **सरप्राइज़ स्कोरिंग** | अप्रत्याशित कपलिंग का पता लगाएं: क्रॉस-कम्युनिटी, क्रॉस-लैंग्वेज, पेरीफ़ेरल-टू-हब एज़ेज़ | @@ -196,6 +196,8 @@ code-review-graph register # मल्टी-रिपो रजिस code-review-graph unregister # रजिस्ट्री से रिपो हटाएं code-review-graph repos # रजिस्टर्ड रिपॉज़िटरीज़ की सूची code-review-graph eval # मूल्यांकन बेंचमार्क चलाएं +code-review-graph embed # एम्बेडिंग वेक्टर्स कंप्यूट / रिफ्रेश (नए रेपो में body enrichment ऑटो-एक्टिव) +code-review-graph embed --include-body --confirm-reembed # मौजूदा DB: body enrichment को एक्सप्लिसिट opt-in code-review-graph serve # MCP सर्वर शुरू करें ``` @@ -286,7 +288,7 @@ export CRG_OPENAI_BATCH_SIZE=100 # टाइट बैच > **मॉडल चुनने की सलाह।** लंबे समय के उपयोग के लिए `-preview` / `-beta` / `-exp` वाले model ID (जैसे `google/gemini-embedding-2-preview`) से बचें — preview मॉडल्स के वज़न बदल सकते हैं (डाइमेंशन बदलने पर पूरा re-embed करना पड़ेगा) या बिना नोटिस deprecate हो सकते हैं। स्टेबल GA मॉडल्स की सलाह दी जाती है: `text-embedding-3-small` / `text-embedding-3-large` (OpenAI), `Qwen/Qwen3-Embedding-8B` (vLLM / LocalAI सेल्फ-होस्टेड के ज़रिए), या `gemini-embedding-001` (नेटिव Gemini provider के ज़रिए, `GOOGLE_API_KEY` चाहिए). > -> साथ ही ध्यान दें: वर्तमान में `code-review-graph` केवल **फ़ंक्शन सिग्नेचर** एम्बेड करता है (प्रति नोड ~10 tokens, जैसे `"parse_file function (path: str) returns Tree"`). जिन मॉडल्स की क्वालिटी का मुख्य source लंबे context में function body को समझना है (जैसे Gemini 2 या Qwen3-8B के MTEB-code SOTA स्कोर्स), वे इस इनपुट लंबाई पर छोटे मॉडल्स से कम अंतर दिखाएंगे। Body / docstring एम्बेडिंग को फ़ॉलो-अप एन्हांसमेंट के रूप में ट्रैक किया जा रहा है। +> `code-review-graph` signature के साथ-साथ **फ़ंक्शन body के implementation** हिस्से का truncated snippet भी एम्बेड करता है (signature और शुरुआती docstring/header comments हटा दिए जाते हैं ताकि snippet token budget real logic पर जाए; प्रति प्रोवाइडर char budget: local 700 / google 3000 / minimax 3000 / openai 6000). अब long-context मॉडल्स (Gemini 2, Qwen3-8B) फ़ंक्शन के नाम के बजाय "वह क्या करता है" के आधार पर रैंक कर सकते हैं। नए रेपो में यह अपने आप active हो जाता है; मौजूदा DB legacy signature-only व्यवहार रखते हैं जब तक आप `code-review-graph embed --include-body --confirm-reembed` से opt-in नहीं करते (एक बार का पूरा re-embed क्लाउड प्रोवाइडर्स पर API tokens खर्च कर सकता है — signature-only पर रहने के लिए `CRG_EMBED_INCLUDE_BODY=0` सेट करें). diff --git a/README.ja-JP.md b/README.ja-JP.md index 220ea864..1929f15a 100644 --- a/README.ja-JP.md +++ b/README.ja-JP.md @@ -135,7 +135,7 @@ gitコミットやファイル保存のたびにフックが起動します。 | **23言語 + ノートブック** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **影響範囲分析** | 変更によって影響を受ける関数、クラス、ファイルを正確に表示 | | **自動更新フック** | ファイル編集やgitコミットのたびに手動操作なしでグラフを更新 | -| **セマンティック検索** | sentence-transformers、Google Gemini、MiniMax、またはOpenAI互換エンドポイント(本家OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI)によるオプションのベクトル埋め込み | +| **セマンティック検索** | sentence-transformers、Google Gemini、MiniMax、またはOpenAI互換エンドポイント(本家OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI)によるオプションのベクトル埋め込み。関数 body の実装部分を埋め込みます(シグネチャと先頭の docstring / ヘッダーコメントは取り除き、snippet の token 予算を実ロジックに回します)。長 context モデルが「コードが何をしているか」でランキングできます | | **インタラクティブ可視化** | D3.js力学レイアウトグラフ。検索、コミュニティ凡例切替、次数スケーリングノード対応 | | **ハブ・ブリッジ検出** | 最も接続の多いノードと媒介中心性によるアーキテクチャのボトルネックを発見 | | **サプライズスコアリング** | 予期しない結合を検出:コミュニティ間、言語間、周辺からハブへのエッジ | @@ -198,6 +198,8 @@ code-review-graph register # マルチリポジトリレジストリに code-review-graph unregister # レジストリからリポジトリを削除 code-review-graph repos # 登録済みリポジトリの一覧表示 code-review-graph eval # 評価ベンチマークの実行 +code-review-graph embed # 埋め込みベクトルの計算 / 更新(新規リポジトリはbody enrichmentが自動有効) +code-review-graph embed --include-body --confirm-reembed # 既存DB: body enrichmentを明示的にopt-in code-review-graph serve # MCPサーバーの起動 ``` @@ -288,7 +290,7 @@ base URLがlocalhost(`127.0.0.1`、`localhost`、`0.0.0.0`、`::1`)を指し > **モデル選択のヒント。** `-preview` / `-beta` / `-exp` 付きのmodel ID(例:`google/gemini-embedding-2-preview`)は長期運用には避けてください。preview モデルは重みが変更される(次元が変わると全ノード re-embed 必須)か、予告なく deprecate される可能性があります。安定版 GA モデル推奨:`text-embedding-3-small` / `text-embedding-3-large`(OpenAI)、`Qwen/Qwen3-Embedding-8B`(vLLM / LocalAI 自前ホスト経由)、または `gemini-embedding-001`(ネイティブ Gemini provider 経由、`GOOGLE_API_KEY` が必要)。 > -> また注意:現状 `code-review-graph` は**関数シグネチャのみ**を埋め込みます(ノードあたり約10トークン、例:`"parse_file function (path: str) returns Tree"`)。長 context で関数 body を理解する能力で差をつけるモデル(Gemini 2 や Qwen3-8B の MTEB-code SOTA スコア)は、この入力長では小型モデルとの差がかなり縮まります。Body / docstring 埋め込みはフォローアップ拡張として追跡中です。 +> `code-review-graph` はシグネチャに加えて**関数 body の実装**部分の切り取り片も埋め込みます(シグネチャと先頭の docstring / ヘッダーコメントは取り除き、snippet の token 予算を実ロジックに割り当てます。プロバイダごとの文字数上限:local 700 / google 3000 / minimax 3000 / openai 6000)。長 context モデル(Gemini 2、Qwen3-8B)は「関数が何をしているか」でランク付けでき、名前だけに頼らなくなります。新規リポジトリでは自動で有効;既存 DB はレガシーの signature-only 動作を保持し、`code-review-graph embed --include-body --confirm-reembed` で明示的に opt-in するまで切り替わりません(一度限りの全件 re-embed はクラウドプロバイダで API トークンを消費します。signature-only のままにしたい場合は `CRG_EMBED_INCLUDE_BODY=0`)。 diff --git a/README.ko-KR.md b/README.ko-KR.md index f593f065..692c8804 100644 --- a/README.ko-KR.md +++ b/README.ko-KR.md @@ -135,7 +135,7 @@ git 커밋이나 파일 저장마다 훅이 실행됩니다. 그래프는 변경 | **23개 언어 + 노트북** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **영향 범위 분석** | 변경에 의해 영향 받는 함수, 클래스, 파일을 정확히 보여줍니다 | | **자동 업데이트 훅** | 수동 개입 없이 파일 편집 및 git 커밋마다 그래프가 업데이트됩니다 | -| **시맨틱 검색** | sentence-transformers, Google Gemini, MiniMax, 또는 OpenAI 호환 엔드포인트(실제 OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI)를 통한 선택적 벡터 임베딩 | +| **시맨틱 검색** | sentence-transformers, Google Gemini, MiniMax, 또는 OpenAI 호환 엔드포인트(실제 OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI)를 통한 선택적 벡터 임베딩. 함수 body의 구현 부분을 임베딩합니다(시그니처와 선두 docstring/헤더 주석은 제거하여 snippet 토큰 예산을 실제 로직에 할당). 긴 context 모델이 "코드가 무엇을 하는지"로 랭킹할 수 있습니다 | | **인터랙티브 시각화** | 검색, 커뮤니티 범례 토글, 차수 기반 노드 크기 조정이 가능한 D3.js 포스 다이렉티드 그래프 | | **허브 및 브릿지 감지** | 가장 많이 연결된 노드와 매개 중심성을 통한 아키텍처 병목 지점 탐색 | | **서프라이즈 스코어링** | 예상치 못한 결합 감지: 커뮤니티 간, 언어 간, 주변부-허브 엣지 | @@ -198,6 +198,8 @@ code-review-graph register # 멀티 레포 레지스트리에 저장소 code-review-graph unregister # 레지스트리에서 저장소 제거 code-review-graph repos # 등록된 저장소 목록 code-review-graph eval # 평가 벤치마크 실행 +code-review-graph embed # 임베딩 벡터 계산 / 갱신 (신규 repo는 body enrichment 자동 활성화) +code-review-graph embed --include-body --confirm-reembed # 기존 DB: body enrichment 명시적 opt-in code-review-graph serve # MCP 서버 시작 ``` @@ -288,7 +290,7 @@ base URL이 localhost(`127.0.0.1`, `localhost`, `0.0.0.0`, `::1`)를 가리킬 > **모델 선택 팁.** `-preview` / `-beta` / `-exp` 접미사가 붙은 model ID(예: `google/gemini-embedding-2-preview`)는 장기 운영용으로 피하세요. preview 모델은 가중치가 바뀌거나(차원 변경 시 전체 re-embed 필수) 예고 없이 deprecate될 수 있습니다. 안정 GA 모델 권장: `text-embedding-3-small` / `text-embedding-3-large`(OpenAI), `Qwen/Qwen3-Embedding-8B`(vLLM / LocalAI 자체 호스팅 경유), 또는 `gemini-embedding-001`(네이티브 Gemini provider 경유, `GOOGLE_API_KEY` 필요). > -> 참고로 현재 `code-review-graph`는 **함수 시그니처만** 임베딩합니다(노드당 약 10 토큰, 예: `"parse_file function (path: str) returns Tree"`). 긴 context로 함수 body를 이해하는 능력으로 차별화되는 모델(Gemini 2 또는 Qwen3-8B의 MTEB-code SOTA 점수)은 이 입력 길이에서 소형 모델과의 품질 차이가 훨씬 좁아집니다. Body / docstring 임베딩은 후속 개선 과제로 추적 중입니다. +> `code-review-graph`는 시그니처에 더해 **함수 body 구현부** 스니펫도 함께 임베딩합니다(시그니처와 선두 docstring/헤더 주석은 제거하고 snippet 토큰 예산을 실제 로직에 할당. 프로바이더별 문자 예산: local 700 / google 3000 / minimax 3000 / openai 6000). 이제 긴 context 모델(Gemini 2, Qwen3-8B)이 함수 이름만이 아니라 "함수가 무엇을 하는지"로 랭킹할 수 있습니다. 새 레포지토리는 자동으로 활성화; 기존 DB는 레거시 signature-only 동작을 유지하며, `code-review-graph embed --include-body --confirm-reembed`로 명시적 opt-in해야 전환됩니다(일회성 전량 re-embed는 클라우드 프로바이더에서 API 토큰을 소비. signature-only로 유지하려면 `CRG_EMBED_INCLUDE_BODY=0`). diff --git a/README.md b/README.md index 3dd714f4..fdf38c23 100644 --- a/README.md +++ b/README.md @@ -195,7 +195,7 @@ The blast-radius analysis never misses an actually impacted file (perfect recall | **23 languages + notebooks** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **Blast-radius analysis** | Shows exactly which functions, classes, and files are affected by any change | | **Auto-update hooks** | Graph updates on every file edit and git commit without manual intervention | -| **Semantic search** | Optional vector embeddings via sentence-transformers, Google Gemini, MiniMax, or any OpenAI-compatible endpoint (real OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI) | +| **Semantic search** | Optional vector embeddings via sentence-transformers, Google Gemini, MiniMax, or any OpenAI-compatible endpoint (real OpenAI, Azure, new-api, LiteLLM, vLLM, LocalAI). Embeds the function body logic (signature + leading docstrings/header comments are stripped so the token budget goes to implementation, not documentation) — long-context models can rank by what code *does* | | **Interactive visualisation** | D3.js force-directed graph with search, community legend toggles, and degree-scaled nodes | | **Hub & bridge detection** | Find most-connected nodes and architectural chokepoints via betweenness centrality | | **Surprise scoring** | Detect unexpected coupling: cross-community, cross-language, peripheral-to-hub edges | @@ -258,6 +258,8 @@ code-review-graph register # Register repo in multi-repo registry code-review-graph unregister # Remove repo from registry code-review-graph repos # List registered repositories code-review-graph eval # Run evaluation benchmarks +code-review-graph embed # Compute / refresh embeddings (new repos auto-enable body) +code-review-graph embed --include-body --confirm-reembed # Existing DB: opt in to body enrichment code-review-graph serve # Start MCP server ``` @@ -376,12 +378,16 @@ The cloud-egress warning is auto-skipped when the base URL points to localhost > `gemini-embedding-001` (via the native Gemini provider, which requires > `GOOGLE_API_KEY` instead of the OpenAI-compatible path). > -> Also note: `code-review-graph` currently embeds **function signatures only** -> (~10 tokens per node, e.g. `"parse_file function (path: str) returns Tree"`). -> Models whose headline quality comes from long-context body understanding -> (such as Gemini 2 or Qwen3-8B at their MTEB-code SOTA scores) will see a -> much narrower quality gap against smaller models at this input length. -> Body/docstring embedding is tracked as a follow-up enhancement. +> `code-review-graph` also embeds a truncated **function body +> implementation** snippet alongside the signature (signature + leading +> docstrings / header comments are stripped so the snippet tokens go to +> real logic; per-provider char budget: local 700 / google 3000 / +> minimax 3000 / openai 6000). Long-context models (Gemini 2, Qwen3-8B) +> can rank by what a function *does* and not just what it's named. New repositories pick this up automatically; +> existing DBs keep the legacy signature-only behavior until you opt in +> with `code-review-graph embed --include-body --confirm-reembed` (a +> one-time re-embed that may spend API tokens on cloud providers — set +> `CRG_EMBED_INCLUDE_BODY=0` to stay on signatures). #### Tool Filtering diff --git a/README.zh-CN.md b/README.zh-CN.md index 9d1c8fa0..cd2343d0 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -133,7 +133,7 @@ Build the code review graph for this project | **23 种语言 + 笔记本** | Python, TypeScript/TSX, JavaScript, Vue, Svelte, Go, Rust, Java, Scala, C#, Ruby, Kotlin, Swift, PHP, Solidity, C/C++, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Jupyter/Databricks (.ipynb) | | **影响半径分析** | 精确展示任何变更所影响的函数、类和文件 | | **自动更新钩子** | 每次文件编辑和 git 提交时自动更新图,无需手动干预 | -| **语义搜索** | 可选的向量嵌入,支持 sentence-transformers、Google Gemini、MiniMax,或任何 OpenAI 兼容端点(真实 OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI) | +| **语义搜索** | 可选的向量嵌入,支持 sentence-transformers、Google Gemini、MiniMax,或任何 OpenAI 兼容端点(真实 OpenAI、Azure、new-api、LiteLLM、vLLM、LocalAI)。嵌入函数 body 的实作片段(剥除签名与开头的 docstring/header 注释,让 token 预算花在真正的逻辑而非文档),让长 context 模型按「代码做什么」排序 | | **交互式可视化** | D3.js 力导向图,支持搜索、社区图例切换和按度数缩放的节点 | | **Hub 与 Bridge 检测** | 查找连接最多的节点和通过介数中心性发现架构瓶颈 | | **异常评分** | 检测意外耦合:跨社区、跨语言、外围到核心的边 | @@ -196,6 +196,8 @@ code-review-graph register # 将仓库注册到多仓库注册表 code-review-graph unregister # 从注册表移除仓库 code-review-graph repos # 列出已注册的仓库 code-review-graph eval # 运行评估基准测试 +code-review-graph embed # 计算 / 刷新向量嵌入(新 repo 自动启用 body enrichment) +code-review-graph embed --include-body --confirm-reembed # 既有 DB:显式 opt-in body enrichment code-review-graph serve # 启动 MCP 服务器 ``` @@ -286,7 +288,7 @@ export CRG_OPENAI_BATCH_SIZE=100 # 某些网关有更严 > **模型选择提示。** 避免用 `-preview` / `-beta` / `-exp` 结尾的 model ID(例如 `google/gemini-embedding-2-preview`)做长期使用——preview 模型可能更换权重(维度一变就要全量 re-embed)或被无预警下架。建议改用正式 GA 模型:`text-embedding-3-small` / `text-embedding-3-large`(OpenAI)、`Qwen/Qwen3-Embedding-8B`(经 vLLM / LocalAI 自宿主)、或 `gemini-embedding-001`(经原生 Gemini provider,需要 `GOOGLE_API_KEY`)。 > -> 另外请注意:目前 `code-review-graph` 只嵌入**函数签名**(每节点约 10 tokens,例如 `"parse_file function (path: str) returns Tree"`)。那些靠长 context 理解函数 body 来拉开差距的模型(Gemini 2 或 Qwen3-8B 在 MTEB-code 的 SOTA 分数)在这个输入长度下跟小模型的品质差距会小很多。Body / docstring 嵌入已列为后续增强任务。 +> `code-review-graph` 也会在签名后面接上截断过的**函数 body 实作**片段(剥除签名与开头的 docstring/header 注释,让 snippet token 留给真正的逻辑而非文档;每家 provider 的字元上限:local 700 / google 3000 / minimax 3000 / openai 6000),长 context 模型(Gemini 2、Qwen3-8B)终于能按「代码做什么」排序而不是只看名字。新 repo 自动开启;既有 DB 保留原本的 signature-only 行为,直到你执行 `code-review-graph embed --include-body --confirm-reembed` 显式 opt-in(一次性全量 re-embed,对云 provider 会花 API token——想留在 signature-only 设 `CRG_EMBED_INCLUDE_BODY=0`)。 diff --git a/code_review_graph/cli.py b/code_review_graph/cli.py index 6e9349b6..d097b75e 100644 --- a/code_review_graph/cli.py +++ b/code_review_graph/cli.py @@ -480,6 +480,41 @@ def main() -> None: eval_cmd.add_argument("--report", action="store_true", help="Generate report from results") eval_cmd.add_argument("--output-dir", default=None, help="Output directory for results") + # embed (opt-in body enrichment gate) + embed_cmd = sub.add_parser( + "embed", + help="Compute embeddings for all graph nodes (opt-in body enrichment)", + ) + embed_cmd.add_argument("--repo", default=None, help="Repository root (auto-detected)") + embed_cmd.add_argument( + "--provider", + default=None, + help="Embedding provider: local (default) / google / minimax / openai", + ) + embed_cmd.add_argument( + "--model", + default=None, + help="Model name (local: sentence-transformers; openai: CRG_OPENAI_MODEL)", + ) + embed_cmd.add_argument( + "--include-body", + action="store_true", + help="Embed function bodies alongside signatures (sets CRG_EMBED_INCLUDE_BODY=1)", + ) + embed_cmd.add_argument( + "--confirm-reembed", + action="store_true", + help="Required alongside --include-body when embeddings.db already has rows — " + "without this flag the command refuses to trigger a full re-embed " + "(cost-safety gate for cloud providers).", + ) + embed_cmd.add_argument( + "--skip-cost-check", + action="store_true", + help="Bypass the ballpark token/cost estimate warning shown before " + "re-embed on cloud providers (CI / automation).", + ) + # detect-changes detect_cmd = sub.add_parser("detect-changes", help="Analyze change impact") detect_cmd.add_argument("--base", default="HEAD~1", help="Git diff base (default: HEAD~1)") @@ -722,6 +757,68 @@ def main() -> None: if result.get("files_updated", 0) > 0: _cli_post_process(store) + elif args.command == "embed": + from .embeddings import ( + CLOUD_PROVIDERS, + EmbeddingStore, + _warn_cloud_egress, + embed_all_nodes, + ) + + provider = getattr(args, "provider", None) or "local" + model = getattr(args, "model", None) + + # --include-body safety gate: existing DB requires --confirm-reembed. + if getattr(args, "include_body", False): + emb_for_count = EmbeddingStore(store.db_path) + try: + existing = emb_for_count.count() + finally: + emb_for_count.close() + if existing > 0 and not getattr(args, "confirm_reembed", False): + logging.error( + "--include-body on a non-empty embeddings DB (%d rows) " + "triggers a full re-embed. Re-run with --confirm-reembed " + "to proceed, or unset --include-body to keep the current " + "sticky gate. See CHANGELOG for cost notes.", + existing, + ) + sys.exit(2) + if ( + provider in CLOUD_PROVIDERS + and existing >= 500 + and not getattr(args, "skip_cost_check", False) + ): + print( + f"Heads up: re-embedding {existing} nodes via '{provider}' " + f"will spend API tokens. A rough ballpark is " + f"{existing * 200} tokens; check your provider's pricing. " + "Pass --skip-cost-check to silence this notice.", + file=sys.stderr, + ) + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + + emb_store = EmbeddingStore(store.db_path, provider=provider, model=model) + try: + if emb_store.provider is None: + logging.error( + "Embedding provider '%s' unavailable. For 'local', " + "install the 'embeddings' extra: " + "pip install code-review-graph[embeddings]", + provider, + ) + sys.exit(1) + if provider in CLOUD_PROVIDERS: + _warn_cloud_egress(emb_store.provider.name) + count = embed_all_nodes(store, emb_store) + print( + f"Embedded {count} new/changed nodes " + f"(provider={emb_store.provider.name}, " + f"total={emb_store.count()})" + ) + finally: + emb_store.close() + elif args.command == "status": stats = store.get_stats() print(f"Nodes: {stats.total_nodes}") diff --git a/code_review_graph/embeddings.py b/code_review_graph/embeddings.py index be29045f..60b04274 100644 --- a/code_review_graph/embeddings.py +++ b/code_review_graph/embeddings.py @@ -13,11 +13,13 @@ import hashlib import logging import os +import re import sqlite3 import struct import sys import time from abc import ABC, abstractmethod +from collections import OrderedDict from pathlib import Path from typing import Any from urllib.parse import urlparse @@ -662,6 +664,11 @@ def _check_available() -> bool: text_hash TEXT NOT NULL, provider TEXT NOT NULL DEFAULT 'unknown' ); + +CREATE TABLE IF NOT EXISTS embeddings_meta ( + key TEXT PRIMARY KEY, + value TEXT NOT NULL +); """ @@ -688,8 +695,541 @@ def _cosine_similarity(a: list[float], b: list[float]) -> float: return dot / (norm_a * norm_b) -def _node_to_text(node: GraphNode) -> str: - """Convert a node to a searchable text representation.""" +# --------------------------------------------------------------------------- +# Body enrichment helpers (Iter 3 final: ADR-A1+dedup / B4 per-provider / C3 sticky) +# --------------------------------------------------------------------------- + +# ADR-B4: per-provider body char budget. Keys are provider-name prefixes +# (the part before ':'). Fallback applies to unknown providers. +_BODY_MAX_CHARS_BY_PROVIDER: dict[str, int] = { + "local": 700, # MiniLM ~256 token window; body + metadata must fit together + "google": 3000, # gemini-embedding-001 ~2048 token window, conservative 50% + "minimax": 3000, # embo-01 ~1024 token window, pay-per-use + "openai": 6000, # text-embedding-3-* 8192 token ceiling, safety margin +} +_BODY_MAX_CHARS_FALLBACK = 700 +_BODY_MAX_LINES_DEFAULT = 15 + +# Pure log/print lines carry almost no semantic signal for code retrieval +# — a function's purpose is almost never "the fact that it logs" — but +# framework-heavy codebases (Discord bots, Rails controllers, Flask +# views, Spring services) can easily burn the body budget on them. +# Skip lines that ARE a log/print call and nothing else. Language- +# agnostic: Python logger/logging/print, JS/TS console.X / logger.X, +# Java System.out.println + log.X, Go fmt.Print* + log.Print*, Rust +# println!/eprintln!/dbg!. Keep lines where the log is mixed with +# other statements. +_LOG_ONLY_LINE_RE = re.compile( + r'^\s*' + r'(?:' + # identifier.method(...) — receiver identifier is deliberately + # restricted to conventional logger names so we don't swallow + # meaningful calls like argparse's ``serve_cmd.error(...)`` or a + # repository's ``db.warn(...)`` (Codex round 10). + r'(?:' + r'log|logger|logging|logs|console|slog|trace|LOG|Logger|LOGGER' + r'|(?:self|this|cls)\s*\.\s*(?:log|logger|logging|slog|trace|_log|_logger)' + r'|_log|_logger' + r')\s*\.\s*' + r'(?:log|debug|info|warn|warning|error|trace|fatal|notice|fine|finer|finest' + r'|Log|Debug|Info|Warn|Warning|Error|Trace|Fatal' + r'|Print|Println|Printf)\s*\(' + r'|' + # bare calls: print(...) / println!(...) / eprintln!(...) / dbg!(...). + # Excludes ``Fprint*`` and other writer-parameterised calls because + # they usually emit to response streams, not logs. + r'(?:print|println|eprintln|dbg)\s*!?\s*\(' + r'|' + # java: System.out.println(...) / System.err.print(...). + r'System\s*\.\s*(?:out|err)\s*\.\s*(?:print|println|printf)\s*\(' + r'|' + # go: fmt.Print / Println / Printf — stdout-ish. Excludes fmt.Fprint* + # because those take an arbitrary io.Writer (HTTP responses, files) + # and carry real output, not logs. + r'fmt\s*\.\s*(?:Print|Println|Printf)\s*\(' + r'|' + # PHP / Laravel: + # instance: $logger->info(...) / $this->logger->debug(...) + # static: Log::info(...) / \Log::debug(...) (Laravel facade) + # bare: error_log(...) / syslog(...) / trigger_error(...) + # PSR-3 log levels covered: debug, info, notice, warning, error, + # critical, alert, emergency (plus common "log"/"warn" aliases). + r'\$(?:logger|log|this->logger|this->log)\s*->\s*' + r'(?:log|debug|info|warn|warning|error|notice|alert|emergency|critical)\s*\(' + r'|' + r'\\?(?:Log|Logger)\s*::\s*' + r'(?:log|debug|info|warn|warning|error|notice|alert|emergency|critical)\s*\(' + r'|' + r'(?:error_log|syslog|trigger_error)\s*\(' + r')' +) + + +def _is_log_only_line(s: str) -> bool: + """Return True when the line is purely a log/print call with no + other executable content. + + ``re.match`` alone is prefix-anchored, which would misclassify + lines like ``logger.info('x') or do_work()`` as log-only (Codex + round 10). Walk the string, tracking string literals and paren + depth, to find the balanced close of the log call's opening + ``(``; the line only counts as log-only when nothing substantive + follows that closing paren. + """ + stripped = s.strip().rstrip(";").strip() + m = _LOG_ONLY_LINE_RE.match(stripped) + if not m: + return False + # The regex ends on the opening ``(`` of the log call. Find its + # balanced ``)`` accounting for nested calls and string literals + # (`'`, `"`, or `` ` ``), respecting backslash escapes. Anything + # non-trivial after that closing paren means the line has other + # executable content and is NOT log-only. + i = m.end() - 1 + assert stripped[i] == "(" + depth = 0 + quote: str | None = None + escape = False + while i < len(stripped): + c = stripped[i] + if escape: + escape = False + elif quote: + if c == "\\": + escape = True + elif c == quote: + quote = None + elif c in ("'", '"', "`"): + quote = c + elif c == "(": + depth += 1 + elif c == ")": + depth -= 1 + if depth == 0: + rest = stripped[i + 1:].strip().rstrip(";").strip() + return rest == "" + i += 1 + return False +_FILE_CACHE_MAX_ENTRIES = 128 +_FILE_CACHE_MAX_BYTES = 2 * 1024 * 1024 # skip files larger than 2 MB +_FILE_CACHE_SNIFF_BYTES = 4096 # null-byte probe window for binary detect +_EMBED_META_KEY_BODY_ENABLED = "embed_body_enabled" + + +def _resolve_body_max_chars(provider_name: str) -> int: + """Return the per-provider char budget for body-enriched embeddings. + + Precedence: + 1. ``CRG_EMBED_BODY_MAX_CHARS`` env var (positive int) — global override + 2. ``_BODY_MAX_CHARS_BY_PROVIDER[prefix]`` where prefix is + ``provider_name.split(':', 1)[0].lower()`` + 3. ``_BODY_MAX_CHARS_FALLBACK`` for unknown providers + """ + override = os.environ.get("CRG_EMBED_BODY_MAX_CHARS") + if override: + try: + n = int(override) + if n > 0: + return n + except ValueError: + pass + prefix = provider_name.split(":", 1)[0].lower() if provider_name else "" + return _BODY_MAX_CHARS_BY_PROVIDER.get(prefix, _BODY_MAX_CHARS_FALLBACK) + + +class _FileLineCache: + """Per-embed-batch file reader with bounded LRU (``OrderedDict``). + + Returns a slice of file lines for a given 1-indexed inclusive range. + Returns ``[]`` (empty, never raises) for: + - missing file / directory / permission / generic ``OSError`` + - binary file (null byte in first ``_FILE_CACHE_SNIFF_BYTES``) + - oversize file (> ``_FILE_CACHE_MAX_BYTES``) + - out-of-range line numbers (``line_start <= 0`` or beyond EOF) + + Not thread-safe. ``embed_nodes`` is single-threaded; do not share + instances across batches. + """ + + def __init__( + self, + repo_root: Path | None = None, + max_entries: int = _FILE_CACHE_MAX_ENTRIES, + ) -> None: + self._repo_root = repo_root + self._max_entries = max_entries + # value ``None`` marks a known-failed read so we don't re-probe + self._cache: OrderedDict[str, list[str] | None] = OrderedDict() + self._read_count = 0 + # Whether the most recent ``get_lines`` call saw a file that + # loaded at all. ``False`` only when the underlying file + # read/decode failed (missing / permission / binary / oversized + # / outside repo). Callers use this to decide whether an empty + # return reflects a legitimate empty slice or a transient read + # failure that should be retried on the next run, NOT persisted + # as an empty-body sentinel (Codex round 5 regression). + self._last_read_ok = True + + def _resolve(self, file_path: str) -> Path | None: + """Resolve ``file_path`` to an absolute path within the repo root. + + Returns ``None`` (treated as a read failure) if the resolved path + escapes the repo root. ``file_path`` comes straight out of the + graph DB, so an attacker who can write to ``nodes.file_path`` + could otherwise point us at ``/etc/passwd`` or ``../../secret`` + and get the contents embedded into a cloud provider. We only + allow the escape when ``repo_root`` is unset (e.g. unit tests that + construct the cache with absolute fixture paths). + """ + p = Path(file_path) + if not p.is_absolute() and self._repo_root is not None: + p = self._repo_root / p + if self._repo_root is None: + return p + try: + resolved = p.resolve() + root = self._repo_root.resolve() + except (OSError, RuntimeError): + return None + try: + resolved.relative_to(root) + except ValueError: + return None + return resolved + + def _load(self, file_path: str) -> list[str] | None: + """Read and split a file; return ``None`` on any failure.""" + try: + p = self._resolve(file_path) + if p is None: + return None + st = p.stat() + if st.st_size > _FILE_CACHE_MAX_BYTES: + return None + with p.open("rb") as fh: + head = fh.read(_FILE_CACHE_SNIFF_BYTES) + if b"\x00" in head: + return None + rest = fh.read() + raw = head + rest + except (OSError, ValueError): + return None + try: + text = raw.decode("utf-8", errors="replace") + except Exception: + return None + return text.splitlines() + + def get_lines(self, file_path: str, line_start: int, line_end: int) -> list[str]: + if line_start <= 0 or line_end < line_start: + # Bogus range is a caller-side issue, not a read failure — + # retrying would hit the same bug, so mark as OK and let the + # caller persist whatever result they compute. + self._last_read_ok = True + return [] + key = file_path + if key in self._cache: + self._cache.move_to_end(key) + lines = self._cache[key] + else: + lines = self._load(key) + self._read_count += 1 + self._cache[key] = lines + if len(self._cache) > self._max_entries: + self._cache.popitem(last=False) + self._last_read_ok = lines is not None + if lines is None: + return [] + if line_start > len(lines): + return [] + return lines[line_start - 1:line_end] + + @property + def last_read_ok(self) -> bool: + return self._last_read_ok + + def clear(self) -> None: + self._cache.clear() + self._last_read_ok = True + + +def _looks_like_signature(line: str, node: GraphNode) -> bool: + """Conservative substring check: is ``line`` the declaration of ``node``? + + Returns True only when ALL of: + - ``node.name`` appears in the line + - the line contains a language-specific declaration keyword + - (if ``node.params`` is set) the first param identifier appears + + Unknown languages return False (do not dedup — keep the line). + """ + s = line.strip() + if not s: + return False + if node.name not in s: + return False + + lang = (node.language or "").lower() + + if lang == "python": + if not ("def " in s or "class " in s): + return False + elif lang in ("javascript", "typescript", "jsx", "tsx"): + if not ( + "function " in s or "=>" in s or "class " in s or "async " in s + ): + return False + elif lang in ("java", "kotlin"): + has_type_name = bool( + node.return_type and node.return_type in s and node.name in s + ) + # Kotlin methods commonly omit an explicit return type (e.g. + # ``override fun save(user: User) { ... }``); fall back to the + # ``fun `` keyword so dedup still fires for those. + has_kotlin_fun = lang == "kotlin" and "fun " in s + if not ( + has_type_name or "class " in s or "interface " in s or has_kotlin_fun + ): + return False + elif lang == "rust": + if not any( + k in s for k in ("fn ", "impl ", "struct ", "enum ", "trait ") + ): + return False + elif lang == "go": + if "func " not in s: + return False + elif lang == "php": + # PHP class methods, free functions, classes, interfaces, + # traits, enums. ``function`` covers every visibility / + # modifier prefix (public, private, protected, static, + # abstract, final, readonly) — they all precede ``function``. + # + # A/B-tested on two Laravel codebases across four providers + # (MiniLM / OpenAI 3-large / Gemini-2-preview / Qwen3-8B). + # Enabling PHP dedup produced +0.110 net MRR@3 across the 8 + # cells; gains concentrate on the top-tier cloud models + # (Gemini-2 +0.083 and +0.050, OpenAI 3-large +0.111 on one + # repo). MiniLM regressed -0.03 to -0.05 because its 256- + # token window can't afford to lose the signature's + # redundant type tokens — those users can revert via + # ``CRG_EMBED_INCLUDE_BODY=0``. First-try rollout was the + # conservative fallback based on a MiniLM-only bench which + # missed the SOTA upside; the SOTA data makes PHP worth + # first-class support. + if not any( + k in s for k in ( + "function ", "class ", "interface ", "trait ", "enum ", + ) + ): + return False + else: + # Conservative fallback for languages we haven't explicitly + # tested — keep the signature line in the body snippet, + # wasting a few dozen chars of budget but avoiding the risk + # of dropping type info the body doesn't repeat. + return False + + if node.params: + pstripped = node.params.strip("()").strip() + if pstripped: + first = pstripped.split(",", 1)[0].split(":", 1)[0].strip() + # tokens like `*args`, `**kwargs`, `&self`, `self` are safe + # to skip param gate on (always match): bail only when we have + # a real identifier we can check. + bare = first.lstrip("*&") + if bare and bare != "self" and bare not in s: + return False + return True + + +def _extract_body_text( + node: GraphNode, + reader: _FileLineCache, + *, + max_chars: int, + max_lines: int = _BODY_MAX_LINES_DEFAULT, +) -> str: + """Return a truncated body snippet for semantic embedding. + + Pipeline: + 1. Skip ``File`` nodes and bogus line ranges. + 2. Read the node's ``[line_start, line_end]`` range via ``reader``. + 3. Drop line 0 if it looks like the node's signature (dedup with metadata). + 4. Run a block-comment + triple-quote state machine to skip leading + docstrings and header comments; leading single-line comments + (``#`` ``//`` ``--`` ``///`` ``*``) are dropped regardless of length. + 5. Collapse consecutive blank lines. + 6. Truncate by ``max_lines`` then by ``max_chars`` (prefer word boundary, + append ``' …'`` suffix when cut). + """ + if node.kind == "File": + return "" + if node.line_start is None or node.line_end is None: + return "" + if node.line_start <= 0 or node.line_end < node.line_start: + return "" + + lines = reader.get_lines(node.file_path, node.line_start, node.line_end) + if not lines: + return "" + + # Step 3: signature dedup + if _looks_like_signature(lines[0], node): + lines = lines[1:] + if not lines: + return "" + + # Step 4: block-comment + triple-quote state machine + kept: list[str] = [] + in_block_comment = False + in_triple_quote: str | None = None + drops = 0 + for line in lines: + s = line.strip() + + # triple-quote state (Python docstring continuation) + if in_triple_quote is not None: + drops += 1 + if in_triple_quote in s: + in_triple_quote = None + continue + + # block-comment state (C/Java/JS /* ... */) + if in_block_comment: + drops += 1 + if "*/" in s: + in_block_comment = False + continue + + # leading-only triple-quote open + if not kept: + matched_triple = False + for q in ('"""', "'''"): + if s.startswith(q): + # same-line open+close: """one-liner.""" + if len(s) >= len(q) * 2 and s.endswith(q): + drops += 1 + matched_triple = True + break + # multi-line open: enter state (only if no closer on this line) + if q not in s[len(q):]: + in_triple_quote = q + drops += 1 + matched_triple = True + break + if matched_triple: + continue + + # block-comment open: /* or /** on its own line + if s.startswith("/*") or s.startswith("/**"): + if "*/" not in s: + # multi-line opener + in_block_comment = True + drops += 1 + continue + # single-line inline block comment: drop only if the line has no + # trailing code after the closing ``*/``. Lines like + # ``/* x */ real_code = 1`` keep flowing into the real-content + # branch (handled by the leading-``*`` case below + kept). + after_close = s.split("*/", 1)[1].strip() + if not kept and not after_close: + drops += 1 + continue + + # leading single-line comments: drop regardless of length (D-iter3-4) + if not kept and s.startswith(("#", "//", "--", "///", "*")): + drops += 1 + continue + + # blank line before any real code + if not kept and s == "": + drops += 1 + continue + + # Pure log/print lines: drop anywhere in the body. Framework- + # heavy code (Discord / Rails / Flask / Spring) otherwise burns + # the budget on Discord.defer() / console.log / logger.info + # that carry no retrieval signal. Keep lines where logging is + # mixed with other statements. + if _is_log_only_line(s): + drops += 1 + continue + + kept.append(line) + + # fallback safety cap + if drops >= 20 and not kept: + break + + if not kept: + return "" + + # Step 5: collapse blank-line runs + collapsed: list[str] = [] + prev_blank = False + for ln in kept: + is_blank = ln.strip() == "" + if is_blank and prev_blank: + continue + collapsed.append(ln) + prev_blank = is_blank + + # Step 6: truncate + collapsed = collapsed[:max_lines] + normalised = [ln.replace("\t", " ").rstrip() for ln in collapsed] + text = " \n ".join(normalised) + if len(text) > max_chars: + cut = text[:max_chars].rsplit(" ", 1) + text = (cut[0] if len(cut) == 2 and cut[0] else text[:max_chars]) + " …" + return text + + +def _split_combined_hash(stored: str) -> tuple[str, str, str]: + """Parse the combined hash stored in ``embeddings.text_hash``. + + Format evolution (all three forms round-trip through this helper): + + * **Iter 1 legacy**: single sha256 hex, no ``':'`` + → ``(stored, '', '')`` + * **Iter 2 (body flag only)**: ``'meta_hash:body_hash'`` + → ``(meta, body, '')`` + * **Iter 3.2 (body + file fingerprint)**: ``'meta:body:file_hash'`` + → ``(meta, body, file_hash)`` + + The third slot lets ``embed_nodes`` detect a body change without + reading the file again: when metadata is unchanged but the node's + file-level hash has shifted, the body snippet may have changed + (Codex-review finding: metadata-only stage-1 filter would otherwise + leave edits like ``return x * 2`` → ``return x * 3`` permanently + stale). Rows older than Iter 3.2 report an empty file-hash slot, + which callers treat as "unknown — must re-verify". + """ + parts = stored.split(":", 2) + if len(parts) == 1: + return parts[0], "", "" + if len(parts) == 2: + return parts[0], parts[1], "" + return parts[0], parts[1], parts[2] + + +def _node_to_text( + node: GraphNode, + reader: _FileLineCache | None = None, + *, + max_chars: int | None = None, +) -> str: + """Build searchable text = metadata + optional body snippet. + + Backward-compatible: callers passing ``reader=None`` get the pre-Iter-2 + metadata-only string exactly as before. Legacy tests that construct a + node and call ``_node_to_text(node)`` continue to pass unchanged. + + When ``reader`` is provided, ``max_chars`` is required (raises + ``ValueError`` otherwise — ``assert`` would be stripped under + ``python -O``). Caller is responsible for gating via ``_body_enabled`` + before passing a reader; this function does not re-check. + """ parts = [node.name] if node.kind != "File": parts.append(node.kind.lower()) @@ -701,9 +1241,53 @@ def _node_to_text(node: GraphNode) -> str: parts.append(f"returns {node.return_type}") if node.language: parts.append(node.language) + + if reader is not None: + if max_chars is None: + raise ValueError("max_chars required when reader provided") + body = _extract_body_text(node, reader, max_chars=max_chars) + if body: + parts.append("body:") + parts.append(body) return " ".join(parts) +def _body_enabled(store: "EmbeddingStore") -> bool: + """ADR-C3 sticky-flag gate for body enrichment (Iter 3 D-iter3-1). + + Decision priority (first match wins; ALL paths write back the flag to + ``embeddings_meta.embed_body_enabled`` so subsequent runs honour it): + + 1. ``CRG_EMBED_INCLUDE_BODY=0`` -> False (write ``"0"``) + 2. ``CRG_EMBED_INCLUDE_BODY=1`` -> True (write ``"1"``) + 3. Existing flag -> honour stored value + 4. No flag + ``store.count() == 0`` -> True (auto-on for fresh DB) + 5. No flag + ``store.count() > 0`` -> False (legacy DB opts out by default) + + The CLI ``--include-body --confirm-reembed`` combo is implemented by + setting ``CRG_EMBED_INCLUDE_BODY=1`` before invoking embed, so case (2) + fires and flips ``"0"`` -> ``"1"`` on legacy DBs. + """ + env = os.environ.get("CRG_EMBED_INCLUDE_BODY") + if env == "0": + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "0") + return False + if env == "1": + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "1") + return True + stored = store._get_meta(_EMBED_META_KEY_BODY_ENABLED) + if stored == "1": + return True + if stored == "0": + return False + # Flag absent: decide based on DB emptiness and persist the choice. + if store.count() == 0: + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "1") + return True + store._set_meta(_EMBED_META_KEY_BODY_ENABLED, "0") + return False + + class EmbeddingStore: """Manages vector embeddings for graph nodes in SQLite.""" @@ -743,31 +1327,168 @@ def __exit__(self, exc_type, exc_val, exc_tb) -> None: # type: ignore[no-untype def close(self) -> None: self._conn.close() + def _repo_root_guess(self) -> Path | None: + """Best-effort repo-root inference for the ``_FileLineCache``. + + Priority (Codex rounds 7 + 8 — ``CRG_DATA_DIR`` lets users keep + ``graph.db`` outside the repo, so neither ``db_path.parent`` nor + even the ``.code-review-graph`` directory name is a reliable + proxy for the repo root in that case; an override like + ``CRG_DATA_DIR=/var/tmp/.code-review-graph`` would otherwise + hand us ``/var/tmp`` as the repo root): + + 1. When ``CRG_DATA_DIR`` is NOT set, trust the zero-config + default layout: if ``db_path.parent`` is literally named + ``.code-review-graph`` then the repo root is its parent. + This keeps the common case cheap (no env reads, no dir + walk). + 2. Otherwise — ``CRG_DATA_DIR`` is set, or the DB lives + somewhere non-standard — delegate to + :func:`incremental.find_project_root`, which honors + ``CRG_REPO_ROOT`` first and otherwise walks up from cwd + for a ``.git`` (or SVN) marker. + 3. Returns ``None`` only when both paths fail to resolve; + callers then treat absolute ``file_path`` values on + ``GraphNode`` as self-locating. + """ + data_dir_override = os.environ.get("CRG_DATA_DIR", "").strip() + if not data_dir_override: + try: + parent = self.db_path.resolve().parent + if parent.name == ".code-review-graph": + return parent.parent + except OSError: + pass + try: + from .incremental import find_project_root + root = find_project_root() + return root.resolve() if root else None + except Exception: + return None + + # --- embeddings_meta (Iter 3 D-iter3-1 sticky flag storage) --- + + def _get_meta(self, key: str) -> str | None: + row = self._conn.execute( + "SELECT value FROM embeddings_meta WHERE key = ?", (key,) + ).fetchone() + return row["value"] if row else None + + def _set_meta(self, key: str, value: str) -> None: + self._conn.execute( + "INSERT INTO embeddings_meta (key, value) VALUES (?, ?) " + "ON CONFLICT(key) DO UPDATE SET value = excluded.value", + (key, value), + ) + self._conn.commit() + def embed_nodes(self, nodes: list[GraphNode], batch_size: int = 64) -> int: - """Compute and store embeddings for a list of nodes.""" + """Compute and store embeddings for a list of nodes. + + Two-stage filter (Iter 3 Directive 5): + + * **Stage 1 (zero IO)** — compute ``metadata_hash`` from the + metadata-only ``_node_to_text`` form and compare with the stored + hash's metadata half. If it matches AND body is disabled OR the + row already has a body part embedded, skip without reading any + file. + * **Stage 2 (with IO)** — only for nodes that pass stage 1's + escape-hatch branches (metadata changed, or body was just + enabled on a legacy row): read the file through the + ``_FileLineCache``, compute ``body_hash``, and store the + combined ``"{metadata_hash}:{body_hash}"`` in ``text_hash``. + """ if not self.provider: return 0 - # Filter to nodes that need embedding - to_embed: list[tuple[GraphNode, str, str]] = [] provider_name = self.provider.name + body_on = _body_enabled(self) + max_chars = _resolve_body_max_chars(provider_name) if body_on else 0 + reader = _FileLineCache(repo_root=self._repo_root_guess()) if body_on else None + + to_embed: list[tuple[GraphNode, str, str]] = [] + + # Sort by file so the LRU cache gets consecutive hits within a file + sorted_nodes = sorted( + nodes, + key=lambda n: (n.file_path or "", n.line_start or 0), + ) - for node in nodes: + for node in sorted_nodes: if node.kind == "File": continue - text = _node_to_text(node) - text_hash = hashlib.sha256(text.encode()).hexdigest() + + # --- Stage 1: metadata-only hash, zero IO --- + meta_text = _node_to_text(node, reader=None) + meta_hash = hashlib.sha256(meta_text.encode()).hexdigest() existing = self._conn.execute( "SELECT text_hash, provider FROM embeddings WHERE qualified_name = ?", (node.qualified_name,), ).fetchone() - # Re-embed if text changed OR provider changed - if (existing and existing["text_hash"] == text_hash - and existing["provider"] == provider_name): + node_file_hash = node.file_hash or "" + if existing and existing["provider"] == provider_name: + stored_meta, stored_body, stored_file_hash = _split_combined_hash( + existing["text_hash"] + ) + if stored_meta == meta_hash: + # Metadata identical. Skip when we can PROVE the body + # snippet is still fresh too. We require *both* sides + # to carry a non-empty file fingerprint that matches; + # an empty fingerprint on either side is not proof of + # freshness — it just means we don't know — so we + # fall through to stage 2 and reread the file. This + # is the conservative path Codex round 3 flagged: + # metadata-only rows with absent ``node.file_hash`` + # would otherwise let body-only edits slip through. + if not body_on: + continue + if (stored_body != "" + and stored_file_hash != "" + and node_file_hash != "" + and stored_file_hash == node_file_hash): + continue + # else: fall through to stage 2 to refresh body_hash + + # --- Stage 2: body-enriched hash (reads file only when needed) --- + if body_on and reader is not None: + full_text = _node_to_text(node, reader=reader, max_chars=max_chars) + body_part = full_text[len(meta_text):] + if reader.last_read_ok: + # Always hash the body slot when body enrichment is + # on and the read succeeded, even when ``body_part`` + # is empty (single-line function whose only line was + # the signature, or a body entirely stripped as + # docstring/header). Writing ``sha256("")`` here + # lets stage 1 distinguish "body ran and produced + # nothing" from "body never ran" (legacy / body-off + # rows), so empty-body nodes short-circuit too. + body_hash = hashlib.sha256(body_part.encode()).hexdigest() + combined_hash = f"{meta_hash}:{body_hash}:{node_file_hash}" + final_text = full_text + else: + # File read failed (missing / permission / binary / + # oversized / outside repo). Do NOT persist the + # empty-body sentinel — that would freeze the node + # on a transient failure (Codex round 5). Fall back + # to the body-off stored form so the next run + # retries. We still embed the metadata-only text + # now, since we need *some* vector for this node. + combined_hash = f"{meta_hash}::{node_file_hash}" + final_text = meta_text + else: + combined_hash = f"{meta_hash}::{node_file_hash}" + final_text = meta_text + + if (existing and existing["provider"] == provider_name + and existing["text_hash"] == combined_hash): continue - to_embed.append((node, text, text_hash)) + + to_embed.append((node, final_text, combined_hash)) + + if reader is not None: + reader.clear() if not to_embed: return 0 diff --git a/tests/fixtures/embeddings/eval/golden_queries.json b/tests/fixtures/embeddings/eval/golden_queries.json new file mode 100644 index 00000000..cfec44ad --- /dev/null +++ b/tests/fixtures/embeddings/eval/golden_queries.json @@ -0,0 +1,101 @@ +[ + { + "query": "parse source code tree into nodes", + "expected_top3": [ + "parser.py::parse_tree", + "parser.py::walk_nodes", + "parser.py::extract_functions" + ], + "type": "signature-searchable", + "notes": "node names contain parse/tree/nodes" + }, + { + "query": "split string into tokens", + "expected_top3": [ + "parser.py::tokenize", + "parser.py::parse_tree", + "parser.py::build_root" + ], + "type": "signature-searchable", + "notes": "tokenize name is the obvious hit" + }, + { + "query": "cosine similarity between embedding vectors", + "expected_top3": [ + "embeddings.py::cosine_similarity", + "embeddings.py::EmbeddingIndex.search", + "embeddings.py::embed_text" + ], + "type": "signature-searchable", + "notes": "cosine_similarity name is explicit" + }, + { + "query": "escape HTML entities including quotes and backticks", + "expected_top3": [ + "visualization.py::esc_h", + "visualization.py::render_html" + ], + "type": "body-dependent", + "notes": "esc_h name is cryptic; only body shows replace('&', '&') / replace('\"', '"') / replace('`', '`')" + }, + { + "query": "breadth first search with visited set and queue", + "expected_top3": [ + "graph.py::Graph.impact_radius", + "graph.py::Graph.neighbors", + "graph.py::Graph.add_edge" + ], + "type": "body-dependent", + "notes": "impact_radius name does not say BFS; body has visited=set(), deque(), while queue" + }, + { + "query": "read OPENAI_API_KEY from the environment", + "expected_top3": [ + "api_client.py::OpenAIClient.__init__", + "api_client.py::GeminiClient.__init__", + "api_client.py::MinimaxClient.__init__" + ], + "type": "body-dependent", + "notes": "all three __init__ share the same signature; only body contains the env var name" + }, + { + "query": "read GEMINI_API_KEY from the environment", + "expected_top3": [ + "api_client.py::GeminiClient.__init__", + "api_client.py::OpenAIClient.__init__", + "api_client.py::MinimaxClient.__init__" + ], + "type": "body-dependent", + "notes": "requires body to distinguish the three __init__ siblings" + }, + { + "query": "render graph nodes as html list", + "expected_top3": [ + "visualization.py::render_html", + "visualization.py::esc_h", + "graph.py::Graph.neighbors" + ], + "type": "mixed", + "notes": "render_html name has semantic signal; esc_h relies on body" + }, + { + "query": "detect git file changes since a ref", + "expected_top3": [ + "incremental.py::detect_changes", + "incremental.py::get_changed_files", + "graph.py::Graph.add_edge" + ], + "type": "mixed", + "notes": "detect_changes is signature-searchable; get_changed_files needs body-ish context" + }, + { + "query": "store embedding vector by qualified name", + "expected_top3": [ + "embeddings.py::EmbeddingIndex.add", + "embeddings.py::EmbeddingIndex.search", + "embeddings.py::EmbeddingIndex.__init__" + ], + "type": "signature-searchable", + "notes": "EmbeddingIndex.add name matches" + } +] diff --git a/tests/fixtures/embeddings/eval/mini_repo/api_client.py b/tests/fixtures/embeddings/eval/mini_repo/api_client.py new file mode 100644 index 00000000..de194c95 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/api_client.py @@ -0,0 +1,24 @@ +"""Mini cloud API clients for MRR eval fixture — identical signatures, body differs.""" + +import os + + +class OpenAIClient: + def __init__(self): + self.api_key = os.environ.get("OPENAI_API_KEY", "") + self.base_url = "https://api.openai.com/v1" + self.default_model = "text-embedding-3-small" + + +class GeminiClient: + def __init__(self): + self.api_key = os.environ.get("GEMINI_API_KEY", "") + self.base_url = "https://generativelanguage.googleapis.com/v1beta" + self.default_model = "gemini-embedding-001" + + +class MinimaxClient: + def __init__(self): + self.api_key = os.environ.get("MINIMAX_API_KEY", "") + self.base_url = "https://api.minimax.io/v1" + self.default_model = "embo-01" diff --git a/tests/fixtures/embeddings/eval/mini_repo/embeddings.py b/tests/fixtures/embeddings/eval/mini_repo/embeddings.py new file mode 100644 index 00000000..b5d30502 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/embeddings.py @@ -0,0 +1,32 @@ +"""Mini embedding index for MRR eval fixture.""" + + +def cosine_similarity(a, b): + if len(a) != len(b): + return 0.0 + dot = sum(x * y for x, y in zip(a, b)) + norm_a = sum(x * x for x in a) ** 0.5 + norm_b = sum(x * x for x in b) ** 0.5 + if norm_a == 0 or norm_b == 0: + return 0.0 + return dot / (norm_a * norm_b) + + +def embed_text(text): + tokens = text.lower().split() + return [float(len(t)) for t in tokens] + + +class EmbeddingIndex: + def __init__(self): + self.store = {} + + def add(self, qname, vec): + self.store[qname] = vec + + def search(self, query_vec, limit=10): + scored = [] + for qn, vec in self.store.items(): + scored.append((qn, cosine_similarity(query_vec, vec))) + scored.sort(key=lambda x: x[1], reverse=True) + return scored[:limit] diff --git a/tests/fixtures/embeddings/eval/mini_repo/graph.py b/tests/fixtures/embeddings/eval/mini_repo/graph.py new file mode 100644 index 00000000..8103ae11 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/graph.py @@ -0,0 +1,33 @@ +"""Mini graph store for MRR eval fixture.""" + +from collections import deque + + +class Graph: + def __init__(self): + self.nodes = {} + self.edges = {} + + def add_node(self, name): + self.nodes[name] = name + self.edges.setdefault(name, []) + + def add_edge(self, src, dst): + self.edges.setdefault(src, []).append(dst) + + def neighbors(self, name): + return list(self.edges.get(name, [])) + + def impact_radius(self, start, depth): + visited = set() + queue = deque([(start, 0)]) + result = [] + while queue: + node, d = queue.popleft() + if node in visited or d > depth: + continue + visited.add(node) + result.append(node) + for nbr in self.edges.get(node, []): + queue.append((nbr, d + 1)) + return result diff --git a/tests/fixtures/embeddings/eval/mini_repo/incremental.py b/tests/fixtures/embeddings/eval/mini_repo/incremental.py new file mode 100644 index 00000000..e9c905b2 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/incremental.py @@ -0,0 +1,16 @@ +"""Mini git change detection for MRR eval fixture.""" + +import subprocess + + +def detect_changes(repo_root, since_ref): + """Return list of files changed in repo since the given git ref.""" + out = subprocess.run( + ["git", "-C", repo_root, "diff", "--name-only", since_ref, "HEAD"], + capture_output=True, text=True, check=False, + ) + return [line for line in out.stdout.splitlines() if line.strip()] + + +def get_changed_files(repo_root): + return detect_changes(repo_root, "HEAD~1") diff --git a/tests/fixtures/embeddings/eval/mini_repo/parser.py b/tests/fixtures/embeddings/eval/mini_repo/parser.py new file mode 100644 index 00000000..6874d7d0 --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/parser.py @@ -0,0 +1,33 @@ +"""Mini AST parser for MRR eval fixture.""" + + +def parse_tree(source): + """Parse source text and return an AST root.""" + tokens = tokenize(source) + root = build_root(tokens) + return root + + +def walk_nodes(tree, visitor): + stack = [tree] + while stack: + node = stack.pop() + visitor(node) + stack.extend(node.children) + + +def extract_functions(tree): + found = [] + walk_nodes(tree, lambda n: found.append(n) if n.kind == "function" else None) + return found + + +def tokenize(source): + return source.split() + + +def build_root(tokens): + class N: + kind = "root" + children = [] + return N() diff --git a/tests/fixtures/embeddings/eval/mini_repo/visualization.py b/tests/fixtures/embeddings/eval/mini_repo/visualization.py new file mode 100644 index 00000000..5e131ccd --- /dev/null +++ b/tests/fixtures/embeddings/eval/mini_repo/visualization.py @@ -0,0 +1,18 @@ +"""Mini HTML rendering for MRR eval fixture.""" + + +def esc_h(s): + """Replace dangerous characters so the string is safe inside HTML.""" + out = s.replace("&", "&") + out = out.replace("<", "<").replace(">", ">") + out = out.replace('"', """) + out = out.replace("`", "`") + return out + + +def render_html(graph): + buf = ["
    "] + for name in graph.nodes: + buf.append("
  • " + esc_h(name) + "
  • ") + buf.append("
") + return "".join(buf) diff --git a/tests/test_embeddings.py b/tests/test_embeddings.py index 2f5cc4b0..46272aa3 100644 --- a/tests/test_embeddings.py +++ b/tests/test_embeddings.py @@ -8,16 +8,25 @@ import pytest from code_review_graph.embeddings import ( + _BODY_MAX_CHARS_BY_PROVIDER, + _BODY_MAX_CHARS_FALLBACK, + _EMBED_META_KEY_BODY_ENABLED, LOCAL_DEFAULT_MODEL, EmbeddingStore, LocalEmbeddingProvider, MiniMaxEmbeddingProvider, OpenAIEmbeddingProvider, + _body_enabled, _cosine_similarity, _decode_vector, _encode_vector, + _extract_body_text, + _FileLineCache, _is_localhost_url, + _looks_like_signature, _node_to_text, + _resolve_body_max_chars, + _split_combined_hash, get_provider, ) from code_review_graph.graph import GraphNode @@ -1034,3 +1043,1079 @@ def test_subdomain_spoof_triggers_warning(self, capsys): get_provider("openai") captured = capsys.readouterr() assert "cloud" in captured.err.lower() + + +# --------------------------------------------------------------------------- +# Body enrichment tests (Iter 3 final: ADR-A1+dedup / B4 per-provider / C3 sticky) +# --------------------------------------------------------------------------- + + +def _mk_node(**kwargs): + """Minimal GraphNode factory for body-enrichment tests.""" + defaults = dict( + id=1, kind="Function", name="my_func", + qualified_name="sample.py::my_func", file_path="sample.py", + line_start=1, line_end=5, language="python", + parent_name=None, params=None, return_type=None, + is_test=False, file_hash=None, extra={}, + ) + defaults.update(kwargs) + return GraphNode(**defaults) + + +class TestFileLineCache: + def test_reads_existing_file(self, tmp_path): + p = tmp_path / "a.py" + p.write_text("line1\nline2\nline3\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("a.py", 1, 3) == ["line1", "line2", "line3"] + assert cache.get_lines("a.py", 2, 2) == ["line2"] + + def test_caches_across_calls(self, tmp_path): + p = tmp_path / "a.py" + p.write_text("x\ny\nz\n") + cache = _FileLineCache(repo_root=tmp_path) + cache.get_lines("a.py", 1, 2) + cache.get_lines("a.py", 2, 3) + cache.get_lines("a.py", 1, 1) + assert cache._read_count == 1 + + def test_missing_file_returns_empty(self, tmp_path): + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("does_not_exist.py", 1, 5) == [] + + def test_binary_file_detected_returns_empty(self, tmp_path): + p = tmp_path / "b.bin" + p.write_bytes(b"abc\x00def\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("b.bin", 1, 3) == [] + + def test_out_of_range_lines_returns_empty(self, tmp_path): + p = tmp_path / "a.py" + p.write_text("only-one\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("a.py", 999, 1000) == [] + assert cache.get_lines("a.py", 0, 5) == [] + assert cache.get_lines("a.py", 3, 1) == [] + + def test_lru_eviction_at_max_entries(self, tmp_path): + cache = _FileLineCache(repo_root=tmp_path, max_entries=4) + for i in range(6): + p = tmp_path / f"f{i}.py" + p.write_text(f"content {i}\n") + cache.get_lines(f"f{i}.py", 1, 1) + # Earliest two entries should have been evicted + assert "f0.py" not in cache._cache + assert "f1.py" not in cache._cache + assert "f5.py" in cache._cache + assert len(cache._cache) == 4 + + +class TestSignatureDedup: + @pytest.mark.parametrize("lang,line,name,params", [ + ("python", "def my_func(x, y):", "my_func", "(x, y)"), + ("javascript", "function handle(req, res) {", "handle", "(req, res)"), + ("typescript", "const handle = (req: Req) => {", "handle", "(req: Req)"), + ("java", "public int compute(int x) {", "compute", "(int x)"), + ("rust", "pub fn build_tree(root: Node) -> Tree {", + "build_tree", "(root: Node)"), + ]) + def test_signature_line_deduplication(self, lang, line, name, params): + node = _mk_node(language=lang, name=name, params=params, + return_type="int" if lang == "java" else None) + assert _looks_like_signature(line, node) is True + + def test_signature_dedup_preserves_body_when_no_match(self): + # Decoy signature-like text but node.name is different -> must NOT dedup + node = _mk_node(language="python", name="compute_totals", + params="(items)") + assert _looks_like_signature("def unrelated(x): pass", node) is False + + def test_signature_dedup_respects_params_mismatch(self): + # Overload with different param first-token + node = _mk_node(language="python", name="handle", params="(request)") + assert _looks_like_signature( + "def handle(completely_different_arg):", node, + ) is False + + def test_signature_dedup_unknown_language_conservative(self): + node = _mk_node(language="kotlin2", name="foo", params="()") + assert _looks_like_signature("fun foo() {}", node) is False + + def test_signature_dedup_php(self): + """PHP classes, methods, interfaces, traits, enums. + + PHP dedup was SOTA A/B tested on two Laravel codebases + (usdt-center + six-forum-api) across MiniLM / OpenAI 3-large / + Gemini-2-preview / Qwen3-8B. Net MRR@3 change across the 8 + cells was +0.110, gains concentrated on Gemini-2 (+0.083 and + +0.050) and OpenAI 3-large (+0.111 on one repo). MiniLM + regressed slightly (-0.03 to -0.05) and those users can revert + via ``CRG_EMBED_INCLUDE_BODY=0``. See ``_looks_like_signature`` + docstring for the full rationale. + """ + node = _mk_node( + language="php", name="store", params="(StoreRequest $request)", + parent_name="AdminController", + ) + # Common modifier prefixes + assert _looks_like_signature( + "public function store(StoreRequest $request): JsonResponse", node, + ) is True + assert _looks_like_signature( + "private static function store(StoreRequest $request)", node, + ) is True + # Class / interface / trait / enum + cls_node = _mk_node( + language="php", name="AdminController", params=None, + ) + assert _looks_like_signature( + "class AdminController extends BaseController {", cls_node, + ) is True + iface_node = _mk_node(language="php", name="Repository", params=None) + assert _looks_like_signature( + "interface Repository {", iface_node, + ) is True + trait_node = _mk_node( + language="php", name="HasTimestamps", params=None, + ) + assert _looks_like_signature( + "trait HasTimestamps {", trait_node, + ) is True + enum_node = _mk_node( + language="php", name="OrderStatus", params=None, + ) + assert _looks_like_signature( + "enum OrderStatus: string {", enum_node, + ) is True + # ``case PENDING = 'pending';`` is an enum VALUE, not a + # declaration — must NOT match (``case`` isn't in the keyword + # allowlist). + assert _looks_like_signature( + "case PENDING = 'pending';", enum_node, + ) is False + # Body line that happens to contain the name but not a decl + # keyword → must NOT be classified as signature. + assert _looks_like_signature( + "$admin = Admin::where('id', $id)->first();", node, + ) is False + # Overload-style param mismatch must not dedup either. + assert _looks_like_signature( + "public function store(DifferentType $arg): void", node, + ) is False + + def test_signature_dedup_go_and_typescript(self): + go_node = _mk_node(language="go", name="Handle", params="(w, r)") + assert _looks_like_signature( + "func Handle(w http.ResponseWriter, r *http.Request) {", go_node, + ) is True + ts_node = _mk_node(language="typescript", name="doThing", params="(x)") + assert _looks_like_signature("async function doThing(x) {", ts_node) is True + + def test_decorator_line_not_mistaken_for_signature(self): + # Iter 3 D-iter3-5: @decorator line has node.name false-positive risk + # because tree-sitter may include the decorator in the node range. + # Three-AND gate requires a declaration keyword; @my_decorator has none. + node = _mk_node(language="python", name="my_func", params="(x)") + assert _looks_like_signature("@my_decorator", node) is False + # Also the real def line DOES match. + assert _looks_like_signature("def my_func(x):", node) is True + + def test_recursive_call_first_body_line_preserved(self): + # Iter 3 D-iter3-5: body first line `return fact(n-1)` contains + # node.name but lacks `def ` keyword, so three-AND gate returns False + # and dedup does NOT drop the line. + node = _mk_node(language="python", name="fact", params="(n)") + assert _looks_like_signature("return n * fact(n - 1)", node) is False + + +class TestBlockCommentStateMachine: + def _extract(self, fixture_text: str, tmp_path, **node_kwargs) -> str: + p = tmp_path / "f.py" + p.write_text(fixture_text) + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node( + file_path="f.py", + line_start=node_kwargs.get("line_start", 1), + line_end=node_kwargs.get("line_end", fixture_text.count("\n") + 1), + language=node_kwargs.get("language", "python"), + name=node_kwargs.get("name", "never-matches-anything"), + ) + return _extract_body_text(node, reader, max_chars=4000, max_lines=40) + + def test_jsdoc_multiline_header_skipped(self, tmp_path): + src = ( + "/**\n" + " * docstring line 1\n" + " * docstring line 2\n" + " */\n" + "return 42;\n" + ) + out = self._extract(src, tmp_path, language="javascript") + assert "docstring" not in out + assert "return 42" in out + + def test_javadoc_preserves_body(self, tmp_path): + src = ( + "/** header doc */\n" + "int answer = 42;\n" + "return answer;\n" + ) + out = self._extract(src, tmp_path, language="java") + assert "header doc" not in out + assert "answer = 42" in out or "return answer" in out + + def test_block_comment_unclosed_falls_back(self, tmp_path): + # Unclosed /* -> state machine stays in block_comment, drops everything. + # Fallback cap prevents hang, function returns "". + src = "/*\n" + ("noise line\n" * 25) + out = self._extract(src, tmp_path, language="javascript") + assert out == "" + + def test_inline_block_comment_not_swallowed(self, tmp_path): + src = "/* x */ real_code = 1\nmore = 2\n" + out = self._extract(src, tmp_path, language="javascript") + # inline /* x */ with trailing code: should NOT enter block state + assert "real_code = 1" in out or "more = 2" in out + + def test_python_single_line_triple_quote_docstring_skipped(self, tmp_path): + src = '"""one-liner docstring."""\nreturn value\n' + out = self._extract(src, tmp_path, language="python") + assert "one-liner docstring" not in out + assert "return value" in out + + def test_python_multiline_triple_quote_docstring_skipped(self, tmp_path): + src = ( + '"""\n' + " line 1\n" + " line 2\n" + '"""\n' + "compute = 3\n" + ) + out = self._extract(src, tmp_path, language="python") + assert "line 1" not in out and "line 2" not in out + assert "compute = 3" in out + + def test_comment_prefix_dropped_regardless_of_length(self, tmp_path): + long_comment = "# " + ("x" * 200) # >> 120 chars + src = f"{long_comment}\nreal = 1\n" + out = self._extract(src, tmp_path, language="python") + assert long_comment not in out + assert "real = 1" in out + + +class TestBodyExtraction: + def _extract(self, src: str, tmp_path, **node_kwargs) -> str: + p = tmp_path / "f.py" + p.write_text(src) + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node( + file_path="f.py", + line_start=node_kwargs.get("line_start", 1), + line_end=node_kwargs.get("line_end", src.count("\n") + 1), + language=node_kwargs.get("language", "python"), + kind=node_kwargs.get("kind", "Function"), + name=node_kwargs.get("name", "whatever-no-sig-match"), + ) + return _extract_body_text( + node, reader, + max_chars=node_kwargs.get("max_chars", 4000), + max_lines=node_kwargs.get("max_lines", 40), + ) + + def test_python_function_body(self, tmp_path): + src = ( + "def parse_tree(data):\n" + " root = build_root(data)\n" + " walk_all_nodes(root)\n" + " return root\n" + ) + out = self._extract( + src, tmp_path, name="parse_tree", line_start=1, line_end=4, + ) + # signature line dropped via dedup; body should contain real code + assert "build_root" in out + assert "walk_all_nodes" in out + + def test_file_node_returns_empty(self, tmp_path): + p = tmp_path / "x.py" + p.write_text("print('hi')\n") + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node(kind="File", file_path="x.py", line_start=1, line_end=1) + assert _extract_body_text(node, reader, max_chars=1000) == "" + + def test_leading_triple_quote_docstring_skipped(self, tmp_path): + src = '"""module docstring."""\nvalue = 1\n' + out = self._extract(src, tmp_path, name="never") + assert "module docstring" not in out + assert "value = 1" in out + + def test_comment_only_region_skipped(self, tmp_path): + src = "# comment 1\n# comment 2\nactual_code = 42\n" + out = self._extract(src, tmp_path, name="never") + assert "comment 1" not in out + assert "actual_code" in out + + def test_truncation_by_chars(self, tmp_path): + src = "value_a = {}\n".format("x" * 100) # single long line + out = self._extract( + src, tmp_path, name="never", + max_chars=40, line_start=1, line_end=1, + ) + assert out.endswith("…") + assert len(out) <= 45 # max_chars + suffix overhead + + def test_truncation_by_lines(self, tmp_path): + lines = "\n".join(f"stmt_{i} = {i}" for i in range(30)) + "\n" + out = self._extract( + src=lines, tmp_path=tmp_path, name="never", + max_chars=100000, max_lines=3, + ) + # only 3 stmts kept; stmt_3+ must not appear + assert "stmt_0" in out + assert "stmt_29" not in out + + def test_single_line_node(self, tmp_path): + src = "def tiny(): return 1\n" + out = self._extract( + src, tmp_path, name="tiny", params="()", + line_start=1, line_end=1, + ) + # signature dedup drops the only line; body should be empty + assert out == "" + + +class TestBodyMaxCharsPerProvider: + @pytest.mark.parametrize("provider_name,expected", [ + ("local:all-MiniLM-L6-v2", _BODY_MAX_CHARS_BY_PROVIDER["local"]), + ("google:gemini-embedding-001", _BODY_MAX_CHARS_BY_PROVIDER["google"]), + ("minimax:embo-01", _BODY_MAX_CHARS_BY_PROVIDER["minimax"]), + ("openai:text-embedding-3-small", _BODY_MAX_CHARS_BY_PROVIDER["openai"]), + ]) + def test_body_max_chars_per_provider(self, provider_name, expected): + with patch.dict(os.environ, {}, clear=False): + os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None) + assert _resolve_body_max_chars(provider_name) == expected + + def test_body_max_chars_env_override(self): + with patch.dict(os.environ, {"CRG_EMBED_BODY_MAX_CHARS": "1234"}): + assert _resolve_body_max_chars("local:anything") == 1234 + assert _resolve_body_max_chars("openai:foo") == 1234 + + def test_body_max_chars_unknown_provider_fallback(self): + with patch.dict(os.environ, {}, clear=False): + os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None) + assert _resolve_body_max_chars("exotic:brand-new") == \ + _BODY_MAX_CHARS_FALLBACK + + def test_body_max_chars_provider_prefix_parsing(self): + with patch.dict(os.environ, {}, clear=False): + os.environ.pop("CRG_EMBED_BODY_MAX_CHARS", None) + assert _resolve_body_max_chars("openai:text-embedding-v4") == \ + _BODY_MAX_CHARS_BY_PROVIDER["openai"] + assert _resolve_body_max_chars("OPENAI:foo") == \ + _BODY_MAX_CHARS_BY_PROVIDER["openai"] + + +class TestBodyEnabledStoreGate: + """ADR-C3 sticky flag (Iter 3 D-iter3-1 / D-iter3-2).""" + + def _fresh_store(self, tmp_path): + with patch("code_review_graph.embeddings.get_provider", return_value=None): + return EmbeddingStore(tmp_path / "e.db") + + def _scrub_env(self): + for k in ("CRG_EMBED_INCLUDE_BODY",): + os.environ.pop(k, None) + + def test_body_enabled_empty_db_auto_enables(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_body_enabled_nonempty_db_requires_opt_in(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + # Simulate an existing embedding row without flag being set yet. + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1, 0.2]), "abc", "unknown"), + ) + store._conn.commit() + assert _body_enabled(store) is False + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + finally: + store.close() + + def test_body_enabled_env_on_overrides(self, tmp_path): + with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "1"}): + store = self._fresh_store(tmp_path) + try: + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "abc", "unknown"), + ) + store._conn.commit() + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_body_enabled_env_off_overrides(self, tmp_path): + with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "0"}): + store = self._fresh_store(tmp_path) + try: + assert _body_enabled(store) is False + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + finally: + store.close() + + def test_sticky_flag_empty_db_writes_on(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) is None + _body_enabled(store) + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_sticky_flag_existing_db_writes_off(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "abc", "unknown"), + ) + store._conn.commit() + _body_enabled(store) + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + finally: + store.close() + + def test_sticky_flag_persists_across_runs(self, tmp_path): + """Iter 2 mid-run bug regression: once on, stays on after count() > 0.""" + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + # Run 1: empty DB -> auto-on, flag = "1" + assert _body_enabled(store) is True + # Simulate embeddings landing (count > 0). + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "h", "unknown"), + ) + store._conn.commit() + assert store.count() > 0 + # Run 2: no env — MUST honour the "1" flag, NOT fall back to + # the count()>0 -> False branch. + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + finally: + store.close() + + def test_sticky_flag_cli_flip_from_off_to_on(self, tmp_path): + self._scrub_env() + store = self._fresh_store(tmp_path) + try: + # Setup: legacy DB with flag="0" (decided at first embed without env) + store._conn.execute( + "INSERT INTO embeddings (qualified_name, vector, text_hash, provider) " + "VALUES (?, ?, ?, ?)", + ("x::y", _encode_vector([0.1]), "h", "unknown"), + ) + store._conn.commit() + assert _body_enabled(store) is False + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "0" + # CLI user runs `embed --include-body --confirm-reembed` + # (sets env var before calling _body_enabled). + with patch.dict(os.environ, {"CRG_EMBED_INCLUDE_BODY": "1"}): + assert _body_enabled(store) is True + assert store._get_meta(_EMBED_META_KEY_BODY_ENABLED) == "1" + # Next run without env: flag is now "1" -> stays on. + self._scrub_env() + assert _body_enabled(store) is True + finally: + store.close() + + +class TestFilterStageShortCircuit: + """Directive 5: metadata_hash short-circuit skips file IO for unchanged nodes.""" + + def _fake_provider(self, dim=4): + mock = MagicMock() + mock.name = "local:fake" + mock.embed = MagicMock(return_value=[[0.1] * dim, [0.2] * dim, [0.3] * dim]) + mock.embed_query = MagicMock(return_value=[0.1] * dim) + return mock + + def _store_with_fake(self, tmp_path, provider): + # Put the DB under ``/.code-review-graph/`` so + # ``_repo_root_guess`` takes the legacy fast-path and returns + # ``tmp_path`` — that's where these tests stage their fixture + # source files. Without this layout the fallback walks up from + # cwd and returns the real project root, which sits outside + # ``tmp_path`` and makes every body read fail containment. + crg_dir = tmp_path / ".code-review-graph" + crg_dir.mkdir(exist_ok=True) + with patch( + "code_review_graph.embeddings.get_provider", return_value=provider, + ): + return EmbeddingStore(crg_dir / "e.db") + + def test_incremental_skips_unchanged_node_without_file_read(self, tmp_path): + # Create real source file for node + src = tmp_path / "sample.py" + src.write_text( + "def compute(x):\n" + " a = x * 2\n" + " return a\n", + ) + provider = self._fake_provider() + # Force sticky flag on so stage-2 would normally read file + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + # Unchanged node carries a populated file_hash — only then + # can stage 1 safely short-circuit (empty fingerprints now + # force a re-read; see Codex round 3 regression fix). + node = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=3, + qualified_name="sample.py::compute", + file_hash="file-fingerprint-v1", + ) + # First embed — reader WILL be used; returns 1 embedded. + assert store.embed_nodes([node]) == 1 + provider.embed.reset_mock() + + # Now patch reader.get_lines so we can assert it's not hit + with patch.object( + _FileLineCache, "get_lines", + autospec=True, return_value=[], + ) as mock_get_lines: + # Re-embed same node — metadata + file_hash both match + # the stored combined hash → stage 1 short-circuits + # with no file IO. + assert store.embed_nodes([node]) == 0 + assert mock_get_lines.call_count == 0 + assert provider.embed.call_count == 0 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_read_failure_does_not_freeze_node(self, tmp_path): + """Codex round 5: transient file-read failure (missing file, binary, + permission error, out-of-repo) must NOT persist an empty-body + sentinel that makes stage 1 skip forever. Once the file becomes + readable again the body must be recomputed.""" + src = tmp_path / "maybe.py" + src.write_text("def maybe():\n a = 1\n return a\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node = _mk_node( + language="python", name="maybe", params="()", + file_path=str(src), line_start=1, line_end=3, + qualified_name="maybe.py::maybe", + file_hash="maybe-fingerprint", + ) + # First embed with a patched reader that simulates a + # transient read failure (last_read_ok=False). + with patch.object( + _FileLineCache, "get_lines", + autospec=True, return_value=[], + ) as mock_get_lines: + def fake_fail(self, *a, **kw): + self._last_read_ok = False + return [] + mock_get_lines.side_effect = fake_fail + assert store.embed_nodes([node]) == 1 + + provider.embed.reset_mock() + # Second embed with a real reader — the failed body + # snippet should NOT short-circuit stage 1 (stored_body + # is ""), so stage 2 reads the file again and persists a + # proper body hash. + assert store.embed_nodes([node]) == 1 + assert provider.embed.call_count == 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_empty_body_does_not_reread_on_unchanged_run(self, tmp_path): + """Codex round 4: single-line functions whose body extracts to "" + used to fall through stage 1 on every incremental run. Stage 1 + must now skip them too (empty body hash is written as sha256('') + to distinguish 'body ran, produced nothing' from legacy rows).""" + src = tmp_path / "tiny.py" + src.write_text("def tiny(): return 1\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node = _mk_node( + language="python", name="tiny", params="()", + file_path=str(src), line_start=1, line_end=1, + qualified_name="tiny.py::tiny", + file_hash="tiny-fingerprint", + ) + assert store.embed_nodes([node]) == 1 + provider.embed.reset_mock() + with patch.object( + _FileLineCache, "get_lines", + autospec=True, return_value=[], + ) as mock_get_lines: + assert store.embed_nodes([node]) == 0 + assert mock_get_lines.call_count == 0 + assert provider.embed.call_count == 0 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_body_edit_with_empty_file_hash_still_reembeds(self, tmp_path): + """Codex round 3: when both stored and current file_hash are empty, + stage 1 must NOT short-circuit — otherwise body-only edits slip + through on callers that don't populate node.file_hash.""" + src = tmp_path / "sample.py" + src.write_text("def compute(x):\n return x * 2\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + # file_hash deliberately omitted (None → empty string slot) + node_v1 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + assert store.embed_nodes([node_v1]) == 1 + provider.embed.reset_mock() + src.write_text("def compute(x):\n return x * 3\n") + node_v2 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + # Without a file_hash signal we must re-read and recompute + # body_hash; the body text has changed so this is 1 embed. + assert store.embed_nodes([node_v2]) == 1 + assert provider.embed.call_count == 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_metadata_change_triggers_stage2_read(self, tmp_path): + src = tmp_path / "sample.py" + src.write_text("def compute(x):\n return x * 2\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + store.embed_nodes([node]) + provider.embed.reset_mock() + # Flip return_type => metadata_hash changes => stage 2 must run. + node_changed = _mk_node( + language="python", name="compute", params="(x)", + return_type="int", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", + ) + with patch.object( + _FileLineCache, "get_lines", autospec=True, + return_value=["def compute(x):", " return x * 2"], + ) as mock_get_lines: + assert store.embed_nodes([node_changed]) == 1 + assert mock_get_lines.call_count >= 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + def test_split_combined_hash_legacy_row(self): + # Iter 1: pure sha256 hex, no separator + legacy = "a" * 64 + assert _split_combined_hash(legacy) == (legacy, "", "") + # Iter 2: metadata + body only (no file fingerprint yet) + iter2 = "abc:def" + assert _split_combined_hash(iter2) == ("abc", "def", "") + # Iter 2 body-disabled stored form + iter2_body_off = "abc:" + assert _split_combined_hash(iter2_body_off) == ("abc", "", "") + # Iter 3.2: metadata + body + file fingerprint + iter3 = "meta:body:fhash" + assert _split_combined_hash(iter3) == ("meta", "body", "fhash") + # Iter 3.2 body-disabled but file_hash captured + iter3_body_off = "meta::fhash" + assert _split_combined_hash(iter3_body_off) == ("meta", "", "fhash") + + def test_body_only_change_triggers_reembed(self, tmp_path): + """Codex-review regression: metadata-unchanged body edit used to be + skipped forever. Now file_hash shift forces stage-2 re-read.""" + src = tmp_path / "sample.py" + src.write_text("def compute(x):\n return x * 2\n") + provider = self._fake_provider() + os.environ["CRG_EMBED_INCLUDE_BODY"] = "1" + try: + store = self._store_with_fake(tmp_path, provider) + try: + node_v1 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", file_hash="h-v1", + ) + assert store.embed_nodes([node_v1]) == 1 + provider.embed.reset_mock() + # Body edits but metadata identical; file_hash changes. + src.write_text("def compute(x):\n return x * 3\n") + node_v2 = _mk_node( + language="python", name="compute", params="(x)", + file_path=str(src), line_start=1, line_end=2, + qualified_name="sample.py::compute", file_hash="h-v2", + ) + assert store.embed_nodes([node_v2]) == 1 + assert provider.embed.call_count == 1 + finally: + store.close() + finally: + os.environ.pop("CRG_EMBED_INCLUDE_BODY", None) + + +class TestFileLineCacheContainment: + """Codex-review regression: file_path must stay inside repo_root.""" + + def test_absolute_path_outside_root_rejected(self, tmp_path): + outside = tmp_path.parent / ("outside-" + tmp_path.name + ".secret") + outside.write_text("top secret\n") + try: + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines(str(outside), 1, 1) == [] + finally: + outside.unlink(missing_ok=True) + + def test_relative_traversal_rejected(self, tmp_path): + outside = tmp_path.parent / "escape-target.secret" + outside.write_text("top secret\n") + try: + cache = _FileLineCache(repo_root=tmp_path) + # "../escape-target.secret" tries to pop out of the repo + assert cache.get_lines( + f"../{outside.name}", 1, 1, + ) == [] + finally: + outside.unlink(missing_ok=True) + + def test_inside_root_still_reads(self, tmp_path): + inside = tmp_path / "hello.py" + inside.write_text("x = 1\n") + cache = _FileLineCache(repo_root=tmp_path) + assert cache.get_lines("hello.py", 1, 1) == ["x = 1"] + + def test_no_root_preserves_absolute_behavior(self, tmp_path): + inside = tmp_path / "hello.py" + inside.write_text("y = 2\n") + # Without repo_root we intentionally skip the containment check so + # unit tests that construct nodes with absolute fixture paths still + # work (see TestFileLineCache.test_reads_existing_file style, but + # called with no repo_root). + cache = _FileLineCache(repo_root=None) + assert cache.get_lines(str(inside), 1, 1) == ["y = 2"] + + +class TestRepoRootGuessExternalDataDir: + """Codex round 7 regression: ``CRG_DATA_DIR`` / ``CRG_REPO_ROOT``. + + When ``graph.db`` lives outside the repo (``CRG_DATA_DIR`` is set), + ``_repo_root_guess`` must NOT return the cache directory — otherwise + ``_FileLineCache`` resolves repo-relative paths against the cache + and every body read fails the containment check, silently degrading + to metadata-only embeddings. We verify the guess falls back to + ``find_project_root`` (which honors ``CRG_REPO_ROOT`` or walks up to + a ``.git`` marker) in that case, while preserving the fast legacy + ``/.code-review-graph/graph.db`` inference. + """ + + def _store(self, db_path): + with patch( + "code_review_graph.embeddings.get_provider", return_value=None, + ): + return EmbeddingStore(db_path) + + def test_legacy_layout_infers_repo_from_db_parent(self, tmp_path): + repo = tmp_path / "repo" + (repo / ".code-review-graph").mkdir(parents=True) + (repo / ".git").mkdir() + db = repo / ".code-review-graph" / "graph.db" + db.touch() + store = self._store(db) + try: + assert store._repo_root_guess() == repo.resolve() + finally: + store.close() + + def test_external_data_dir_falls_back_to_crg_repo_root(self, tmp_path): + repo = tmp_path / "my-repo" + (repo / ".git").mkdir(parents=True) + (repo / "sample.py").write_text("def f():\n return 42\n") + cache = tmp_path / "external-cache" + cache.mkdir() + db = cache / "graph.db" + db.touch() + with patch.dict( + os.environ, {"CRG_REPO_ROOT": str(repo)}, + ): + store = self._store(db) + try: + # Without the fix this would point at ``cache`` and make + # every downstream read fail containment. + assert store._repo_root_guess() == repo.resolve() + finally: + store.close() + + def test_crg_data_dir_named_dot_code_review_graph(self, tmp_path): + """Codex round 8: a ``CRG_DATA_DIR`` whose last segment happens + to be ``.code-review-graph`` (e.g. ``/var/tmp/.code-review-graph`` + or a shared cache dir set up that way) must NOT trigger the + fast path — ``db_path.parent.parent`` would be the cache's + parent directory, not the repo root. When the env override is + set we always defer to ``find_project_root`` so that the + repo-root resolution uses the real git walk / ``CRG_REPO_ROOT`` + override instead of a coincidental directory name.""" + repo = tmp_path / "real-repo" + (repo / ".git").mkdir(parents=True) + (repo / "sample.py").write_text("x = 1\n") + # Cache dir whose last segment IS '.code-review-graph' but + # sits far away from the real repo — classic misfire case. + cache = tmp_path / "var-tmp" / ".code-review-graph" + cache.mkdir(parents=True) + db = cache / "graph.db" + db.touch() + with patch.dict( + os.environ, + {"CRG_DATA_DIR": str(cache), "CRG_REPO_ROOT": str(repo)}, + ): + store = self._store(db) + try: + root = store._repo_root_guess() + # WRONG (fast-path trap): tmp_path / "var-tmp" + # RIGHT: repo (via find_project_root + CRG_REPO_ROOT) + assert root == repo.resolve() + reader = _FileLineCache(repo_root=root) + assert reader.get_lines("sample.py", 1, 1) == ["x = 1"] + finally: + store.close() + + def test_external_data_dir_resolves_file_read(self, tmp_path): + repo = tmp_path / "my-repo" + (repo / ".git").mkdir(parents=True) + (repo / "sample.py").write_text("def f():\n return 42\n") + cache = tmp_path / "external-cache" + cache.mkdir() + db = cache / "graph.db" + db.touch() + with patch.dict( + os.environ, {"CRG_REPO_ROOT": str(repo)}, + ): + store = self._store(db) + try: + root = store._repo_root_guess() + reader = _FileLineCache(repo_root=root) + # Repo-relative file_path resolves correctly against the + # real repo, not the external cache dir. + assert reader.get_lines("sample.py", 1, 2) == [ + "def f():", " return 42", + ] + assert reader.last_read_ok is True + finally: + store.close() + + +class TestLogOnlyLineSkipping: + """Language-agnostic log/print-line skipping. + + Framework-heavy code (Discord bots, web controllers, server + handlers) interleaves ``console.log`` / ``logger.info`` / + ``fmt.Println`` / ``print()`` lines throughout function bodies. + Those lines carry almost no retrieval signal — a function's + purpose is rarely "the fact that it logs" — but they eat the body + char budget, and on small models (MiniLM 256-token window) they + crowd out the real logic tokens. Dropping pure log lines is a + language-agnostic way to raise token-to-signal density. + """ + + def test_is_log_only_line_python(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" logger.info('handling request')") is True + assert _is_log_only_line("logging.debug('x', extra={'y': 1})") is True + assert _is_log_only_line(" print('hi')") is True + assert _is_log_only_line(" print('hi');") is True # trailing ; + + def test_is_log_only_line_js_ts(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" console.log('hi');") is True + assert _is_log_only_line(" console.error(err);") is True + assert _is_log_only_line("logger.warn(`connection ${host}`);") is True + + def test_is_log_only_line_java(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" System.out.println(\"hi\");") is True + assert _is_log_only_line("log.debug(msg);") is True + + def test_is_log_only_line_go(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" fmt.Println(msg)") is True + assert _is_log_only_line(" log.Printf(\"%v\", err)") is True + + def test_is_log_only_line_rust(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line(" println!(\"hi\");") is True + assert _is_log_only_line(" eprintln!(\"err: {}\", e);") is True + assert _is_log_only_line(" dbg!(value);") is True + + def test_is_log_only_line_keeps_mixed_statements(self): + """Log call mixed with other code is retrieval signal — keep it.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line( + "value = compute(); logger.info(value); return value" + ) is False + assert _is_log_only_line( + "if err: logger.error(err); raise" + ) is False + assert _is_log_only_line( + "x = logger.info('x')" # assignment, not pure log + ) is False + + def test_is_log_only_line_tail_continuation_preserved(self): + """Codex round 10 regression: lines that start with a log call but + continue with executable code must NOT be dropped. Requires + balanced-paren walk, not prefix-only match.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line("logger.info('x') or do_work()") is False + assert _is_log_only_line("console.log(x) && doThing()") is False + assert _is_log_only_line( + "logger.info('hello') if verbose else default()" + ) is False + # Balanced-paren walker handles nested parens inside the log call. + assert _is_log_only_line( + "logger.info('val=' + str(compute(x, y)))" + ) is True + # Log call with trailing `;` is still log-only. + assert _is_log_only_line( + "console.log(payload);" + ) is True + + def test_is_log_only_line_non_logger_identifier_preserved(self): + """Codex round 10 regression: ``x.error(...)`` / ``x.warn(...)`` on + NON-logger receivers is legitimate API behaviour (argparse + parsers, custom repos, HTTP response writers) and must not be + treated as log-only.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line("serve_cmd.error('bad arg')") is False + assert _is_log_only_line("db.warn(tx, 'slow')") is False + assert _is_log_only_line("item.notice(code, reason)") is False + # Real logger names still match + assert _is_log_only_line("logger.warn('slow')") is True + assert _is_log_only_line("self.logger.debug('x')") is True + + def test_is_log_only_line_php(self): + """PHP + Laravel log patterns: ``->`` for instance, ``::`` for + static facades, bare ``error_log`` / ``syslog`` / + ``trigger_error`` helpers.""" + from code_review_graph.embeddings import _is_log_only_line + # Monolog-style instance logger + assert _is_log_only_line("$logger->info('fetched user');") is True + assert _is_log_only_line("$this->logger->debug($payload);") is True + assert _is_log_only_line("$log->warning('slow query');") is True + # Laravel static facades (with and without leading backslash) + assert _is_log_only_line("Log::info('merchant debit');") is True + assert _is_log_only_line("Log::error($e->getMessage());") is True + assert _is_log_only_line("\\Log::debug('detail');") is True + # Bare helpers + assert _is_log_only_line("error_log('boot failure');") is True + assert _is_log_only_line("syslog(LOG_WARNING, $msg);") is True + assert _is_log_only_line("trigger_error('deprecated', E_USER_WARNING);") is True + # Non-logger PHP calls stay kept + assert _is_log_only_line("$response->setStatusCode(200);") is False + assert _is_log_only_line("$user->save();") is False + assert _is_log_only_line("User::create($data);") is False + assert _is_log_only_line("$db->query('SELECT *');") is False + + def test_is_log_only_line_fprint_not_skipped(self): + """Codex round 10 regression: ``fmt.Fprintf(w, ...)`` writes to an + arbitrary io.Writer (HTTP response, file handle) — that IS the + function's real output, not a log.""" + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line('fmt.Fprintf(w, "ok")') is False + assert _is_log_only_line('fmt.Fprintln(stdout, "hi")') is False + # Pure fmt.Print* still is log-like (goes to stdout). + assert _is_log_only_line('fmt.Println(msg)') is True + assert _is_log_only_line('fmt.Printf("%v", err)') is True + + def test_is_log_only_line_not_a_log(self): + from code_review_graph.embeddings import _is_log_only_line + assert _is_log_only_line("return self.compute(x)") is False + assert _is_log_only_line("db.execute('SELECT *')") is False + assert _is_log_only_line("await interaction.deferReply()") is False + + def test_body_extraction_drops_scattered_log_lines(self, tmp_path): + """Log lines interleaved with real code get stripped, freeing + the budget for actual logic.""" + src = tmp_path / "handler.ts" + src.write_text( + "export async function handle(req) {\n" + " const user = await lookupUser(req.id);\n" + " logger.info('fetched user', { id: req.id });\n" + " const result = await computeScore(user);\n" + " console.log('score', result);\n" + " return { score: result, user: user.name };\n" + "}\n" + ) + reader = _FileLineCache(repo_root=tmp_path) + node = _mk_node( + language="typescript", name="handle", params="(req)", + file_path=str(src), line_start=1, line_end=7, + ) + body = _extract_body_text(node, reader, max_chars=4000) + assert "logger.info" not in body + assert "console.log" not in body + assert "lookupUser" in body + assert "computeScore" in body + assert "return { score" in body + + +class TestKotlinFunSignatureDedup: + """Codex-review regression: Kotlin ``fun`` without return type.""" + + def test_kotlin_fun_without_return_type(self): + node = _mk_node( + language="kotlin", name="save", params="(user: User)", + return_type=None, + ) + assert _looks_like_signature( + "override fun save(user: User) {", node, + ) is True diff --git a/tests/test_embeddings_mrr.py b/tests/test_embeddings_mrr.py new file mode 100644 index 00000000..0746ff38 --- /dev/null +++ b/tests/test_embeddings_mrr.py @@ -0,0 +1,183 @@ +"""MRR regression test for body-enriched embeddings (AC-6 Iter 3). + +Runs the Local sentence-transformers provider against a fixed mini_repo + +10 golden queries and asserts that enabling body enrichment does NOT +regress Mean Reciprocal Rank at top-3 beyond the -0.02 tolerance AND +produces a strictly positive delta (non-inferiority + soft signal). + +Skipped when ``sentence-transformers`` is not installed, so the project's +default ``.[dev]`` test matrix keeps passing; CI jobs that care about the +gate should install ``.[embeddings]`` to activate this module. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest + +from code_review_graph.graph import GraphNode + +pytest.importorskip("sentence_transformers") + +from code_review_graph.embeddings import ( # noqa: E402 — after importorskip + LocalEmbeddingProvider, + _cosine_similarity, + _FileLineCache, + _node_to_text, + _resolve_body_max_chars, +) + +FIXTURE_DIR = Path(__file__).parent / "fixtures" / "embeddings" / "eval" +MINI_REPO = FIXTURE_DIR / "mini_repo" + + +def _mk(qualified_name, file_rel, line_start, line_end, *, + kind="Function", name=None, parent_name=None, params=None, + return_type=None): + """Build a GraphNode for mini_repo fixture nodes.""" + return GraphNode( + id=0, kind=kind, + name=name or qualified_name.rsplit("::", 1)[1].split(".")[-1], + qualified_name=qualified_name, + file_path=str(MINI_REPO / file_rel), + line_start=line_start, line_end=line_end, + language="python", + parent_name=parent_name, params=params, return_type=return_type, + is_test=False, file_hash=None, extra={}, + ) + + +# Accurate line ranges from mini_repo files (see tests/fixtures/.../mini_repo/). +NODES: list[GraphNode] = [ + # parser.py + _mk("parser.py::parse_tree", "parser.py", 4, 9, params="(source)"), + _mk("parser.py::walk_nodes", "parser.py", 11, 16, params="(tree, visitor)"), + _mk("parser.py::extract_functions", "parser.py", 19, 22, params="(tree)"), + _mk("parser.py::tokenize", "parser.py", 25, 26, params="(source)"), + _mk("parser.py::build_root", "parser.py", 29, 33, params="(tokens)"), + # graph.py + _mk("graph.py::Graph.__init__", "graph.py", 7, 9, params="(self)", + parent_name="Graph"), + _mk("graph.py::Graph.add_node", "graph.py", 11, 13, params="(self, name)", + parent_name="Graph"), + _mk("graph.py::Graph.add_edge", "graph.py", 15, 16, params="(self, src, dst)", + parent_name="Graph"), + _mk("graph.py::Graph.neighbors", "graph.py", 18, 19, params="(self, name)", + parent_name="Graph"), + _mk("graph.py::Graph.impact_radius","graph.py", 21, 32, + params="(self, start, depth)", parent_name="Graph"), + # embeddings.py + _mk("embeddings.py::cosine_similarity", "embeddings.py", 4, 12, + params="(a, b)"), + _mk("embeddings.py::embed_text", "embeddings.py", 15, 17, + params="(text)"), + _mk("embeddings.py::EmbeddingIndex.__init__", "embeddings.py", 21, 22, + params="(self)", parent_name="EmbeddingIndex"), + _mk("embeddings.py::EmbeddingIndex.add", "embeddings.py", 24, 25, + params="(self, qname, vec)", parent_name="EmbeddingIndex"), + _mk("embeddings.py::EmbeddingIndex.search", "embeddings.py", 27, 32, + params="(self, query_vec, limit=10)", parent_name="EmbeddingIndex"), + # visualization.py + _mk("visualization.py::esc_h", "visualization.py", 4, 10, params="(s)"), + _mk("visualization.py::render_html", "visualization.py", 13, 18, params="(graph)"), + # api_client.py + _mk("api_client.py::OpenAIClient.__init__", "api_client.py", 7, 10, + params="(self)", parent_name="OpenAIClient"), + _mk("api_client.py::GeminiClient.__init__", "api_client.py", 14, 17, + params="(self)", parent_name="GeminiClient"), + _mk("api_client.py::MinimaxClient.__init__", "api_client.py", 21, 24, + params="(self)", parent_name="MinimaxClient"), + # incremental.py + _mk("incremental.py::detect_changes", "incremental.py", 6, 12, + params="(repo_root, since_ref)"), + _mk("incremental.py::get_changed_files", "incremental.py", 15, 16, + params="(repo_root)"), +] + + +def _rank_top(query_vec, doc_vecs, qnames, limit=3): + scored = [ + (qn, _cosine_similarity(query_vec, v)) + for qn, v in zip(qnames, doc_vecs) + ] + scored.sort(key=lambda x: x[1], reverse=True) + return [qn for qn, _ in scored[:limit]] + + +def _reciprocal_rank(ranked_top: list[str], expected: set[str]) -> float: + for i, qn in enumerate(ranked_top, start=1): + if qn in expected: + return 1.0 / i + return 0.0 + + +def _mrr(rankings: list[list[str]], expectations: list[set[str]]) -> float: + if not rankings: + return 0.0 + return sum( + _reciprocal_rank(r, e) for r, e in zip(rankings, expectations) + ) / len(rankings) + + +def test_mrr_body_not_worse_than_legacy(): + """AC-6 (Iter 3 D-iter3-9): non-inferiority MRR gate. + + Two runs with the Local provider: + - legacy: ``_node_to_text(node, reader=None)`` — metadata only + - body-on: ``_node_to_text(node, reader, max_chars=...)`` — body enriched + + The body-on run must not regress MRR@3 beyond the tolerance + (``MRR_on >= MRR_legacy - 0.02``) AND must strictly improve it + (``MRR_on > MRR_legacy``). + """ + provider = LocalEmbeddingProvider() + max_chars = _resolve_body_max_chars(provider.name) + reader = _FileLineCache() + + legacy_texts = [_node_to_text(n, reader=None) for n in NODES] + body_texts = [ + _node_to_text(n, reader=reader, max_chars=max_chars) for n in NODES + ] + + legacy_vecs = provider.embed(legacy_texts) + body_vecs = provider.embed(body_texts) + + qnames = [n.qualified_name for n in NODES] + + queries = json.loads((FIXTURE_DIR / "golden_queries.json").read_text()) + # Sanity: all golden expected nodes exist in our fixture. + known = set(qnames) + for q in queries: + for exp in q["expected_top3"]: + assert exp in known, f"Golden query references unknown node: {exp}" + + query_texts = [q["query"] for q in queries] + query_vecs = [provider.embed_query(t) for t in query_texts] + expectations = [set(q["expected_top3"]) for q in queries] + + legacy_rankings = [_rank_top(qv, legacy_vecs, qnames) for qv in query_vecs] + body_rankings = [_rank_top(qv, body_vecs, qnames) for qv in query_vecs] + + mrr_legacy = _mrr(legacy_rankings, expectations) + mrr_body = _mrr(body_rankings, expectations) + + # Diagnostic logging (captured on failure) + print(f"MRR@3 legacy={mrr_legacy:.4f} body={mrr_body:.4f}") + for q, lr, br in zip(queries, legacy_rankings, body_rankings): + print(f" [{q['type']:19s}] {q['query']}") + print(f" legacy top-3: {lr}") + print(f" body top-3 : {br}") + + # non-inferiority hard gate (must not regress beyond tolerance) + assert mrr_body >= mrr_legacy - 0.02, ( + f"body enrichment regressed MRR beyond tolerance " + f"(body={mrr_body:.4f}, legacy={mrr_legacy:.4f}, " + f"delta={mrr_body - mrr_legacy:+.4f})" + ) + # soft signal: must strictly improve (not just match) + assert mrr_body > mrr_legacy, ( + f"body enrichment failed to improve MRR " + f"(body={mrr_body:.4f}, legacy={mrr_legacy:.4f})" + )