fix: restore dropped parser fixes from #274 / #276 / #285 (+ Julia impl from #319)#362
fix: restore dropped parser fixes from #274 / #276 / #285 (+ Julia impl from #319)#362npkriami18 wants to merge 6 commits intotirth8205:mainfrom
Conversation
The parser gated CALLS edge emission on `enclosing_func` being set, so calls made from module scope (top-level script glue, CLI entrypoints, `if __name__ == "__main__"` blocks, and Jupyter/Databricks notebook cells) produced zero CALLS edges. Any function invoked only from those contexts was flagged as dead by `find_dead_code`, even when the function was the entire reason the script existed. Notebooks are particularly affected because every cell is module-scope by definition, so the existing notebook parser (PR tirth8205#69) emitted nodes and IMPORTS_FROM edges but no CALLS edges — making the dead-code detector's notebook coverage vacuous. Fix: when `enclosing_func` is None, attribute the CALLS edge to the File node instead of dropping it. Matches the existing convention used by `_extract_value_references` and CONTAINS edges. Applied to all 5 gated emission sites: generic Python/JS/TS path, JSX components, Elixir, Solidity `emit`, and R. Downstream: `detect_entry_points` now filters File-sourced CALLS via `get_all_call_targets(include_file_sources=False)` so script-only callees remain detectable as entry points (otherwise `run_job()` called from `script.py` module scope would look "called" by `script.py` and disappear from flow analysis). Verified end-to-end against a Databricks `.ipynb` that calls `Predict.extract_data_from_sample_ids()` from cell-level code: edge count went from 0 to 14 CALLS edges, and `find_dead_code` no longer flags the method. Tests: - `test_module_scope_calls_attributed_to_file` — bare `.py` script - `test_module_scope_calls_in_notebook` — `.ipynb` file - `test_detect_entry_points_module_scope_caller_is_still_root` — flow analysis treats File-sourced CALLS correctly - `test_module_scope_caller_prevents_dead_code_flag` — end-to-end parse → store → find_dead_code - `test_if_main_block_caller_prevents_dead_code_flag` — same for `__main__` block (cherry picked from commit fe383c7)
…on-less scripts (tirth8205#276) Two parser improvements that expand code-review-graph's file coverage to extension-less Unix scripts and Korn shell files. Feature 1: .ksh extension → bash parser (tirth8205#235) ----------------------------------------------- Register .ksh (Korn shell) with tree-sitter-bash alongside the existing .sh / .bash / .zsh entries shipped in v2.3.0. Korn shell is close enough to bash syntactically that tree-sitter-bash handles the structural features the graph captures correctly. Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged this as worth adding: "The .ksh extension in particular looks worth adding — I didn't include it in tirth8205#227." Tests: test_detects_language extended with .ksh assertion; test_ksh_extension_parses_as_bash — end-to-end regression test that copies sample.sh to a temp .ksh file, parses it, and asserts identical function set and edge counts. Feature 2: shebang-based language detection (tirth8205#237) -------------------------------------------------- detect_language() was extension-only — any file with no extension returned None and was silently skipped. This misses a huge category of production files: git hooks, CI scripts, bin/ entry points, installers. New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter basenames to languages already registered: bash/sh/zsh/ksh/dash/ash -> bash python/python2/python3/pypy/pypy3 -> python node/nodejs -> javascript ruby, perl, lua, Rscript, php New _detect_language_from_shebang(path) static method reads the first 256 bytes, handles direct form (#!/bin/bash), env indirection (#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e), CRLF, binary content, and strict UTF-8 decoding. detect_language() now falls back to the shebang probe for files with no extension (suffix == ""). Files with a known extension are never re-read — extension-based detection stays authoritative. Tests (16 new in test_parser.py): every interpreter mapping, env -S flag, trailing flags, missing shebang, empty file, binary content, unknown interpreter, extension-does-not-get-overridden, and end-to-end parse_file producing function nodes from an extension-less bash script. Files changed ------------- - code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE table + _detect_language_from_shebang() + detect_language() fallback - tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test - tests/test_parser.py — 16 shebang detection tests (cherry picked from commit e6e3144)
…irth8205#274) Five root-cause fixes that together resolve every pre-existing Windows test failure on main. Each fix has targeted regression tests; the net effect is full green CI on Windows (8 failures -> 0). Bug 1: get_data_dir() writes non-UTF-8 .gitignore on Windows (tirth8205#239) -------------------------------------------------------------------- write_text() called without encoding="utf-8". The em-dash in the header is U+2014 which Python encodes as cp1252 byte 0x97 on Windows. Any later UTF-8 read fails with UnicodeDecodeError. Fix: add encoding="utf-8" (matches sibling _ensure_repo_gitignore). Test: test_auto_gitignore_is_valid_utf8 — asserts UTF-8 byte sequence, rejects cp1252 byte. Bug 2: Databricks notebook detection fails on CRLF line endings (tirth8205#239) ---------------------------------------------------------------------- source.startswith(b"# Databricks notebook source\n") hard-codes LF. Windows git checkout (core.autocrlf=true) produces CRLF. All Databricks handling silently bypassed — 4 tests fail. Fix: parse first line robustly, strip trailing \r before exact match. Tests: test_databricks_header_crlf_line_endings, test_databricks_header_lf_line_endings_still_work, test_databricks_header_prefix_false_positive_rejected. Bug 3: Stale FastMCP API in async regression guard (tirth8205#239) --------------------------------------------------------- test_heavy_tools_are_coroutines called mcp.get_tools() which does not exist in fastmcp>=2.14.0 (pinned in pyproject.toml). The guard has been silently broken since it was written — the protection promised by PR tirth8205#231 for tirth8205#46/tirth8205#136 was never actually enforced. Fix: resolve tools via getattr(crg_main, name) like the sibling test. Drop @pytest.mark.asyncio since no event loop is needed. Test: test_regression_guard_does_not_depend_on_fastmcp_internals — AST-walks the guard source to ensure no mcp internal API references. Bug 4-5: find_repo_root walks above test sandbox (tirth8205#241) ------------------------------------------------------- test_returns_none_without_git and test_falls_back_to_start fail on any machine where tmp_path has a git-initialized ancestor (dotfiles repo at ~/.git — very common on developer machines). Fix: add optional stop_at parameter to find_repo_root() and find_project_root(). When set, the walk examines stop_at for .git and then stops. Default is None (existing walk-to-root behavior). Fully backward-compatible — all 7 production callers unchanged. Tests: test_stop_at_prevents_escape_to_outer_git, test_stop_at_finds_git_at_boundary, test_stop_at_forwarded_to_find_repo_root. Files changed ------------- - code_review_graph/incremental.py — encoding fix + stop_at API - code_review_graph/parser.py — CRLF-tolerant Databricks detection - tests/test_incremental.py — gitignore UTF-8 guard + stop_at tests - tests/test_main.py — fixed async guard + meta-guard - tests/test_notebook.py — CRLF + LF + false-positive guards (cherry picked from commit aa627fb)
(cherry picked from commit f092922)
(cherry picked from commit 425810b)
(cherry picked from commit 536fd4b)
|
Thanks for catching this @npkriami18 — and thanks for doing the blame-and-cherry-pick work to restore authorship properly. Two of the four reverted fixes are mine (#274 and #276); this PR picks them back up exactly as I wrote them. No intent lost in translation. Verified from my end:
Your conflict resolution notes match what I'd have done too:
Root cause is worth flagging for future super-PR resolution: when the base of a stacked feature branch predates a fix that lives in a completely unrelated region, Happy to approve if reviewers have the bit for it, or land my OK here and wait for @tirth8205. Zero objections from me — restore my two commits as-is. Not addressed here and I agree they're out of scope:
Thanks again for the careful work. |
Fixes #361.
Context
Four already-merged fixes were silently reverted on
mainby twochore: resolve merge conflicts with maincommits (3cf31ec,cb9d25a). Neither conflict emitted markers because the feature branches had not touched those regions at all — only themainside had them — sogit mergeproduced a clean auto-resolution with the wrong side accepted. See #361 for the full root-cause analysis with blame evidence.What this PR does
Cherry-picks the four commits back onto
main, preserving authorship:fe383c7e6e3144aa627fbf092922+425810b+536fd4bConflict resolution notes:
aa627fb: onlycode_review_graph/parser.py(the one file onmainthat actually reverted);incremental.pystop_at changes and test scaffolding were still onmain, so the no-op hunks dropped out.tests/test_main.pyHEAD-side kept (preservesTestApplyToolFilterintroduced bya3a043b).incremental.py: HEAD side kept (retains SVN support added after fix: resolve all 8 Windows test failures — UTF-8, CRLF, FastMCP API, stop_at (#239, #241) #274).f092922: trivial — kept both ReScript and Julia test-file patterns in_TEST_PATH_PATTERNS.Test results
44 test failures fixed. Remaining 16 + 6 are out of scope:
TestApplyToolFiltererrors — separateFastMCP._tool_managerprivate-API bug introduced bya3a043b.test_stdio_calls_mcp_run_stdio— separate stdio-transport regression.Quality gate
uv run ruff check code_review_graph/parser.py code_review_graph/incremental.pyuv run mypy code_review_graph/parser.py code_review_graph/incremental.py --ignore-missing-imports --no-strict-optionalNon-goals
skills.py,cli.py, or any Copilot work (see feature: Add GitHub Copilot support (CLI and CHAT) ‼️‼️‼️ #129 for that)._tool_manageror Java/PHP/GDScript regressions — those deserve their own PRs.cc @tirth8205 — could you take a look? @azizur100389 @michael-denyer @danvinci — pinging as original authors in case I misread any of your original intent during the cherry-pick.