Skip to content

fix: restore dropped parser fixes from #274 / #276 / #285 (+ Julia impl from #319)#362

Closed
npkriami18 wants to merge 6 commits intotirth8205:mainfrom
npkriami18:fix/restore-reverted-parser-fixes
Closed

fix: restore dropped parser fixes from #274 / #276 / #285 (+ Julia impl from #319)#362
npkriami18 wants to merge 6 commits intotirth8205:mainfrom
npkriami18:fix/restore-reverted-parser-fixes

Conversation

@npkriami18
Copy link
Copy Markdown

Fixes #361.

Context

Four already-merged fixes were silently reverted on main by two chore: resolve merge conflicts with main commits (3cf31ec, cb9d25a). Neither conflict emitted markers because the feature branches had not touched those regions at all — only the main side had them — so git merge produced a clean auto-resolution with the wrong side accepted. See #361 for the full root-cause analysis with blame evidence.

What this PR does

Cherry-picks the four commits back onto main, preserving authorship:

Commit PR Author Tests restored
fe383c7 #285 @michael-denyer 7 module-scope CALLS tests
e6e3144 #276 @azizur100389 10 shebang-detection tests
aa627fb #274 @azizur100389 1 Databricks CRLF test
f092922 + 425810b + 536fd4b #319 @danvinci 21 Julia parser tests

Conflict resolution notes:

  • aa627fb: only code_review_graph/parser.py (the one file on main that actually reverted); incremental.py stop_at changes and test scaffolding were still on main, so the no-op hunks dropped out. tests/test_main.py HEAD-side kept (preserves TestApplyToolFilter introduced by a3a043b).
  • incremental.py: HEAD side kept (retains SVN support added after fix: resolve all 8 Windows test failures — UTF-8, CRLF, FastMCP API, stop_at (#239, #241) #274).
  • f092922: trivial — kept both ReScript and Julia test-file patterns in _TEST_PATH_PATTERNS.

Test results

# Before (upstream/main @ b0f8527)
60 failed, 1015 passed, 1 skipped, 2 xpassed, 6 errors

# After (this branch)
16 failed, 1059 passed, 1 skipped, 2 xpassed, 6 errors

44 test failures fixed. Remaining 16 + 6 are out of scope:

  • 6 TestApplyToolFilter errors — separate FastMCP._tool_manager private-API bug introduced by a3a043b.
  • 1 test_stdio_calls_mcp_run_stdio — separate stdio-transport regression.
  • 6 Java / 1 PHP / 9 GDScript — pre-existing regressions unrelated to any of the four PRs restored here.

Quality gate

  • uv run ruff check code_review_graph/parser.py code_review_graph/incremental.py
  • uv run mypy code_review_graph/parser.py code_review_graph/incremental.py --ignore-missing-imports --no-strict-optional

Non-goals

cc @tirth8205 — could you take a look? @azizur100389 @michael-denyer @danvinci — pinging as original authors in case I misread any of your original intent during the cherry-pick.

michael-denyer and others added 6 commits April 21, 2026 19:12
The parser gated CALLS edge emission on `enclosing_func` being set, so
calls made from module scope (top-level script glue, CLI entrypoints,
`if __name__ == "__main__"` blocks, and Jupyter/Databricks notebook
cells) produced zero CALLS edges. Any function invoked only from those
contexts was flagged as dead by `find_dead_code`, even when the
function was the entire reason the script existed.

Notebooks are particularly affected because every cell is module-scope
by definition, so the existing notebook parser (PR tirth8205#69) emitted nodes
and IMPORTS_FROM edges but no CALLS edges — making the dead-code
detector's notebook coverage vacuous.

Fix: when `enclosing_func` is None, attribute the CALLS edge to the
File node instead of dropping it. Matches the existing convention used
by `_extract_value_references` and CONTAINS edges. Applied to all 5
gated emission sites: generic Python/JS/TS path, JSX components,
Elixir, Solidity `emit`, and R.

Downstream: `detect_entry_points` now filters File-sourced CALLS via
`get_all_call_targets(include_file_sources=False)` so script-only
callees remain detectable as entry points (otherwise `run_job()`
called from `script.py` module scope would look "called" by `script.py`
and disappear from flow analysis).

Verified end-to-end against a Databricks `.ipynb` that calls
`Predict.extract_data_from_sample_ids()` from cell-level code: edge
count went from 0 to 14 CALLS edges, and `find_dead_code` no longer
flags the method.

Tests:
- `test_module_scope_calls_attributed_to_file` — bare `.py` script
- `test_module_scope_calls_in_notebook` — `.ipynb` file
- `test_detect_entry_points_module_scope_caller_is_still_root` — flow
  analysis treats File-sourced CALLS correctly
- `test_module_scope_caller_prevents_dead_code_flag` — end-to-end
  parse → store → find_dead_code
- `test_if_main_block_caller_prevents_dead_code_flag` — same for
  `__main__` block

(cherry picked from commit fe383c7)
…on-less scripts (tirth8205#276)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests

(cherry picked from commit e6e3144)
…irth8205#274)

Five root-cause fixes that together resolve every pre-existing Windows
test failure on main.  Each fix has targeted regression tests; the net
effect is full green CI on Windows (8 failures -> 0).

Bug 1: get_data_dir() writes non-UTF-8 .gitignore on Windows (tirth8205#239)
--------------------------------------------------------------------
write_text() called without encoding="utf-8".  The em-dash in the header
is U+2014 which Python encodes as cp1252 byte 0x97 on Windows.  Any later
UTF-8 read fails with UnicodeDecodeError.

Fix: add encoding="utf-8" (matches sibling _ensure_repo_gitignore).
Test: test_auto_gitignore_is_valid_utf8 — asserts UTF-8 byte sequence,
rejects cp1252 byte.

Bug 2: Databricks notebook detection fails on CRLF line endings (tirth8205#239)
----------------------------------------------------------------------
source.startswith(b"# Databricks notebook source\n") hard-codes LF.
Windows git checkout (core.autocrlf=true) produces CRLF.  All Databricks
handling silently bypassed — 4 tests fail.

Fix: parse first line robustly, strip trailing \r before exact match.
Tests: test_databricks_header_crlf_line_endings,
test_databricks_header_lf_line_endings_still_work,
test_databricks_header_prefix_false_positive_rejected.

Bug 3: Stale FastMCP API in async regression guard (tirth8205#239)
---------------------------------------------------------
test_heavy_tools_are_coroutines called mcp.get_tools() which does not
exist in fastmcp>=2.14.0 (pinned in pyproject.toml).  The guard has
been silently broken since it was written — the protection promised by
PR tirth8205#231 for tirth8205#46/tirth8205#136 was never actually enforced.

Fix: resolve tools via getattr(crg_main, name) like the sibling test.
Drop @pytest.mark.asyncio since no event loop is needed.
Test: test_regression_guard_does_not_depend_on_fastmcp_internals —
AST-walks the guard source to ensure no mcp internal API references.

Bug 4-5: find_repo_root walks above test sandbox (tirth8205#241)
-------------------------------------------------------
test_returns_none_without_git and test_falls_back_to_start fail on any
machine where tmp_path has a git-initialized ancestor (dotfiles repo at
~/.git — very common on developer machines).

Fix: add optional stop_at parameter to find_repo_root() and
find_project_root().  When set, the walk examines stop_at for .git and
then stops.  Default is None (existing walk-to-root behavior).  Fully
backward-compatible — all 7 production callers unchanged.
Tests: test_stop_at_prevents_escape_to_outer_git,
test_stop_at_finds_git_at_boundary,
test_stop_at_forwarded_to_find_repo_root.

Files changed
-------------
- code_review_graph/incremental.py — encoding fix + stop_at API
- code_review_graph/parser.py — CRLF-tolerant Databricks detection
- tests/test_incremental.py — gitignore UTF-8 guard + stop_at tests
- tests/test_main.py — fixed async guard + meta-guard
- tests/test_notebook.py — CRLF + LF + false-positive guards

(cherry picked from commit aa627fb)
(cherry picked from commit f092922)
(cherry picked from commit 425810b)
(cherry picked from commit 536fd4b)
@azizur100389
Copy link
Copy Markdown
Contributor

Thanks for catching this @npkriami18 — and thanks for doing the blame-and-cherry-pick work to restore authorship properly. Two of the four reverted fixes are mine (#274 and #276); this PR picks them back up exactly as I wrote them. No intent lost in translation.

Verified from my end:

Your conflict resolution notes match what I'd have done too:

  • Keeping the incremental.py HEAD side is correct — the SVN support added after fix: resolve all 8 Windows test failures — UTF-8, CRLF, FastMCP API, stop_at (#239, #241) #274 is additive and doesn't touch the stop_at branch or get_data_dir encoding branch. I re-read the diff; no interaction.
  • tests/test_main.py HEAD-side keeps TestApplyToolFilter, which is independent of my work. Good call.
  • The f092922 trivial merge to keep both ReScript and Julia patterns in _TEST_PATH_PATTERNS is fine.

Root cause is worth flagging for future super-PR resolution: when the base of a stacked feature branch predates a fix that lives in a completely unrelated region, git merge auto-resolves by taking the feature side and silently drops the fix. There are no conflict markers because the feature branch simply doesn't know the fix exists. The only defense is diffing main against the merge result on any file the super-PR touches — not just the hunks the super-PR modified. #361's writeup is the cleanest diagnosis of this class of bug I've seen on the repo.

Happy to approve if reviewers have the bit for it, or land my OK here and wait for @tirth8205. Zero objections from me — restore my two commits as-is.

Not addressed here and I agree they're out of scope:

Thanks again for the careful work.

@npkriami18
Copy link
Copy Markdown
Author

Closing in favour of a single unified restoration PR — combining all reverted parser fixes (#274, #275, #276, #278, #280, #285, #298, #316, #319) plus the _tool_manager FastMCP API drift and stdio banner test into one green PR.

@npkriami18 npkriami18 closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parser fixes from #274, #276, #285 (and Julia impl from #319) were silently reverted by merge-conflict resolutions

4 participants