Skip to content

fix: add encoding='utf-8' to subprocess.run() calls to fix Windows non-UTF-8 locale#409

Open
suainam wants to merge 1 commit intotirth8205:mainfrom
suainam:main
Open

fix: add encoding='utf-8' to subprocess.run() calls to fix Windows non-UTF-8 locale#409
suainam wants to merge 1 commit intotirth8205:mainfrom
suainam:main

Conversation

@suainam
Copy link
Copy Markdown

@suainam suainam commented Apr 30, 2026

Fix Windows Encoding Issue in subprocess.run() Calls

Problem

The code-review-graph build command fails on Windows systems with non-UTF-8 default encoding (e.g., Chinese Windows with GBK, Japanese Windows with Shift-JIS).

Error Message

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 3921: illegal multibyte sequence
AttributeError: 'NoneType' object has no attribute 'splitlines'

Root Cause

In incremental.py, all subprocess.run() calls use text=True without specifying encoding='utf-8':

result = subprocess.run(
    cmd,
    capture_output=True,
    text=True,  # ❌ Uses system default encoding (GBK on Chinese Windows)
    cwd=str(repo_root),
    timeout=_GIT_TIMEOUT,
)

When text=True is used without an explicit encoding:

  1. Python uses the system's default encoding (e.g., GBK on Chinese Windows)
  2. Git outputs UTF-8 encoded content
  3. Python tries to decode UTF-8 bytes as GBK → UnicodeDecodeError
  4. The exception is caught, but result.stdout becomes None
  5. Code crashes when trying to call .splitlines() on None

Affected Code Locations

All subprocess.run() calls in incremental.py:

  • Line ~356: get_all_tracked_files() - git ls-files
  • Line ~420: get_changed_files() - git diff
  • Line ~450: get_file_content_at_commit() - git show
  • And other git command invocations

Impact

  • Affected platforms: Windows with non-UTF-8 system encoding (Chinese, Japanese, Korean, etc.)
  • Affected operations: All git operations during graph building and updating
  • Severity: Critical - completely blocks graph building on affected systems

Solution

Add encoding='utf-8' parameter to all subprocess.run() calls:

result = subprocess.run(
    cmd,
    capture_output=True,
    text=True,
    encoding='utf-8',  # ✅ Explicitly use UTF-8
    cwd=str(repo_root),
    timeout=_GIT_TIMEOUT,
)

Proposed Changes

File: code_review_graph/incremental.py

Replace all occurrences of:

text=True,

With:

text=True, encoding='utf-8',

This ensures consistent UTF-8 decoding across all platforms.

Testing

Before Fix

# On Chinese Windows (GBK)
$ python -m code_review_graph build
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80...

After Fix

# On Chinese Windows (GBK)
$ python -m code_review_graph build
Building code-review-graph...
✓ Parsed 437 files
✓ Created 14,888 nodes
✓ Created 86,584 edges
[OK] Graph built successfully!

Workaround (Current)

Users on affected systems can use this workaround script:

#!/usr/bin/env python3
"""Temporary workaround for Windows encoding issue."""
import shutil
from pathlib import Path

def patch_incremental_py(file_path: Path):
    backup_path = file_path.with_suffix('.py.backup')
    shutil.copy2(file_path, backup_path)
    
    content = file_path.read_text(encoding='utf-8')
    patched = content.replace('text=True,', "text=True, encoding='utf-8',")
    file_path.write_text(patched, encoding='utf-8')
    
    return backup_path

# Patch, build, restore...

Additional Notes

  • This is a backward-compatible change - UTF-8 works on all platforms
  • Python's subprocess documentation recommends explicit encoding for cross-platform compatibility
  • Similar issues may exist in other files that call subprocess.run() with text=True

References


Environment:

  • OS: Windows 10 Pro (Chinese locale, GBK encoding)
  • Python: 3.10+
  • code-review-graph: 2.3.2
  • Git: 2.x (outputs UTF-8)

Reproduction:

  1. Install code-review-graph on Chinese Windows
  2. Run code-review-graph build in any git repository
  3. Observe UnicodeDecodeError

Expected behavior:
Graph builds successfully regardless of system encoding.

Actual behavior:
Build fails with encoding error on non-UTF-8 Windows systems.

…n-UTF-8 locale

fix: add encoding='utf-8' to subprocess.run() calls to fix Windows non-UTF-8 locale
dpesch added a commit to 11com7/code-review-graph that referenced this pull request May 3, 2026
dpesch added a commit to 11com7/code-review-graph that referenced this pull request May 3, 2026
- PR tirth8205#400 (niveku): auto-select ThreadPoolExecutor on Windows MCP stdio
  to avoid ProcessPool pipe-handle inheritance deadlock; stdin=DEVNULL
  on all git subprocess calls (closes upstream tirth8205#46, tirth8205#136, tirth8205#401)
- PR tirth8205#409 (suainam): encoding='utf-8' on subprocess.run() calls for
  Windows non-UTF-8 locales (GBK, Shift-JIS, etc.)

These are pending upstream review. Merged here for internal use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant