Skip to content

Pin UTF-8 encoding for cache and manifest I/O#668

Closed
azizur100389 wants to merge 1 commit intosafishamsi:v3from
azizur100389:fix/utf8-encoding-cache-manifest
Closed

Pin UTF-8 encoding for cache and manifest I/O#668
azizur100389 wants to merge 1 commit intosafishamsi:v3from
azizur100389:fix/utf8-encoding-cache-manifest

Conversation

@azizur100389
Copy link
Copy Markdown
Contributor

Summary

On Windows the default text codec is cp1252, so Path.read_text() / Path.write_text() raise UnicodeDecodeError / UnicodeEncodeError whenever the graphify cache or manifest holds non-ASCII content — for example:

  • Chinese / Japanese / Korean identifier names extracted from source code
  • File paths under directories with accented Latin or CJK characters
  • Emoji or non-ASCII labels in semantic extraction output

Four call sites are affected:

  • cache.load_cached()read_text(encoding="utf-8") + catch UnicodeDecodeError
  • cache.save_cached()write_text(encoding="utf-8") with ensure_ascii=False so CJK/emoji stay readable UTF-8 instead of \uXXXX escapes
  • detect.load_manifest()read_text(encoding="utf-8")
  • detect.save_manifest()write_text(encoding="utf-8") with ensure_ascii=False

The repo already uses encoding="utf-8" in __main__.py and export.py; this extends the same convention to the file types most likely to hold user data.

Test plan

  • New tests/test_encoding.py (13 tests) covers cache + manifest roundtrips with CJK, Japanese, Korean, accented Latin, emoji, and mixed-script payloads — including non-ASCII filenames
  • Verifies cache files on disk are valid UTF-8 JSON readable by external tools
  • No regressions in existing tests

On Windows the default text codec is cp1252, so Path.read_text() and
Path.write_text() raise UnicodeDecodeError / UnicodeEncodeError when
the cache or manifest contain non-ASCII content (CJK identifier names,
accented Latin file paths, emoji in node labels, etc.).

Fix four call sites:
- cache.load_cached(): read_text(encoding="utf-8") + catch UnicodeDecodeError
- cache.save_cached(): write_text(encoding="utf-8") with ensure_ascii=False
  so CJK/emoji are stored as readable UTF-8 instead of \uXXXX escapes
- detect.load_manifest(): read_text(encoding="utf-8")
- detect.save_manifest(): write_text(encoding="utf-8") with ensure_ascii=False

The repo already uses encoding="utf-8" in __main__.py and export.py;
this just extends the same convention to the file types most likely
to contain user data with non-ASCII content.
@azizur100389
Copy link
Copy Markdown
Contributor Author

Closing as redundant — checked the v6 default branch and graphify/cache.py already uses encoding="utf-8" (line 87, 95), and detect.load_manifest/save_manifest already use it too. This PR was opened against v3 before realizing v6 was the active branch. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant