fix: handle lone surrogates in MCP write tools#1235
Conversation
|
Thanks for the surrogate fix — Destructive rebase against an old baseThe diff (
These were intentional fixes for known crash modes; merging this PR would re-open all of them. Hardcoded contributor pathsTwo changes can't apply to any other user:
Path forwardThe actual fix is small (~10 lines: Removing from the v3.3.5 milestone for now. |
|
Hi Igor, Thanks for the detailed review! I've opened a new clean PR as you suggested: #1422 Changes: - Branched off latest develop (no destructive rebase) - Only mcp_server.py is touched (+21 / -2) - No hardcoded paths - _clean() applied in tool_add_drawer and tool_diary_write before SHA256 hash and ChromaDB write Please take a look when you have a moment. Appreciate the guidance! Best, YuanYi
…---------- Original ----------
From: ***@***.***
Subject: Re: [MemPalace/mempalace] fix: handle lone surrogates in MCP write tools (PR #1235)
igorls left a comment (MemPalace/mempalace#1235)
Thanks for the surrogate fix — \udc95-style lone surrogates from MCP clients are a real problem and the _clean() approach is the right shape. Unfortunately I can't merge this PR as-is; flagging the blockers so we can land the fix properly in a fresh PR.
Destructive rebase against an old base
The diff (+181 / -213 across 2 files) is much larger than the description suggests. Most of the deletions remove recent safety fixes that landed on develop after this branch forked:
The entire _vector_disabled system (#1222 SIGSEGV mitigation): _refresh_vector_disabled_flag(), _tool_status_via_sqlite(), the BM25 fallback in tool_check_duplicate, and the vector_disabled reporting in tool_search / tool_reconnect
_HNSW_BLOAT_GUARD import and application
_pin_hnsw_threads() calls (HNSW race fix #974 / #965)
These were intentional fixes for known crash modes; merging this PR would re-open all of them.
Hardcoded contributor paths
Two changes can't apply to any other user:
mcp_server.py adds logging.FileHandler(r\"C:\Users\SJC\mempalace_mcp.log\", encoding=\"utf-8\") — would crash on every non-SJC machine on Windows, and on Linux/macOS entirely
docs/fix-lone-surrogate.md references D:\ProgramData\Python312\Lib\site-packages\mempalace\mcp_server.py and D:\gitfork\mcp_server.py — local-machine paths that don't belong in repo docs
Path forward
The actual fix is small (~10 lines: _clean() helper + use it in tool_add_drawer and tool_diary_write + surrogatepass on the SHA256 hashes). Would you be willing to open a fresh PR against current develop with just those changes? Happy to review quickly.
Removing from the v3.3.5 milestone for now.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Add tests/test_clean_lone_surrogates.py covering: Unit tests (TestCleanLoneSurrogates, 11 cases): - _clean() passes normal ASCII and CJK strings unchanged - lone surrogates (high/low, single/multiple) are replaced with U+FFFD - real emoji (\U0001f600 astral code points) pass through unchanged - empty string, all-surrogate string, SHA-256-hash-after-clean - the specific \udcad surrogate observed in WorkBuddy production logs Integration tests (TestLoneSurrogateCleaning, 6 cases): - tool_add_drawer: surrogate in content and in metadata fields - tool_check_duplicate: surrogate in query - tool_search: surrogate in search query - tool_update_drawer: surrogate in updated content - tool_diary_write: surrogate in diary entry Fix test environment issue: - conftest.py redirects HOME to a temp dir, causing chromadb's ONNXMiniLM_L6_V2 to look for its ONNX model in the wrong location and trigger a 79 MB network download on every run. - Fix: at module import time, recover the real USERPROFILE from conftest._original_env and patch ONNXMiniLM_L6_V2.DOWNLOAD_PATH before any ChromaDB collection fixture is invoked.
Problem
MCP write tools (mempalace_add_drawer, mempalace_diary_write) fail with UnicodeEncodeError when processing strings containing lone surrogates injected by MCP clients like WorkBuddy.
Solution
Add _clean() function that removes lone surrogates using surrogatepass/surrogateignore encoding before writing to ChromaDB.
Changes
Testing
See docs/fix-lone-surrogate.md for full details.