fix(mcp): sanitize lone surrogates before ChromaDB ops (#1235)#1422
fix(mcp): sanitize lone surrogates before ChromaDB ops (#1235)#1422YuanYiZheXue wants to merge 3 commits into
Conversation
…e#1235) MCP clients can emit lone surrogates (\udc00–\udfff) that cause Python's str.encode('utf-8') to raise UnicodeEncodeError, which bubbles up as -32000 Internal Error from ChromaDB. Add _clean(text) helper that uses 'surrogatepass'/'replace' to remove lone surrogates before the string reaches ChromaDB. Apply it in tool_add_drawer and tool_diary_write, and use 'surrogatepass' error handler on the SHA256 hash inputs for defensive idempotency.
There was a problem hiding this comment.
Code Review
This pull request introduces a _clean utility to strip lone surrogates from input strings, mitigating potential UnicodeEncodeError crashes in ChromaDB. The logic is integrated into tool_add_drawer and tool_diary_write. Reviewers suggest extending this cleaning to additional parameters and metadata fields across other tools to ensure consistency. Furthermore, the docstring for _clean requires technical correction regarding paired surrogates, and the use of surrogatepass in hashing operations is identified as redundant for already-cleaned strings.
| def _clean(text: str) -> str: | ||
| """Remove lone surrogates that break UTF-8 encoding (issue #1235). | ||
|
|
||
| MCP clients (Claude Desktop, WorkBuddy) can emit lone surrogates | ||
| (``\\udc00``–``\\udfff``) when relaying binary-in-Unicode or | ||
| corrupted text. Python's ``str.encode('utf-8')`` raises | ||
| ``UnicodeEncodeError`` on these; ChromaDB's ``add()`` / | ||
| ``upsert()`` then crashes with -32000 Internal Error. | ||
|
|
||
| Replace lone surrogates with U+FFFD (REPLACEMENT CHARACTER) so | ||
| the string is legal UTF-8 while preserving as much content as | ||
| possible. Paired surrogates (valid UTF-16) are left alone — | ||
| the Python interpreter already decoded them correctly. | ||
| """ | ||
| return text.encode("utf-8", "surrogatepass").decode("utf-8", "replace") |
There was a problem hiding this comment.
This is a critical fix for ChromaDB stability. However, the cleaning logic should be applied to all user-provided strings that reach ChromaDB to prevent similar crashes in other tools.
Specifically, _clean() is currently missing in:
tool_check_duplicatefor thecontentparameter.tool_update_drawerfor thecontentparameter.tool_searchfor thequeryparameter (if not handled bysanitize_query).- Metadata fields in
tool_add_drawersuch assource_fileandadded_by.
Also, the docstring's claim that "Paired surrogates... are left alone" is slightly misleading. In Python 3, if a str contains a pair of surrogate code points (e.g., \ud83d\ude00), this function will replace both with \ufffd\ufffd because encode('utf-8', 'surrogatepass') treats them as individual surrogates. Valid characters (like the actual emoji \U0001f600) are indeed left alone, but they are no longer surrogates at that point.
| wing = sanitize_name(wing, "wing") | ||
| room = sanitize_name(room, "room") | ||
| content = sanitize_content(content) | ||
| content = _clean(content) |
There was a problem hiding this comment.
The source_file and added_by parameters should also be cleaned before being stored in the collection metadata. Since they are not sanitized via sanitize_name, they could contain lone surrogates that would cause ChromaDB to crash during the upsert operation.
| content = _clean(content) | |
| content = _clean(content) | |
| source_file = _clean(source_file) if source_file else source_file | |
| added_by = _clean(added_by) if added_by else added_by |
|
|
||
| drawer_id = ( | ||
| f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode()).hexdigest()[:24]}" | ||
| f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode('utf-8', 'surrogatepass')).hexdigest()[:24]}" |
There was a problem hiding this comment.
Since content has already been cleaned and wing/room are sanitized (blocking surrogates), the use of surrogatepass here is redundant. Standard .encode() would be sufficient and clearer.
| f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode('utf-8', 'surrogatepass')).hexdigest()[:24]}" | |
| f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode()).hexdigest()[:24]}" |
- Fix _clean() docstring (paired surrogate description was misleading) - Apply _clean() to source_file and added_by metadata fields - Remove redundant surrogatepass from SHA256 hashes (content already cleaned) - Apply _clean() to content in tool_check_duplicate - Apply _clean() to new_doc in tool_update_drawer - Apply _clean() to sanitized query in tool_search Addresses feedback from gemini-code-assist[bot] on PR MemPalace#1422.
igorls
left a comment
There was a problem hiding this comment.
Thanks for the fix — the implementation is correct and addresses a real crash.
Requested changes before merge:
-
Add test coverage. A helper touching 5 production tool paths needs at least a unit test for
_clean()(strings with/without surrogates, valid surrogate pairs) and one integration test exercising a tool path (e.g.tool_add_drawer) with a surrogate-laden payload to confirm ChromaDB no longer crashes. -
Trim the docstring. The 14-line explanation belongs in the commit message / issue #1235. Per project conventions, keep the docstring to 2–3 lines max.
-
PR title — already updated to a conventional-commit style.
Add tests/test_clean_lone_surrogates.py covering: Unit tests (TestCleanLoneSurrogates, 11 cases): - _clean() passes normal ASCII and CJK strings unchanged - lone surrogates (high/low, single/multiple) are replaced with U+FFFD - real emoji (\U0001f600 astral code points) pass through unchanged - empty string, all-surrogate string, SHA-256-hash-after-clean - the specific \udcad surrogate observed in WorkBuddy production logs Integration tests (TestLoneSurrogateCleaning, 6 cases): - tool_add_drawer: surrogate in content and in metadata fields - tool_check_duplicate: surrogate in query - tool_search: surrogate in search query - tool_update_drawer: surrogate in updated content - tool_diary_write: surrogate in diary entry Fix test environment issue: - conftest.py redirects HOME to a temp dir, causing chromadb's ONNXMiniLM_L6_V2 to look for its ONNX model in the wrong location and trigger a 79 MB network download on every run. - Fix: at module import time, recover the real USERPROFILE from conftest._original_env and patch ONNXMiniLM_L6_V2.DOWNLOAD_PATH before any ChromaDB collection fixture is invoked.
Tests AddedTests have been added in tests/test_clean_lone_surrogates.py: Unit tests (11) for _clean(): lone surrogates, real emoji preservation, empty strings, SHA-256 compatibility Integration tests (6) covering all 5 tool paths:
Commit: 726495d — all 17 tests pass. Note: A test-environment fix is included. ONNXMiniLM_L6_V2.DOWNLOAD_PATH is patched at module import time using conftest._original_env to avoid 79 MB ONNX model re-downloads on every test run. |
No description provided.