Skip to content

fix(mcp): sanitize lone surrogates before ChromaDB ops (#1235)#1422

Open
YuanYiZheXue wants to merge 3 commits into
MemPalace:developfrom
YuanYiZheXue:fix/lone-surrogate-clean
Open

fix(mcp): sanitize lone surrogates before ChromaDB ops (#1235)#1422
YuanYiZheXue wants to merge 3 commits into
MemPalace:developfrom
YuanYiZheXue:fix/lone-surrogate-clean

Conversation

@YuanYiZheXue
Copy link
Copy Markdown

No description provided.

…e#1235)

MCP clients can emit lone surrogates (\udc00–\udfff) that
cause Python's str.encode('utf-8') to raise UnicodeEncodeError,
which bubbles up as -32000 Internal Error from ChromaDB.

Add _clean(text) helper that uses 'surrogatepass'/'replace' to
remove lone surrogates before the string reaches ChromaDB.
Apply it in tool_add_drawer and tool_diary_write, and use
'surrogatepass' error handler on the SHA256 hash inputs for
defensive idempotency.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a _clean utility to strip lone surrogates from input strings, mitigating potential UnicodeEncodeError crashes in ChromaDB. The logic is integrated into tool_add_drawer and tool_diary_write. Reviewers suggest extending this cleaning to additional parameters and metadata fields across other tools to ensure consistency. Furthermore, the docstring for _clean requires technical correction regarding paired surrogates, and the use of surrogatepass in hashing operations is identified as redundant for already-cleaned strings.

Comment thread mempalace/mcp_server.py
Comment on lines +90 to +104
def _clean(text: str) -> str:
"""Remove lone surrogates that break UTF-8 encoding (issue #1235).

MCP clients (Claude Desktop, WorkBuddy) can emit lone surrogates
(``\\udc00``–``\\udfff``) when relaying binary-in-Unicode or
corrupted text. Python's ``str.encode('utf-8')`` raises
``UnicodeEncodeError`` on these; ChromaDB's ``add()`` /
``upsert()`` then crashes with -32000 Internal Error.

Replace lone surrogates with U+FFFD (REPLACEMENT CHARACTER) so
the string is legal UTF-8 while preserving as much content as
possible. Paired surrogates (valid UTF-16) are left alone —
the Python interpreter already decoded them correctly.
"""
return text.encode("utf-8", "surrogatepass").decode("utf-8", "replace")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a critical fix for ChromaDB stability. However, the cleaning logic should be applied to all user-provided strings that reach ChromaDB to prevent similar crashes in other tools.

Specifically, _clean() is currently missing in:

  1. tool_check_duplicate for the content parameter.
  2. tool_update_drawer for the content parameter.
  3. tool_search for the query parameter (if not handled by sanitize_query).
  4. Metadata fields in tool_add_drawer such as source_file and added_by.

Also, the docstring's claim that "Paired surrogates... are left alone" is slightly misleading. In Python 3, if a str contains a pair of surrogate code points (e.g., \ud83d\ude00), this function will replace both with \ufffd\ufffd because encode('utf-8', 'surrogatepass') treats them as individual surrogates. Valid characters (like the actual emoji \U0001f600) are indeed left alone, but they are no longer surrogates at that point.

Comment thread mempalace/mcp_server.py
wing = sanitize_name(wing, "wing")
room = sanitize_name(room, "room")
content = sanitize_content(content)
content = _clean(content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The source_file and added_by parameters should also be cleaned before being stored in the collection metadata. Since they are not sanitized via sanitize_name, they could contain lone surrogates that would cause ChromaDB to crash during the upsert operation.

Suggested change
content = _clean(content)
content = _clean(content)
source_file = _clean(source_file) if source_file else source_file
added_by = _clean(added_by) if added_by else added_by

Comment thread mempalace/mcp_server.py Outdated

drawer_id = (
f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode()).hexdigest()[:24]}"
f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode('utf-8', 'surrogatepass')).hexdigest()[:24]}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since content has already been cleaned and wing/room are sanitized (blocking surrogates), the use of surrogatepass here is redundant. Standard .encode() would be sufficient and clearer.

Suggested change
f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode('utf-8', 'surrogatepass')).hexdigest()[:24]}"
f"drawer_{wing}_{room}_{hashlib.sha256((wing + room + content).encode()).hexdigest()[:24]}"

- Fix _clean() docstring (paired surrogate description was misleading)
- Apply _clean() to source_file and added_by metadata fields
- Remove redundant surrogatepass from SHA256 hashes (content already cleaned)
- Apply _clean() to content in tool_check_duplicate
- Apply _clean() to new_doc in tool_update_drawer
- Apply _clean() to sanitized query in tool_search

Addresses feedback from gemini-code-assist[bot] on PR MemPalace#1422.
@igorls igorls changed the title test fix(mcp): sanitize lone surrogates before ChromaDB ops (#1235) May 15, 2026
Copy link
Copy Markdown
Member

@igorls igorls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix — the implementation is correct and addresses a real crash.

Requested changes before merge:

  1. Add test coverage. A helper touching 5 production tool paths needs at least a unit test for _clean() (strings with/without surrogates, valid surrogate pairs) and one integration test exercising a tool path (e.g. tool_add_drawer) with a surrogate-laden payload to confirm ChromaDB no longer crashes.

  2. Trim the docstring. The 14-line explanation belongs in the commit message / issue #1235. Per project conventions, keep the docstring to 2–3 lines max.

  3. PR title — already updated to a conventional-commit style.

@igorls igorls added bug Something isn't working area/mcp MCP server and tools labels May 15, 2026
@igorls igorls added this to the v3.3.6 milestone May 15, 2026
Add tests/test_clean_lone_surrogates.py covering:

Unit tests (TestCleanLoneSurrogates, 11 cases):
- _clean() passes normal ASCII and CJK strings unchanged
- lone surrogates (high/low, single/multiple) are replaced with U+FFFD
- real emoji (\U0001f600 astral code points) pass through unchanged
- empty string, all-surrogate string, SHA-256-hash-after-clean
- the specific \udcad surrogate observed in WorkBuddy production logs

Integration tests (TestLoneSurrogateCleaning, 6 cases):
- tool_add_drawer: surrogate in content and in metadata fields
- tool_check_duplicate: surrogate in query
- tool_search: surrogate in search query
- tool_update_drawer: surrogate in updated content
- tool_diary_write: surrogate in diary entry

Fix test environment issue:
- conftest.py redirects HOME to a temp dir, causing chromadb's
  ONNXMiniLM_L6_V2 to look for its ONNX model in the wrong location
  and trigger a 79 MB network download on every run.
- Fix: at module import time, recover the real USERPROFILE from
  conftest._original_env and patch ONNXMiniLM_L6_V2.DOWNLOAD_PATH
  before any ChromaDB collection fixture is invoked.
@YuanYiZheXue
Copy link
Copy Markdown
Author

Tests Added

Tests have been added in tests/test_clean_lone_surrogates.py:

Unit tests (11) for _clean(): lone surrogates, real emoji preservation, empty strings, SHA-256 compatibility

Integration tests (6) covering all 5 tool paths:

  • tool_add_drawer (content + metadata fields)
  • tool_check_duplicate
  • tool_search
  • tool_update_drawer
  • tool_diary_write

Commit: 726495d — all 17 tests pass.

Note: A test-environment fix is included. ONNXMiniLM_L6_V2.DOWNLOAD_PATH is patched at module import time using conftest._original_env to avoid 79 MB ONNX model re-downloads on every test run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants