fix: harden KnowledgeGraph and MCP server for multi-process access#948
fix: harden KnowledgeGraph and MCP server for multi-process access#948felipecpaiva wants to merge 4 commits intoMemPalace:developfrom
Conversation
When multiple mcp-proxy SSE connections share the same mempalace data, concurrent processes compete for SQLite and ChromaDB resources. KnowledgeGraph changes: - Increase busy_timeout from 10s to 60s - Add _sqlite_retry decorator with exponential backoff for lock/busy errors - Use BEGIN IMMEDIATE for writes (detect contention at transaction start) - Add WAL autocheckpoint and journal_size_limit pragmas - Register atexit handler for clean WAL shutdown MCP server changes: - Rate-limit chroma.sqlite3 mtime checks to 5s intervals to prevent PersistentClient recreation (and HNSW index reload) on every write - Add _refresh_db_mtime() after ChromaDB writes to prevent self-triggered reconnects - Bypass mtime cooldown for safety-critical DB disappearance detection - Fix tool_reconnect to fully clear client and mtime state Tests: - Retry decorator: lock retry, exhaustion, non-lock errors, busy variant - Connection pragmas: wal_autocheckpoint, journal_size_limit, WAL mode - Multi-process concurrent writes: 4 processes x 20 triples, zero failures
|
Hi, KnowledgeGraph.init registers a bound method with atexit, which holds a strong reference to the instance and prevents it from being garbage-collected; repeated short-lived KnowledgeGraph creations will accumulate open SQLite connections until process exit. Severity: action required | Category: reliability How to fix: Avoid atexit holding self Agent prompt to fix - you can give this to your LLM of choice:
We noticed a couple of other issues in this PR as well — happy to share if helpful. Found by Qodo code review |
Registering `self.close` in `KnowledgeGraph.__init__` kept the atexit registry strongly referencing every instance, leaking SQLite connections for short-lived KGs (e.g. fact_checker._check_kg_contradictions which constructs one per call). Instances could never be GC'd until process exit. - Drop per-instance atexit registration - Close the KG explicitly in fact_checker via try/finally - Preserve clean WAL checkpoint on shutdown for the long-lived MCP server KG by registering atexit on the module-level singleton, which is retained by the module anyway so no new references are leaked - Regression test: weakref to a closed KG must resolve to None after GC
|
Thanks for flagging this — good catch. Addressed in 2feaa71. What changed
Why this shape You listed three options:
I took (1) + (3). The weakref variant (2) would work, but option (1) is simpler and makes ownership explicit: the MCP server owns the process-wide singleton, short-lived constructors are responsible for their own cleanup. A grep for Suite is green ( Happy to hear the other issues you noticed — please share. |
|
Hi maintainers — both workflow runs on this PR are sitting in Locally on
Happy to rebase onto latest |
|
cc @bensig @igorls @milla-jovovich — friendly ping on the CI approval in the comment above. The |
|
@felipecpaiva thx for this - small lint issue |
|
@bensig Yeah i missed that due to a ruff version difference. |
Integrate upstream HNSW capacity probe (_refresh_vector_disabled_flag) into both the first-call and reconnect paths of _get_client(), alongside our mtime rate-limiting. Merge _last_mtime_check and _vector_disabled resets in tool_reconnect.
|
Also @bensig had a small conflict fixed on the branch. |
Summary
When mempalace runs behind
mcp-proxy(SSE mode) with multiple concurrent clients, two classes of concurrency bugs surface:SQLite "database is locked" in KnowledgeGraph —
busy_timeoutwas only 10s with no application-level retry. Concurrent writers from separate mcp-proxy processes exhaust the timeout.ChromaDB client cache thrashing — every write changes
chroma.sqlite3mtime. Other processes detect this and recreatePersistentClient, reloading the full HNSW index from disk on every operation.KnowledgeGraph fixes
busy_timeout10s → 60s_sqlite_retrydecorator: exponential backoff with jitter (5 retries, only for "locked"/"busy" errors)BEGIN IMMEDIATEfor write transactions (detect contention at start, not mid-transaction)PRAGMA wal_autocheckpoint=1000+journal_size_limit=64MB(manage WAL growth)atexit.register(self.close)for clean WAL checkpointing on shutdownMCP server fixes
chroma.sqlite3stat/mtime checks to 5-second intervals_refresh_db_mtime()after writes prevents self-inflicted client recreationtool_reconnectnow fully clears client + mtime stateTests
TestSQLiteRetryDecorator— 5 cases: retry success, exhaustion, non-lock errors, busy variantTestConnectionPragmas— 3 cases: autocheckpoint, journal_size_limit, WAL modeTestMultiProcessLocking— 4 processes × 20 triples to same DB file, zero failuresTest plan
python -m pytest tests/ -v --ignore=tests/benchmarks— 958 passedruff check+ruff format --check— clean