Skip to content

fix: harden KnowledgeGraph and MCP server for multi-process access#948

Open
felipecpaiva wants to merge 4 commits intoMemPalace:developfrom
felipecpaiva:fix/multi-process-safety
Open

fix: harden KnowledgeGraph and MCP server for multi-process access#948
felipecpaiva wants to merge 4 commits intoMemPalace:developfrom
felipecpaiva:fix/multi-process-safety

Conversation

@felipecpaiva
Copy link
Copy Markdown

Summary

When mempalace runs behind mcp-proxy (SSE mode) with multiple concurrent clients, two classes of concurrency bugs surface:

  1. SQLite "database is locked" in KnowledgeGraph — busy_timeout was only 10s with no application-level retry. Concurrent writers from separate mcp-proxy processes exhaust the timeout.

  2. ChromaDB client cache thrashing — every write changes chroma.sqlite3 mtime. Other processes detect this and recreate PersistentClient, reloading the full HNSW index from disk on every operation.

KnowledgeGraph fixes

  • busy_timeout 10s → 60s
  • _sqlite_retry decorator: exponential backoff with jitter (5 retries, only for "locked"/"busy" errors)
  • BEGIN IMMEDIATE for write transactions (detect contention at start, not mid-transaction)
  • PRAGMA wal_autocheckpoint=1000 + journal_size_limit=64MB (manage WAL growth)
  • atexit.register(self.close) for clean WAL checkpointing on shutdown

MCP server fixes

  • Rate-limit chroma.sqlite3 stat/mtime checks to 5-second intervals
  • _refresh_db_mtime() after writes prevents self-inflicted client recreation
  • Bypass cooldown for safety-critical DB disappearance detection (rebuild scenarios)
  • tool_reconnect now fully clears client + mtime state

Tests

  • TestSQLiteRetryDecorator — 5 cases: retry success, exhaustion, non-lock errors, busy variant
  • TestConnectionPragmas — 3 cases: autocheckpoint, journal_size_limit, WAL mode
  • TestMultiProcessLocking — 4 processes × 20 triples to same DB file, zero failures

Test plan

  • python -m pytest tests/ -v --ignore=tests/benchmarks — 958 passed
  • ruff check + ruff format --check — clean
  • Manual: connect 2+ Claude Code sessions via SSE, trigger concurrent kg_add calls
  • Manual: concurrent add_drawer calls from multiple clients — no cache thrashing

When multiple mcp-proxy SSE connections share the same mempalace data,
concurrent processes compete for SQLite and ChromaDB resources.

KnowledgeGraph changes:
- Increase busy_timeout from 10s to 60s
- Add _sqlite_retry decorator with exponential backoff for lock/busy errors
- Use BEGIN IMMEDIATE for writes (detect contention at transaction start)
- Add WAL autocheckpoint and journal_size_limit pragmas
- Register atexit handler for clean WAL shutdown

MCP server changes:
- Rate-limit chroma.sqlite3 mtime checks to 5s intervals to prevent
  PersistentClient recreation (and HNSW index reload) on every write
- Add _refresh_db_mtime() after ChromaDB writes to prevent self-triggered
  reconnects
- Bypass mtime cooldown for safety-critical DB disappearance detection
- Fix tool_reconnect to fully clear client and mtime state

Tests:
- Retry decorator: lock retry, exhaustion, non-lock errors, busy variant
- Connection pragmas: wal_autocheckpoint, journal_size_limit, WAL mode
- Multi-process concurrent writes: 4 processes x 20 triples, zero failures
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, KnowledgeGraph.init registers a bound method with atexit, which holds a strong reference to the instance and prevents it from being garbage-collected; repeated short-lived KnowledgeGraph creations will accumulate open SQLite connections until process exit.

Severity: action required | Category: reliability

How to fix: Avoid atexit holding self

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

KnowledgeGraph.__init__ registers self.close with atexit, which strongly retains the instance and leaks SQLite connections in any code path that constructs KnowledgeGraph without explicitly calling close().

Issue Context

This is triggered in fact_checker._check_kg_contradictions() which creates a new KnowledgeGraph per call and does not close it.

Fix Focus Areas

  • mempalace/knowledge_graph.py[109-113]
  • mempalace/fact_checker.py[179-217]

Suggested fix approaches

  • Remove the per-instance atexit.register(self.close) and rely on explicit close() in long-lived owners (MCP server already has a global KG) and/or context management.
  • Alternatively, register an atexit handler that does not capture self strongly (e.g., store weakref to the instance and close if still alive).
  • Ensure fact_checker closes the KG (e.g., try/finally: kg.close()), which is the most direct leak stopper regardless of atexit strategy.

We noticed a couple of other issues in this PR as well — happy to share if helpful.


Found by Qodo code review

Registering `self.close` in `KnowledgeGraph.__init__` kept the atexit
registry strongly referencing every instance, leaking SQLite connections
for short-lived KGs (e.g. fact_checker._check_kg_contradictions which
constructs one per call). Instances could never be GC'd until process
exit.

- Drop per-instance atexit registration
- Close the KG explicitly in fact_checker via try/finally
- Preserve clean WAL checkpoint on shutdown for the long-lived MCP
  server KG by registering atexit on the module-level singleton, which
  is retained by the module anyway so no new references are leaked
- Regression test: weakref to a closed KG must resolve to None after GC
@felipecpaiva
Copy link
Copy Markdown
Author

felipecpaiva commented Apr 17, 2026

Thanks for flagging this — good catch. Addressed in 2feaa71.

What changed

  • knowledge_graph.py: removed atexit.register(self.close) from __init__ (and the now-unused import atexit). The bound-method registration was keeping a strong reference to every instance in atexit's registry, so short-lived KGs could never be GC'd.
  • fact_checker.py::_check_kg_contradictions: wraps the query loop in try/finally: kg.close() so the per-call KG always releases its SQLite connection, regardless of atexit strategy.
  • mcp_server.py: moved the atexit registration to module scope on the _kg singleton (atexit.register(_kg.close)). The clean WAL-checkpoint-on-shutdown benefit is preserved for the long-lived owner, and there's no retention issue because the module already holds _kg for the process lifetime.
  • Regression test in tests/test_knowledge_graph.py::TestGarbageCollection: create a KG, take a weakref, close + del + gc.collect(), assert the weakref resolves to None. This is the exact property that was broken before.

Why this shape

You listed three options:

  1. Remove the per-instance atexit and rely on explicit close() in long-lived owners.
  2. Use a weakref-based atexit handler.
  3. Explicit close() in fact_checker.

I took (1) + (3). The weakref variant (2) would work, but option (1) is simpler and makes ownership explicit: the MCP server owns the process-wide singleton, short-lived constructors are responsible for their own cleanup. A grep for KnowledgeGraph( confirmed fact_checker._check_kg_contradictions is the only production short-lived caller; tests already call kg.close() explicitly.

Suite is green (pytest tests/ --ignore=tests/benchmarks → 959 passed), lint/format clean.

Happy to hear the other issues you noticed — please share.

@felipecpaiva
Copy link
Copy Markdown
Author

Hi maintainers — both workflow runs on this PR are sitting in action_required state (first-time-contributor gate on Actions), so Tests hasn't actually executed yet. Could one of you click "Approve and run workflows" so reviewers have a green check to anchor on?

Locally on 2feaa71:

  • pytest tests/ --ignore=tests/benchmarks → 959 passed
  • ruff check + ruff format --check → clean

Happy to rebase onto latest develop if preferred (mergeable: true as of now; PR #1014's schema columns don't overlap our modified regions).

@felipecpaiva
Copy link
Copy Markdown
Author

cc @bensig @igorls @milla-jovovich — friendly ping on the CI approval in the comment above. The Tests workflow is blocked on the first-time-contributor gate, so there's no green check for you to anchor on yet. Tests pass locally (959/959, ruff clean). Happy to address anything you'd want adjusted before this can move.

@igorls igorls added bug Something isn't working area/kg Knowledge graph area/mcp MCP server and tools labels Apr 24, 2026
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 24, 2026

@felipecpaiva thx for this - small lint issue tests/test_knowledge_graph.py

@felipecpaiva
Copy link
Copy Markdown
Author

@bensig Yeah i missed that due to a ruff version difference.
Fixed and pushed.

Integrate upstream HNSW capacity probe (_refresh_vector_disabled_flag)
into both the first-call and reconnect paths of _get_client(), alongside
our mtime rate-limiting. Merge _last_mtime_check and _vector_disabled
resets in tool_reconnect.
@felipecpaiva
Copy link
Copy Markdown
Author

Also @bensig had a small conflict fixed on the branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kg Knowledge graph area/mcp MCP server and tools bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants