fix: harden KnowledgeGraph and MCP server for multi-process access by felipecpaiva · Pull Request #948 · MemPalace/mempalace

felipecpaiva · 2026-04-16T13:19:53Z

Summary

When mempalace runs behind mcp-proxy (SSE mode) with multiple concurrent clients, two classes of concurrency bugs surface:

SQLite "database is locked" in KnowledgeGraph — busy_timeout was only 10s with no application-level retry. Concurrent writers from separate mcp-proxy processes exhaust the timeout.
ChromaDB client cache thrashing — every write changes chroma.sqlite3 mtime. Other processes detect this and recreate PersistentClient, reloading the full HNSW index from disk on every operation.

KnowledgeGraph fixes

busy_timeout 10s → 60s
_sqlite_retry decorator: exponential backoff with jitter (5 retries, only for "locked"/"busy" errors)
BEGIN IMMEDIATE for write transactions (detect contention at start, not mid-transaction)
PRAGMA wal_autocheckpoint=1000 + journal_size_limit=64MB (manage WAL growth)
atexit.register(self.close) for clean WAL checkpointing on shutdown

MCP server fixes

Rate-limit chroma.sqlite3 stat/mtime checks to 5-second intervals
_refresh_db_mtime() after writes prevents self-inflicted client recreation
Bypass cooldown for safety-critical DB disappearance detection (rebuild scenarios)
tool_reconnect now fully clears client + mtime state

Tests

TestSQLiteRetryDecorator — 5 cases: retry success, exhaustion, non-lock errors, busy variant
TestConnectionPragmas — 3 cases: autocheckpoint, journal_size_limit, WAL mode
TestMultiProcessLocking — 4 processes × 20 triples to same DB file, zero failures

Test plan

python -m pytest tests/ -v --ignore=tests/benchmarks — 958 passed
ruff check + ruff format --check — clean
Manual: connect 2+ Claude Code sessions via SSE, trigger concurrent kg_add calls
Manual: concurrent add_drawer calls from multiple clients — no cache thrashing

When multiple mcp-proxy SSE connections share the same mempalace data, concurrent processes compete for SQLite and ChromaDB resources. KnowledgeGraph changes: - Increase busy_timeout from 10s to 60s - Add _sqlite_retry decorator with exponential backoff for lock/busy errors - Use BEGIN IMMEDIATE for writes (detect contention at transaction start) - Add WAL autocheckpoint and journal_size_limit pragmas - Register atexit handler for clean WAL shutdown MCP server changes: - Rate-limit chroma.sqlite3 mtime checks to 5s intervals to prevent PersistentClient recreation (and HNSW index reload) on every write - Add _refresh_db_mtime() after ChromaDB writes to prevent self-triggered reconnects - Bypass mtime cooldown for safety-critical DB disappearance detection - Fix tool_reconnect to fully clear client and mtime state Tests: - Retry decorator: lock retry, exhaustion, non-lock errors, busy variant - Connection pragmas: wal_autocheckpoint, journal_size_limit, WAL mode - Multi-process concurrent writes: 4 processes x 20 triples, zero failures

Qodo-Free-For-OSS · 2026-04-17T06:30:06Z

Hi, KnowledgeGraph.init registers a bound method with atexit, which holds a strong reference to the instance and prevents it from being garbage-collected; repeated short-lived KnowledgeGraph creations will accumulate open SQLite connections until process exit.

Severity: action required | Category: reliability

How to fix: Avoid atexit holding self

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

KnowledgeGraph.__init__ registers self.close with atexit, which strongly retains the instance and leaks SQLite connections in any code path that constructs KnowledgeGraph without explicitly calling close().

Issue Context

This is triggered in fact_checker._check_kg_contradictions() which creates a new KnowledgeGraph per call and does not close it.

Fix Focus Areas

mempalace/knowledge_graph.py[109-113]

mempalace/fact_checker.py[179-217]

Suggested fix approaches

Remove the per-instance atexit.register(self.close) and rely on explicit close() in long-lived owners (MCP server already has a global KG) and/or context management.

Alternatively, register an atexit handler that does not capture self strongly (e.g., store weakref to the instance and close if still alive).

Ensure fact_checker closes the KG (e.g., try/finally: kg.close()), which is the most direct leak stopper regardless of atexit strategy.

We noticed a couple of other issues in this PR as well — happy to share if helpful.

Found by Qodo code review

Registering `self.close` in `KnowledgeGraph.__init__` kept the atexit registry strongly referencing every instance, leaking SQLite connections for short-lived KGs (e.g. fact_checker._check_kg_contradictions which constructs one per call). Instances could never be GC'd until process exit. - Drop per-instance atexit registration - Close the KG explicitly in fact_checker via try/finally - Preserve clean WAL checkpoint on shutdown for the long-lived MCP server KG by registering atexit on the module-level singleton, which is retained by the module anyway so no new references are leaked - Regression test: weakref to a closed KG must resolve to None after GC

felipecpaiva · 2026-04-17T08:44:41Z

Thanks for flagging this — good catch. Addressed in 2feaa71.

What changed

knowledge_graph.py: removed atexit.register(self.close) from __init__ (and the now-unused import atexit). The bound-method registration was keeping a strong reference to every instance in atexit's registry, so short-lived KGs could never be GC'd.
fact_checker.py::_check_kg_contradictions: wraps the query loop in try/finally: kg.close() so the per-call KG always releases its SQLite connection, regardless of atexit strategy.
mcp_server.py: moved the atexit registration to module scope on the _kg singleton (atexit.register(_kg.close)). The clean WAL-checkpoint-on-shutdown benefit is preserved for the long-lived owner, and there's no retention issue because the module already holds _kg for the process lifetime.
Regression test in tests/test_knowledge_graph.py::TestGarbageCollection: create a KG, take a weakref, close + del + gc.collect(), assert the weakref resolves to None. This is the exact property that was broken before.

Why this shape

You listed three options:

Remove the per-instance atexit and rely on explicit close() in long-lived owners.
Use a weakref-based atexit handler.
Explicit close() in fact_checker.

I took (1) + (3). The weakref variant (2) would work, but option (1) is simpler and makes ownership explicit: the MCP server owns the process-wide singleton, short-lived constructors are responsible for their own cleanup. A grep for KnowledgeGraph( confirmed fact_checker._check_kg_contradictions is the only production short-lived caller; tests already call kg.close() explicitly.

Suite is green (pytest tests/ --ignore=tests/benchmarks → 959 passed), lint/format clean.

Happy to hear the other issues you noticed — please share.

felipecpaiva · 2026-04-23T10:38:20Z

Hi maintainers — both workflow runs on this PR are sitting in action_required state (first-time-contributor gate on Actions), so Tests hasn't actually executed yet. Could one of you click "Approve and run workflows" so reviewers have a green check to anchor on?

Locally on 2feaa71:

pytest tests/ --ignore=tests/benchmarks → 959 passed
ruff check + ruff format --check → clean

Happy to rebase onto latest develop if preferred (mergeable: true as of now; PR #1014's schema columns don't overlap our modified regions).

felipecpaiva · 2026-04-23T10:40:12Z

cc @bensig @igorls @milla-jovovich — friendly ping on the CI approval in the comment above. The Tests workflow is blocked on the first-time-contributor gate, so there's no green check for you to anchor on yet. Tests pass locally (959/959, ruff clean). Happy to address anything you'd want adjusted before this can move.

bensig · 2026-04-24T07:32:25Z

@felipecpaiva thx for this - small lint issue tests/test_knowledge_graph.py

felipecpaiva · 2026-04-25T13:58:18Z

@bensig Yeah i missed that due to a ruff version difference.
Fixed and pushed.

Integrate upstream HNSW capacity probe (_refresh_vector_disabled_flag) into both the first-call and reconnect paths of _get_client(), alongside our mtime rate-limiting. Merge _last_mtime_check and _vector_disabled resets in tool_reconnect.

felipecpaiva · 2026-04-27T11:25:31Z

Also @bensig had a small conflict fixed on the branch.

felipecpaiva requested review from bensig, igorls and milla-jovovich as code owners April 16, 2026 13:19

igorls added bug Something isn't working area/kg Knowledge graph area/mcp MCP server and tools labels Apr 24, 2026

style: fix ruff 0.4 format for assert parenthesization

5098d67

merge: resolve conflict with upstream develop

e226008

Integrate upstream HNSW capacity probe (_refresh_vector_disabled_flag) into both the first-call and reconnect paths of _get_client(), alongside our mtime rate-limiting. Merge _last_mtime_check and _vector_disabled resets in tool_reconnect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden KnowledgeGraph and MCP server for multi-process access#948

fix: harden KnowledgeGraph and MCP server for multi-process access#948
felipecpaiva wants to merge 4 commits intoMemPalace:developfrom
felipecpaiva:fix/multi-process-safety

felipecpaiva commented Apr 16, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 17, 2026

Issue description

Issue Context

Fix Focus Areas

Suggested fix approaches

Uh oh!

felipecpaiva commented Apr 17, 2026 •

edited

Loading

Uh oh!

felipecpaiva commented Apr 23, 2026

Uh oh!

felipecpaiva commented Apr 23, 2026

Uh oh!

bensig commented Apr 24, 2026

Uh oh!

felipecpaiva commented Apr 25, 2026

Uh oh!

felipecpaiva commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

felipecpaiva commented Apr 16, 2026

Summary

KnowledgeGraph fixes

MCP server fixes

Tests

Test plan

Uh oh!

Qodo-Free-For-OSS commented Apr 17, 2026

Issue description

Issue Context

Fix Focus Areas

Suggested fix approaches

Uh oh!

felipecpaiva commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipecpaiva commented Apr 23, 2026

Uh oh!

felipecpaiva commented Apr 23, 2026

Uh oh!

bensig commented Apr 24, 2026

Uh oh!

felipecpaiva commented Apr 25, 2026

Uh oh!

felipecpaiva commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

felipecpaiva commented Apr 17, 2026 •

edited

Loading