Skip to content

mempalace repair crashes mid-rebuild on ChromaDB 1.5.8, leaving palace unrecoverable #1238

@gounthar

Description

@gounthar

What happened?

I ran mempalace repair --yes on a palace with ~196K drawers. The command crashed mid-rebuild and left the palace in a state with only 235 embeddings, no backup, and no way to recover without the drawers_export.json.

Two separate failures occurred in sequence.

Failure 1: FTS5 table not registered

Before repair would even start, it hit:

chromadb.errors.InternalError: Database error: error returned from database:
(code: 1) no such table: embedding_fulltext_search

The FTS5 shadow tables were present in sqlite_master but embedding_fulltext_search itself was not registered. Had to fix it manually:

DROP TABLE IF EXISTS embedding_fulltext_search_config;
DROP TABLE IF EXISTS embedding_fulltext_search_content;
DROP TABLE IF EXISTS embedding_fulltext_search_data;
DROP TABLE IF EXISTS embedding_fulltext_search_docsize;
DROP TABLE IF EXISTS embedding_fulltext_search_idx;
CREATE VIRTUAL TABLE embedding_fulltext_search USING fts5(string_value, tokenize='trigram');
INSERT INTO embedding_fulltext_search (rowid, string_value)
  SELECT rowid, string_value FROM embedding_metadata;

(~2M rows, took a few minutes.)

Failure 2: Compaction crash mid-rebuild

After the FTS5 fix, repair extracted all drawers fine, then crashed:

chromadb.errors.InternalError: Error in compaction: Failed to apply logs to the metadata segment

State after the crash:

  • The original collection was already deleted from SQLite before the crash hit.
  • Only 235 embeddings had been re-inserted when it died.
  • palace.backup was overwritten at the start of the repair run, so it held the broken 235-embedding state, not the original.
  • The only copy of the full 196K drawers was drawers_export.json (560 MB).

What did you expect?

Either the repair succeeds, or it fails without destroying the original data. Deleting the SQLite collection before the new inserts are done means a mid-rebuild crash leaves nothing to fall back to.

How to reproduce:

  1. Palace with ~196K+ drawers on ChromaDB 1.5.8
  2. mempalace repair --yes
  3. Observe crash with Error in compaction: Failed to apply logs to the metadata segment

I suspect this is more likely to trigger on larger palaces due to how the Rust compaction backend handles big collections, but I haven't confirmed a minimum size threshold.

What actually fixed it:

# 1. Delete only the HNSW segment directory (leaves SQLite untouched)
rm -rf ~/.mempalace/palace/<uuid>/

# 2. Trigger HNSW recreation from SQLite
mempalace status

# 3. Re-import missing drawers using mempalace's ChromaBackend
#    (so _HNSW_BLOAT_GUARD gets applied automatically)
python3 import_missing.py

The SQLite data was intact the whole time. Only the HNSW index was broken. Repair didn't need to touch SQLite at all.

Suggestions

Short term: --mode hnsw-only flag

Most "palace won't start" cases I've hit involve a corrupted or bloated HNSW index on top of intact SQLite. Deleting just the HNSW segment directory is enough:

rm -rf ~/.mempalace/palace/<uuid>/
mempalace status

A --mode hnsw-only option doing exactly this would cover the majority of cases safely. No re-embedding, no SQLite surgery, no data-loss risk.

Medium term: keep the previous backup until rebuild succeeds

repair overwrites palace.backup at the start of each run. Keeping palace.backup.prev until the new build completes would give users a real fallback if the rebuild fails.

Long term: catch the compaction error before touching SQLite

If ChromaDB throws Error in compaction, repair should abort before wiping the original collection data, not after.

Environment:

  • OS: WSL2 / Debian (Linux x86_64)
  • Python version: 3.12
  • MemPalace version: 3.3.3
  • ChromaDB version: 1.5.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions