Skip to content

[Bug] RocksDB lock contention: multiple _SingleAccountBackend instances open same DB path (local backend) #1072

@plhys

Description

@plhys

Environment

  • OpenViking: v0.2.8 (Docker, local vectordb backend)
  • OS: Linux 6.12.18 (Synology NAS)

Problem

When using the local vectordb backend, OpenViking creates multiple _SingleAccountBackend instances (one per account_id), each calling create_collection_adapter(config) which creates a new LocalCollectionAdapterget_or_create_local_collection() → opens the same RocksDB/LevelDB store path.

RocksDB enforces an exclusive file lock on the store/LOCK file. The first instance acquires it; every subsequent instance fails immediately with:

IO error: lock /app/data/vectordb/context/store/LOCK: already held by process

This makes all vector operations (count, query, search, upsert) fail for any account except the first one initialized.

Root Cause Analysis

The issue is in the architecture:

  1. VikingVectorIndexBackend._get_backend_for_account() creates a new _SingleAccountBackend per account_id
  2. Each backend calls create_collection_adapter(config) in factory.pyno caching
  3. Each adapter opens its own PersistCollectionStoreManager → RocksDB at the same path
  4. RocksDB only allows one writer per database directory

Since account filtering is done at the query level (via account_id field filter), all accounts share the same physical collection. There's no need for separate DB handles.

Contributing Factors

  • lock_timeout defaults to 0.0 (immediate failure) in lock_manager.py and transaction_config.py
  • self._adapter.query() and self._adapter.count() are synchronous calls inside async def methods, blocking the event loop and preventing lock release
  • self.embedder.embed() in hierarchical_retriever.py is also synchronous in async context

Fix Applied (verified working)

Three patches that together resolve the issue:

Patch 1: Adapter instance cache in factory.py (PRIMARY FIX)

_local_adapter_cache: dict[str, CollectionAdapter] = {}

def create_collection_adapter(config) -> CollectionAdapter:
    # ... existing code ...
    if backend == "local" and hasattr(config, "path") and hasattr(config, "name"):
        from pathlib import Path
        cache_key = str(Path(config.path) / "vectordb" / (config.name or "context"))
        if cache_key in _local_adapter_cache:
            return _local_adapter_cache[cache_key]
        adapter = adapter_cls.from_config(config)
        _local_adapter_cache[cache_key] = adapter
        return adapter
    return adapter_cls.from_config(config)

Patch 2: lock_timeout 0 → 5s

In lock_manager.py and transaction_config.py:

lock_timeout: float = 5.0  # was 0.0

Patch 3: asyncio.to_thread for blocking calls

In viking_vector_index_backend.py:

return await asyncio.to_thread(self._adapter.query, ...)
return await asyncio.to_thread(self._adapter.count, filter=filter)

In hierarchical_retriever.py:

result = await asyncio.to_thread(self.embedder.embed, query.query, True)

Related PRs

These existing PRs address parts of the problem but not the root cause (adapter caching):

None of these PRs fix the fundamental issue: multiple adapter instances opening the same RocksDB. Even with all these PRs merged, the lock contention will persist because each _SingleAccountBackend still creates its own DB handle.

Suggested Long-term Fix

The proper fix should be at the VikingVectorIndexBackend facade level: ensure all _SingleAccountBackend instances share a single CollectionAdapter (and thus a single RocksDB handle), since they all operate on the same physical collection with account-level filtering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions