feat(consolidation): parallelize llm_batch processing within a single op by connorblack · Pull Request #1604 · vectorize-io/hindsight

connorblack · 2026-05-13T02:09:00Z

Summary

Replace the serial for llm_batch in llm_batches loop in run_consolidation_job with bounded asyncio.gather using a Semaphore at config.consolidation_llm_max_concurrent. The env knob has existed for a while (default 8) but was unused for single-bank workloads — the previous serial loop bottlenecked on single-LLM-call latency regardless of available capacity.

Why this matters

In production we observed consolidation_llm_max_concurrent=8 configured but the consolidator only ever made one in-flight LLM call per op. With Qwen3.6-35B-A3B-FP8 on a single GB10 GPU, latency per consolidation call is 5-10s once chat_template_kwargs.enable_thinking=false is set; with parallelism unlocked, the same wave processes 8 memories in roughly the same wall time as 1.

Verified on a 14k-memory bank: same-millisecond timestamp on the per-batch log lines for all batches in a wave (proves gather returned them as a wave) with no double-create or skipped failures across hundreds of batches.

Correctness invariants preserved

Tag-group security boundary: memories with different tags still never share an LLM call (slicing happens upstream of the gather; each task operates on a single tag group's slice).
Adaptive split-on-failure protocol: the inner pending queue (halve sub-batch on LLM failure, mark failed only at size=1) remains serial within each batch — that ordering is required by the split protocol.
Disjoint memory sets per batch: LIMIT \$2 query produces a fixed set per fetch round; tag-group slicing produces non-overlapping subsets. No two batches in a gather wave can race on the same memory_id.
DB commit semantics: cross-batch IDs aggregated into a single executemany per fetch round (functionally equivalent to the previous per-batch commits — IDs are disjoint).

Module surface added

ConsolidationPerfLog.__iadd__ — accumulator (matches TokenUsage.__add__ codebase idiom).
_BatchExecutionResult dataclass — self-contained per-batch outputs for parent aggregation.
_record_result(stats, result) -> _ResultDelta — centralizes the action-vocabulary mapping (created/updated/merged/multiple/skipped/failed). Side benefit: pre-patch silently dropped merged actions from per-batch log lines; the new helper tracks them.
_merge_pass_result(existing, new) -> dict — extracted from the dense inline block in the multi-pass loop; testable in isolation.
_resolve_obs_tags_list(memory_tags, scope_spec) — translates observation_scopes spec into concrete tag-set passes; called once per llm_batch instead of once per sub_batch.
_is_op_cancelled(memory_engine, operation_id) -> bool — predicate used at both the per-sub-batch check inside _execute_one_llm_batch and the end-of-wave checkpoint in run_consolidation_job.
_execute_one_llm_batch(...) async helper extracted from the for-loop body. Accepts operation_id so per-sub-batch cancellation polling restores per-batch cancellation granularity (the gather wave checkpoint alone would have lengthened cancellation latency from ~5s to ~45s).

Per-batch logging

Each batch executes against its own ConsolidationPerfLog instance so timings/llm_calls/prompt_chars are not raced across concurrent gather participants. Per-task perf is merged into the parent via += after gather completes. The pre-patch snap-delta-style logging (capture-before, log-after) was incorrect under parallel execution; the new shape carries per-batch perf in the result dataclass and emits log lines deterministically ordered by batch_num.

Throughput notes

Empirical: ~16× speedup of consolidation throughput when combined with thinking-off (~5s/call vs ~80s/call). At a fixed concurrency of 8, the 8 in-flight LLM calls per wave saturate the GPU; cranking to 16 increases per-call latency due to KV-cache contention without proportional throughput gain (this is GPU-bound, not Python-bound). Conservative production setting: leave consolidation_llm_max_concurrent at the default of 8.

Test plan

Local pytest suite at hindsight-api-slim/tests/test_consolidation*.py requires HINDSIGHT_API_LLM_API_KEY for testcontainers setup; was not exercised against this fork. CI's core lane (paths-filter: hindsight-api-slim/**) will run the suite end-to-end.
Production smoke verified on a 14k-memory bank: zero 404s, zero schema rejections, zero duplicate observations, zero consolidation_failed_at regressions across 1000+ batches.
pre-commit hooks (ruff check --fix, ruff format, ty check) all green per scripts/hooks/lint.sh.

Notes for reviewers

Three commits on the branch — visible iteration via two rounds of internal review (reuse / quality / efficiency). Squash on merge is fine if that's the project preference; the final state is what matters.

Cancellation latency under parallelism: per-sub-batch via operation_id (~one DB roundtrip per sub_batch) plus end-of-wave checkpoint. With wave size ≤ consolidation_batch_size (default 50) and waves completing in seconds with thinking off, worst-case cancellation latency is bounded by the slowest LLM call in the active wave.

Related follow-ups (not in this PR)

Inter-wave fetch pipelining: the next DB fetch round can't start until the current gather wave completes. With <0.1% inter-wave overhead in practice (5ms fetch vs ~6s wave) this is negligible at default config; would matter only on much smaller round sizes. Worth filing as a separate issue if motivated by observed throughput pain.

Replaces the serial `for llm_batch in llm_batches` loop in run_consolidation_job with bounded asyncio.gather using a Semaphore at config.consolidation_llm_max_concurrent. Each batch executes in its own ConsolidationPerfLog instance so timings/llm_calls/prompt_chars are not raced across concurrent gather participants; the per-task perf is merged into the parent's shared perf after gather completes. Why this matters: - consolidation_llm_max_concurrent (default 8) was unused for single-bank workloads — the previous serial loop bottlenecked on single-LLM-call latency. With gather + Semaphore we saturate up to N parallel LLM calls within one op for an N x speedup. - The tag-group security boundary (memories with different tags never share an LLM call) is preserved unchanged. - The adaptive split-on-failure protocol remains serial within each batch (correctness requirement of the split protocol). - The cross-batch DB commit is now a single round-trip per fetch wave instead of one commit per batch (functionally equivalent — IDs are disjoint by tag-grouping). Cancellation granularity changes from per-batch to per-gather-wave; acceptable since waves are bounded by consolidation_batch_size memories (default 50) and complete in seconds. New module surface: - ConsolidationPerfLog.merge() — aggregate per-task perf into shared - _BatchExecutionResult dataclass — self-contained per-batch result - _execute_one_llm_batch() — extracted per-batch work, used as gather task

Address reviewer feedback on the prior commit before opening upstream PR: - ConsolidationPerfLog.merge() -> __iadd__: matches the codebase's TokenUsage.__add__ accumulator idiom (used heavily in fact_extraction). Callers now write `perf += batch_result.perf`. - Extract _resolve_obs_tags_list() so observation_scopes parsing happens once per llm_batch instead of once per sub_batch (all sub_batches share the same parent batch's tags by tag-grouping invariant). - Extract _apply_action_to_stats() so the action-vocabulary mapping has one definition; per-batch counters and aggregate stats now come from a single pass over batch_result.results instead of two. - Plumb operation_id through _execute_one_llm_batch with a per-sub-batch cancellation check via _check_op_alive — restores per-batch cancellation granularity that the prior commit traded for per-wave only. - Tighten three docstrings (merge, _BatchExecutionResult, _execute_one_llm_batch) to contracts; drop refactor-narrating paragraphs. - Inline _resolve_obs_tags_list and the shared-tags assignment at the top of _execute_one_llm_batch, replacing the per-sub-batch redundant computation and the for-memory tag-tracking loop (all memories in a batch share tags). - Comment in run_consolidation_job notes the new semaphore stacks on top of the global LLM semaphore in llm_wrapper.py — effective concurrency is min(this, the global cap). - Rename loop variable b -> batch and br -> batch_result for readability. - Drop dead `denom = max(1, br.memories_count)` guard (memories_count is always >= 1 for non-empty batches by construction); use `or 1` inline. No behavior changes intended. Smoke-tested end-to-end on a deployed image.

Address remaining findings before opening upstream PR: - Introduce _ResultDelta dataclass with __iadd__; _record_result returns _ResultDelta instead of an opaque dict. Caller now writes `batch_counters += _record_result(stats, result)` (matches the codebase's TokenUsage.__add__ accumulator idiom). - Track 'merged' in per-batch counters and per-batch log line. Pre-patch silently dropped merged actions from the per-batch log; the new helper is the right place to fix this. - Extract _merge_pass_result(existing, new) -> dict from the dense inline block in _execute_one_llm_batch's multi-pass loop. Centralizes the skipped-is-weak / non-skipped-combine-into-multiple action vocabulary that previously duplicated knowledge across two sites. - Extract _is_op_cancelled(memory_engine, operation_id) -> bool predicate. Used at both the per-sub-batch check inside _execute_one_llm_batch and the end-of-wave checkpoint in run_consolidation_job; the implicit `operation_id is not None` short-circuit is now in one place. - Replace `start_num = llm_batch_num` capture pattern with `enumerate(llm_batches, start=llm_batch_num + 1)` to match the codebase's idiom (memory_engine.py:3234, search/tracer.py:325, etc.). - Drop dead `(memories_count or 1)` defensive guard. memories_count is always >= 1 by tag-group construction (range slice over non-empty group). - Rename _apply_action_to_stats -> _record_result. Old name suggested one-way write; new name covers the mutate-and-return shape clearly.

Copilot

Pull request overview

This PR parallelizes consolidation LLM-batch execution within each DB fetch “wave” by replacing a serial per-batch loop with bounded asyncio.gather() concurrency, while refactoring the batch execution into an async helper and making per-batch performance/log aggregation safe under parallelism.

Changes:

Execute llm_batches concurrently with a per-wave asyncio.Semaphore bounded by config.consolidation_llm_max_concurrent.
Extract per-batch execution into _execute_one_llm_batch() and return a _BatchExecutionResult for parent aggregation.
Centralize stats/log accounting via _record_result() / _ResultDelta and merge per-task perf into the parent via ConsolidationPerfLog.__iadd__.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review on vectorize-io#1604: 1. Coalesce None on consolidation_llm_max_concurrent The field is Optional[int] in HindsightConfig and only set when HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT is in env. Without the coalesce, max(1, None) raises TypeError on first wave. Fall back to the global llm_max_concurrent cap (which always has a default). 2. Use return_exceptions=True on the gather wave Without it, gather's first-exception-cancels-siblings semantic would skip the post-wave UPDATE, leaving observation rows already inserted by successful batches without their consolidated_at marker. Those memories would be re-consolidated on the next run, producing duplicate observations. Now: partition gather results into successes vs the first exception, apply DB markers for successful batches, then re-raise so the worker poller's standard exception handling kicks in. Additional exceptions in the same wave are logged via exc_info. 3. Revert obs_tags caching in _execute_one_llm_batch Round-2 hoisted observation_scopes parsing out of the while-pending loop on the assumption tag_groups upstream guaranteed uniform scopes per batch. tag_groups actually only keys on tags, so adaptive split sub_batches could legitimately have different observation_scopes than the parent llm_batch. Parse per sub_batch to preserve scope semantics. 4. Add tests/test_consolidation_parallelism.py - test_consolidation_honors_max_concurrent: pin max=4, ingest 8 memories with distinct tag groups, mock _process_memory_batch with a slot counter; assert 1 < max_in_flight <= 4. - test_partial_failure_preserves_succeeded_markers: ingest 6 memories, mock raises on the 2nd call; assert 5 succeeded markers present, 1 absent, exception re-raised.

connorblack added 3 commits May 12, 2026 19:49

Copilot AI review requested due to automatic review settings May 13, 2026 02:09

Copilot started reviewing on behalf of connorblack May 13, 2026 02:09 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(consolidation): parallelize llm_batch processing within a single op#1604

feat(consolidation): parallelize llm_batch processing within a single op#1604
connorblack wants to merge 4 commits into
vectorize-io:mainfrom
connorblack:feat/consolidation-parallelism

connorblack commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

connorblack commented May 13, 2026

Summary

Why this matters

Correctness invariants preserved

Module surface added

Per-batch logging

Throughput notes

Test plan

Notes for reviewers

Related follow-ups (not in this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants