Merged
Conversation
ankrgyl
reviewed
Mar 2, 2026
clutchski
reviewed
Mar 2, 2026
CLowbrow
approved these changes
Mar 2, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
resolves #1394
AI Summary
Problem
Starting in v1.0.0, the braintrust SDK introduced a severe latency regression during
Evalruns. A workload of 40 examples × 8 spans that completed in ~2s on v0.4.10 took ~18s on v3.2.0 — an 8.2x slowdown.Two changes caused this:
Serial chunk uploads:
flushOnce()was changed to process items through awhileloop thatawaited each chunk of 25 items sequentially, replacing the v0.4.10 behavior of sending all items at once with parallel batches viaPromise.all.Per-task blocking flush:
framework.tsaddedawait experiment.flush()after every single eval task whenmaxConcurrencyis set, creating a synchronous network round-trip per task. v0.4.10 had no per-task flush — it let items accumulate and flushed in the background.Approaches attempted
Final fix (3 changes)
1.
logger.ts— Remove chunking loop influshOnce()Send all drained items to
flushWrappedItemsChunkat once, which internally batches by item count and byte size and sends batches in parallel viaPromise.all. This matches v0.4.10 behavior.2.
framework.ts— Byte-based backpressure threshold instead of per-task flushReplace the unconditional
await experiment.flush()after every task with a check against pending in-flight bytes. Only flush when serialized data exceeds 10MB (configurable viaBRAINTRUST_FLUSH_BACKPRESSURE_BYTES). This lets items accumulate into larger, parallelizable batches for normal workloads while still bounding memory for large ones.3.
logger.ts— Track pending bytes and deprecateBRAINTRUST_LOG_FLUSH_CHUNK_SIZE_pendingBytes(bytes serialized but not yet uploaded) influshWrappedItemsChunk, exposed viapendingFlushBytes()on theBackgroundLoggerinterface.flushBackpressureBytes(), configurable viaBRAINTRUST_FLUSH_BACKPRESSURE_BYTESenv var.BRAINTRUST_LOG_FLUSH_CHUNK_SIZEis set, since it no longer has any effect.Results
Benchmark: 40 examples × 8 spans, 128-byte payloads, maxConcurrency=4
CPU Benchmarking
The CPU profiles revealed a key insight: this was never a CPU problem. Here's what they showed:
CPU idle time tells the whole story
The broken v3.2.0 spent 97.1% of its 18-second runtime doing nothing — just waiting on network I/O. The CPU was idle for ~17.5 seconds. No function in the top 30 by self-time was flush-related;
flushOnceandflushWrappedItemsChunkregistered just 1 sample each. The bottleneck was purely accumulated network round-trip latency from sequential requests.Request timeline analysis confirmed the pattern
The most useful data came not from the CPU profiles themselves but from the HTTP request timelines captured by our fetch instrumentation. In the v0.4.10 profile:
10 overlapping request pairs — requests 2-6 all started within 2ms of each other. The upload phase completed in 1.7s despite 4.3s of cumulative server time.
In the broken v3.2.0: zero overlapping request pairs. 50 back-to-back requests with 3ms average gaps between them, stretching to 17.4s.
The profiles pointed us away from the wrong fix
Without the profiles, you might assume the fix is to optimize the flush code itself — make serialization faster, reduce GC, etc. The profiles showed that CPU work (serialization, GC, JSON.stringify) was negligible — under 3% of runtime. The entire regression was architectural: how requests were dispatched (sequential vs parallel) and when flushes were triggered (every task vs threshold-based).
Speedscope viewing tips
For anyone opening these in speedscope.app:
fetchcalls — upload requests overlap vertically, showing parallel executionfetch→ idle gap →fetch→ idle gap pattern across the entire 18s timeline. The call stack showsflushOnce→flushWrappedItemsChunk→submitLogsRequeston each requestawaitto resolve before starting the next chunk