Skip to content

perf: Improve eval runtime performance#1459

Merged
lforst merged 4 commits intomainfrom
abhi-flush-improvements
Mar 2, 2026
Merged

perf: Improve eval runtime performance#1459
lforst merged 4 commits intomainfrom
abhi-flush-improvements

Conversation

@AbhiPrasad
Copy link
Member

@AbhiPrasad AbhiPrasad commented Mar 2, 2026

resolves #1394

AI Summary

Problem

Starting in v1.0.0, the braintrust SDK introduced a severe latency regression during Eval runs. A workload of 40 examples × 8 spans that completed in ~2s on v0.4.10 took ~18s on v3.2.0 — an 8.2x slowdown.

Two changes caused this:

  1. Serial chunk uploads: flushOnce() was changed to process items through a while loop that awaited each chunk of 25 items sequentially, replacing the v0.4.10 behavior of sending all items at once with parallel batches via Promise.all.

  2. Per-task blocking flush: framework.ts added await experiment.flush() after every single eval task when maxConcurrency is set, creating a synchronous network round-trip per task. v0.4.10 had no per-task flush — it let items accumulate and flushed in the background.

Approaches attempted

Approach Wall time Requests Why it didn't fully work
Parallel chunks (Promise.all over chunk loop) 6.6s 50 Parallelized within each flush, but still 20 separate flush cycles from per-task blocking
Increase chunk size to 100 6.8s 20 Fewer requests but each flush had only 1 chunk — no parallelism benefit
Remove outer chunking, send all to batchItems 8.8s 20 Items fit in 1 HTTP batch per flush → 20 sequential round-trips, no parallelism
Microtask yield (Promise.resolve) before flush 8.4s 50 Tasks run on separate macrotasks; microtask coalescing didn't help
setTimeout(0) debounce in flushOnce 10.0s 41 Added latency to every flush including recursive tails
Item count threshold (100) for per-task flush 5.0s 10 Good improvement, but count is a poor proxy for memory — 100 small items ≠ 100 large items
Item count threshold (500) 2.6s 7 Near-baseline, but still count-based
Item count threshold (1000) 1.9s 6 Matches baseline, but effectively disables backpressure

Final fix (3 changes)

1. logger.ts — Remove chunking loop in flushOnce()

Send all drained items to flushWrappedItemsChunk at once, which internally batches by item count and byte size and sends batches in parallel via Promise.all. This matches v0.4.10 behavior.

2. framework.ts — Byte-based backpressure threshold instead of per-task flush

Replace the unconditional await experiment.flush() after every task with a check against pending in-flight bytes. Only flush when serialized data exceeds 10MB (configurable via BRAINTRUST_FLUSH_BACKPRESSURE_BYTES). This lets items accumulate into larger, parallelizable batches for normal workloads while still bounding memory for large ones.

3. logger.ts — Track pending bytes and deprecate BRAINTRUST_LOG_FLUSH_CHUNK_SIZE

  • Track _pendingBytes (bytes serialized but not yet uploaded) in flushWrappedItemsChunk, exposed via pendingFlushBytes() on the BackgroundLogger interface.
  • Expose the backpressure threshold via flushBackpressureBytes(), configurable via BRAINTRUST_FLUSH_BACKPRESSURE_BYTES env var.
  • Emit a deprecation warning when BRAINTRUST_LOG_FLUSH_CHUNK_SIZE is set, since it no longer has any effect.

Results

Benchmark: 40 examples × 8 spans, 128-byte payloads, maxConcurrency=4

Metric v0.4.10 (baseline) v3.2.0 (broken) v3.2.0 (fixed)
Wall time 2.0s 17.9s 2.1s
Upload requests 6 50 6
Upload latency (cumulative) 4.1s 17.3s 3.8s
Regression factor 1x 8.2x ~1x

CPU Benchmarking

The CPU profiles revealed a key insight: this was never a CPU problem. Here's what they showed:

CPU idle time tells the whole story

Profile Duration CPU Idle Active CPU
v0.4.10 (baseline) 2.2s 89.0% 11.0%
v3.2.0 (broken) 18.0s 97.1% 2.9%
v3.2.0 (fixed) 7.0s 93.6% 6.4%

The broken v3.2.0 spent 97.1% of its 18-second runtime doing nothing — just waiting on network I/O. The CPU was idle for ~17.5 seconds. No function in the top 30 by self-time was flush-related; flushOnce and flushWrappedItemsChunk registered just 1 sample each. The bottleneck was purely accumulated network round-trip latency from sequential requests.

Request timeline analysis confirmed the pattern

The most useful data came not from the CPU profiles themselves but from the HTTP request timelines captured by our fetch instrumentation. In the v0.4.10 profile:

req 1: |=========945ms=========|
req 2:                          |====578ms====|
req 3:                          |=====657ms=====|
req 4:                          |======689ms======|    ← 5 fire together
req 5:                          |======699ms=======|
req 6:                          |=======757ms=======|

10 overlapping request pairs — requests 2-6 all started within 2ms of each other. The upload phase completed in 1.7s despite 4.3s of cumulative server time.

In the broken v3.2.0: zero overlapping request pairs. 50 back-to-back requests with 3ms average gaps between them, stretching to 17.4s.

The profiles pointed us away from the wrong fix

Without the profiles, you might assume the fix is to optimize the flush code itself — make serialization faster, reduce GC, etc. The profiles showed that CPU work (serialization, GC, JSON.stringify) was negligible — under 3% of runtime. The entire regression was architectural: how requests were dispatched (sequential vs parallel) and when flushes were triggered (every task vs threshold-based).

Speedscope viewing tips

For anyone opening these in speedscope.app:

  • v0.4.10: In the timeline view, look for stacked fetch calls — upload requests overlap vertically, showing parallel execution
  • v3.2.0 (broken): Look for the repeating fetch → idle gap → fetch → idle gap pattern across the entire 18s timeline. The call stack shows flushOnceflushWrappedItemsChunksubmitLogsRequest on each request
  • The idle gaps between requests are the smoking gun — each gap is the event loop waiting for the previous await to resolve before starting the next chunk

@AbhiPrasad AbhiPrasad self-assigned this Mar 2, 2026
@AbhiPrasad AbhiPrasad changed the title fix: Improve eval runtime performance perf: Improve eval runtime performance Mar 2, 2026
@AbhiPrasad AbhiPrasad marked this pull request as ready for review March 2, 2026 19:13
@AbhiPrasad AbhiPrasad requested a review from lforst March 2, 2026 19:13
@AbhiPrasad AbhiPrasad mentioned this pull request Mar 2, 2026
@lforst lforst merged commit aa95a27 into main Mar 2, 2026
37 checks passed
@lforst lforst deleted the abhi-flush-improvements branch March 2, 2026 23:55
lforst added a commit that referenced this pull request Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Significant latency regression between TypeScript SDKs v0.4.10 and v1.0.0

5 participants