perf: Improve eval runtime performance by AbhiPrasad · Pull Request #1459 · braintrustdata/braintrust-sdk-javascript

AbhiPrasad · 2026-03-02T18:50:28Z

resolves #1394

AI Summary

Problem

Starting in v1.0.0, the braintrust SDK introduced a severe latency regression during Eval runs. A workload of 40 examples × 8 spans that completed in ~2s on v0.4.10 took ~18s on v3.2.0 — an 8.2x slowdown.

Two changes caused this:

Serial chunk uploads: flushOnce() was changed to process items through a while loop that awaited each chunk of 25 items sequentially, replacing the v0.4.10 behavior of sending all items at once with parallel batches via Promise.all.
Per-task blocking flush: framework.ts added await experiment.flush() after every single eval task when maxConcurrency is set, creating a synchronous network round-trip per task. v0.4.10 had no per-task flush — it let items accumulate and flushed in the background.

Approaches attempted

Approach	Wall time	Requests	Why it didn't fully work
Parallel chunks (Promise.all over chunk loop)	6.6s	50	Parallelized within each flush, but still 20 separate flush cycles from per-task blocking
Increase chunk size to 100	6.8s	20	Fewer requests but each flush had only 1 chunk — no parallelism benefit
Remove outer chunking, send all to batchItems	8.8s	20	Items fit in 1 HTTP batch per flush → 20 sequential round-trips, no parallelism
Microtask yield (Promise.resolve) before flush	8.4s	50	Tasks run on separate macrotasks; microtask coalescing didn't help
setTimeout(0) debounce in flushOnce	10.0s	41	Added latency to every flush including recursive tails
Item count threshold (100) for per-task flush	5.0s	10	Good improvement, but count is a poor proxy for memory — 100 small items ≠ 100 large items
Item count threshold (500)	2.6s	7	Near-baseline, but still count-based
Item count threshold (1000)	1.9s	6	Matches baseline, but effectively disables backpressure

Final fix (3 changes)

1. logger.ts — Remove chunking loop in flushOnce()

Send all drained items to flushWrappedItemsChunk at once, which internally batches by item count and byte size and sends batches in parallel via Promise.all. This matches v0.4.10 behavior.

2. framework.ts — Byte-based backpressure threshold instead of per-task flush

Replace the unconditional await experiment.flush() after every task with a check against pending in-flight bytes. Only flush when serialized data exceeds 10MB (configurable via BRAINTRUST_FLUSH_BACKPRESSURE_BYTES). This lets items accumulate into larger, parallelizable batches for normal workloads while still bounding memory for large ones.

3. logger.ts — Track pending bytes and deprecate BRAINTRUST_LOG_FLUSH_CHUNK_SIZE

Track _pendingBytes (bytes serialized but not yet uploaded) in flushWrappedItemsChunk, exposed via pendingFlushBytes() on the BackgroundLogger interface.
Expose the backpressure threshold via flushBackpressureBytes(), configurable via BRAINTRUST_FLUSH_BACKPRESSURE_BYTES env var.
Emit a deprecation warning when BRAINTRUST_LOG_FLUSH_CHUNK_SIZE is set, since it no longer has any effect.

Results

Benchmark: 40 examples × 8 spans, 128-byte payloads, maxConcurrency=4

Metric	v0.4.10 (baseline)	v3.2.0 (broken)	v3.2.0 (fixed)
Wall time	2.0s	17.9s	2.1s
Upload requests	6	50	6
Upload latency (cumulative)	4.1s	17.3s	3.8s
Regression factor	1x	8.2x	~1x

CPU Benchmarking

The CPU profiles revealed a key insight: this was never a CPU problem. Here's what they showed:

CPU idle time tells the whole story

Profile	Duration	CPU Idle	Active CPU
v0.4.10 (baseline)	2.2s	89.0%	11.0%
v3.2.0 (broken)	18.0s	97.1%	2.9%
v3.2.0 (fixed)	7.0s	93.6%	6.4%

The broken v3.2.0 spent 97.1% of its 18-second runtime doing nothing — just waiting on network I/O. The CPU was idle for ~17.5 seconds. No function in the top 30 by self-time was flush-related; flushOnce and flushWrappedItemsChunk registered just 1 sample each. The bottleneck was purely accumulated network round-trip latency from sequential requests.

Request timeline analysis confirmed the pattern

The most useful data came not from the CPU profiles themselves but from the HTTP request timelines captured by our fetch instrumentation. In the v0.4.10 profile:

req 1: |=========945ms=========|
req 2:                          |====578ms====|
req 3:                          |=====657ms=====|
req 4:                          |======689ms======|    ← 5 fire together
req 5:                          |======699ms=======|
req 6:                          |=======757ms=======|

10 overlapping request pairs — requests 2-6 all started within 2ms of each other. The upload phase completed in 1.7s despite 4.3s of cumulative server time.

In the broken v3.2.0: zero overlapping request pairs. 50 back-to-back requests with 3ms average gaps between them, stretching to 17.4s.

The profiles pointed us away from the wrong fix

Without the profiles, you might assume the fix is to optimize the flush code itself — make serialization faster, reduce GC, etc. The profiles showed that CPU work (serialization, GC, JSON.stringify) was negligible — under 3% of runtime. The entire regression was architectural: how requests were dispatched (sequential vs parallel) and when flushes were triggered (every task vs threshold-based).

Speedscope viewing tips

For anyone opening these in speedscope.app:

v0.4.10: In the timeline view, look for stacked fetch calls — upload requests overlap vertically, showing parallel execution
v3.2.0 (broken): Look for the repeating fetch → idle gap → fetch → idle gap pattern across the entire 18s timeline. The call stack shows flushOnce → flushWrappedItemsChunk → submitLogsRequest on each request
The idle gaps between requests are the smoking gun — each gap is the event loop waiting for the previous await to resolve before starting the next chunk

js/src/framework.ts

js/src/logger.ts

js/src/framework.ts

Includes: - #1459

fix: Improve eval runtime performance

47113b2

AbhiPrasad mentioned this pull request Mar 2, 2026

Significant latency regression between TypeScript SDKs v0.4.10 and v1.0.0 #1394

Closed

ankrgyl reviewed Mar 2, 2026

View reviewed changes

js/src/framework.ts Outdated Show resolved Hide resolved

clutchski reviewed Mar 2, 2026

View reviewed changes

js/src/logger.ts Outdated Show resolved Hide resolved

AbhiPrasad added 2 commits March 2, 2026 14:00

cleanup comments and deprecate BRAINTRUST_LOG_FLUSH_CHUNK_SIZE

fd3f517

allow for BRAINTRUST_FLUSH_BACKPRESSURE_BYTES to be configured

759b3ec

AbhiPrasad self-assigned this Mar 2, 2026

AbhiPrasad changed the title ~~fix: Improve eval runtime performance~~ perf: Improve eval runtime performance Mar 2, 2026

AbhiPrasad marked this pull request as ready for review March 2, 2026 19:13

AbhiPrasad requested a review from lforst March 2, 2026 19:13

CLowbrow approved these changes Mar 2, 2026

View reviewed changes

js/src/framework.ts Show resolved Hide resolved

change threshold to 10mb

e043e56

AbhiPrasad mentioned this pull request Mar 2, 2026

3.3.0 pre-release #1468

Closed

lforst merged commit aa95a27 into main Mar 2, 2026
37 checks passed

lforst deleted the abhi-flush-improvements branch March 2, 2026 23:55

lforst mentioned this pull request Mar 3, 2026

chore: Bump SDK version for 3.2.1 #1474

Merged

lforst added a commit that referenced this pull request Mar 3, 2026

chore: Bump SDK version for 3.2.1 (#1474)

fe77523

Includes: - #1459

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Improve eval runtime performance#1459

perf: Improve eval runtime performance#1459
lforst merged 4 commits intomainfrom
abhi-flush-improvements

AbhiPrasad commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AbhiPrasad commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Summary

Problem

Approaches attempted

Final fix (3 changes)

Results

CPU Benchmarking

CPU idle time tells the whole story

Request timeline analysis confirmed the pattern

The profiles pointed us away from the wrong fix

Speedscope viewing tips

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AbhiPrasad commented Mar 2, 2026 •

edited

Loading