fix hanging evals #1223

clutchski · 2026-01-02T19:34:34Z

This bug adds timeouts / retries to prevent network errors from hanging evals.

report from customer:

We have running eval jobs (500+ rows/PDF docs) on a VM that consistently freezes ~15 minutes after reaching 100% completion and gets stuck in experiment.summarize() . It occurs in fetch_base_experiment() on a POST to /api/base_experiment/get_id (https://github.com/braintrustdata/braintrust-sdk/blob/main/py/src/braintrust/logger.py#L3606C9-L3606C78). This only reproduces with large datasets on the VM and does not occur with smaller datasets or when run locally.
Root Cause
We think the issue is that fetch_base_experiment() uses app_conn() (Vercel IP), which is called at experiment start (registration) and then not used again until summarize(). During the long eval run, other logging traffic goes through api_conn() (AWS IP), and leaves app_conn() idle for 15+ minutes.
Azure NAT gateways have a ~4-minute idle timeout, and would silently drop idle connections. Seems like when summarize() reuses the stale connection, the TCP session has already been removed by the NAT, leading to hangs and eventual connection failure. The customer confirmed via network capture that TCP retransmissions fail at this point, which is consistent with a stale NAT mapping.

this tightly couples our retry logic to braintrust state, which is weird.

manugoyal

Generally looks good. Mainly one question about the use of .close

manugoyal · 2026-01-22T01:04:29Z

py/src/braintrust/logger.py

+                    # Reset connection pool on timeout errors to clear stale connections
+                    # (e.g., NAT gateway dropped idle connections)
+                    if isinstance(e, requests.exceptions.ReadTimeout):
+                        self.close()


Are we sure that closing the HTTPAdapter will work for subsequent requests? It
seems to be calling methods like
this,
and it's unclear to me whether that's okay? I wonder if another weird thing to
try is to recreate the adapter like roughly self = RetryRequestExceptionsAdapter(...) (even though reassigning self won't strictly
work, but we could reassign a member variable).

manugoyal · 2026-01-22T01:06:08Z

py/src/braintrust/logger.py

    def ping(self) -> bool:
        try:
            resp = self.get("ping")
-            _state.set_user_info_if_null(resp.json())


Why get rid of this? If we don't need it, should we get rid of the whole user_info family of functions?

manugoyal · 2026-01-22T01:07:05Z

py/src/braintrust/logger.py

        return self.session.delete(_urljoin(self.base_url, path), *args, **kwargs)

    def get_json(self, object_type: str, args: Mapping[str, Any] | None = None, retries: int = 0) -> Mapping[str, Any]:
+        # FIXME[matt]: the retry logic seems to be unused and could be n*2 because of the the retry logic


Agreed. We were really just thinking "add more retries" when introducing the
retry handler. But we shouldn't have compounding retry handlers.

clutchski changed the title ~~Matt/long eval~~ fix hanging evals Jan 2, 2026

clutchski added 3 commits January 2, 2026 14:50

remove ping caching from http adapter.

243daff

this tightly couples our retry logic to braintrust state, which is weird.

add retry logic comment

77f7096

add timeouts to add http calls

30c1fa0

clutchski force-pushed the matt/long-eval branch from ab4019c to 30c1fa0 Compare January 2, 2026 19:51

clutchski marked this pull request as ready for review January 2, 2026 19:52

realark approved these changes Jan 20, 2026

View reviewed changes

clutchski added the run-integration-tests label Jan 21, 2026

manugoyal reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hanging evals #1223

fix hanging evals #1223

Uh oh!

clutchski commented Jan 2, 2026 •

edited

Loading

Uh oh!

manugoyal left a comment

Uh oh!

manugoyal Jan 22, 2026

Uh oh!

manugoyal Jan 22, 2026

Uh oh!

manugoyal Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix hanging evals #1223

Are you sure you want to change the base?

fix hanging evals #1223

Uh oh!

Conversation

clutchski commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manugoyal left a comment

Choose a reason for hiding this comment

Uh oh!

manugoyal Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

manugoyal Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

manugoyal Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clutchski commented Jan 2, 2026 •

edited

Loading