Skip to content

Conversation

@clutchski
Copy link
Collaborator

@clutchski clutchski commented Jan 2, 2026

This bug adds timeouts / retries to prevent network errors from hanging evals.

report from customer:

We have running eval jobs (500+ rows/PDF docs) on a VM that consistently freezes ~15 minutes after reaching 100% completion and gets stuck in experiment.summarize() . It occurs in fetch_base_experiment() on a POST to /api/base_experiment/get_id (https://github.com/braintrustdata/braintrust-sdk/blob/main/py/src/braintrust/logger.py#L3606C9-L3606C78). This only reproduces with large datasets on the VM and does not occur with smaller datasets or when run locally.
Root Cause
We think the issue is that fetch_base_experiment() uses app_conn() (Vercel IP), which is called at experiment start (registration) and then not used again until summarize(). During the long eval run, other logging traffic goes through api_conn() (AWS IP), and leaves app_conn() idle for 15+ minutes.
Azure NAT gateways have a ~4-minute idle timeout, and would silently drop idle connections. Seems like when summarize() reuses the stale connection, the TCP session has already been removed by the NAT, leading to hangs and eventual connection failure. The customer confirmed via network capture that TCP retransmissions fail at this point, which is consistent with a stale NAT mapping.

@clutchski clutchski changed the title Matt/long eval fix hanging evals Jan 2, 2026
this tightly couples our retry logic to braintrust state,
which is weird.
@clutchski clutchski marked this pull request as ready for review January 2, 2026 19:52
Copy link
Contributor

@manugoyal manugoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Mainly one question about the use of .close

# Reset connection pool on timeout errors to clear stale connections
# (e.g., NAT gateway dropped idle connections)
if isinstance(e, requests.exceptions.ReadTimeout):
self.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that closing the HTTPAdapter will work for subsequent requests? It
seems to be calling methods like
this,
and it's unclear to me whether that's okay? I wonder if another weird thing to
try is to recreate the adapter like roughly self = RetryRequestExceptionsAdapter(...) (even though reassigning self won't strictly
work, but we could reassign a member variable).

def ping(self) -> bool:
try:
resp = self.get("ping")
_state.set_user_info_if_null(resp.json())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why get rid of this? If we don't need it, should we get rid of the whole user_info family of functions?

return self.session.delete(_urljoin(self.base_url, path), *args, **kwargs)

def get_json(self, object_type: str, args: Mapping[str, Any] | None = None, retries: int = 0) -> Mapping[str, Any]:
# FIXME[matt]: the retry logic seems to be unused and could be n*2 because of the the retry logic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We were really just thinking "add more retries" when introducing the
retry handler. But we shouldn't have compounding retry handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants