Skip to content

Conversation

sandersaares
Copy link

@sandersaares sandersaares commented Jul 22, 2025

Test to accompany bug report #853.

@seanmonstar
Copy link
Member

Thanks for providing a test case! What you describe in the issue sounds plausible, but I notice in this that flow control isn't touched, so neither side will give any more connection window space at all, so it will by design definitely not be able to make further progress.

);
}

thread::sleep(CHECK_FOR_PROGRESS_INTERVAL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd also want to use something like tokio::time::sleep(CHECK_FOR_PROGRESS_INTERVAL).await to not block any of the other tasks on this thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - updated.

@sandersaares
Copy link
Author

sandersaares commented Jul 22, 2025

What you describe in the issue sounds plausible, but I notice in this that flow control isn't touched, so neither side will give any more connection window space at all, so it will by design definitely not be able to make further progress.

It is not clear what you mean by this. Why would flow control need to be touched? The connection has a non-zero window (before it gets allocated to pending_open requests that could not possibly use it, based on the theory above), so all the requests should be transmitted through that window. Am I misunderstanding something?

@seanmonstar
Copy link
Member

In HTTP/2, there's a window on each stream, and on the connection as a whole. And the default is fairly small, just 64kb. Flow control management with h2 is manual. Without the other "side" give more window on the connection, no stream will be allowed to send more data.

hyper will manage it for you automatically... And, perhaps you do in your larger app. But it is a reason why this specific test will hang. With 10kb bodies, it should stop around the 7th request (-ish, since prioritization/ordering of DATA frames is not specified).

@sandersaares
Copy link
Author

If this test were failing because the connection window is not being extended, it would be failing also with CONCURRENCY = 1, wouldn't it? As configured in the PR, each increment of CONCURRENCY translates to 100 requests of 10 KB, so 1 MB of data, which would exceed a 64KB window.

Yet, with lower values of CONCURRENCY the test appears to pass just fine. Even 10 000 requests go through just fine with CONCURRENCY=1.

This suggests to me that the problem is not the lack of connection window extension.

@seanmonstar
Copy link
Member

OK, I've debugged a bit more, and was reminded that h2 will automatically clear connection window that a stream has used upon dropping all references to that particular stream, which is why there are WINDOW_UPDATE frames of stream-id=0 being sent. That clears that issue up.

Fiddling with the numbers in your test case, and adding a few more printlns, it seems there is a correlation between streams being refused and eventually the connection window not having capacity. I'll need to investigate that a little more.

@benjaminp
Copy link

The lack of calls to request_capacity() calls in this testcase means it completely relies on the streaming-dropping auto-release to keep the connection window open. From the client's perspective, when the CONCURRENCY is sufficiently large, it's possible that all open streams have partially transmitted request bodies and all local-half-closed streams have partially transmitted response bodies. Neither side can make progress because the window only changes when a stream completes. I think this state could explain the deadlock?

@sandersaares
Copy link
Author

sandersaares commented Aug 27, 2025

Sounds plausible. Forgive my ignorance but where in the Hyper stack does this "make sure you reserve enough connection window before you try sending something" go? In our real scenario where we found this behavior, we were just using Hyper as the higher layer above h2 and I posted this as an h2 bug simply in order to try minimize the repro.

Perhaps that was a mistake and I removed important machinery from the picture in my minimization attempt. If we take the full picture of a Hyper-based HTTP2 client, where would such logic go? I was under the assumption that Hyper would itself manage the different requests on the same connection. Is that assumption false? Or should we be looking at this as a Hyper bug, whereby it is failing to do that?

@seanmonstar
Copy link
Member

hyper does manage capacity for you automatically, yes. When I investigated some last month, I do believe there's a bug in h2 that is missing either a window update or release or something.

But my current priorities have me looking into other things at the moment.

@sandersaares
Copy link
Author

sandersaares commented Sep 4, 2025

Slightly beyond my depth in h2 internals but I am staring at this block here:

let eos = frame.is_end_stream();
let len = len as usize;
if frame.payload().remaining() > len {
frame.set_end_stream(false);
}
(eos, len)

That if is entered if we only have enough connection capacity to send part of the payload, not the entire payload. Two questions immediately pop into my head:

  • We buf.take(len) below but what happens to the rest of the payload? The original frame appears to just be dropped after we .map() to only the head of the payload.
  • We clear the "end stream" flag because we are only sending a part of the queued frame. However, we still appear to return the partial frame with the previous eos flag, indicating "end of stream == true". Meanwhile, I do not see any logic that "preserves" the "end stream" flag for the remainder that we are not sending.

It is possible that I am only seeing a partial view - perhaps this logic collaborates with some other logic that takes care of these concerns - but this code looks suspicious to me. Could it be that these "partial frames" are getting their second half lost, causing some kind of stall whereby these streams never complete?

@benjaminp
Copy link

I was wondering if #860 had an effect on your production version of this issue.

@sandersaares
Copy link
Author

@benjaminp I gave it a try but was not able to see any difference in outcome. I can see the rationale behind the change - why should things get allocated capacity if they just sit in a queue - but in practice, even if it is a piece of the puzzle, it seems not to be the whole puzzle.

@benjaminp
Copy link

An update to #860 now fixes another case where connection windows could be starved.

@sandersaares
Copy link
Author

Tried latest state in that branch, still no difference in our real workload - the deadlocks occur in an equivalent manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants