You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In short, it's about wrong usage of unbuffered channel in multiple goroutines.
The problem is in the TSO forwarding & Dispatching framework on Server side, where many streaming process goroutines, calling Tso() gPRC api, share the same handleDispatcher() goroutine per forwarded host.
Below is the event sequence describing what happened:
In the first streaming process goroutine, which created the handleDispatcher() goroutine, dispatched a request then invoked the blocking call (client.gPRC stream).Recv() and waited for some time without receiving anything.
The handleDispatcher() goroutine failed to process the request dispatched above, then trying to pass the error through an unbuffered channel (errCh <- error)
The streaming process goroutine can't move forward to reach the place to consume the error channel because of 1, so both goroutines blocked.
As new Tso streaming requests came in, more streaming process goroutines were created, and they enqueued the requests to requests channel (buffered channel with 10000 capacity). Since handleDispatcher() goroutine, the consumer of requests channel, were blocked at 2, no one drained the requests in the requests channel and the max capacity of the channel was reached. Eventually there were more and more streaming process goroutines blocked at enqueuing requests to the requests channel.
Eventually, gPRC server can't spawn more streaming process goroutines.
Besides the above issue in the TSO forwarding & Dispatching framework, there might have other several issues:
All gPRC client streams share the same handleDispatcher() goroutine for the same forwarded host, if one gPRC client stream Send() has problem, handleDispatcher() goroutine exits, and all requests dispatched by all gPRC client streams become orphan requests.
When error happens in the handleDispatcher() goroutine, the error handling speed will be slowed down by blocking call of stream.Recv() in the corresponding streaming process goroutine.
After getting response of the batch request from tso microservice, it sequentially sends the requests to the client streams, which actually eat the benefit of batching.
The text was updated successfully, but these errors were encountered:
… gPRC stream (#6572)
close#6549, ref #6565
Simplify tso proxy implementation by using one forward stream for one grpc.ServerStream.
#6565 is a longer term solution for both follower batching and tso microservice.
It's well implemented, but just need more time to bake, and we need a short term workable solution for now.
Signed-off-by: Bin Shi <[email protected]>
rleungx
pushed a commit
to rleungx/pd
that referenced
this issue
Aug 2, 2023
… gPRC stream (tikv#6572)
closetikv#6549, ref tikv#6565
Simplify tso proxy implementation by using one forward stream for one grpc.ServerStream.
tikv#6565 is a longer term solution for both follower batching and tso microservice.
It's well implemented, but just need more time to bake, and we need a short term workable solution for now.
Signed-off-by: Bin Shi <[email protected]>
Enhancement Task
What did you do?
It happened as time went when it was running in staging cluster.
What did you expect to see?
No tso forwarding stuck issue.
What did you see instead?
The api leader got stuck at dispathTSORequest when forwarding tso requests to TSO servers.
What version of PD are you using (
pd-server -V
)?tidbcloud/pd-cse release-6.6-keyspace 9e1e2de
Root Cause Analysis
In short, it's about wrong usage of unbuffered channel in multiple goroutines.
The problem is in the TSO forwarding & Dispatching framework on Server side, where many streaming process goroutines, calling Tso() gPRC api, share the same handleDispatcher() goroutine per forwarded host.
Below is the event sequence describing what happened:
Besides the above issue in the TSO forwarding & Dispatching framework, there might have other several issues:
The text was updated successfully, but these errors were encountered: