The api leader got stuck at tso requests forwarding #6549

binshi-bing · 2023-06-01T23:51:12Z

Enhancement Task

What did you do?

It happened as time went when it was running in staging cluster.

What did you expect to see?

No tso forwarding stuck issue.

What did you see instead?

The api leader got stuck at dispathTSORequest when forwarding tso requests to TSO servers.

What version of PD are you using (`pd-server -V`)?

tidbcloud/pd-cse release-6.6-keyspace 9e1e2de

Root Cause Analysis

In short, it's about wrong usage of unbuffered channel in multiple goroutines.

The problem is in the TSO forwarding & Dispatching framework on Server side, where many streaming process goroutines, calling Tso() gPRC api, share the same handleDispatcher() goroutine per forwarded host.

Below is the event sequence describing what happened:

In the first streaming process goroutine, which created the handleDispatcher() goroutine, dispatched a request then invoked the blocking call (client.gPRC stream).Recv() and waited for some time without receiving anything.
The handleDispatcher() goroutine failed to process the request dispatched above, then trying to pass the error through an unbuffered channel (errCh <- error)
The streaming process goroutine can't move forward to reach the place to consume the error channel because of 1, so both goroutines blocked.
As new Tso streaming requests came in, more streaming process goroutines were created, and they enqueued the requests to requests channel (buffered channel with 10000 capacity). Since handleDispatcher() goroutine, the consumer of requests channel, were blocked at 2, no one drained the requests in the requests channel and the max capacity of the channel was reached. Eventually there were more and more streaming process goroutines blocked at enqueuing requests to the requests channel.
Eventually, gPRC server can't spawn more streaming process goroutines.

Besides the above issue in the TSO forwarding & Dispatching framework, there might have other several issues:

All gPRC client streams share the same handleDispatcher() goroutine for the same forwarded host, if one gPRC client stream Send() has problem, handleDispatcher() goroutine exits, and all requests dispatched by all gPRC client streams become orphan requests.
When error happens in the handleDispatcher() goroutine, the error handling speed will be slowed down by blocking call of stream.Recv() in the corresponding streaming process goroutine.
After getting response of the batch request from tso microservice, it sequentially sends the requests to the client streams, which actually eat the benefit of batching.

The text was updated successfully, but these errors were encountered:

… gPRC stream (#6572) close #6549, ref #6565 Simplify tso proxy implementation by using one forward stream for one grpc.ServerStream. #6565 is a longer term solution for both follower batching and tso microservice. It's well implemented, but just need more time to bake, and we need a short term workable solution for now. Signed-off-by: Bin Shi <[email protected]>

… gPRC stream (tikv#6572) close tikv#6549, ref tikv#6565 Simplify tso proxy implementation by using one forward stream for one grpc.ServerStream. tikv#6565 is a longer term solution for both follower batching and tso microservice. It's well implemented, but just need more time to bake, and we need a short term workable solution for now. Signed-off-by: Bin Shi <[email protected]>

binshi-bing added the type/enhancement The issue or PR belongs to an enhancement. label Jun 1, 2023

binshi-bing mentioned this issue Jun 2, 2023

Support Multi-tenant and Sharded TSO #5895

Closed

60 tasks

binshi-bing self-assigned this Jun 2, 2023

This was referenced Jun 7, 2023

Improve TSO proxy based on the existing TSO Follower Batching framework #6565

Draft

TSO Proxy: Simplify TSO Proxy implementation by using one forward stream for one gPRC stream #6572

Merged

ti-chi-bot bot closed this as completed in #6572 Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The api leader got stuck at tso requests forwarding #6549

The api leader got stuck at tso requests forwarding #6549

binshi-bing commented Jun 1, 2023 •

edited

Loading

The api leader got stuck at tso requests forwarding #6549

The api leader got stuck at tso requests forwarding #6549

Comments

binshi-bing commented Jun 1, 2023 • edited Loading

Enhancement Task

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

Root Cause Analysis

binshi-bing commented Jun 1, 2023 •

edited

Loading

What version of PD are you using (`pd-server -V`)?