-
Notifications
You must be signed in to change notification settings - Fork 14.5k
KAFKA-18660: Transactions Version 2 doesn't handle epoch overflow correctly #18730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -408,13 +408,13 @@ class TransactionCoordinator(txnConfig: TransactionConfig, | |||
|
|||
// generate the new transaction metadata with added partitions | |||
txnMetadata.inLock { | |||
if (txnMetadata.producerId != producerId) { | |||
if (txnMetadata.pendingTransitionInProgress) { | |||
// return a retriable exception to let the client backoff and retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here that explains the significance of ordering this check prior to others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// This check is performed first so that the pending transition can complete before subsequent checks. | ||
// With TV2, we may be transitioning over a producer epoch overflow, and the producer may be using the | ||
// new producer ID that is still only in pending state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To understand: L419 will return producer_fenced which for epoch overflow which we don't want. Hence, we moved the pending state check here and we are applying this logic in both addPartitionsToTxn and endTransaction phases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were hitting the invalid producer ID mapping in the overflow case. Let me explain briefly.
For EndTxn, we don't return until the PrepareX transition has completed on the state machine. For TV2 in both epoch overflow and normal case, this will be the previous epoch + 1. (In the overflow case, this is max short)
At this point, metadata is pending the CompleteX state. This is where the value differs depending on the epoch. If the epoch overflowed, the state will contain a new producer ID and epoch 0. Otherwise it is the same as PrepareX (same producer id and epoch + 1).
We intended to return the values of the CompleteX state to the producer so the producer can use the correct producer ID and epoch going forward, but we were accidentally returning the PrepareX state instead. This was the first bug. We would hit invalid pid mapping when the transition completed becauase the state would contain the new producer ID and the producer was still trying to use the one that had epoch overflow. Thus, producer ID mismatch.
When I fixed this bug by returning the correct values to the producer, we had the opposite problem. When the producer started using the new producer ID when the CompleteX state was still pending, we would have the opposite producer ID mismatch. In order to avoid this, we should return with a retriable error and wait for the state to complete transition rather than the fatal invalid pid mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense. thanks for the detailed clarification 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks!
…rectly (apache#18730) Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly. We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID. I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id) Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
…rectly (#18730) (#18758) Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly. We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID. I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id) Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
…rectly (apache#18730) (apache#18758) Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly. We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID. I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id) Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
…rectly (apache#18730) Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly. We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID. I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id) Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
…rectly (apache#18730) Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly. We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID. I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id) Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.
We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.
I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)