Skip to content

KAFKA-18660: Transactions Version 2 doesn't handle epoch overflow correctly #18730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 30, 2025

Conversation

jolshan
Copy link
Member

@jolshan jolshan commented Jan 28, 2025

Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.

We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.

I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)

@github-actions github-actions bot added core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Jan 28, 2025
@github-actions github-actions bot removed the small Small PRs label Jan 29, 2025
@jolshan jolshan added Blocker This pull request is identified as solving a blocker for a release. transactions Transactions and EOS labels Jan 29, 2025
@jolshan jolshan marked this pull request as ready for review January 29, 2025 00:14
@@ -408,13 +408,13 @@ class TransactionCoordinator(txnConfig: TransactionConfig,

// generate the new transaction metadata with added partitions
txnMetadata.inLock {
if (txnMetadata.producerId != producerId) {
if (txnMetadata.pendingTransitionInProgress) {
// return a retriable exception to let the client backoff and retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here that explains the significance of ordering this check prior to others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. 👍

Copy link
Contributor

@artemlivshits artemlivshits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +413 to +415
// This check is performed first so that the pending transition can complete before subsequent checks.
// With TV2, we may be transitioning over a producer epoch overflow, and the producer may be using the
// new producer ID that is still only in pending state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To understand: L419 will return producer_fenced which for epoch overflow which we don't want. Hence, we moved the pending state check here and we are applying this logic in both addPartitionsToTxn and endTransaction phases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were hitting the invalid producer ID mapping in the overflow case. Let me explain briefly.

For EndTxn, we don't return until the PrepareX transition has completed on the state machine. For TV2 in both epoch overflow and normal case, this will be the previous epoch + 1. (In the overflow case, this is max short)
At this point, metadata is pending the CompleteX state. This is where the value differs depending on the epoch. If the epoch overflowed, the state will contain a new producer ID and epoch 0. Otherwise it is the same as PrepareX (same producer id and epoch + 1).

We intended to return the values of the CompleteX state to the producer so the producer can use the correct producer ID and epoch going forward, but we were accidentally returning the PrepareX state instead. This was the first bug. We would hit invalid pid mapping when the transition completed becauase the state would contain the new producer ID and the producer was still trying to use the one that had epoch overflow. Thus, producer ID mismatch.

When I fixed this bug by returning the correct values to the producer, we had the opposite problem. When the producer started using the new producer ID when the CompleteX state was still pending, we would have the opposite producer ID mismatch. In order to avoid this, we should return with a retriable error and wait for the state to complete transition rather than the fatal invalid pid mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense. thanks for the detailed clarification 👍

Copy link
Contributor

@jeffkbkim jeffkbkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks!

@jolshan jolshan merged commit ccab9eb into apache:trunk Jan 30, 2025
9 checks passed
jolshan added a commit to jolshan/kafka that referenced this pull request Jan 30, 2025
…rectly (apache#18730)

Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.

We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.

I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)

Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
jolshan added a commit that referenced this pull request Jan 31, 2025
…rectly (#18730) (#18758)

Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.

We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.

I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)

Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
airlock-confluentinc bot pushed a commit to confluentinc/kafka that referenced this pull request Feb 3, 2025
…rectly (apache#18730) (apache#18758)

Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.

We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.

I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)

Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
pdruley pushed a commit to pdruley/kafka that referenced this pull request Feb 12, 2025
…rectly (apache#18730)

Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.

We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.

I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)

Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
manoj-mathivanan pushed a commit to manoj-mathivanan/kafka that referenced this pull request Feb 19, 2025
…rectly (apache#18730)

Fixed the typo that used the wrong producer ID and epoch when returning so that we handle epoch overflow correctly.

We also had to rearrange the concurrent transaction handling so that we don't self-fence when we start the new transaction with the new producer ID.

I also tested this with a modified version of the code where epoch overflow happens on the first epoch bump (every request has a new producer id)

Reviewers: Artem Livshits <[email protected]>, Jeff Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker This pull request is identified as solving a blocker for a release. core Kafka Broker tests Test fixes (including flaky tests) transactions Transactions and EOS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants