Skip to content

Conversation

@lucasbru
Copy link
Member

@lucasbru lucasbru commented Oct 23, 2025

This change enhances the offset commit validation logic in streams
groups to validate against per-partition assignment epochs. When a
member attempts to commit offsets with an older member epoch, the logic
now validates that the epoch is not older than the assignment epoch for
each individual partition being committed.

The implementation adds a new createAssignmentEpochValidator method
that creates partition-level validators, checking each partition
against its assignment epoch from either assigned tasks or tasks
pending revocation.

We extend the SmokeTestDriverIntegrationTest to detect if we have
processed more records than needed, which, in this restricted scenario,
should only happen when offset commits are failing.

We re-enable the previously flaky test in EosIntegrationTest, which
failed due to previously failing offset commits.

Both tests have been run 100x in their streams protocol variation to
validate that they are not flaky anymore.

This is a stacked PR. Only review the last commit.

This commit enhances the offset commit validation logic in streams groups
to validate against per-partition assignment epochs. When a member attempts to
commit offsets with an older member epoch, the logic now validates that the epoch
is not older than the assignment epoch for each individual partition being committed.

The implementation adds a new `createAssignmentEpochValidator` method that creates
partition-level validators, checking each partition against its assignment epoch from
either assigned tasks or tasks pending revocation.

We extend the SmokeTestDriverIntegrationTest to detect if we have
processed more records than needed, which, in this restricted scenario,
should only happen when offset commits are failing.

We re-enable the previously flaky test in EosIntegrationTest, which
failed due to previously failing offset commits.

Both tests have been run 100x in there streams protocol variation to
validate that they are not flaky anymore.
@github-actions github-actions bot added streams build Gradle build or GitHub Actions group-coordinator labels Oct 23, 2025
return (topicName, topicId, partitionId) -> {
final StreamsGroupTopologyValue.Subtopology subtopology = streamsTopology.sourceTopicMap().get(topicName);
if (subtopology == null) {
throw new StaleMemberEpochException("Topic " + topicName + " is not in the topology.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is actually impossible right now because we do not allow updating the topology yet. But I think this would be the correct behavior once we allow changing the topology: We are trying to commit for a subtopology that does not exist anymore, so we should fence the member.

) {
// Retrieve topology once for all partitions - not per partition!
final StreamsTopology streamsTopology = topology.get().orElseThrow(() ->
new StaleMemberEpochException("Topology is not available for offset commit validation."));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not allow removing the topology, so I think this may almost impossible. We'd have to recreate the group of the same name, and get the same member ID back to reach this point. If that would ever happen, I think fencing the member would be okay.

@lucasbru lucasbru requested a review from Copilot October 23, 2025 15:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances offset commit validation in streams groups by implementing per-partition epoch validation. Instead of just checking the member epoch, the system now validates that each partition being committed has an assignment epoch that is not newer than the commit request's epoch, enabling proper fencing of zombie commit requests.

Key changes:

  • Introduction of TasksTupleWithEpochs to track assignment epochs per partition for active tasks
  • Addition of createAssignmentEpochValidator method for partition-level validation
  • Extension of SmokeTestDriverIntegrationTest to detect excessive record processing
  • Re-enabling of previously flaky EosIntegrationTest

Reviewed Changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TasksTupleWithEpochs.java New class storing active tasks with per-partition assignment epochs
TasksTuple.java Updated to work with TasksTupleWithEpochs for containment checks
StreamsGroup.java Implements per-partition epoch validation via createAssignmentEpochValidator
StreamsGroupMember.java Changed to use TasksTupleWithEpochs for assigned tasks
CurrentAssignmentBuilder.java Updated to preserve and assign epochs when building assignments
StreamsCoordinatorRecordHelpers.java Serialization support for assignment epochs
StreamsGroupCurrentMemberAssignmentValue.json Added AssignmentEpochs field to schema
SmokeTestClient.java Tracks total data records processed
SmokeTestDriverIntegrationTest.java Validates no excessive record reprocessing occurs
EosIntegrationTest.java Removed @Flaky annotation from test
Various test files Updated to work with epoch-aware task assignments

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// We check that we did no have to reprocess any records, which would indicate a bug since everything
// runs locally in this test.
assertEquals(expectedRecords, numDataRecordsProcessed,
String.format("It seems we had to reprocess records, with %d processed records expected, but %d records processed.",
Copy link

Copilot AI Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion message format is backwards - it states 'expected, but actual' when the actual parameters to assertEquals are (expected, actual). The message should say 'expected %d records but processed %d records' to match the parameter order.

Suggested change
String.format("It seems we had to reprocess records, with %d processed records expected, but %d records processed.",
String.format("It seems we had to reprocess records, expected %d records but processed %d records.",

Copilot uses AI. Check for mistakes.
@lucasbru lucasbru changed the title KAFKA-19779: Add per-partition epoch validation to streams groups [5/N] KAFKA-19779: Add per-partition epoch validation to streams groups [4/N] Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Gradle build or GitHub Actions group-coordinator streams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant