KAFKA-19661 [4/N]: Prefer range-style assignment #20486

lucasbru · 2025-09-05T10:14:35Z

This is actually fixing a difference between the old and the new
assignor. Given the assignment ordering, the legacy assignor has a
preference for range-style assignments built in, that is, assigning

C1: 0_0, 1_0 C2: 0_1, 1_1

instead of

C1: 0_0, 0_1 C2: 1_0, 1_1

We add tests to both assignors to check for this behavior, and improve
the new assingor by enforcing corresponding orderings.

Reviewers: Bill Bejeck [email protected]

lucasbru · 2025-09-05T10:47:24Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

+    private void assignActive(final LinkedList<TaskId> activeTasks) {
+
+        // Assuming our current assignment pairs same partitions (range-based), we want to sort by partition first
+        activeTasks.sort(Comparator.comparing(TaskId::partition).thenComparing(TaskId::subtopologyId));


I added the sorting here. The old assignor did not do the sorting explicitly, but randomly ran into the "good case".

The point is this:
Normally, we want to assign an active task like a range-assignor, when we have two subtopologies with two partitions and two clients, we will assign

Client1: 0_0, 1_0
Client2: 0_1, 1_1

The reason being, heuristically, if we'd have the assignment

Client1: 0_0, 0_1
Client2: 1_0, 1_1

and the first subtopology has large state and the second subtopology has small state, then one client gets most of the state.

The sorting here helps to also achieve this kind of range assignment when scaling up. Assume we have now all tasks assigned to the first member:

Client1: 0_0, 0_1, 1_0, 1_1
Client2: -

Now, we will first assign the previous tasks, we want to start with all 0 partitions, before doing all 1 partitions, until Client1 fills up:

Client1: 0_0, 1_0
Client2:

Then filling up client2 the usual way.

Client1: 0_0, 1_0
Client2: 1_0, 1_1

This is corner case, but seems like a useful improvement.

lucasbru · 2025-09-05T10:48:53Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

-        for (final TaskId task : standbyTasks) {
+
+        // Assuming our current assignment is range-based, we want to sort by partition first.
+        standbyTasks.sort(Comparator.comparing(TaskId::partition).thenComparing(TaskId::subtopologyId).reversed());


We want to assign standby tasks in reverse.

The reason why we want to traverse standby tasks in reverse is the example that I added in the unit tests of both LegacyStickTaskAssignor and the new StickyTaskAssignor.

Assume we have
Node 1: Active task 0,1, Standby task 2,3
Node 2: Active task 2,3, Standby task 0,1
Node 3: - (new)

Then we don't want to assign active tasks and standby tasks in the same order.
Suppose we try to assign active tasks in increasing order, we will get:

Node 1: Active task 0,1
Node 2: Active task 2
Node 3: Active task 3

Since task 3 is the last task we will assign, and at that point, the quota for active tasks is 1, so it can only be assigned to Node 3.

Suppose now we assign standby tasks in the same order, we will get this:

Node 1: Active task 0,1, Standby task 2, 3
Node 2: Active task 2, Standby task 0, 1
Node 3: Active task 3

The reason is that we first assign tasks 0,1,2, which all can be assigned to the previous member that owned it. Finally, we want to assign standby task 3, but it cannot be assigned to Node 3, so we have to assign it to Node 1 or Node 2. Using reverse order means, when I have new nodes, they will get the numerically last few active tasks, and the numerically first standby tasks, which should avoid this problem.

Using reverse order means, when I have new nodes, they will get the numerically last few active tasks, and the numerically first standby tasks,

I was going to ask about this working with the existing HA assignor, but I don't think that it applies anymore for KIP-1071, correct?

and the numerically first standby tasks

If I'm understanding your example correctly, previous ownership will take priority when assigning standbys?

I was going to ask about this working with the existing HA assignor, but I don't think that it applies anymore for KIP-1071, correct?

Yes

If I'm understanding your example correctly, previous ownership will take priority when assigning standbys?

Yes

bbejeck

Thanks for the PR @lucasbru - I've left some comments - overall lgtm

bbejeck · 2025-09-05T15:53:20Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

-        for (final TaskId task : standbyTasks) {
+
+        // Assuming our current assignment is range-based, we want to sort by partition first.
+        standbyTasks.sort(Comparator.comparing(TaskId::partition).thenComparing(TaskId::subtopologyId).reversed());


Using reverse order means, when I have new nodes, they will get the numerically last few active tasks, and the numerically first standby tasks,

I was going to ask about this working with the existing HA assignor, but I don't think that it applies anymore for KIP-1071, correct?

and the numerically first standby tasks

If I'm understanding your example correctly, previous ownership will take priority when assigning standbys?

bbejeck · 2025-09-05T16:20:11Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java


+        // To achieve an initially range-based assignment, sort by subtopology
+        activeTasks.sort(Comparator.comparing(TaskId::subtopologyId).thenComparing(TaskId::partition));
+


So we do the second sort here by subtopologyId then partitions to get the range assignment to distribute optimizing for state across sub-topologies.

Yes. I should clarify in the comment that this mostly applies to the case where the number of partitions is a multiple of the number of nodes, and in particular the common case where number of partitions = number of nodes.

We assume we start from a pretty balanced assignment (all processes have roughly equal load). Then, the assignment by-load below is mostly a round-robin assignment in most situations:

If we start fresh, all processes have 0 load and we will do a complete round-robin assignment

If we scale down, all processes will have roughly the same N load and we will do roughly round-robin assignment

If we scale up, we will assign all the tasks that we didnt assign above to the new nodes. We will do a round-robin assignment among the new nodes.

bbejeck · 2025-09-05T16:37:22Z

...rc/test/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignorTest.java

+        assertTrue(getAllActiveTaskIds(result, "member2").size() + getAllStandbyTaskIds(result, "member2").size() <= 3);
+
+        assertTrue(getAllActiveTaskIds(result, "member3").size() >= 1 && getAllActiveTaskIds(result, "member3").size() <= 2);
+        assertTrue(getAllActiveTaskIds(result, "member3").size() + getAllStandbyTaskIds(result, "member3").size() <= 3);


Should we also assert that the distribution of task ownership in addition to the owned count?

what do you mean by distribution of task ownership?

We're confirming the size or the number of tasks vs. the sub-topology where they are from but the test below confirms that already

bbejeck · 2025-09-05T16:59:25Z

...va/org/apache/kafka/streams/processor/internals/assignment/LegacyStickyTaskAssignorTest.java

+
+        assertThat(node3.activeTasks().size(), greaterThanOrEqualTo(1));
+        assertThat(node3.activeTasks().size(), lessThanOrEqualTo(2));
+        assertThat(node3.activeTasks().size() + node3.standbyTasks().size(), lessThanOrEqualTo(3));


same question about membership vs. task count - but I'm not sure if that applies in this case

I'm not sure I understand the question

same as my comment above - this is covered by another test

bbejeck · 2025-09-05T17:01:29Z

tools/src/test/java/org/apache/kafka/tools/streams/DescribeStreamsGroupTest.java

-            List.of(APP_ID, "", "", "", "ACTIVE:", "0:[0,1];"),
-            List.of(APP_ID, "", "", "", "ACTIVE:", "1:[0,1];"));
+            List.of(APP_ID, "", "", "", "ACTIVE:", "0:[1];", "1:[1];"),
+            List.of(APP_ID, "", "", "", "ACTIVE:", "0:[0];", "1:[0];"));


this is confirming the subtopology_partition task ids right?

bbejeck

Thanks for the PR @lucasbru - LGTM

Copilot

Pull Request Overview

This PR implements range-style assignment preference in Kafka Streams task assignors to maintain consistency between legacy and new assignor implementations. The change ensures that tasks are assigned in a range-based pattern (e.g., C1: 0_0, 1_0; C2: 0_1, 1_1) rather than round-robin (e.g., C1: 0_0, 0_1; C2: 1_0, 1_1).

Updates the new StickyTaskAssignor to prefer range-style assignments through explicit sorting
Adds comprehensive test coverage for range-style assignment behavior in both legacy and new assignors
Updates existing test expectations to reflect the new assignment pattern

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
StickyTaskAssignor.java	Implements range-style assignment sorting for both active and standby tasks
StickyTaskAssignorTest.java	Adds new tests to verify range-style assignment behavior
LegacyStickyTaskAssignorTest.java	Adds tests to verify existing range-style assignment behavior
DescribeStreamsGroupTest.java	Updates test expectations to match new assignment pattern

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

...rc/test/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignorTest.java

lucasbru changed the title ~~KAFKA-19661 [5/N]: Prefer range-style assignment~~ KAFKA-19661 [4/N]: Prefer range-style assignment Sep 5, 2025

github-actions bot added streams tools group-coordinator labels Sep 5, 2025

lucasbru requested review from bbejeck and Copilot September 5, 2025 10:14

This comment was marked as outdated.

Sign in to view

impl

da2857b

lucasbru force-pushed the assignor_range_fix branch from 0814131 to da2857b Compare September 5, 2025 10:31

lucasbru commented Sep 5, 2025

View reviewed changes

lucasbru mentioned this pull request Sep 5, 2025

KAFKA-19661 [5/N]: Use below-quota as a condition for standby task assignment #20458

Merged

lucasbru assigned bbejeck Sep 5, 2025

bbejeck reviewed Sep 5, 2025

View reviewed changes

bbejeck approved these changes Sep 8, 2025

View reviewed changes

lucasbru requested a review from Copilot September 8, 2025 14:57

Copilot AI reviewed Sep 8, 2025

View reviewed changes

lucasbru merged commit 620a01b into apache:trunk Sep 9, 2025
46 of 48 checks passed

lucasbru added the KIP-1071 PRs related to KIP-1071 label Oct 23, 2025


		// To achieve an initially range-based assignment, sort by subtopology
		activeTasks.sort(Comparator.comparing(TaskId::subtopologyId).thenComparing(TaskId::partition));

KAFKA-19661 [4/N]: Prefer range-style assignment #20486

KAFKA-19661 [4/N]: Prefer range-style assignment #20486

Uh oh!

Conversation

lucasbru commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucasbru commented Sep 5, 2025 •

edited

Loading

bbejeck Sep 8, 2025 •

edited

Loading