Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: Speed up JoinGroupRequestTest, ListGroupsRequestTest, SyncGroupRequestTest by decreased rebalance deplay from 3s to 1s #16810

Merged
merged 3 commits into from
Aug 22, 2024

Conversation

m1a2st
Copy link
Contributor

@m1a2st m1a2st commented Aug 6, 2024

Due to this test is too slow, it can't test in 30 seconds. Thus I add a config to speed up this test, let it can completed in 30s

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@ExtendWith(value = Array(classOf[ClusterTestExtensions]))
class JoinGroupRequestTest(cluster: ClusterInstance) extends GroupCoordinatorBaseRequestTest(cluster) {
@ClusterTest(types = Array(Type.KRAFT), serverProperties = Array(
new ClusterConfigProperty(key = "group.coordinator.new.enable", value = "true"),
new ClusterConfigProperty(key = "offsets.topic.num.partitions", value = "1"),
new ClusterConfigProperty(key = "offsets.topic.replication.factor", value = "1")
new ClusterConfigProperty(key = "offsets.topic.replication.factor", value = "1"),
new ClusterConfigProperty(key = "group.initial.rebalance.delay.ms", value = "0"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting finding, makes sense to me for the test. I expect it would it be helpful to add it also in the other test in this file, right?

Copy link
Contributor Author

@m1a2st m1a2st Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but we test these tests without add group.initial.rebalance.delay.ms and find a problem that Kraft mode is obviously slower than Zookeeper mode, thus we guess that root cause is the metadata synchronization speed.
螢幕快照 2024-08-06 21-14-26 (1)

@lianetm lianetm added the tests Test fixes (including flaky tests) label Aug 7, 2024
@dajac
Copy link
Member

dajac commented Aug 8, 2024

@m1a2st Thanks for the patch. I understand the motivation but I wonder if it is a good choice. If we change the configuration, it means that the integration test won’t cover the initial rebalance delay which is the default behavior. I think that we should rather keep it as it is.

@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 8, 2024

Thanks for your comments, however I watch the kafka document

A new config, group.initial.rebalance.delay.ms was introduced. This config specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance. The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms. The default value for this is 3 seconds. During development and testing it might be desirable to set this to 0 in order to not delay test execution time.

According to the document, thus I consider that these tests are suitable for this config. WDYT?

@chia7712
Copy link
Member

chia7712 commented Aug 8, 2024

The test sends many JoinGroupRequest and each request needs to wait delay ("3" seconds). Also, the zk mode runs so fast due to initial_delay=0 (see following link)

if (!props.containsKey(GroupCoordinatorConfig.GROUP_INITIAL_REBALANCE_DELAY_MS_CONFIG))

Maybe we can reduce the delay from 3 seconds to 1.5 seconds? That not only keep the initial rebalance delay but also speed up the test.

@m1a2st m1a2st force-pushed the MINOR-JoinGroupRequestTEst branch 2 times, most recently from 1230c8f to 39e2dbb Compare August 10, 2024 06:32
@chia7712
Copy link
Member

@m1a2st Could you please share the "before" and "after"?

@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 11, 2024

  • KafkaServerKRaftRegistrationTest

    • Before:
      CleanShot 2024-08-11 at 10 10 58@2x
    • After:
      CleanShot 2024-08-11 at 11 05 23@2x
  • ConsumerProtocolMigrationTest

    • Before:
      CleanShot 2024-08-11 at 10 47 07@2x
    • After:
      CleanShot 2024-08-11 at 11 08 06@2x
  • JoinGroupRequestTest

    • Before:
      CleanShot 2024-08-11 at 10 50 12@2x
    • After:
      CleanShot 2024-08-11 at 11 08 54@2x
  • ListGroupsRequestTest

    • Before:
      CleanShot 2024-08-11 at 10 54 41@2x
    • After:
      CleanShot 2024-08-11 at 11 09 39@2x
  • OffsetFetchRequestTest

    • Before:
      CleanShot 2024-08-11 at 10 56 28@2x
    • After:
      CleanShot 2024-08-11 at 11 11 44@2x
  • SyncGroupRequestTest

    • Before:
      CleanShot 2024-08-11 at 10 58 48@2x
    • After:
      CleanShot 2024-08-11 at 11 12 24@2x

@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 11, 2024

According to the above test results, I consider we should modify JoinGroupRequestTest, ListGroupsRequestTest and SyncGroupRequestTest by add group.initial.rebalance.delay.ms config, and others should reduce the timeout from 120s to 30s

@m1a2st m1a2st force-pushed the MINOR-JoinGroupRequestTEst branch from 39e2dbb to b25234f Compare August 11, 2024 03:20
@chia7712
Copy link
Member

According to the above test results, I consider we should modify JoinGroupRequestTest, ListGroupsRequestTest and SyncGroupRequestTest by add group.initial.rebalance.delay.ms config

that makes sense to me

@chia7712
Copy link
Member

@m1a2st could you please revise the title? thanks!

@m1a2st m1a2st changed the title MINOR: Speed up JoinGroupRequestTest#testJoinGroupWithOldConsumerGroupProtocolAndNewGroupCoordinator MINOR: Speed up JoinGroupRequestTest, ListGroupsRequestTest, SyncGroupRequestTest by decreased rebalance deplay from 3s to 1s Aug 11, 2024
@m1a2st m1a2st force-pushed the MINOR-JoinGroupRequestTEst branch from b25234f to 9aedd6a Compare August 11, 2024 11:56
Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1a2st thanks for updated PR

@@ -41,7 +41,7 @@ import scala.jdk.CollectionConverters._
* failure paths is to use timeouts. See {@link unit.kafka.server.BrokerRegistrationRequestTest} for integration test
* of just the broker registration path.
*/
@Timeout(120)
@Timeout(30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this unrelated change

@@ -28,7 +28,7 @@ import org.junit.jupiter.api.Assertions.assertEquals
import org.junit.jupiter.api.Timeout
import org.junit.jupiter.api.extension.ExtendWith

@Timeout(120)
@Timeout(30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this unrelated change

@@ -29,7 +29,7 @@ import org.junit.jupiter.api.extension.ExtendWith

import scala.jdk.CollectionConverters._

@Timeout(120)
@Timeout(30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this unrelated change

@m1a2st m1a2st force-pushed the MINOR-JoinGroupRequestTEst branch from 9aedd6a to f36ac96 Compare August 12, 2024 00:22
@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 12, 2024

@chia7712, Thanks for your comments, revert it.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1a2st thanks for this patch. please attach the results again as the updated config is incorrect

@@ -51,7 +52,8 @@ class JoinGroupRequestTest(cluster: ClusterInstance) extends GroupCoordinatorBas
@ClusterTest(serverProperties = Array(
new ClusterConfigProperty(key = "group.coordinator.new.enable", value = "false"),
new ClusterConfigProperty(key = "offsets.topic.num.partitions", value = "1"),
new ClusterConfigProperty(key = "offsets.topic.replication.factor", value = "1")
new ClusterConfigProperty(key = "offsets.topic.replication.factor", value = "1"),
new ClusterConfigProperty(key = "group.initial.rebalance.delay.ms", value = "1"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be "1000" rather than "1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the unit is ms, I fix it.

@chia7712
Copy link
Member

@m1a2st could you please share the results again?

@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 19, 2024

The test results:

  • JoinGroupRequestTest
    CleanShot 2024-08-19 at 09 03 25@2x
  • ListGroupsRequestTest
    CleanShot 2024-08-19 at 09 05 31@2x
  • SyncGroupRequestTest
    CleanShot 2024-08-19 at 09 08 34@2x

@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 20, 2024

  • JoinGroupRequestTest

    test name type time(Before) time(After)
    JoinGroupRequestTest 3m 21s 1m 43s
    testJoinGroupWithOldConsumerGroupProtocolAndNewGroupCoordinator [1] Type=Raft-Isolated, MetadataVersion=4.0-IV1, Security=PLAINTEXT 33s 328ms 13s 557ms
    testJoinGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator [1] Type=ZK, MetadataVersion=4.0-IV1, Security=PLAINTEXT 2s 597ms 13s 49ms
    testJoinGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator [2] Type=Raft-Isolated, MetadataVersion=4.0-IV1, Security=PLAINTEXT 32s 596ms 12s 324ms
    testJoinGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator [3] Type=Raft-Combined, MetadataVersion=4.0-IV1, Security=PLAINTEXT 31s 986ms 12s 329ms
  • ListGroupsRequestTest

    test name type time(Before) time(After)
    ListGroupsRequestTest 7m 31s 3m 18s
    testListGroupsWithNewConsumerGroupProtocolAndNewGroupCoordinator [1] Type=Raft-Isolated, MetadataVersion=4.0-IVO, Security=PLAINTEXT 57s 215ms 21s 428ms
    testListGroupsWithOldConsumerGroupProtocolAndNewGroupCoordinator [1] Type=Raft-Isolated, MetadataVersion=4.0-IVO, Security=PLAINTEXT 55s 929ms 19s 735ms
    testListGroupsWithOldConsumerGroupProtocolAndOldGroupCoordinator [1] Type=ZK, MetadataVersion=4.0-IVO, Security=PLAINTEXT 1s 731ms 20s 32ms
    testListGroupsWithOldConsumerGroupProtocolAndOldGroupCoordinator [2] Type=Raft-Isolated, MetadataVersion=4.0-IVO, Security=PLAINTEXT 55s 362ms 19s 34ms
    testListGroupsWithOldConsumerGroupProtocolAndOldGroupCoordinator [3] Type=Raft-Combined, MetadataVersion=4.0-IVO, Security=PLAINTEXT 55s 3ms 18s 945ms
  • SyncGroupRequestTest

    test name type time(Before) time(After)
    SyncGroupRequestTest 2m 3s 1m 4s
    testSyncGroupWithOldConsumerGroupProtocolAndNewGroupCoordinator [1] Type=Raft-Isolated, MetadataVersion=4.0-IVO, Security=PLAINTEXT 20s 411ms 8s 587ms
    testSyncGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator [1] Type=ZK, MetadataVersion=4.0-IVO, Security=PLAINTEXT 1s 849ms 8s 248ms
    testSyncGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator [2] Type=Raft-Isolated, MetadataVersion=4.0-IVO, Security=PLAINTEXT 19s 753ms 7s 728ms
    testSyncGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator [3] Type=Raft-Combined, MetadataVersion=4.0-IVO, Security=PLAINTEXT 19s 555ms 7s 374ms

@chia7712
Copy link
Member

@m1a2st there are some tests getting slower. testJoinGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator, testListGroupsWithOldConsumerGroupProtocolAndOldGroupCoordinator, and testSyncGroupWithOldConsumerGroupProtocolAndOldGroupCoordinator. Do you know the root cause?

@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 20, 2024

Becasue It is Zookeeper mode test, the config group.initial.rebalance.delay.ms in zookeeper mode default is 0 in

if (!props.containsKey(GroupCoordinatorConfig.GROUP_INITIAL_REBALANCE_DELAY_MS_CONFIG))
thus In these cases that we transfer to 1000 is slower than before. In KRaft mode it doesn't init the config value, thus the defulat value is 3000.

@chia7712
Copy link
Member

thus In these cases that we transfer to 1000 is slower than before. In KRaft mode it doesn't init the config value, thus the defulat value is 3000.

makes sense. LGTM. will merge it later

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chia7712
Copy link
Member

@m1a2st could you please rebase code to trigger QA again? there are many failed tests. I feel they are unrelated, but double-check is not bad thing :)

@m1a2st m1a2st force-pushed the MINOR-JoinGroupRequestTEst branch from b77077a to d3bb650 Compare August 20, 2024 02:11
@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 20, 2024

I have rebased it :)

@chia7712
Copy link
Member

@m1a2st please fix conflicts

@m1a2st m1a2st force-pushed the MINOR-JoinGroupRequestTEst branch from d3bb650 to 3f2d0e7 Compare August 22, 2024 05:28
@m1a2st
Copy link
Contributor Author

m1a2st commented Aug 22, 2024

Thanks for your comments. I resolve it.

@chia7712 chia7712 merged commit 61a661e into apache:trunk Aug 22, 2024
5 of 6 checks passed
@chia7712
Copy link
Member

With merging this PR we can keep testing the global timeout for all IT (#16957)

LoganZhuZzz pushed a commit to LoganZhuZzz/kafka that referenced this pull request Aug 28, 2024
…pRequestTest by decreased rebalance deplay from 3s to 1s (apache#16810)

Reviewers: Chia-Ping Tsai <[email protected]>
bboyleonp666 pushed a commit to bboyleonp666/kafka that referenced this pull request Sep 4, 2024
…pRequestTest by decreased rebalance deplay from 3s to 1s (apache#16810)

Reviewers: Chia-Ping Tsai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants