KAFKA-18654[2/2]: Transction V2 retry add partitions on the server side when handling produce request. #18810

CalvinConfluent · 2025-02-05T18:21:32Z

During the transaction commit phase, it is normal to hit CONCURRENT_TRANSACTION error before the transaction markers are fully propagated. Instead of letting the client to retry the produce request, it is better to retry on the server side.
https://issues.apache.org/jira/browse/KAFKA-18654

…de when handling produce request.

jolshan · 2025-02-05T22:03:37Z

core/src/main/scala/kafka/server/ReplicaManager.scala

+      requestLocal
+    )
+
+    val retryTimeoutMs = config.addPartitionsToTxnConfig.addPartitionsToTxnMaxTimeoutMs()


Do we want this value to be separate from the request timeout? And if so should it be strictly smaller than that value?

I think using the request timeout is good enough.

...dinator/src/main/java/org/apache/kafka/coordinator/transaction/AddPartitionsToTxnConfig.java

jolshan · 2025-02-05T22:07:00Z

...dinator/src/main/java/org/apache/kafka/coordinator/transaction/AddPartitionsToTxnConfig.java

+public final class AddPartitionsToTxnConfig {
+    // The default config values for the server-side add partition to transaction operations.
+    public static final String ADD_PARTITIONS_TO_TXN_MAX_TIMEOUT_MS_CONFIG = "add.partitions.to.txn.max.timeout.ms";
+    public static final int ADD_PARTITIONS_TO_TXN_MAX_TIMEOUT_MS_DEFAULT = 100;


Why did we choose this value?

I'm also wondering, if someone wants to turn this feature off, what should the value be? 0 I suppose?

I'm also wondering, if someone wants to turn this feature off, what should the value be? 0 I suppose?

I guess we will not disable this feature only? It is a critical part for the TV2. So we may need to disable TV2 if this feature does not work.

The timeout should be small enough to not exceed client request timeout (which is controlled by the client, so we cannot make assumptions on the broker), large enough to be longer than a typical time to commit transaction, large enough to not add latency to overall call by making the client retry (the default client backoff is 100ms, so for an outlier case where transaction couldn't complete in 100ms it seems ok if the client does another 100ms backoff).

Setting to 0 will effectively turn off the retries on the broker.

We currently don't allow 0 based on this config atLeast(1) which is why I asked. 1 is effectively the same as zero though probably.

Thanks for the comments, updated to atLeast(0)

jolshan · 2025-02-06T18:38:38Z

...dinator/src/main/java/org/apache/kafka/coordinator/transaction/AddPartitionsToTxnConfig.java

+            "It will not be effective if it is larger than request.timeout.ms";
+    public static final String ADD_PARTITIONS_TO_TXN_RETRY_BACKOFF_MS_CONFIG = "add.partitions.to.txn.retry.backoff.ms";
+    public static final int ADD_PARTITIONS_TO_TXN_RETRY_BACKOFF_MS_DEFAULT = 20;
+    public static final String ADD_PARTITIONS_TO_TXN_RETRY_BACKOFF_MS_DOC = "The retry backoff when the server attempts" +


nit: should we just mention this is a server-side backoff?

jolshan · 2025-02-06T22:25:31Z

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

+
+      if (error != Errors.CONCURRENT_TRANSACTIONS) {
+        assertEquals(Errors.NOT_ENOUGH_REPLICAS, result.assertFired.error)
+        return


This return is for the not_coordinator case? I wonder if putting in in an if/else format would be a little easier to read then this early return

jolshan · 2025-02-06T22:27:52Z

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

+      val result = handleProduceAppend(replicaManager, tp0, transactionalRecords, origin = AppendOrigin.CLIENT,
+        transactionalId = transactionalId, transactionSupportedOperation = addPartition)
+      val appendCallback = ArgumentCaptor.forClass(classOf[AddPartitionsToTxnManager.AppendCallback])
+      verify(addPartitionsToTxnManager, times(1)).addOrVerifyTransaction(


does the times here account for the previous verify? In other words, should this be 2 or does the counter reset after the first verify is called?

For the NOT_COORDINATOR case, the addOrVerifyTransaction is only called once.
For the CONCURRENT_TRANSACTIONS, the first verify is consumed with the error. Later addOrVerifyTransaction is called another time which matches with the second verify.

Right, I'm just wondering if the value in the append callback should be 2. Or if calling the first verify resets the counter.

CalvinConfluent added 4 commits February 5, 2025 10:10

KAFKA-18654[2/2]: Transction V2 retry add partitions on the server si…

0527708

…de when handling produce request.

Minor

b1d51bb

Minor

aafd03a

Minor

45aec2c

github-actions bot added triage PRs from the community core Kafka Broker transactions Transactions and EOS labels Feb 5, 2025

jolshan reviewed Feb 5, 2025

View reviewed changes

...dinator/src/main/java/org/apache/kafka/coordinator/transaction/AddPartitionsToTxnConfig.java Outdated Show resolved Hide resolved

jolshan reviewed Feb 5, 2025

View reviewed changes

Address comments

7b4d681

jolshan removed the triage PRs from the community label Feb 6, 2025

Add max timeout

39931fb

jolshan reviewed Feb 6, 2025

View reviewed changes

jolshan added the ci-approved label Feb 6, 2025

Address comments

c4b91e4

CalvinConfluent requested review from jolshan and artemlivshits February 6, 2025 22:17

jolshan reviewed Feb 6, 2025

View reviewed changes

Address comment

412396d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-18654[2/2]: Transction V2 retry add partitions on the server side when handling produce request. #18810

KAFKA-18654[2/2]: Transction V2 retry add partitions on the server side when handling produce request. #18810

CalvinConfluent commented Feb 5, 2025

jolshan Feb 5, 2025

CalvinConfluent Feb 6, 2025

jolshan Feb 5, 2025

CalvinConfluent Feb 5, 2025 •

edited

Loading

artemlivshits Feb 6, 2025

jolshan Feb 6, 2025

CalvinConfluent Feb 6, 2025

jolshan Feb 6, 2025

CalvinConfluent Feb 6, 2025

jolshan Feb 6, 2025

jolshan Feb 6, 2025

CalvinConfluent Feb 6, 2025

jolshan Feb 6, 2025

KAFKA-18654[2/2]: Transction V2 retry add partitions on the server side when handling produce request. #18810

Are you sure you want to change the base?

KAFKA-18654[2/2]: Transction V2 retry add partitions on the server side when handling produce request. #18810

Conversation

CalvinConfluent commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CalvinConfluent Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CalvinConfluent Feb 5, 2025 •

edited

Loading