KAFKA-18723; Better handle invalid records during replication #18852

jsancio · 2025-02-10T19:49:12Z

For the KRaft implementation there is a race between the network thread, which read bytes in the log segments, and the KRaft driver thread, which truncates the log and appends records to the log. This race can cause the network thread to send corrupted records or inconsistent records. The corrupted records case is handle by catching and logging the CorruptRecordException. The inconsistent records case is handle by only appending record batches who's partition leader epoch is less than or equal to the fetching replica's epoch and the epoch didn't change between the request and response.

For the ISR implementation there is also a race between the network thread and the replica fetcher thread, which truncates the log and appends records to the log. This race can cause the network thread send corrupted records or inconsistent records. The replica fetcher thread already handles the corrupted record case. The inconsistent records case is handle by only appending record batches who's partition leader epoch is less than or equal to the leader epoch in the FETCH request.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…invalid-records

junrao

@jsancio : Thanks for the PR. Made a pass of non-testing files. Left a few comments.

junrao · 2025-02-13T22:10:10Z

core/src/main/scala/kafka/log/UnifiedLog.scala

-          s"which exceeds the maximum configured value of ${config.maxMessageSize}.")
-      }
+      /* During replication of uncommitted data it is possible for the remote replica to send record batches after it lost
+       * leadership. This can happend if sending FETCH responses is slowed because there is a race between sending the FETCH


typo happend

part about sending FETCH response is slow can be read in an inaccurate way - current wording seems to suggest the response is slow because of the race condition. what about instead:
This can happen if sending FETCH responses is slow. There is a race...

Fixed both suggestions.

junrao · 2025-02-13T22:26:29Z

core/src/main/scala/kafka/server/AbstractFetcherThread.scala

@@ -333,7 +336,9 @@ abstract class AbstractFetcherThread(name: String,
            // In this case, we only want to process the fetch response if the partition state is ready for fetch and
            // the current offset is the same as the offset requested.
            val fetchPartitionData = sessionPartitions.get(topicPartition)
-            if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
+            if (fetchPartitionData != null &&


It's possible that the fetch response is for an old leader epoch. It would be useful to further validate if the leader epoch in the fetch request matches the leader epoch in the current fetch state.

Yeah. I thought about this when I was implementing the PR. I think we have two options:

Always append up to the currentLeaderEpoch, the FETCH request's currentLeaderEpoch if the request version supports it or the locally recorded currentLeaderEpoch if the FETCH request version doesn't support the currentLeaderEpoch field. This is what this PR implements.

Only append records up to the currentLeaderEpoch if the local replica's currentLeaderEpoch still matches the leader epoch when the FETCH request was created and sent.

I think they are both correct. Option 1 accepts and handles a superset of the FETCH responses that option 2 can handle. I figured that if they are both correct, it is better to progress faster and with less FETCH RPCs. What do you think?

junrao · 2025-02-13T22:30:05Z

core/src/main/scala/kafka/cluster/Partition.scala

+  def appendRecordsToFollowerOrFutureReplica(
+    records: MemoryRecords,
+    isFuture: Boolean,
+    maxEpoch: Int


maxEpoch => leaderEpochForReplica?

junrao · 2025-02-13T23:03:37Z

core/src/main/scala/kafka/log/UnifiedLog.scala

@@ -1159,6 +1177,25 @@ class UnifiedLog(@volatile var logStartOffset: Long,
      validBytesCount, lastOffsetOfFirstBatch, Collections.emptyList[RecordError], LeaderHwChange.NONE)
  }

+  /**
+   * Return true if the record batch should not be appending to the log.


Return true if the record batch should not be appending to the log => Return true if the record batch has a higher leader epoch than the replica?

junrao · 2025-02-13T23:05:47Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

@@ -1786,8 +1788,16 @@ private boolean handleFetchResponse(
                }
            } else {
                Records records = FetchResponse.recordsOrFail(partitionResponse);
-                if (records.sizeInBytes() > 0) {
+                try {
+                    // TODO: make sure to test corrupted records in kafka metadata log


Should this be removed?

junrao · 2025-02-13T23:06:05Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

                    appendAsFollower(records);
+                } catch (CorruptRecordException | InvalidRecordException e) {
+                    // TODO: this should log up to 265 bytes from the records


Is this done yet?

ahuang98 · 2025-02-14T21:16:31Z

core/src/main/scala/kafka/log/UnifiedLog.scala

+   * @return true if the append reason is replication and the partition leader epoch is greater
+   *         than the leader epoch, otherwise false


distinction between partition leader epoch and leader epoch not very clear

Fair. I improved the wording.

ahuang98 · 2025-02-14T21:19:15Z

core/src/main/scala/kafka/log/UnifiedLog.scala

+       */
+      skipRemainingBatches = skipRemainingBatches || hasInvalidPartitionLeaderEpoch(batch, origin, leaderEpoch);
+      if (skipRemainingBatches) {
+        info(s"Skipping batch $batch because origin is $origin and leader epoch is $leaderEpoch")


perhaps log message should also include batch.partitionLeaderEpoch() (e.g. operator can compare the epochs)

Yes. $batch should include the partition leader epoch. This is why I updated the toString implementation for DefaultRecordBatch to include the leader epoch of the batch.

https://github.com/apache/kafka/pull/18852/files#diff-c11736eb30dd10f1b56fb894c3efaf2bc724a9306004a71e8b3bd46d46f26ee5R505

ah missed that you changed the toString impl to include epoch, thanks!

ahuang98 · 2025-02-14T21:42:19Z

core/src/test/scala/kafka/raft/KafkaMetadataLogTest.scala

+  }
+
+  @ParameterizedTest
+  @ArgumentsSource(classOf[InvalidMemoryRecordsProvider])


maybe I'm missing something - should we also have a test here where we call log.appendAsFollower(records, epoch) where epoch is less than one of the epochs in the records batch? it could be a malformed batch and we could check that the test does not throw CorruptRecordException

Yes. You are correct, we need to test the case when the leader epoch is invalid. I added tests for that case to MockLogTest, KafkaMetadataLogTest and UnifiedLogTest

ahuang98 · 2025-02-15T00:03:08Z

raft/src/test/java/org/apache/kafka/raft/InvalidMemoryRecordsProvider.java

+        return buffer;
+    }
+
+    private static ByteBuffer largeMagic() {


nit: should this just be incorrectMagic

Large magic is one incorrect magic. Another incorrect magic is a negative number. Hence why I added negativeMagic and largeMagic.

ahuang98 · 2025-02-19T21:54:35Z

core/src/main/scala/kafka/log/UnifiedLog.scala

+   * @return true if the append reason is replication and the batch's partition leader epoch is
+   *         greater than the leader epoch, otherwise false


nit: maybe this could be further adjusted to
and the batch's partition leader epoch is greater than specified leaderEpoch, otherwise false

junrao

@jsancio : Thanks for the updated PR. A few more comments.

junrao · 2025-02-19T23:11:24Z

core/src/main/scala/kafka/log/UnifiedLog.scala

+      /* During replication of uncommitted data it is possible for the remote replica to send record batches after it lost
+       * leadership. This can happen if sending FETCH responses is slow. There is a race between sending the FETCH
+       * response and the replica truncating and appending to the log. The replicating replica resolves this issue by only
+       * persisting up to the partition leader epoch of the leader when the FETCH request was handled. See KAFKA-18723 for


persisting up to the partition leader epoch of the leader when the FETCH request was handled => persisting up to the current leader epoch used in the fetch request

junrao · 2025-02-19T23:13:43Z

core/src/main/scala/kafka/log/UnifiedLog.scala

+   * @return true if the append reason is replication and the batch's partition leader epoch is
+   *         greater than the leader epoch, otherwise false
+   */
+  private def hasInvalidPartitionLeaderEpoch(


hasInvalidPartitionLeaderEpoch => hasHigherPartitionLeaderEpoch ?

junrao · 2025-02-19T23:22:56Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+
+        Optional<LogAppendInfo> appendInfo = Optional.empty();
+        try {
+            appendInfo = Optional.of(log.appendAsFollower(records, quorum.epoch()));


quorum.epoch() could change between the fetch request is issued and the fetch response is received, right? If so, we need to use the epoch used when creating the fetch request.

Yes, the epoch can change between the request and response. If that happens, the KRaft replica transition states. All state transitions in KRaft reset the request manager: resetConnections. By reseting the request manager any RPC response, including FETCH, that doesn't match the set of pending requests will be ignored.

The other case is that the epoch changed because of the FETCH response being handled. This is handled here. When the epoch in the response is greater than the epoch of the replica, the replica transitions and skips the rest of the FETCH response handling, including appending the contained records.

junrao · 2025-02-19T23:28:45Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

        kafkaRaftMetrics.updateLogEnd(endOffset);
        logger.trace("Follower end offset updated to {} after append", endOffset);
+
+        appendInfo.ifPresent(
+            info -> kafkaRaftMetrics.updateFetchedRecords(info.lastOffset - info.firstOffset + 1)


Could we move this inside the try/catch where appendInfo is created? This avoids the need to make appendInfo an Optional.

junrao · 2025-02-19T23:33:09Z

clients/src/test/java/org/apache/kafka/common/record/InvalidMemoryRecordsProvider.java

+import java.util.stream.Stream;
+
+public final class InvalidMemoryRecordsProvider implements ArgumentsProvider {
+    // Use a baseOffset that not zero so that is less likely to match the LEO


that not zero => that's not zero
so that is less likely => so that it is less likely

junrao · 2025-02-19T23:49:11Z

core/src/test/scala/unit/kafka/server/MockFetcherThread.scala

@@ -115,6 +125,11 @@ class MockFetcherThread(val mockLeader: MockLeaderEndPoint,
      batches.headOption.map(_.lastOffset).getOrElse(-1)))
  }

+  private def hasInvalidPartitionLeaderEpoch(batch: RecordBatch, leaderEpoch: Int): Boolean = {


hasInvalidPartitionLeaderEpoch => hasHigherPartitionLeaderEpoch?

Has this been fixed?

Fixed. I missed that this was for a different file and implementation.

junrao · 2025-02-19T23:57:42Z

core/src/test/scala/unit/kafka/cluster/PartitionTest.scala

      .setControllerEpoch(0)
      .setLeader(2)
-      .setLeaderEpoch(1)
+      .setLeaderEpoch(prevLeaderEpoch + 1)


If we change leaderEpoch, the partition epoch should also change, right?

Yes, good catch. Fixed it.

junrao · 2025-02-20T00:06:36Z

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

+                new SimpleRecord("first message".getBytes)
+              ),
+              isFuture = false,
+              partitionLeaderEpoch = Int.MaxValue


Should we just use 0 as the leader epoch?

Sure. Fixed it. I wanted to make it explicit that this test was "ignoring" the skip higher leader epoch logic.

junrao · 2025-02-20T00:12:36Z

core/src/test/scala/unit/kafka/log/UnifiedLogTest.scala

+        0L,
+        Compression.NONE,
+        pid,
+        epoch,


epoch => producerEpoch

junrao · 2025-02-20T00:17:08Z

core/src/test/scala/unit/kafka/log/UnifiedLogTest.scala

+      val log = createLog(logDir, logConfig)
+      val previousEndOffset = log.logEndOffsetMetadata.messageOffset
+
+      // Depedning on the random corruption, unified log sometimes throws and sometimes returns an


typo Depedning

…invalid-records

junrao

@jsancio : Thanks for the updated PR. Just one comment.

junrao · 2025-02-20T18:23:35Z

core/src/test/scala/unit/kafka/server/MockFetcherThread.scala

@@ -115,6 +125,11 @@ class MockFetcherThread(val mockLeader: MockLeaderEndPoint,
      batches.headOption.map(_.lastOffset).getOrElse(-1)))
  }

+  private def hasInvalidPartitionLeaderEpoch(batch: RecordBatch, leaderEpoch: Int): Boolean = {


Has this been fixed?

junrao · 2025-02-20T21:59:44Z

@jsancio : The following failed tests seem related to this PR?

FAILED ❌ AbstractFetcherThreadTest > testRetryAfterUnknownLeaderEpochInLatestOffsetFetch()
FAILED ❌ AbstractFetcherThreadTest > testFollowerFetchOutOfRangeLow()
FAILED ❌ AbstractFetcherThreadWithIbp26Test > testRetryAfterUnknownLeaderEpochInLatestOffsetFetch()
FAILED ❌ AbstractFetcherThreadWithIbp26Test > testFollowerFetchOutOfRangeLow()

jsancio · 2025-02-21T00:00:08Z

@jsancio : The following failed tests seem related to this PR?

FAILED ❌ AbstractFetcherThreadTest > testRetryAfterUnknownLeaderEpochInLatestOffsetFetch()
FAILED ❌ AbstractFetcherThreadTest > testFollowerFetchOutOfRangeLow()
FAILED ❌ AbstractFetcherThreadWithIbp26Test > testRetryAfterUnknownLeaderEpochInLatestOffsetFetch()
FAILED ❌ AbstractFetcherThreadWithIbp26Test > testFollowerFetchOutOfRangeLow()

@junrao Fixed. Also added tests for testing that batches with a "higher partition leader epoch" are not replicated.

junrao

@jsancio : Thanks for the updated PR. Just a minor comment.

junrao · 2025-02-21T01:10:32Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientFetchTest.java

+    }
+
+    @Test
+    void testReplicationOfInvalidPartitionLeaderEpoch() throws Exception {


testReplicationOfInvalidPartitionLeaderEpoch => testReplicationOfHigherPartitionLeaderEpoch ?

junrao

@jsancio : Thanks for updated PR. The code LGTM. Are the test failures related?

jsancio added 3 commits February 4, 2025 11:53

KAFKA-18723; Basic implementation and tests

3da82cc

KAFKA-18723; Add test for invalid records

72aa70c

KAFKA-18723; Add epoch to appendAsFollower

330c72c

github-actions bot added core Kafka Broker kraft build Gradle build or GitHub Actions clients labels Feb 10, 2025

jsancio changed the title ~~KAFKA-18723; Better handling invalid records during replication~~ KAFKA-18723; Better handle invalid records during replication Feb 10, 2025

KAFKA-18723; Add max epoch to partition and replica fetcher

7a1169d

github-actions bot added the performance label Feb 11, 2025

jsancio added 2 commits February 11, 2025 16:24

Merge remote-tracking branch 'upstream/trunk' into kafka-18723-kraft-…

1dcb514

…invalid-records

KAFKA-18723; Add max epoch to the replica fetcher

bcbbdf6

junrao reviewed Feb 13, 2025

View reviewed changes

ahuang98 reviewed Feb 14, 2025

View reviewed changes

ahuang98 reviewed Feb 15, 2025

View reviewed changes

jsancio added 3 commits February 17, 2025 16:55

KAFKA-18723; Print record batch header

74c5e3b

KAFKA-18723; Improve random tests

3087b4f

KAFKA-18723; Add test for unexpected leader epoch

5e6db9e

ahuang98 reviewed Feb 19, 2025

View reviewed changes

junrao reviewed Feb 20, 2025

View reviewed changes

jsancio added 2 commits February 20, 2025 09:27

Merge remote-tracking branch 'upstream/trunk' into kafka-18723-kraft-…

26933df

…invalid-records

KAFKA-18723; Simplify KRaft's handling of corrupted records

1a79c7f

junrao reviewed Feb 20, 2025

View reviewed changes

KAFKA-18723; Rename to hasHigherPartitionLeaderEpoch

6faf1c8

KAFKA-18723; Add test for replicating higher leader epochs

f09dfb1

junrao reviewed Feb 21, 2025

View reviewed changes

KAFKA-18723; Re-word test name

ae5a895

junrao reviewed Feb 21, 2025

View reviewed changes

		* @return true if the append reason is replication and the partition leader epoch is greater
		* than the leader epoch, otherwise false

		* @return true if the append reason is replication and the batch's partition leader epoch is
		* greater than the leader epoch, otherwise false

KAFKA-18723; Better handle invalid records during replication #18852

Are you sure you want to change the base?

KAFKA-18723; Better handle invalid records during replication #18852

Conversation

jsancio commented Feb 10, 2025 • edited Loading

Committer Checklist (excluded from commit message)

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao commented Feb 20, 2025

jsancio commented Feb 21, 2025

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

jsancio commented Feb 10, 2025 •

edited

Loading

jsancio Feb 20, 2025 •

edited

Loading