-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-18723; Better handle invalid records during replication #18852
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsancio : Thanks for the PR. Made a pass of non-testing files. Left a few comments.
s"which exceeds the maximum configured value of ${config.maxMessageSize}.") | ||
} | ||
/* During replication of uncommitted data it is possible for the remote replica to send record batches after it lost | ||
* leadership. This can happend if sending FETCH responses is slowed because there is a race between sending the FETCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo happend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part about sending FETCH response is slow can be read in an inaccurate way - current wording seems to suggest the response is slow because of the race condition. what about instead:
This can happen if sending FETCH responses is slow. There is a race...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed both suggestions.
@@ -333,7 +336,9 @@ abstract class AbstractFetcherThread(name: String, | |||
// In this case, we only want to process the fetch response if the partition state is ready for fetch and | |||
// the current offset is the same as the offset requested. | |||
val fetchPartitionData = sessionPartitions.get(topicPartition) | |||
if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) { | |||
if (fetchPartitionData != null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that the fetch response is for an old leader epoch. It would be useful to further validate if the leader epoch in the fetch request matches the leader epoch in the current fetch state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I thought about this when I was implementing the PR. I think we have two options:
- Always append up to the
currentLeaderEpoch
, the FETCH request'scurrentLeaderEpoch
if the request version supports it or the locally recordedcurrentLeaderEpoch
if the FETCH request version doesn't support thecurrentLeaderEpoch
field. This is what this PR implements. - Only append records up to the
currentLeaderEpoch
if the local replica'scurrentLeaderEpoch
still matches the leader epoch when the FETCH request was created and sent.
I think they are both correct. Option 1 accepts and handles a superset of the FETCH responses that option 2 can handle. I figured that if they are both correct, it is better to progress faster and with less FETCH RPCs. What do you think?
def appendRecordsToFollowerOrFutureReplica( | ||
records: MemoryRecords, | ||
isFuture: Boolean, | ||
maxEpoch: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxEpoch => leaderEpochForReplica?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -1159,6 +1177,25 @@ class UnifiedLog(@volatile var logStartOffset: Long, | |||
validBytesCount, lastOffsetOfFirstBatch, Collections.emptyList[RecordError], LeaderHwChange.NONE) | |||
} | |||
|
|||
/** | |||
* Return true if the record batch should not be appending to the log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return true if the record batch should not be appending to the log => Return true if the record batch has a higher leader epoch than the replica?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix.
@@ -1786,8 +1788,16 @@ private boolean handleFetchResponse( | |||
} | |||
} else { | |||
Records records = FetchResponse.recordsOrFail(partitionResponse); | |||
if (records.sizeInBytes() > 0) { | |||
try { | |||
// TODO: make sure to test corrupted records in kafka metadata log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed?
appendAsFollower(records); | ||
} catch (CorruptRecordException | InvalidRecordException e) { | ||
// TODO: this should log up to 265 bytes from the records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this done yet?
* @return true if the append reason is replication and the partition leader epoch is greater | ||
* than the leader epoch, otherwise false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distinction between partition leader epoch
and leader epoch
not very clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. I improved the wording.
*/ | ||
skipRemainingBatches = skipRemainingBatches || hasInvalidPartitionLeaderEpoch(batch, origin, leaderEpoch); | ||
if (skipRemainingBatches) { | ||
info(s"Skipping batch $batch because origin is $origin and leader epoch is $leaderEpoch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps log message should also include batch.partitionLeaderEpoch() (e.g. operator can compare the epochs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. $batch
should include the partition leader epoch. This is why I updated the toString
implementation for DefaultRecordBatch
to include the leader epoch of the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah missed that you changed the toString impl to include epoch, thanks!
} | ||
|
||
@ParameterizedTest | ||
@ArgumentsSource(classOf[InvalidMemoryRecordsProvider]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm missing something - should we also have a test here where we call log.appendAsFollower(records, epoch)
where epoch
is less than one of the epochs in the records batch? it could be a malformed batch and we could check that the test does not throw CorruptRecordException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You are correct, we need to test the case when the leader epoch is invalid. I added tests for that case to MockLogTest
, KafkaMetadataLogTest
and UnifiedLogTest
return buffer; | ||
} | ||
|
||
private static ByteBuffer largeMagic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this just be incorrectMagic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large magic is one incorrect magic. Another incorrect magic is a negative number. Hence why I added negativeMagic
and largeMagic
.
* @return true if the append reason is replication and the batch's partition leader epoch is | ||
* greater than the leader epoch, otherwise false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe this could be further adjusted to
and the batch's partition leader epoch is greater than specified leaderEpoch, otherwise false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsancio : Thanks for the updated PR. A few more comments.
/* During replication of uncommitted data it is possible for the remote replica to send record batches after it lost | ||
* leadership. This can happen if sending FETCH responses is slow. There is a race between sending the FETCH | ||
* response and the replica truncating and appending to the log. The replicating replica resolves this issue by only | ||
* persisting up to the partition leader epoch of the leader when the FETCH request was handled. See KAFKA-18723 for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
persisting up to the partition leader epoch of the leader when the FETCH request was handled => persisting up to the current leader epoch used in the fetch request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @return true if the append reason is replication and the batch's partition leader epoch is | ||
* greater than the leader epoch, otherwise false | ||
*/ | ||
private def hasInvalidPartitionLeaderEpoch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasInvalidPartitionLeaderEpoch => hasHigherPartitionLeaderEpoch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Optional<LogAppendInfo> appendInfo = Optional.empty(); | ||
try { | ||
appendInfo = Optional.of(log.appendAsFollower(records, quorum.epoch())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quorum.epoch()
could change between the fetch request is issued and the fetch response is received, right? If so, we need to use the epoch used when creating the fetch request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the epoch can change between the request and response. If that happens, the KRaft replica transition states. All state transitions in KRaft reset the request manager: resetConnections
. By reseting the request manager any RPC response, including FETCH, that doesn't match the set of pending requests will be ignored.
The other case is that the epoch changed because of the FETCH response being handled. This is handled here. When the epoch in the response is greater than the epoch of the replica, the replica transitions and skips the rest of the FETCH response handling, including appending the contained records.
kafkaRaftMetrics.updateLogEnd(endOffset); | ||
logger.trace("Follower end offset updated to {} after append", endOffset); | ||
|
||
appendInfo.ifPresent( | ||
info -> kafkaRaftMetrics.updateFetchedRecords(info.lastOffset - info.firstOffset + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this inside the try/catch where appendInfo is created? This avoids the need to make appendInfo an Optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
import java.util.stream.Stream; | ||
|
||
public final class InvalidMemoryRecordsProvider implements ArgumentsProvider { | ||
// Use a baseOffset that not zero so that is less likely to match the LEO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that not zero => that's not zero
so that is less likely => so that it is less likely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -115,6 +125,11 @@ class MockFetcherThread(val mockLeader: MockLeaderEndPoint, | |||
batches.headOption.map(_.lastOffset).getOrElse(-1))) | |||
} | |||
|
|||
private def hasInvalidPartitionLeaderEpoch(batch: RecordBatch, leaderEpoch: Int): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasInvalidPartitionLeaderEpoch => hasHigherPartitionLeaderEpoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I missed that this was for a different file and implementation.
.setControllerEpoch(0) | ||
.setLeader(2) | ||
.setLeaderEpoch(1) | ||
.setLeaderEpoch(prevLeaderEpoch + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change leaderEpoch, the partition epoch should also change, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. Fixed it.
new SimpleRecord("first message".getBytes) | ||
), | ||
isFuture = false, | ||
partitionLeaderEpoch = Int.MaxValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use 0 as the leader epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Fixed it. I wanted to make it explicit that this test was "ignoring" the skip higher leader epoch logic.
0L, | ||
Compression.NONE, | ||
pid, | ||
epoch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epoch => producerEpoch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
val log = createLog(logDir, logConfig) | ||
val previousEndOffset = log.logEndOffsetMetadata.messageOffset | ||
|
||
// Depedning on the random corruption, unified log sometimes throws and sometimes returns an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo Depedning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsancio : Thanks for the updated PR. Just one comment.
@@ -115,6 +125,11 @@ class MockFetcherThread(val mockLeader: MockLeaderEndPoint, | |||
batches.headOption.map(_.lastOffset).getOrElse(-1))) | |||
} | |||
|
|||
private def hasInvalidPartitionLeaderEpoch(batch: RecordBatch, leaderEpoch: Int): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been fixed?
@jsancio : The following failed tests seem related to this PR?
|
@junrao Fixed. Also added tests for testing that batches with a "higher partition leader epoch" are not replicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsancio : Thanks for the updated PR. Just a minor comment.
} | ||
|
||
@Test | ||
void testReplicationOfInvalidPartitionLeaderEpoch() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testReplicationOfInvalidPartitionLeaderEpoch => testReplicationOfHigherPartitionLeaderEpoch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsancio : Thanks for updated PR. The code LGTM. Are the test failures related?
For the KRaft implementation there is a race between the network thread, which read bytes in the log segments, and the KRaft driver thread, which truncates the log and appends records to the log. This race can cause the network thread to send corrupted records or inconsistent records. The corrupted records case is handle by catching and logging the CorruptRecordException. The inconsistent records case is handle by only appending record batches who's partition leader epoch is less than or equal to the fetching replica's epoch and the epoch didn't change between the request and response.
For the ISR implementation there is also a race between the network thread and the replica fetcher thread, which truncates the log and appends records to the log. This race can cause the network thread send corrupted records or inconsistent records. The replica fetcher thread already handles the corrupted record case. The inconsistent records case is handle by only appending record batches who's partition leader epoch is less than or equal to the leader epoch in the FETCH request.
Committer Checklist (excluded from commit message)