-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Background
In the production environment, we discovered the following issues with the primary promotion in segment replication and attempted to optimize them. This RFC mainly describes our solution and also aims to hear suggestions from the community.
- During primary promotion of vanilla segment replication, data loss may occur.
- Primary promotion may take up to
15minutes or even longer. During this period, write throughput significantly decreases.
The purpose of this RFC is to discuss solutions to data loss in vanilla segment replication.
Another issue will be discussed in #20131.
Reproduction
Case of data loss
I introduced SegmentReplicationIT#testPrimaryStopped_ReplicaPromoted_reproduction_data_loss, which can reproduce data loss. I submitted the code to branch.
The execution process is described as follows.
- Start two nodes.
- Create an index with
1primary shard and1replica shard, enable segment replication, and disable automatic refresh. - Write
doc1. ThenextSeqNoof the primary shard is updated to1, and theprocessedCheckpointandmaxSeqNoare updated to0. ThenextSeqNoof the replica shard is updated to1. - Write
doc2. ThenextSeqNoof the primary shard is updated to2. - Before the primary shard executes
InternalEngine#indexIntoLuceneondoc2, add a lock to block. - Perform a flush operation. The primary shard build segment, then persists the index files to disk, with the
local_checkpointin userData being0and themax_seq_nobeing1. - Wait for the segment replication to finish. Both the primary and the replica contain
doc1. The replica shard updated theprocessedCheckpointto1. - Release the lock in Step
5to allow the write operation ondoc2to complete. Both the primary shard and the replica shard contain the translog ofdoc2. - Shut down the node where the primary shard is located.
- The replica is promoted to the primary shard. First, execute
NRTReplicationEngineclose to persist the index files to disk, with thelocal_checkpointin userData being1and themax_seq_nobeing1. Then, switch toInternalEngine, start translog recovery fromprocessedCheckpoint + 1, and skip the translog corresponding todoc2. - After the replica is promoted to the primary shard,
doc2is lost.
Analysis
The cause of data loss
When primary promotion, the replica first closes the engine, records the LocalCheckpointTracker#processedCheckpoint in userData, and persists the index files. Then it switches to InternalEngine and starts recovering the translog from LocalCheckpointTracker#processedCheckpoint + 1.
In the scenario of vanilla segment replication, during the finalize phase of segment replication, the replica will advance the LocalCheckpointTracker#processedCheckpoint to infos.userData.get(MAX_SEQ_NO). The infos.userData.get(MAX_SEQ_NO) is recorded by primary shard during the flush operation.
This also means that the doc between the LocalCheckpointTracker#processedCheckpoint and the LocalCheckpointTracker#nextSeqNo may be lost after the replica is promoted.
Solution
Avoid data loss
During segment replication, when InternalEngine#getSegmentInfosSnapshot is invoked, record InternalEngine.LastRefreshedCheckpointListener#refreshedCheckpoint in segmentInfos.userData and use it to update replica shard's LocalCheckpointTracker#processedCheckpoint.
Evaluation
No data loss
In the branch, using the new logic in NRTReplicationEngine#updateSegments allows the test SegmentReplicationIT#testPrimaryStopped_ReplicaPromoted_reproduction_data_loss to pass.
Related component
No response
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status