Skip to content

[RFC] Avoid data loss in vanilla segment replication #20118

@guojialiang92

Description

@guojialiang92

Background

In the production environment, we discovered the following issues with the primary promotion in segment replication and attempted to optimize them. This RFC mainly describes our solution and also aims to hear suggestions from the community.

  1. During primary promotion of vanilla segment replication, data loss may occur.
  2. Primary promotion may take up to 15 minutes or even longer. During this period, write throughput significantly decreases.

The purpose of this RFC is to discuss solutions to data loss in vanilla segment replication.
Another issue will be discussed in #20131.

Reproduction

Case of data loss

I introduced SegmentReplicationIT#testPrimaryStopped_ReplicaPromoted_reproduction_data_loss, which can reproduce data loss. I submitted the code to branch.
The execution process is described as follows.

  1. Start two nodes.
  2. Create an index with 1 primary shard and 1 replica shard, enable segment replication, and disable automatic refresh.
  3. Write doc1. The nextSeqNo of the primary shard is updated to 1, and the processedCheckpoint and maxSeqNo are updated to 0. The nextSeqNo of the replica shard is updated to 1.
  4. Write doc2. The nextSeqNo of the primary shard is updated to 2.
  5. Before the primary shard executes InternalEngine#indexIntoLucene on doc2, add a lock to block.
  6. Perform a flush operation. The primary shard build segment, then persists the index files to disk, with the local_checkpoint in userData being 0 and the max_seq_no being 1.
  7. Wait for the segment replication to finish. Both the primary and the replica contain doc1. The replica shard updated the processedCheckpoint to 1.
  8. Release the lock in Step 5 to allow the write operation on doc2 to complete. Both the primary shard and the replica shard contain the translog of doc2.
  9. Shut down the node where the primary shard is located.
  10. The replica is promoted to the primary shard. First, execute NRTReplicationEngine close to persist the index files to disk, with the local_checkpoint in userData being 1 and the max_seq_no being 1. Then, switch to InternalEngine, start translog recovery from processedCheckpoint + 1, and skip the translog corresponding to doc2.
  11. After the replica is promoted to the primary shard, doc2 is lost.

Analysis

The cause of data loss

When primary promotion, the replica first closes the engine, records the LocalCheckpointTracker#processedCheckpoint in userData, and persists the index files. Then it switches to InternalEngine and starts recovering the translog from LocalCheckpointTracker#processedCheckpoint + 1.
In the scenario of vanilla segment replication, during the finalize phase of segment replication, the replica will advance the LocalCheckpointTracker#processedCheckpoint to infos.userData.get(MAX_SEQ_NO). The infos.userData.get(MAX_SEQ_NO) is recorded by primary shard during the flush operation.
This also means that the doc between the LocalCheckpointTracker#processedCheckpoint and the LocalCheckpointTracker#nextSeqNo may be lost after the replica is promoted.

Solution

Avoid data loss

During segment replication, when InternalEngine#getSegmentInfosSnapshot is invoked, record InternalEngine.LastRefreshedCheckpointListener#refreshedCheckpoint in segmentInfos.userData and use it to update replica shard's LocalCheckpointTracker#processedCheckpoint.

Evaluation

No data loss

In the branch, using the new logic in NRTReplicationEngine#updateSegments allows the test SegmentReplicationIT#testPrimaryStopped_ReplicaPromoted_reproduction_data_loss to pass.

Related component

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Storage:DurabilityIssues and PRs related to the durability frameworkbugSomething isn't workingenhancementEnhancement or improvement to existing feature or requestlucene

Type

No type

Projects

Status

🆕 New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions