[RFC] Avoid data loss in vanilla segment replication

## Background
In the production environment, we discovered the following issues with the primary promotion in segment replication and attempted to optimize them. This RFC mainly describes our solution and also aims to hear suggestions from the community.
1. During primary promotion of vanilla segment replication, data loss may occur.
2. Primary promotion may take up to `15` minutes or even longer. During this period, write throughput significantly decreases.

The purpose of this RFC is to discuss solutions to **data loss in vanilla segment replication**. 
Another issue will be discussed in https://github.com/opensearch-project/OpenSearch/issues/20131.

## Reproduction
### Case of data loss
I introduced `SegmentReplicationIT#testPrimaryStopped_ReplicaPromoted_reproduction_data_loss`, which can reproduce data loss. I submitted the code to [branch](https://github.com/guojialiang92/OpenSearch/tree/dev/reproduce_segment_replication_data_loss).
The execution process is described as follows.
1. Start two nodes.
2. Create an index with `1` primary shard and `1` replica shard, enable segment replication, and disable automatic refresh.
3. Write `doc1`. The `nextSeqNo` of the primary shard is updated to `1`, and the `processedCheckpoint` and `maxSeqNo` are updated to `0`. The `nextSeqNo` of the replica shard is updated to `1`.
4. Write `doc2`. The `nextSeqNo` of the primary shard is updated to `2`.
5. Before the primary shard executes `InternalEngine#indexIntoLucene` on `doc2`, add a lock to block.
6. Perform a flush operation. The primary shard build segment, then persists the index files to disk, with the `local_checkpoint` in userData being `0` and the `max_seq_no` being `1`.
7. Wait for the segment replication to finish. Both the primary and the replica contain `doc1`. The replica shard updated the `processedCheckpoint` to `1`.
8. Release the lock in Step `5` to allow the write operation on `doc2` to complete. Both the primary shard and the replica shard contain the translog of `doc2`. 
9. Shut down the node where the primary shard is located.
10. The replica is promoted to the primary shard. First, execute `NRTReplicationEngine` close to persist the index files to disk, with the `local_checkpoint` in userData being `1` and the `max_seq_no` being `1`. Then, switch to `InternalEngine`, start translog recovery from `processedCheckpoint + 1`, and skip the translog corresponding to `doc2`.
11. After the replica is promoted to the primary shard, `doc2` is lost.

## Analysis
### The cause of data loss
When primary promotion, the replica first closes the engine, records the `LocalCheckpointTracker#processedCheckpoint` in userData, and persists the index files. [Then it switches to InternalEngine and starts recovering the translog from `LocalCheckpointTracker#processedCheckpoint + 1`](https://github.com/opensearch-project/OpenSearch/blob/1ee30dcc8334f63c51bed276397fa4d50406a8e6/server/src/main/java/org/opensearch/index/translog/InternalTranslogManager.java#L149).
In the scenario of vanilla segment replication, during the finalize phase of segment replication, [the replica will advance the `LocalCheckpointTracker#processedCheckpoint` to `infos.userData.get(MAX_SEQ_NO)`](https://github.com/opensearch-project/OpenSearch/blob/1ee30dcc8334f63c51bed276397fa4d50406a8e6/server/src/main/java/org/opensearch/index/engine/NRTReplicationEngine.java#L184). [The `infos.userData.get(MAX_SEQ_NO)` is recorded by primary shard during the flush operation](https://github.com/opensearch-project/OpenSearch/blob/1ee30dcc8334f63c51bed276397fa4d50406a8e6/server/src/main/java/org/opensearch/index/engine/InternalEngine.java#L2616C17-L2616C113).
This also means that the doc between the `LocalCheckpointTracker#processedCheckpoint` and the `LocalCheckpointTracker#nextSeqNo` may be lost after the replica is promoted.

## Solution
### Avoid data loss
During segment replication, when `InternalEngine#getSegmentInfosSnapshot` is invoked, record `InternalEngine.LastRefreshedCheckpointListener#refreshedCheckpoint` in `segmentInfos.userData` and use it to update replica shard's `LocalCheckpointTracker#processedCheckpoint`.

## Evaluation
### No data loss
In the [branch](https://github.com/guojialiang92/OpenSearch/tree/dev/reproduce_segment_replication_data_loss), using the [new logic](https://github.com/guojialiang92/OpenSearch/blob/1e18693c3889b53e96a52141fa5c3fd09ad79450/server/src/main/java/org/opensearch/index/engine/NRTReplicationEngine.java#L194) in `NRTReplicationEngine#updateSegments` allows the test `SegmentReplicationIT#testPrimaryStopped_ReplicaPromoted_reproduction_data_loss` to pass.

### Related component

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Avoid data loss in vanilla segment replication #20118

Background

Reproduction

Case of data loss

Analysis

The cause of data loss

Solution

Avoid data loss

Evaluation

No data loss

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Avoid data loss in vanilla segment replication #20118

Description

Background

Reproduction

Case of data loss

Analysis

The cause of data loss

Solution

Avoid data loss

Evaluation

No data loss

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions