Pauseless Consumption #3: Disaster Recovery with Reingestion #14920

KKcorps · 2025-01-27T04:21:17Z

This PR adds support for Disaster recovery for Pauseless Ingestion along with Reingestion. These changes help solve the scenario where real-time segments permanently fail to transition out of ERROR state, leading to data gaps. With reingestion, Pinot can recover such segments, ensuring availability and correctness of real-time data.

During Pauseless ingestion, a ONLINE segment can wind up in an ERROR state if its commit fails due to server restart and there are no other replicas. Currently in pinot, there is no way to recover from such failures.

Reingestion Flow

Segments that fail to commit or end up in ERROR state can now be re-ingested by calling a new endpoint (/reingestSegment) on the server.

The ReIngestionResource reconstructs the segment from the stream, builds it, and commits it, ensuring that offline peers and the deep store get updated properly.

If successful, the re-ingested segment transitions from ERROR to ONLINE.

New APIs introduced:

Get Running Re-ingestion Jobs

GET /reingestSegment/jobs

Returns all currently running re-ingestion jobs with their status information.

Response

Type: JSON
Contains array of jobs with:
- jobId: Unique identifier
- tableNameWithType: Table being processed
- segmentName: Segment being re-ingested
- startTimeMs: Job start timestamp

Responses

200: Success - List of running jobs

Re-ingest Segment

POST /reingestSegment

Asynchronously re-ingests a segment with updated configurations.

Request Body

Type: JSON
Required fields:
- tableNameWithType: Table name with type (e.g. "myTable_REALTIME")
- segmentName: Name of segment to re-ingest

Response

Type: JSON
Contains:
- jobId: Unique identifier for tracking progress
- message: Success confirmation

Responses

200: Success - Job started successfully
429: Too Many Requests - Parallel job limit reached
404: Not Found - Table/segment not found
409: Conflict - Segment already being re-ingested
500: Internal Server Error - Server initialization issues

Reingestion data flow

sequenceDiagram
    participant Controller
    participant Server
    participant ReIngestionResource
    participant SimpleRealtimeSegmentDataManager

    Controller->>Controller: Finds ERROR segment in validation task
    Controller->>Controller: Pick one alive server from IS for segment
    Controller->>Server: POST /reIngestSegment (tableName, segmentName)
    Server->>ReIngestionResource: ReIngestionRequest
    ReIngestionResource->>SimpleRealtimeSegmentDataManager: startConsumption()
    SimpleRealtimeSegmentDataManager-->>SimpleRealtimeSegmentDataManager: Consume data & Build Segment
    SimpleRealtimeSegmentDataManager->>ReIngestionResource: Segment Tar File
    ReIngestionResource->>Controller: Upload segment metadata
    Controller->>Controller: Update ZK segment status to UPLOADED
    Server->>Server: Wait for segment to get uploads
    Server->>Controller: Reset segment to ONLINE
    Server-->>Controller: 200 OK (Reingestion complete)

Reingestion design diagram

flowchart TD
    A[Start reingestion request] --> B[Check concurrency <br> & segment ingestion map]
    B --> C{Already ingesting?}
    C -- Yes --> D[Return 409 conflict]
    C -- No --> E[Acquire semaphore, set ingesting]
    E --> F[Create & start SimpleRealtimeSegmentDataManager]
    F --> G{Consume from<br>startOffset to endOffset}
    G --> H[Build & tar segment]
    H --> I[Push metadata to controller]
    I --> J[Wait for segment to be uploaded]
    J --> K[Trigger segment reset]
    K --> L[Return success]
    L --> M[Release semaphore, mark not ingesting]
    M --> N[Done]

1. Changing FSM 2. Changing the 3 steps performed during the commit protocol to update ZK and Ideal state

1. Changes in the commit protocol to start segment commit before the build 2. Changes in the BaseTableDataManager to ensure that the locally built segment is replaced by a downloaded one only when the CRC is present in the ZK Metadata 3. Changes in the download segment method to allow waited download in case of pauseless consumption

…segment commit end metadata call Refactoing code for redability

… ingestion by moving it out of streamConfigMap

…auseless ingestion in RealtimeSegmentValidationManager

…d by RealtimeSegmentValitdationManager to fix commit protocol failures

…g commit protocol

… checks

…ption is enabled or not

…eepstore path with fallbacks

…jor fields of SegmentZKMetadata

This reverts commit c4b99bd.

This reverts commit 791ac21.

noob-se7en · 2025-01-27T15:29:18Z

So this check is only performed when RealtimeSegmentValidationManager job runs right? If yes then should this logic be part of separate dedicated job with higher frequency?

noob-se7en · 2025-01-27T15:45:43Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

@@ -2195,6 +2198,139 @@ URI createSegmentPath(String rawTableName, String segmentName) {
    return URIUtils.getUri(_controllerConf.getDataDir(), rawTableName, URIUtils.encode(segmentName));
  }

+  /**
+   * Re-ingests segments that are in DONE status with a missing download URL, but also


Shouldn't this be:
Re-ingests segments that are in ONLINE status with a missing download URL, but also

9aman · 2025-01-27T17:29:36Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+        LOGGER.info(
+            "Segment {} in table {} is in ERROR state with download URL present. Resetting segment to ONLINE state.",
+            segmentName, tableNameWithType);
+        _helixResourceManager.resetSegment(tableNameWithType, segmentName, null);


Reset segment does not work when the SegmentDataManager is missing on the server. Consider the following scenario:
A segment has missing url. The server hosting these segments restart and the segment goes in ERROR state in EV.
The re-ingestion updates the ZKMetadata and the reset segment message is sent.
The server does not have any SegmentDataManager instance for the segment and hence the reset does not work.

protected void doReplaceSegment(String segmentName) throws Exception { SegmentDataManager segmentDataManager = _segmentDataManagerMap.get(segmentName); if (segmentDataManager != null) { SegmentZKMetadata zkMetadata = fetchZKMetadata(segmentName); IndexLoadingConfig indexLoadingConfig = fetchIndexLoadingConfig(); indexLoadingConfig.setSegmentTier(zkMetadata.getTier()); replaceSegmentIfCrcMismatch(segmentDataManager, zkMetadata, indexLoadingConfig); } else { _logger.warn("Failed to find segment: {}, skipping replacing it", segmentName); } }

A ran the above code and found the following error.
[upsertMeetupRsvp_with_dr_2_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_40] Failed to find segment: upsertMeetupRsvp_with_dr_2__0__57__20250127T0745Z, skipping replacing it

… segment API's

…can only be updated after a fixed time has elapsed Reduced the time requirements by creating a FakePauselessLLCRealtimeSegmentManager.

…tion

codecov-commenter · 2025-01-29T10:39:46Z

Codecov Report

Attention: Patch coverage is 1.42857% with 552 lines in your changes missing coverage. Please review.

Project coverage is 63.36%. Comparing base (59551e4) to head (0d46327).
Report is 1648 commits behind head on master.

Files with missing lines	Patch %	Lines
...ion/utils/StatelessRealtimeSegmentDataManager.java	0.00%	250 Missing ⚠️
...inot/server/api/resources/ReIngestionResource.java	0.00%	191 Missing ⚠️
.../core/realtime/PinotLLCRealtimeSegmentManager.java	0.00%	70 Missing ⚠️
...e/pinot/common/utils/FileUploadDownloadClient.java	0.00%	20 Missing ⚠️
.../api/resources/reingestion/ReIngestionRequest.java	0.00%	7 Missing ⚠️
...api/resources/reingestion/ReIngestionResponse.java	0.00%	6 Missing ⚠️
...r/validation/RealtimeSegmentValidationManager.java	0.00%	3 Missing ⚠️
...ata/manager/realtime/RealtimeTableDataManager.java	0.00%	2 Missing ⚠️
...altime/ServerSegmentCompletionProtocolHandler.java	0.00%	2 Missing ⚠️
.../pinot/core/data/manager/BaseTableDataManager.java	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14920      +/-   ##
============================================
+ Coverage     61.75%   63.36%   +1.61%     
- Complexity      207     1376    +1169     
============================================
  Files          2436     2714     +278     
  Lines        133233   152474   +19241     
  Branches      20636    23521    +2885     
============================================
+ Hits          82274    96615   +14341     
- Misses        44911    48584    +3673     
- Partials       6048     7275    +1227

Flag	Coverage Δ
custom-integration1	`?`
integration	`?`
integration1	`?`
integration2	`?`
java-11	`?`
java-21	`63.36% <1.42%> (+1.73%)`	⬆️
skip-bytebuffers-false	`63.35% <1.42%> (+1.60%)`	⬆️
skip-bytebuffers-true	`63.32% <1.42%> (+35.59%)`	⬆️
temurin	`63.36% <1.42%> (+1.61%)`	⬆️
unittests	`63.36% <1.42%> (+1.61%)`	⬆️
unittests1	`56.10% <24.24%> (+9.21%)`	⬆️
unittests2	`33.88% <0.00%> (+6.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

noob-se7en

partial review

noob-se7en · 2025-01-30T08:46:37Z

pinot-server/src/main/java/org/apache/pinot/server/api/resources/ReIngestionResource.java

+    // Grab start/end offsets
+    String startOffsetStr = segmentZKMetadata.getStartOffset();
+    String endOffsetStr = segmentZKMetadata.getEndOffset();
+    if (startOffsetStr == null || endOffsetStr == null) {


(minor) maybe let's breakup all these similar validations into separate method (will be useful for unit testing as well).

noob-se7en · 2025-01-30T09:08:43Z

pinot-server/src/main/java/org/apache/pinot/server/api/resources/ReIngestionResource.java

+      } catch (Exception e) {
+        LOGGER.error("Error during async re-ingestion for job {} (segment={})", jobId, segmentName, e);
+      } finally {
+        isIngesting.set(false);


Won't there be race condition here when same segments is again scheduled to be re-ingested?

No, because it will get returned when trying to set this to true right?

I was talking about at just the instant just after flag is set false. (Just looking from as a server API perspective ignoring however frequently this API is called)

pinot-server/src/main/java/org/apache/pinot/server/api/resources/ReIngestionResource.java

noob-se7en · 2025-01-30T09:45:41Z

pinot-server/src/main/java/org/apache/pinot/server/api/resources/ReIngestionResource.java

+    String tableNameWithType = request.getTableNameWithType();
+    String segmentName = request.getSegmentName();
+
+    if (RUNNING_JOBS.size() >= MAX_PARALLEL_REINGESTIONS) {


These states are in-memory hence curious about the case when controllers sends re-ingest request to another server for same segment.

That's a good point. Currently it will get executed. However that scenario is unlikely as reingestion runs extremely infrequently (once per few hours)

One possible solution could be ZK

Another would be to keep track on controller side instead of server side.

Will check

reingestion runs extremely infrequently (once per few hours)

Isn't this a concern as well? Users might not be ok with data missing for a long time

noob-se7en · 2025-01-30T09:49:31Z

...rg/apache/pinot/server/api/resources/reingestion/utils/SimpleRealtimeSegmentDataManager.java

+/**
+ * Simplified Segment Data Manager for ingesting data from a start offset to an end offset.
+ */
+public class SimpleRealtimeSegmentDataManager extends SegmentDataManager {
+


Won't this create code debt in future maintaining both SimpleRealtimeSegmentDataManager and RealtimeSegmentDataManager?

Maybe we should leverage RealtimeSegmentDataManager methods in this class

Unfortunately that is not possible as these methods diverge a lot from main ones

pinot-server/src/main/java/org/apache/pinot/server/api/resources/ReIngestionResource.java

9aman and others added 30 commits January 2, 2025 16:57

Controller side changes to allow pauseless ingestion:

6defce3

1. Changing FSM 2. Changing the 3 steps performed during the commit protocol to update ZK and Ideal state

Minor improvements to improve readability

54ab7b3

Fixes in the commit protocol for pauseless ingestion to complete the …

012da87

…segment commit end metadata call Refactoing code for redability

Changing the way to enable pausless post introduction of multi-stream…

a97847f

… ingestion by moving it out of streamConfigMap

WIP: Changes in the expected state while performing recovery during p…

2c2ba86

…auseless ingestion in RealtimeSegmentValidationManager

Adding changes for ensurePartitionsConsuming function that is utilize…

4d7c893

…d by RealtimeSegmentValitdationManager to fix commit protocol failures

Add Server side Reingestion API

a041a75

run segment level validation

b6d0904

Add method to trigger reingestion

58f6c51

Linting fixes

d6313b3

Adding integration tests for 3 failure scenarios that can occur durin…

ca6134a

…g commit protocol

WIP: Reingestion test

845f616

Fix bug with indexes in reingestion

d2dd313

Add tests for reingestion

fb34fc8

Fix controller url in Reingestion

50725bd

Support https and auth in reingestion

e74d360

Merge branch 'master' into resolve-failures-pauseless-ingestion

ce3c851

Removing checks on default crc and replacing them with segment status…

d6208a6

… checks

Formatting improvements

7f5b720

Allowing null table config to be passed for checking pauseless consum…

3f05b2f

…ption is enabled or not

Ensuring upload retries are backward compatible

2012e38

Removing unnecessary code

7b9da37

Fixing existing test cases and adding unit tests to check upload to d…

ded8962

…eepstore path with fallbacks

Refactor test cases to reduce repetition

e703d84

Add missing header file

c2fda4a

Fix reingestion test

c836009

refactoring file upload download client

7974ab9

Removing pauselessConsumptionEnabled from index config

c08f841

Remove reingestion code

c4b99bd

9aman and others added 8 commits January 24, 2025 12:48

Adding a new class for simple serialization and deserialization of ma…

88a619a

…jor fields of SegmentZKMetadata

Removing files related to reingestion tests

791ac21

Merging master and including force commit PR changes

8db5bae

Revert "Remove reingestion code"

55b2b29

This reverts commit c4b99bd.

Revert "Removing files related to reingestion tests"

f74df66

This reverts commit 791ac21.

Fix reingestion issue where consuming segemnts are not replaced

1f4db11

Copy full segment to deep store before triggerring metadata upload

8e9249c

Refactoring: added support for tracking running reingestion jobs

b804a69

noob-se7en reviewed Jan 27, 2025

View reviewed changes

9aman reviewed Jan 27, 2025

View reviewed changes

Refactor: fix doc comments

609942d

KKcorps force-pushed the pauseless-reingestion branch from 45f6f29 to 609942d Compare January 27, 2025 18:26

Jackie-Jiang and others added 8 commits January 28, 2025 11:32

Make SegmentZKMetadata JSON serializable

f939714

Minor API name change

155c49f

Refactor PinotLLC class to add ability to inject failures

080ec55

Minor improvements

7e04fa3

Moving ZkMetadaptaUtils to commons and reusing the code in the upload…

523913f

… segment API's

Fix lint failures

5689333

Misc fix and cleanup

f42c6b8

The tests were running slow due to the condition that the IdealState …

a2eebf9

…can only be updated after a fixed time has elapsed Reduced the time requirements by creating a FakePauselessLLCRealtimeSegmentManager.

KKcorps changed the title ~~Pauseless reingestion #3: Disaster Recovery with Reingestion~~ Pauseless Consumption #3: Disaster Recovery with Reingestion Jan 29, 2025

Refactor Reingestion integration test

1ba5b6c

KKcorps changed the base branch from resolve-failures-pauseless-ingestion to master January 29, 2025 09:39

KKCorps added 2 commits January 29, 2025 15:20

Merge remote-tracking branch 'upstream/master' into pauseless-reinges…

11aa170

…tion

Fix error in tests post rebase

e84788a

KKcorps requested a review from Jackie-Jiang January 29, 2025 09:59

noob-se7en reviewed Jan 30, 2025

View reviewed changes

refactoring

0d46327

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pauseless Consumption #3: Disaster Recovery with Reingestion #14920

Pauseless Consumption #3: Disaster Recovery with Reingestion #14920

KKcorps commented Jan 27, 2025 •

edited

Loading

noob-se7en commented Jan 27, 2025 •

edited

Loading

noob-se7en Jan 27, 2025

9aman Jan 27, 2025

codecov-commenter commented Jan 29, 2025 •

edited

Loading

noob-se7en left a comment

noob-se7en Jan 30, 2025

noob-se7en Jan 30, 2025

KKcorps Jan 30, 2025

noob-se7en Jan 30, 2025 •

edited

Loading

noob-se7en Jan 30, 2025

KKcorps Jan 30, 2025

noob-se7en Jan 30, 2025 •

edited

Loading

noob-se7en Jan 30, 2025

noob-se7en Jan 30, 2025

KKcorps Jan 30, 2025

Pauseless Consumption #3: Disaster Recovery with Reingestion #14920

Are you sure you want to change the base?

Pauseless Consumption #3: Disaster Recovery with Reingestion #14920

Conversation

KKcorps commented Jan 27, 2025 • edited Loading

Reingestion Flow

New APIs introduced:

Get Running Re-ingestion Jobs

Re-ingest Segment

Reingestion data flow

Reingestion design diagram

noob-se7en commented Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 29, 2025 • edited Loading

Codecov Report

noob-se7en left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noob-se7en Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noob-se7en Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KKcorps commented Jan 27, 2025 •

edited

Loading

noob-se7en commented Jan 27, 2025 •

edited

Loading

codecov-commenter commented Jan 29, 2025 •

edited

Loading

noob-se7en Jan 30, 2025 •

edited

Loading

noob-se7en Jan 30, 2025 •

edited

Loading