Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pauseless Consumption #3: Disaster Recovery with Reingestion #14920
base: master
Are you sure you want to change the base?
Pauseless Consumption #3: Disaster Recovery with Reingestion #14920
Changes from 51 commits
6defce3
54ab7b3
1e40134
012da87
a97847f
2c2ba86
4d7c893
a041a75
b6d0904
58f6c51
d6313b3
ca6134a
845f616
d2dd313
fb34fc8
50725bd
e74d360
ce3c851
d6208a6
7f5b720
3f05b2f
2012e38
7b9da37
ded8962
e703d84
c2fda4a
c836009
7974ab9
c08f841
c4b99bd
aee514c
88a619a
791ac21
8db5bae
55b2b29
f74df66
1f4db11
8e9249c
b804a69
609942d
f939714
155c49f
080ec55
7e04fa3
523913f
5689333
f42c6b8
a2eebf9
1ba5b6c
11aa170
e84788a
0d46327
a94c7e3
8b9b8d1
d035249
bc8a65b
f082d24
d1ad30b
7e47dd5
5a42c28
1551685
837aa26
09e8583
3d6fdf4
b525c7e
f7ae25f
3c17957
a3fa25c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are
ONLINE
replica, ideally we should reset theERROR
replica. Do we rely on validation manager for that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are ONLINE replica, ideally we should reset the ERROR replica. Do we rely on validation manager for that?
yes that is already a part of validation manager https://github.com/apache/pinot/pull/14217/filesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pull that logic out and handle it differently for pauseless table? For pauseless table, reset doesn't really work. Also that shouldn't be segment level validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to wait for the segment to be in
ERROR
state for all the servers.What would happen in this case:
Server 1 stays in ERROR state as it's not reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can happen even in non pauseless scenario right? It should already be handled. Ideally we need to simply reset the segment in this scenario without reingestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already happens in segment level validation
https://github.com/apache/pinot/pull/14217/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset segment does not work when the SegmentDataManager is missing on the server. Consider the following scenario:
A segment has missing url. The server hosting these segments restart and the segment goes in ERROR state in EV.
The re-ingestion updates the ZKMetadata and the reset segment message is sent.
The server does not have any SegmentDataManager instance for the segment and hence the reset does not work.
A ran the above code and found the following error.
[upsertMeetupRsvp_with_dr_2_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_40] Failed to find segment: upsertMeetupRsvp_with_dr_2__0__57__20250127T0745Z, skipping replacing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not able to come up with a better approach to achieve this but this does not seem the right way to cater to re-ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noob-se7en Can you please review this part once. Feel free to post any questions regarding the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check was added in the following PR - https://github.com/apache/pinot/pull/12886/files
What problems do you see with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will crc match? We shouldn't replace a segment multiple times (e.g. somehow 2 servers trying to re-ingest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am also not happy with this. The thing I am trying to solve for is basically making segment refresh succeed for reingesion
Without this check it fails on the following line:
Segment refresh is triggered whenever a segment is uploaded which is what reingestion is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need a new API to handle re-ingested segment. It is not regular segment push. With a new API we can also set the status to
DONE
, and then trigger the reset from the API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, added a new API called
/segment/completeReingestion
that handles this.