-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][CELEBORN-894] Add support for end to end Integrity Checks #3062
Conversation
Glad to see your PR. If it's ready you can remove the draft label. |
Hi @FMX |
Code review will be done within this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gauravkm Looks like this PR is incomplete. Are there more PRs for this Jira Ticket?
@@ -377,7 +377,8 @@ private void close() throws IOException, InterruptedException { | |||
updateRecordsWrittenMetrics(); | |||
|
|||
long waitStartTime = System.nanoTime(); | |||
shuffleClient.mapperEnd(shuffleId, mapId, encodedAttemptId, numMappers); | |||
int bytesWritten = shuffleClient.mapperEnd(shuffleId, mapId, encodedAttemptId, numMappers, numMappers); | |||
writeMetrics.incBytesWritten(bytesWritten); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not add the bytes written to spark metrics. Because the write metric has the correct value of written bytes.
@@ -32,6 +33,7 @@ public class PushState { | |||
private final int pushBufferMaxSize; | |||
public AtomicReference<IOException> exception = new AtomicReference<>(); | |||
private final InFlightRequestTracker inFlightRequestTracker; | |||
private final ConcurrentHashMap<Integer, CommitMetadata> commitMetadataMap = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map is always empty. Seems that you forget to update this maps.
int bytes = 0; | ||
|
||
for (int partitionId = 0; partitionId < numPartitions; partitionId++) { | ||
CommitMetadata metadata = metadataMap.getOrDefault(partitionId, new CommitMetadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here will always get empty commit meta data.
gentle ping @gauravkm |
@gaoyajun02 Yes. I am still working on this. We (at Stripe) realized that the implementation needs to be a lot more comprehensive and thorough for the checks to be meaningful and provide confidence. We are internally testing out the new implementation which I will then open for review in OSS as well. |
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This issue was closed because it has been staled for 10 days with no activity. |
gentle ping @gauravkm Since this PR has been closed without updates and this feature is important to us, due to the pressure from our production applications, we hope to accelerate the progress of consistency validation work. We plan to submit a new PR with our implementation. I wanted to check with you one last time before we proceed. We have an improved solution and implementation ready to submit to the community. However, as it's an enhancement based on your initial approach, we wanted to align with you first. Please let us know if you have any concerns or if you'd prefer to continue with your implementation. Thanks for your understanding. |
@gaoyajun02 Apologies that the PR got marked stale and was closed. We do plan to contribute our implementation and are actively testing it out. I am going to ahead and put up a new PR with the updated implementation next week More detailed answer: We would prefer to continue contributing our implementation. I will put up the new PR next week. Please be rest assured that we are very interested in contributing back our implementation to the community With respect to the improvements/optimizations you mentioned, would be great if you could contribute them on top of our implementation. Let me try to reach out to you on slack/email to better understand these. Please feel free to comment about the proposed improvements on the design doc as well Thanks for checking with me and your patience here. |
@gaoyajun02 Could you please drop me an email to gauravkm (at gmail.com) and I am happy to coordinate a 1:1 discussion if you are interested. I am unable to find you on the Celelborn slack or find an email from your github profile. |
Great, I'm really looking forward to seeing the new implementation. If you have plans for next week, I can provide feedback and discussion on your CIP document or new PR in combination with our implementation. @gauravkm |
Thanks! Maybe I should update the design doc as a first step so that you can take a look at the updated design |
@gaoyajun02 Not sure if you have had a chance to look at the design as yet. Please do take a look and feel free to leave comments. The PR is still a work in progress. I could not find enough time to get into a reviewable state today and I am OOO tomorrow I will make progress on putting up the PR next week. Thanks for your continued patience! |
I've reviewed the design document, but I'm still not entirely clear about some details. I'd like to see your PR earlier - could you have it ready today or tomorrow? @gauravkm |
Thanks for reviewing @gaoyajun02! I am in PST timezone. I won't be able to get it to a polished state today, but I will get it into the state by tomorrow where you can get the details you are looking for in code. I am having to apply a lot of changes in the patch by hand. So its taking time. |
@gaoyajun02 Here you go - https://github.com/apache/celeborn/pull/3204/files
|
|
What changes were proposed in this pull request?
CELEBORN-894
End to End integrity checks reference implementation. Not looking to merge this yet. Looking for high level feedback on the approach
Why are the changes needed?
https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0
Does this PR introduce any user-facing change?
How was this patch tested?