Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][CELEBORN-894] Add support for end to end Integrity Checks #3062

Closed
wants to merge 1 commit into from

Conversation

gauravkm
Copy link

@gauravkm gauravkm commented Jan 10, 2025

What changes were proposed in this pull request?

CELEBORN-894
End to End integrity checks reference implementation. Not looking to merge this yet. Looking for high level feedback on the approach

Why are the changes needed?

https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0

Does this PR introduce any user-facing change?

How was this patch tested?

@gauravkm gauravkm changed the title [CELEBORN-894] Add support for end to end Integrity Checks [Draft][RFC][CELEBORN-894] Add support for end to end Integrity Checks Jan 10, 2025
@FMX
Copy link
Contributor

FMX commented Jan 12, 2025

Glad to see your PR. If it's ready you can remove the draft label.

@gauravkm gauravkm changed the title [Draft][RFC][CELEBORN-894] Add support for end to end Integrity Checks [RFC][CELEBORN-894] Add support for end to end Integrity Checks Jan 14, 2025
@gauravkm
Copy link
Author

gauravkm commented Feb 3, 2025

Hi @FMX
Could you please take a look? I removed the draft tag.

@FMX
Copy link
Contributor

FMX commented Feb 5, 2025

the

Code review will be done within this week.

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gauravkm Looks like this PR is incomplete. Are there more PRs for this Jira Ticket?

@@ -377,7 +377,8 @@ private void close() throws IOException, InterruptedException {
updateRecordsWrittenMetrics();

long waitStartTime = System.nanoTime();
shuffleClient.mapperEnd(shuffleId, mapId, encodedAttemptId, numMappers);
int bytesWritten = shuffleClient.mapperEnd(shuffleId, mapId, encodedAttemptId, numMappers, numMappers);
writeMetrics.incBytesWritten(bytesWritten);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not add the bytes written to spark metrics. Because the write metric has the correct value of written bytes.

@@ -32,6 +33,7 @@ public class PushState {
private final int pushBufferMaxSize;
public AtomicReference<IOException> exception = new AtomicReference<>();
private final InFlightRequestTracker inFlightRequestTracker;
private final ConcurrentHashMap<Integer, CommitMetadata> commitMetadataMap = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is always empty. Seems that you forget to update this maps.

int bytes = 0;

for (int partitionId = 0; partitionId < numPartitions; partitionId++) {
CommitMetadata metadata = metadataMap.getOrDefault(partitionId, new CommitMetadata());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here will always get empty commit meta data.

@gaoyajun02
Copy link

gentle ping @gauravkm
Thank you for your PR. This PR is very valuable to us, but it seems incomplete. Are you still working on it?

@gauravkm
Copy link
Author

gauravkm commented Feb 26, 2025

@gaoyajun02 Yes. I am still working on this. We (at Stripe) realized that the implementation needs to be a lot more comprehensive and thorough for the checks to be meaningful and provide confidence. We are internally testing out the new implementation which I will then open for review in OSS as well.
We also found that the current approach introduces a lot of overhead for apps that provision a high partition count but only write to a few of them. So we have altered the design to be able to accommodate such apps.

Copy link

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Mar 18, 2025
Copy link

This issue was closed because it has been staled for 10 days with no activity.

@github-actions github-actions bot closed this Mar 28, 2025
@gaoyajun02
Copy link

gaoyajun02 commented Mar 28, 2025

gentle ping @gauravkm

Since this PR has been closed without updates and this feature is important to us, due to the pressure from our production applications, we hope to accelerate the progress of consistency validation work. We plan to submit a new PR with our implementation. I wanted to check with you one last time before we proceed.

We have an improved solution and implementation ready to submit to the community. However, as it's an enhancement based on your initial approach, we wanted to align with you first.

Please let us know if you have any concerns or if you'd prefer to continue with your implementation.

Thanks for your understanding.

@gauravkm
Copy link
Author

gauravkm commented Mar 28, 2025

@gaoyajun02 Apologies that the PR got marked stale and was closed. We do plan to contribute our implementation and are actively testing it out. I am going to ahead and put up a new PR with the updated implementation next week

More detailed answer:
As i mentioned we have been testing out a new optimized implementation internally and it has now been subject to 1000's of Spark apps and multiple petabytes of shuffle running everyday. It also has optimizations around handling for empty partitions to avoid scenarios where commit metadata size overhead was higher than actual shuffle.

We would prefer to continue contributing our implementation. I will put up the new PR next week. Please be rest assured that we are very interested in contributing back our implementation to the community

With respect to the improvements/optimizations you mentioned, would be great if you could contribute them on top of our implementation. Let me try to reach out to you on slack/email to better understand these. Please feel free to comment about the proposed improvements on the design doc as well

Thanks for checking with me and your patience here.

@gauravkm
Copy link
Author

gauravkm commented Mar 28, 2025

@gaoyajun02 Could you please drop me an email to gauravkm (at gmail.com) and I am happy to coordinate a 1:1 discussion if you are interested. I am unable to find you on the Celelborn slack or find an email from your github profile.

@gaoyajun02
Copy link

@gaoyajun02 Could you please drop me an email to gauravkm (at gmail.com) and I am happy to coordinate a 1:1 discussion if you are interested. I am unable to find you on the Celelborn slack or find an email from your github profile.

Great, I'm really looking forward to seeing the new implementation. If you have plans for next week, I can provide feedback and discussion on your CIP document or new PR in combination with our implementation. @gauravkm

@gauravkm
Copy link
Author

gauravkm commented Apr 2, 2025

Thanks!
Working on it this week. Rebased to begin with but having some trouble pushing. Will figure this out tomorrow.

Maybe I should update the design doc as a first step so that you can take a look at the updated design

@gauravkm
Copy link
Author

gauravkm commented Apr 2, 2025

Updated the design here - https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0#heading=h.spmltvz1rcvt

@gauravkm
Copy link
Author

gauravkm commented Apr 4, 2025

@gaoyajun02 Not sure if you have had a chance to look at the design as yet. Please do take a look and feel free to leave comments.

The PR is still a work in progress. I could not find enough time to get into a reviewable state today and I am OOO tomorrow

I will make progress on putting up the PR next week. Thanks for your continued patience!

@gaoyajun02
Copy link

gaoyajun02 commented Apr 7, 2025

I've reviewed the design document, but I'm still not entirely clear about some details. I'd like to see your PR earlier - could you have it ready today or tomorrow? @gauravkm

@gauravkm
Copy link
Author

gauravkm commented Apr 7, 2025

Thanks for reviewing @gaoyajun02!

I am in PST timezone. I won't be able to get it to a polished state today, but I will get it into the state by tomorrow where you can get the details you are looking for in code. I am having to apply a lot of changes in the patch by hand. So its taking time.

@gauravkm
Copy link
Author

gauravkm commented Apr 7, 2025

@gaoyajun02 Here you go - https://github.com/apache/celeborn/pull/3204/files

  • I accidentally opened up this PR in the main repo. For now i have closed it.
  • Still requires a ton of cleanup and some test related fixes.
  • But for the purposes of having better understanding the proposed solution this should be good enough
  • I am going to continue to iterate on the changes in the PR to get them into a more polished state and then put it up for official review

@gauravkm
Copy link
Author

gauravkm commented Apr 7, 2025

  • I will also incorporate feedback from the current PR into the new one as well. (There was some feedback around counters)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants