feat(backend, config)!: add code to prevent race condition from ever putting the db into a weird state#6678
Open
anna-parker wants to merge 4 commits into
Open
feat(backend, config)!: add code to prevent race condition from ever putting the db into a weird state#6678anna-parker wants to merge 4 commits into
anna-parker wants to merge 4 commits into
Conversation
…putting the db into a weird state - make maxOrphanAge minutes and rename as orphanRetentionPeriod - make the polling frequency of the gc task configurable for integration testing
Member
There was a problem hiding this comment.
an alternative would be just to delete the entry before we delete the S3 file? worst case scenario one would get truly orphaned files in a crash, but one could add a job to find those if it became a real problem, which seems unlikely
Member
There was a problem hiding this comment.
(but no strong feelings, I don't know the code well)
Contributor
Author
There was a problem hiding this comment.
Im not sure I understand how we would accomplish your suggestion as the db tables dont have a link from the s3 file to the table entries - we have to read all jsonb to find the ones linked so I dont think your suggestion is possible... but please correct me if I misunderstand!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
claude helped me with this
resolves #6543 (comment)
The Exact Race Window
The current code has two separate steps with no lock between them:
validateFileIdsExist checks the DB, then the submission writes to sequence_entries later. Between those two events, GC
can delete the file. The window is roughly: time to delete all the S3 objects + DB rows in the batch.
Additional improvements
Alternatives considered
Unresolved Issues
As discussed with @corneliusroemer async there is still a very small window between when the files are validated and are uploaded to the db during a submission. When only one backend is running this can be ignored, however on PPX we have two backends if both start this job simultaneously we can have a submission occur during deletion and have one backend attempt to delete a file within this interval. Leading to the issue we are attempting to resolve still occuring.
We should probably prevent multiple garbage collection tasks from running simultaneously to prevent this from occuring. The alternative is to only delete files a certain time period after they have been marked as ready for deletion - this would prevent this race condition.
Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable