Skip to content

feat(backend, config)!: add code to prevent race condition from ever putting the db into a weird state#6678

Open
anna-parker wants to merge 4 commits into
s3-garbage-collectionfrom
gc-anya
Open

feat(backend, config)!: add code to prevent race condition from ever putting the db into a weird state#6678
anna-parker wants to merge 4 commits into
s3-garbage-collectionfrom
gc-anya

Conversation

@anna-parker

@anna-parker anna-parker commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

claude helped me with this

resolves #6543 (comment)

The Exact Race Window

The current code has two separate steps with no lock between them:

  GC:         getOrphanedFileIds() ──────────────────── deleteFile(F) + deleteFileEntry(F)
                    ↑                                          ↑
  User:             │  validateFileIdsExist(F) → submit() ────┤
                    │  (passes, F still in DB)                │
                    └──────────────────────────────── F deleted, but submission committed

validateFileIdsExist checks the DB, then the submission writes to sequence_entries later. Between those two events, GC
can delete the file. The window is roughly: time to delete all the S3 objects + DB rows in the batch.

Additional improvements

  • make maxOrphanAge minutes and rename as orphanRetentionPeriod
  • make the polling frequency of the gc task configurable for integration testing

Alternatives considered

  • Have one transaction for finding the files to delete and deleting them, this transaction would need to apply table locks/isolation to prevent the file validation function (called when submitting) to read a file as existing while we are deleting it (getting into the same issue as above). We additionally tested that just reading all the jsonb in ppx takes 3min, this transaction time will only grow once we add actual files. Locking tables while doing this transaction is not feasible.

Unresolved Issues

As discussed with @corneliusroemer async there is still a very small window between when the files are validated and are uploaded to the db during a submission. When only one backend is running this can be ignored, however on PPX we have two backends if both start this job simultaneously we can have a submission occur during deletion and have one backend attempt to delete a file within this interval. Leading to the issue we are attempting to resolve still occuring.

We should probably prevent multiple garbage collection tasks from running simultaneously to prevent this from occuring. The alternative is to only delete files a certain time period after they have been marked as ready for deletion - this would prevent this race condition.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

…putting the db into a weird state

- make maxOrphanAge minutes and rename as orphanRetentionPeriod
- make the polling frequency of the gc task configurable for integration testing
@claude claude Bot added backend related to the loculus backend component deployment Code changes targetting the deployment infrastructure labels Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative would be just to delete the entry before we delete the S3 file? worst case scenario one would get truly orphaned files in a crash, but one could add a job to find those if it became a real problem, which seems unlikely

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but no strong feelings, I don't know the code well)

@anna-parker anna-parker Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure I understand how we would accomplish your suggestion as the db tables dont have a link from the s3 file to the table entries - we have to read all jsonb to find the ones linked so I dont think your suggestion is possible... but please correct me if I misunderstand!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component deployment Code changes targetting the deployment infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants