feat(backend): S3 garbage collection#6543
Conversation
|
This PR may be related to: #3935 (Build a garbage collector for file sharing) |
|
I believe the changes here: #6323 will still be required to ensure submissions that were discarded on the website are properly deleted |
ff145ac to
1eb50e1
Compare
|
Claude finished @maverbiest's task in 2m 23s —— View job Code Review
Good implementation overall. The approach of scanning Bugs (inline comments posted):
Minor notes:
|
Looking for file references in the |
| fun task() { | ||
| // `maxOrphanAge` must be at least 1 or files produced by preprocessing may be | ||
| // garbage collected before they're attached to sequence entries | ||
| val maxOrphanAge = max(maxOrphanAge, 1) |
There was a problem hiding this comment.
this should really just be configurable to allow for testing
There was a problem hiding this comment.
Values lower than 1 will brick a deployment (GC will delete files out from under the prepro pipeline), so I do think clamping is warranted here. For testing, I'd prefer to create backdated entries that look older than they are so that the GC will pick them up
|
|
||
| fun getOrphanedFileIds(threshold: LocalDateTime): Set<FileId> { | ||
| val sql = """ | ||
| -- check for files for which an upload was requested > threshold days ago |
There was a problem hiding this comment.
Note: we're reading through the whole database for this to find referenced files.
If this becomes perf problematic or otherwise we might want to consider normalizing files with proper foreign keys etc to allow full power of SQL.
| chunk.filterNot { it in existingIds } | ||
| }, 1).toSet() | ||
|
|
||
| fun getOrphanedFileIds(threshold: LocalDateTime): Set<FileId> { |
There was a problem hiding this comment.
We ran this on staging to do some perf (added one entry in the files table to actually run). Scanning every entry is definitely not teh greatest, so in future we might need to refactor but this should work for now :-)
Details
⏺ What This Query Plan Does
The logical query: Find rows in files where the file's id does NOT appear as a fileId in any sequence entry's JSON data — i.e., find orphaned files.
---
Why It Finished in 2ms Despite Terrifying Cost Estimates
The plan executed a Merge Anti Join, which works like this:
1. Sort the left side (files)
2. Sort the right side (all fileIds extracted from JSON in sequence entries)
3. Walk both sorted lists; emit left-side rows that have no match on the right
The critical detail: the left side (files filtered by upload_requested_at < now()) returned 0 rows. Once the left side is exhausted, the right side is never executed. Every node tagged (never executed) was skipped entirely.
---
The Right Side Would Have Been Catastrophic
If any files had matched the filter, the right side would need to:
sequence_entries (309k rows)
→ jsonb_each(metadata) → explode into category/value pairs ~30M rows
→ jsonb_array_elements(fil) → explode arrays of file refs ~3B rows
sequence_entries_preprocessed_data (614k rows, joined via parallel hash join)
→ jsonb_each → jsonb_array_elements ~6B rows
Total estimated: 9.2 billion rows to sort and deduplicate
PostgreSQL has almost no statistics about what jsonb_each and jsonb_array_elements return — it assumes 100 rows per call by default, which compounds wildly across nested loops. The 2.6 trillion cost estimate is the planner's honest (if terrible) guess.
---
Key Observations
┌───────────────────────────────────────────┬─────────────────────────────────────────────────────┐
│ Thing │ What it means │
├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│ rows=0 on Seq Scan of files │ The filter found nothing — query was trivially fast │
├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│ (never executed) everywhere on right side │ Short-circuited; never touched │
├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│ Cost 2645430954 │ Planner's estimate, not actual work done │
├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│ Actual time=0.010..0.013 on top node │ 13 microseconds of real work │
├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│ Workers Launched: 0 (planned 2) │ Parallel workers weren't even started │
└───────────────────────────────────────────┴─────────────────────────────────────────────────────┘
---
The Real Risk
This query has a latent performance bomb. If the files table ever has rows matching the filter, PostgreSQL will attempt to sort and deduplicate ~9 billion JSON-extracted values. The fix would be one of:
- A materialized CTE or temp table to pre-extract fileIds with proper indexing
- An index on a normalized file_references table instead of JSON scanning
- Rewriting as NOT EXISTS (SELECT 1 FROM sequence_entries WHERE metadata @> ...) with a GIN index on the JSONB column
✻ Baked for 25s
❯ how can we insert a few rows just to trigger the right side
⏺ Let me find the files table schema and the query itself.
Searched for 6 patterns, read 2 files (ctrl+o to expand)
⎿ Loaded backend/CLAUDE.md
⎿ Loaded backend/AGENTS.md
⏺ Now I have everything. Here's the minimal insert to trigger the right side:
INSERT INTO files (id, upload_requested_at, uploader, group_id, size, multipart_completed, multipart_upload_id)
VALUES (
gen_random_uuid(),
now() - interval '1 day', -- must be < now() to pass the filter
'test-user',
1, -- any group_id that exists
NULL,
false,
NULL
);
The filter is upload_requested_at < now(), so any timestamp in the past works. The group_id needs to be a valid integer but there's no FK constraint enforced at the DB level based on the schema, so 1 should work regardless.
After inserting, re-run the EXPLAIN ANALYZE and you'll see the right side execute — and likely see it sorting/deduplicating hundreds of millions of rows (or running out of memory/time). That's the performance cliff.
To clean up afterwards:
DELETE FROM files WHERE uploader = 'test-user';
✻ Baked for 26s
❯ guess how logn it will take
⏺ Hard to say precisely without knowing actual JSON density, but here's the reasoning:
The 9.2 billion row estimate is almost certainly fiction. PostgreSQL blindly assumes jsonb_each returns 100 rows and jsonb_array_elements returns another 100 — multiplied across 309K + 614K base rows. In reality, most sequences probably have 0 or a handful of file references (file sharing looks like a newer/optional feature in this codebase).
What actually drives the time:
1. Seq scan of sequence_entries — 309K rows × 966 bytes each ≈ 300MB to read
2. Seq scan + hash join of sequence_entries_preprocessed_data — 614K rows × 783 bytes ≈ 480MB
3. JSON expansion — if most submitted_data->'files' are null/empty, jsonb_each('{}') emits 0 rows and it collapses fast
4. Sort + Unique — only expensive if expansion actually produces many rows
My guess:
- If file sharing is sparsely used (most rows have no files): 10–60 seconds, dominated by the table scans
- If sequences have ~1–5 file refs each: 2–10 minutes, sort of a few million rows
- If the 9.2B estimate were anywhere close to real: hours, and it would probably spill to disk
You can tell which world you're in after inserting the test row by watching whether the Sort node in EXPLAIN ANALYZE says Memory: Xkb (fast, fits in work_mem) or Sort Method: external merge Disk: XMB (slow, spilling).
✻ Baked for 32s
❯ there is just one file and this came back "Merge Anti Join (cost=2645430954.56..2837916997.28 rows=105 width=16) (actual time=165112.005..165115.272 rows=1 loops=1)"
" Merge Cond: (f.id = (((fil.value ->> 'fileId'::text))::uuid))"
" Buffers: shared hit=1497697 read=232824"
" I/O Timings: shared read=156340.591"
" -> Sort (cost=27.55..28.07 rows=210 width=16) (actual time=0.019..0.020 rows=1 loops=1)"
" Sort Key: f.id"
" Sort Method: quicksort Memory: 25kB"
" Buffers: shared hit=1"
" -> Seq Scan on files f (cost=0.00..19.45 rows=210 width=16) (actual time=0.015..0.016 rows=1 loops=1)"
" Filter: (upload_requested_at < now())"
" Buffers: shared hit=1"
" -> Unique (cost=2645430927.01..2691627577.01 rows=9239330000 width=16) (actual time=165111.984..165115.249 rows=0 loops=1)"
" Buffers: shared hit=1497696 read=232824"
" I/O Timings: shared read=156340.591"
" -> Sort (cost=2645430927.01..2668529252.01 rows=9239330000 width=16) (actual time=165111.983..165115.248 rows=0 loops=1)"
" Sort Key: (((fil.value ->> 'fileId'::text))::uuid)"
" Sort Method: quicksort Memory: 25kB"
" Buffers: shared hit=1497696 read=232824"
" I/O Timings: shared read=156340.591"
" -> Append (cost=0.02..326590763.34 rows=9239330000 width=16) (actual time=165111.981..165115.245 rows=0 loops=1)"
" Buffers: shared hit=1497696 read=232824"
" I/O Timings: shared read=156340.591"
" -> Nested Loop (cost=0.02..62874337.85 rows=3090840000 width=16) (actual time=36509.936..36509.938 rows=0 loops=1)"
" Buffers: shared hit=238372 read=53174"
" I/O Timings: shared read=34721.468"
" -> Nested Loop (cost=0.01..671081.85 rows=30908400 width=32) (actual time=36509.935..36509.936 rows=0 loops=1)"
" Buffers: shared hit=238372 read=53174"
" I/O Timings: shared read=34721.468"
" -> Seq Scan on sequence_entries (cost=0.00..52913.84 rows=309084 width=966) (actual time=0.002..177.216 rows=309742 loops=1)"
" Buffers: shared hit=49823"
" -> Function Scan on jsonb_each cat (cost=0.01..1.01 rows=100 width=32) (actual time=0.117..0.117 rows=0 loops=309742)"
" Buffers: shared hit=188549 read=53174"
" I/O Timings: shared read=34721.468"
" -> Memoize (cost=0.01..1.01 rows=100 width=32) (never executed)"
" Cache Key: cat.v"
" Cache Mode: binary"
" -> Function Scan on jsonb_array_elements fil (cost=0.00..1.00 rows=100 width=32) (never executed)"
" -> Nested Loop (cost=11917.49..125126475.49 rows=6148490000 width=16) (actual time=128602.042..128605.303 rows=0 loops=1)"
" Buffers: shared hit=1259324 read=179650"
" I/O Timings: shared read=121619.123"
" -> Nested Loop (cost=11917.48..1388013.24 rows=61484900 width=32) (actual time=128602.041..128605.300 rows=0 loops=1)"
" Buffers: shared hit=1259324 read=179650"
" I/O Timings: shared read=121619.123"
" -> Gather (cost=11917.47..158315.23 rows=614849 width=764) (actual time=61.485..366.351 rows=617729 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" Buffers: shared hit=133642"
" -> Parallel Hash Join (cost=10917.47..95830.33 rows=256187 width=764) (actual time=56.310..279.466 rows=205910 loops=3)"
" Hash Cond: ((sepd.accession = se.accession) AND (sepd.version = se.version))"
" Buffers: shared hit=133642"
" -> Parallel Seq Scan on sequence_entries_preprocessed_data sepd (cost=0.00..83567.87 rows=256187 width=783) (actual time=0.007..87.833 rows=205910 loops=3)"
" Buffers: shared hit=81006"
" -> Parallel Hash (cost=8985.69..8985.69 rows=128785 width=19) (actual time=55.291..55.291 rows=103247 loops=3)"
" Buckets: 524288 Batches: 1 Memory Usage: 21120kB"
" Buffers: shared hit=52490"
" -> Parallel Index Only Scan using sequence_entries_pkey on sequence_entries se (cost=0.42..8985.69 rows=128785 width=19) (actual time=0.082..20.626 rows=103247 loops=3)"
" Heap Fetches: 1924"
" Buffers: shared hit=52490"
" -> Function Scan on jsonb_each cat_1 (cost=0.01..1.01 rows=100 width=32) (actual time=0.207..0.207 rows=0 loops=617729)"
" Buffers: shared hit=1125682 read=179650"
" I/O Timings: shared read=121619.123"
" -> Memoize (cost=0.01..1.01 rows=100 width=32) (never executed)"
" Cache Key: cat_1.v"
" Cache Mode: binary"
" -> Function Scan on jsonb_array_elements fil_1 (cost=0.00..1.00 rows=100 width=32) (never executed)"
"Planning:"
" Buffers: shared hit=310"
"Planning Time: 0.403 ms"
"Execution Time: 165135.771 ms"
⏺ 165 seconds, almost entirely I/O. Here's what actually happened:
The Real Story
Zero actual file references exist in any sequence. Both jsonb_each calls returned 0 rows across ~930K invocations:
jsonb_each cat → rows=0 loops=309,742 (every sequence_entry)
jsonb_each cat_1 → rows=0 loops=617,729 (every preprocessed_data row)
Yet it still took 165 seconds because PostgreSQL had to read the entire JSONB blob for every row just to discover there's no files key. You can see it in the I/O:
┌───────────────────────────────────────────────┬──────────────────────────┬──────────┐
│ Branch │ Blocks read from disk │ I/O time │
├───────────────────────────────────────────────┼──────────────────────────┼──────────┤
│ sequence_entries jsonb_each │ 53,174 blocks (~415 MB) │ 34.7s │
├───────────────────────────────────────────────┼──────────────────────────┼──────────┤
│ sequence_entries_preprocessed_data jsonb_each │ 179,650 blocks (~1.4 GB) │ 121.6s │
├───────────────────────────────────────────────┼──────────────────────────┼──────────┤
│ Total │ ~1.8 GB │ 156.3s │
└───────────────────────────────────────────────┴──────────────────────────┴──────────┘
The data wasn't in shared_buffers — so 156 of the 165 seconds were just waiting on disk reads. CPU work was trivial (sorting 0 rows is instant).
The Structural Problem
The query has no way to skip a row without reading its JSONB. To check submitted_data->'files', it must load the whole submitted_data blob. With 309K rows at ~966 bytes each (compressed; uncompressed likely much larger), that's a guaranteed full-scan of every JSONB column every time this runs.
A targeted fix would be to add a GIN index and pre-filter before the lateral join:
CREATE INDEX ON sequence_entries USING GIN (submitted_data jsonb_path_ops);
-- then in the CTE:
FROM sequence_entries
WHERE submitted_data ? 'files' -- index hit, skips rows with no files key
LATERAL jsonb_each(...)
That would reduce the 309K jsonb_each calls to however many sequences actually have a files key — likely near zero right now, making the whole query instant.
There was a problem hiding this comment.
@corneliusroemer and I also discussed one potential race condition. If a user submits a file, waits for like a month (or however long the grace period is) and then submits data with the files at the same exact time this transaction is run the fileId will get into the list of files to delete but the submission will not be rejected.
There was a problem hiding this comment.
we dont foresee this happening but it might be good to somehow mark columns as "marked for deletion" after running this job, stopping submission of files that are marked for deletion and then rerunning this transaction again before deleting
| insertFile(old, groupId, daysAgo(10)) | ||
| insertFile(recent, groupId, daysAgo(1)) | ||
|
|
||
| val orphans = filesDatabaseService.getOrphanedFileIds(daysAgo(5)) |
There was a problem hiding this comment.
Use constants somewhere at top for "young" and "old" then reuse those instead of using magic values that only are clear from extended context by referencing defaults (that might be changed easily)
|
|
||
| @Suppress("ktlint:standard:max-line-length") | ||
| @Test | ||
| fun `GIVEN a file only referenced in archive_of_submitted_data THEN it is orphaned, but submitted_data is protected`() { |
There was a problem hiding this comment.
We should probably not delete files referenced in archive
| val sql = """ | ||
| -- check for files for which an upload was requested > threshold days ago | ||
| -- but are not referenced by a submission. For this, check the submitted_data | ||
| -- and processed_data jsonb objects (but not archive_of_submitted_data) |
There was a problem hiding this comment.
lets check the file is also not in the archive for consistency
There was a problem hiding this comment.
Alright I'll make it scan the archive as well, although I expect it will take the query take even longer unfortunately.
The more I think about it the more I become convinced we should take files out of the jsonb and add join tables from files to se and sepd, storing the files type (raw_reads, annotation, etc.) as a column on the files table. That's a pretty big refactor though so maybe something for a follow-up PR
| private fun insertSequenceEntry( | ||
| accession: String, | ||
| version: Long, | ||
| archive: SubmittedData<CompressedSequence>?, | ||
| submitted: SubmittedData<CompressedSequence>?, | ||
| ) = transaction { | ||
| SequenceEntriesTable.insert { | ||
| it[accessionColumn] = accession | ||
| it[versionColumn] = version | ||
| it[organismColumn] = DEFAULT_ORGANISM | ||
| it[submissionIdColumn] = "submission-$accession-$version" | ||
| it[submitterColumn] = "testuser" | ||
| it[groupIdColumn] = groupId | ||
| it[submittedAtTimestampColumn] = dateProvider.getCurrentDateTime() | ||
| it[archiveOfSubmittedDataColumn] = archive | ||
| it[submittedDataColumn] = submitted | ||
| } | ||
| } | ||
|
|
||
| private fun insertPreprocessedData( | ||
| accession: String, | ||
| version: Long, | ||
| pipelineVersion: Long, | ||
| processed: ProcessedData<CompressedSequence>, | ||
| ) = transaction { | ||
| SequenceEntriesPreprocessedDataTable.insert { | ||
| it[accessionColumn] = accession | ||
| it[versionColumn] = version | ||
| it[pipelineVersionColumn] = pipelineVersion | ||
| it[processedDataColumn] = processed | ||
| it[processingStatusColumn] = "PROCESSED" | ||
| it[startedProcessingAtColumn] = dateProvider.getCurrentDateTime() | ||
| } | ||
| } | ||
|
|
||
| private fun makeUnprocessedData(fileId: UUID?): SubmittedData<CompressedSequence> = SubmittedData( | ||
| metadata = emptyMap(), | ||
| unalignedNucleotideSequences = emptyMap(), | ||
| files = fileId?.let { mapOf("rawReads" to listOf(FileIdAndName(it, "raw.fastq"))) }, | ||
| ) | ||
|
|
||
| private fun makeProcessedData(fileId: UUID): ProcessedData<CompressedSequence> = ProcessedData( | ||
| metadata = emptyMap(), | ||
| unalignedNucleotideSequences = emptyMap(), | ||
| alignedNucleotideSequences = emptyMap(), | ||
| nucleotideInsertions = emptyMap(), | ||
| alignedAminoAcidSequences = emptyMap(), | ||
| aminoAcidInsertions = emptyMap(), | ||
| files = mapOf("processedOutput" to listOf(FileIdAndName(fileId, "aligned.bam"))), | ||
| ) | ||
| } |
There was a problem hiding this comment.
I think we usually have a different pattern for creating test data in the backend tests. I wonder if the existing pattern could be used instead of doing direct db operations.
If the existing test helpers don't allow customizing file id and insertion date they could be generalized to support these features.
The helpers seem quite ok except creating very minimal submissions. The data inserted is very simple, there are no sequences in there, no metadata etc. This shouldn't matter but more realistic data could catch some things we miss.
Things that are missing in tests:
- What if multiple files are referenced by a single sequence, right now it seems to be only one file per sequence per datatype
| .createNewGroup(group = DEFAULT_GROUP, jwt = jwtForDefaultUser) | ||
| .andGetGroupId() | ||
| val orphan = UUID.randomUUID() | ||
| insertFile(orphan, groupId, daysAgo(2)) |
There was a problem hiding this comment.
This is a bit magic now, might be better to call this TIME_LONGER_THAN_GC or something like this.
| assertThat(filesDatabaseService.getNonExistentFileIds(setOf(orphan)), `is`(emptySet())) | ||
| } | ||
|
|
||
| @Suppress("ktlint:standard:max-line-length") |
There was a problem hiding this comment.
Might be clearer here to explicitly set dry run to false - otherwise it's easy to miss that the config from above no longer applies - this also isolates against default changes.
| } | ||
| } | ||
|
|
||
| private fun insertPreprocessedData( |
There was a problem hiding this comment.
we have the PreparedProcessedData object I wonder if we could reuse it here instead
| assertThat(orphans.isEmpty(), `is`(true)) | ||
| } | ||
|
|
||
| private fun insertSequenceEntry( |
There was a problem hiding this comment.
we also have prepareUnprocessedSequenceEntry maybe this could be used here? just trying to avoid code duplications
| s3Service.deleteFile(fileId) | ||
| filesDatabaseService.deleteFileEntry(fileId) |
There was a problem hiding this comment.
Transactionally this might be some of the most critical code here - shoudl we really wrap it in a single try/catch? No retry, no recovery if deletion fails at step 1? Or at step 2?
Off the top of my head, for example, we might want to handle s3 deletion failure by setting a flag in our db about it. Vice versa, if s3 deletion succeeds but then db deletion doesn't what do we do? In the future deletion might fail every time as the file is no longer there? So we end in unrecoverable state?
There was a problem hiding this comment.
An actual failure mode that can happen is that we have 2 backends in production - they might happen to cross and run at the same time racing to deletion. We might want to add a lock for this, maybe a table that tracks last cleanup time and if cleanup is in progress so we don't cleanup at the same time twice.
There was a problem hiding this comment.
would be good to check this out... chatGPT tells me the s3 deletion should not raise an error if the file is missing which means we would still try to remove the file from the db:
Usually deleting a missing object from S3 does not raise an error.
DeleteObject in S3 is idempotent: if the key does not exist, S3 still returns a successful response (typically HTTP 204 No Content).
But then Im not sure why we have code to handle NoSuchKey
There was a problem hiding this comment.
Update: NoSuchKey is only thrown when trying to get a file that is missing
There was a problem hiding this comment.
Yes trying to delete a deleted file should be fine as far as I'm aware. The NoSuchKey errors happen when trying to access a deleted file, which came up when GC was deleting files generated by prepro before they were attached to files
| s3Service.deleteFile(fileId) | ||
| filesDatabaseService.deleteFileEntry(fileId) | ||
| } catch (e: Exception) { | ||
| log.warn("Failed to delete $fileId", e) |
There was a problem hiding this comment.
Thinking about it we definitely at the very least want different error messages for failing to delete file from s3 and from postgres db.
There was a problem hiding this comment.
Any reason this is under service/submission? To me this seems more service/files related.
| * Runs once daily (with an initial delay of 15 minutes) and deletes S3 objects older than | ||
| * `loculus.s3.max-orphan-age-days` that are not referenced in submitted_data or processed_data | ||
| */ | ||
| @Scheduled(initialDelay = 15, fixedDelay = 60 * 24, timeUnit = TimeUnit.MINUTES) |
There was a problem hiding this comment.
Initial delay should also be configurable and also jitter. Otherwise if 2 backends come up at the same time they'll run at the same time.
We have 2 backends in prod, so they would quite possibly either end up in deadlock or cause unnecessary high load.
| val threshold = dateProvider.getCurrentInstant() | ||
| .minus(maxOrphanAge, DateTimeUnit.DAY, DateProvider.timeZone) | ||
| .toLocalDateTime(DateProvider.timeZone) | ||
| val orphans = filesDatabaseService.getOrphanedFileIds(threshold) |
There was a problem hiding this comment.
I'm not 100% certain but this whole task looks like it has not transaction isolation at all. Considering the 2 backends racing this is probably not a good idea.
| @@ -0,0 +1 @@ | |||
| CREATE INDEX IF NOT EXISTS files_upload_requested_at_idx ON files (upload_requested_at); | |||
There was a problem hiding this comment.
The real index we'd need is knowing fileIds referenced by se and sepd - this one here doesn't harm but is not very helpful either..
There was a problem hiding this comment.
we do select based on upload_requested_at so it should help some :-)
| SequenceEntriesTable.insert { | ||
| it[accessionColumn] = "A1" | ||
| it[versionColumn] = 1 | ||
| it[organismColumn] = DEFAULT_ORGANISM | ||
| it[submissionIdColumn] = "submission-A1" | ||
| it[submitterColumn] = "testuser" | ||
| it[groupIdColumn] = groupId | ||
| it[submittedAtTimestampColumn] = dateProvider.getCurrentDateTime() | ||
| it[submittedDataColumn] = SubmittedData( | ||
| metadata = emptyMap(), | ||
| unalignedNucleotideSequences = emptyMap(), | ||
| files = mapOf("rawReads" to listOf(FileIdAndName(referenced, "raw.fastq"))), | ||
| ) |
There was a problem hiding this comment.
Isn't this essentially the same function from the other test file?
| @EndpointTest( | ||
| properties = [ | ||
| "${BackendSpringProperty.S3_ENABLED}=true", | ||
| "${BackendSpringProperty.S3_GC_DRY_RUN}=true", | ||
| "${BackendSpringProperty.S3_MAX_ORPHAN_AGE_DAYS}=1", | ||
| ], | ||
| ) | ||
| class S3GarbageCollectionTaskTest( | ||
| @Autowired val s3GarbageCollectionTask: S3GarbageCollectionTask, |
There was a problem hiding this comment.
Might be better to not configure some defaults at top level, autowire and then override for 1 of the 2 tests - instead explicitly configure in each of the 2 tests to make behavior more local.
| val threshold = dateProvider.getCurrentInstant() | ||
| .minus(maxOrphanAge, DateTimeUnit.DAY, DateProvider.timeZone) | ||
| .toLocalDateTime(DateProvider.timeZone) |
There was a problem hiding this comment.
Isn't this the same function daysAgo that's in testHelpers?
There was a problem hiding this comment.
Almost:
fun daysAgo(days: Long): LocalDateTime = Clock.System.now()
.minus(days, DateTimeUnit.DAY, DateProvider.timeZone)
.toLocalDateTime(DateProvider.timeZone)
Can probably be unified
There was a problem hiding this comment.
We might be using dateprovider for a reason - so maybe shouldn't be replaced by Clock.System.now() in tests
| "by part number.", | ||
| ) | ||
|
|
||
| "NoSuchKey" -> UnprocessableEntityException( |
There was a problem hiding this comment.
This reminds me there are no tests for non-happy-path cases like if we're trying to delete something that doesn't exist.
resolves #3935
(alternative to #6323)
This PR introduces a scheduled task in the backend that finds and deletes orphaned files on S3 older than
s3.maxOrphanAgeDays.Orphan files are defined as files that have been uploaded to S3 by users or a prepocessing pipeline that are not referenced by any sequence submission in the database. Concretely, orphan files are identified by comparing file ids from the
filestable tofilesreferenced bysequence_entries.submitted_dataandsequence_entries_preprocessed_data.processed_data. Orphan files older thans3.maxOrphanAgeDays(currently set to 7 invalues.yaml) are deleted ifs3.gcDryRunis true or logged ifs3.gcDryRunis false (which is the default).The main difference with #6323 is that this PR does not introduce a
submitted_atcolumn to thefilestable to identify orphan files. Instead, thesubmitted_dataandprocessed_datajsonb columns are scanned directly for file references. This avoids the bookkeeping that would be needed to ensure thesubmitted_atcolumn stays in sync with the actual state in the sequence_entries tables (especially regarding deletions, multiple sequence entries referencing the same S3 file etc.).Manual testing
For testing, I made the GC run every 5 minutes and set
maxOrphanAgeDaysto 0, so everything that is not referenced by a sequence is considered an orphan.PUTting files to S3
After logging in on the website as testuser, making a group with group id 2, and getting an auth token for the testuser:
Orphan files are deleted from S3
Directly after I upload a file to S3, I can stat it in the preview deployment:
5 minutes later, after I see that the GC task has run, the file is no longer there:
Orphan files are deleted from the DB
Similarly, I can follow the lifecycle of an orphan file in the database though phAdmin by portforwarding to the database pod in the preview deployment:
Files attached to submissions are protected from GC
I also uploaded a test file for
Test organism (with files)through the website. After submission, I monitored the backend logs to wait until I see a GC run. There are no logs of anything being delted, and I can also still see the file in the DB:I can also still
statthe file on S3, confirming it was protected from garbage collection both in the database and on S3:Open issues/weirdness (update: solved, see below)
When I had
initialDelayandfixedDelayboth at 5 minutes, I saw logs pop up in the backend saying >70 files are deleted by the GC even though I wasn't uploading files.I also saw the following S3 errors in the backend logs:
Both of the above were caused by GC running while a batch of sequences was in the middle of being preprocessed: GC detected EMBL flatfiles produced by the prepro pipeline as orphans because I put the orphan threshold at 0 for testing:
create_embl_file: trueby defaultNoSuchKeyerrorThis shouldn't be an issue with a more realistic orphan threshold, but it's worth keeping in mind. To address this on the preview, I put the
initialDelayto 90 minutes just so all the inital preprocessing of ingested sequences is finished before the garbage collection task starts to run. After testing, I added code to clamp maxOrphanAgeDays to a minimum value, so the GC task usesmax(maxOrphanAgeDays, 1).PR Checklist
🚀 Preview: Add
previewlabel to enable