Skip to content

feat(backend): S3 garbage collection#6543

Open
maverbiest wants to merge 40 commits into
mainfrom
s3-garbage-collection
Open

feat(backend): S3 garbage collection#6543
maverbiest wants to merge 40 commits into
mainfrom
s3-garbage-collection

Conversation

@maverbiest

@maverbiest maverbiest commented May 31, 2026

Copy link
Copy Markdown
Contributor

resolves #3935

(alternative to #6323)

This PR introduces a scheduled task in the backend that finds and deletes orphaned files on S3 older than s3.maxOrphanAgeDays.

Orphan files are defined as files that have been uploaded to S3 by users or a prepocessing pipeline that are not referenced by any sequence submission in the database. Concretely, orphan files are identified by comparing file ids from the files table to files referenced by sequence_entries.submitted_data and sequence_entries_preprocessed_data.processed_data. Orphan files older than s3.maxOrphanAgeDays (currently set to 7 in values.yaml) are deleted if s3.gcDryRun is true or logged if s3.gcDryRun is false (which is the default).

The main difference with #6323 is that this PR does not introduce a submitted_at column to the files table to identify orphan files. Instead, the submitted_data and processed_data jsonb columns are scanned directly for file references. This avoids the bookkeeping that would be needed to ensure the submitted_at column stays in sync with the actual state in the sequence_entries tables (especially regarding deletions, multiple sequence entries referencing the same S3 file etc.).

Manual testing

For testing, I made the GC run every 5 minutes and set maxOrphanAgeDays to 0, so everything that is not referenced by a sequence is considered an orphan.

PUTting files to S3

After logging in on the website as testuser, making a group with group id 2, and getting an auth token for the testuser:

RESP=$(curl -s -X POST "$BACKEND/files/request-upload?groupId=$GROUP_ID&numberFiles=1" -H "Authorization: Bearer $TOKEN")
FILE_ID=$(echo "$RESP" | jq -r '.[0].fileId')
URL=$(echo "$RESP" | jq -r '.[0].url')
echo "orphan test content" > /tmp/orphan.txt
curl -X PUT --upload-file /tmp/orphan.txt "$URL"

Orphan files are deleted from S3

Directly after I upload a file to S3, I can stat it in the preview deployment:

➜  loculus git:(s3-garbage-collection) kubectl exec -n "$NS" "$POD" -- mc stat "local/loculus-preview-private/files/c49bd8da-6661-4063-8718-2d6703a6c8f9"
Name      : c49bd8da-6661-4063-8718-2d6703a6c8f9
Date      : 2026-06-09 12:17:01 UTC
Size      : 20 B
ETag      : 415ab09be58b32f2bd922650fa08e3d4
Type      : file
Metadata  :
  Content-Type: binary/octet-stream

5 minutes later, after I see that the GC task has run, the file is no longer there:

➜  loculus git:(s3-garbage-collection) kubectl exec -n "$NS" "$POD" -- mc stat "local/loculus-preview-private/files/c49bd8da-6661-4063-8718-2d6703a6c8f9"
mc: <ERROR> Unable to stat `local/loculus-preview-private/files/c49bd8da-6661-4063-8718-2d6703a6c8f9`. Object does not exist.
command terminated with exit code 1

Orphan files are deleted from the DB

Similarly, I can follow the lifecycle of an orphan file in the database though phAdmin by portforwarding to the database pod in the preview deployment:

kubectl port-forward loculus-database-659b795cbd-7f65x --namespace prev-s3-garbage-collection 5432:5432
Screenshot 2026-06-10 at 10 49 58 Screenshot 2026-06-10 at 10 50 24 Screenshot 2026-06-10 at 10 50 36

Files attached to submissions are protected from GC

I also uploaded a test file for Test organism (with files) through the website. After submission, I monitored the backend logs to wait until I see a GC run. There are no logs of anything being delted, and I can also still see the file in the DB:

grafik

I can also still stat the file on S3, confirming it was protected from garbage collection both in the database and on S3:

➜  loculus git:(s3-garbage-collection) kubectl exec -n "$NS" minio-6c565c64bf-wdks6 -- mc stat "local/loculus-preview-private/files/d2b5debc-dca9-4279-8c5d-9bd81bd046ba"
Name      : d2b5debc-dca9-4279-8c5d-9bd81bd046ba
Date      : 2026-06-10 09:20:36 UTC
Size      : 25 B
ETag      : 671ee2e3e61346d33e76305c0862bdc8-1
Type      : file
Metadata  :
  Content-Type       : binary/octet-stream
  X-Amz-Tagging-Count: 1

Open issues/weirdness (update: solved, see below)

When I had initialDelay and fixedDelay both at 5 minutes, I saw logs pop up in the backend saying >70 files are deleted by the GC even though I wasn't uploading files.

2026-06-09 12:20:39,598 INFO [scheduling-1] [] [] - org.loculus.backend.service.submission.S3GarbageCollectionTask: Deleted 80 orphans that were not referenced by a submission after 0 days

I also saw the following S3 errors in the backend logs:

2026-06-09 12:24:33,991 ERROR [http-nio-8079-exec-3] [0d016c50-67d4-4f94-8b79-3c8c8666859e] [cchf] - org.loculus.backend.controller.ExceptionHandler: Caught unexpected exception: Unexpected S3 error: NoSuchKey
java.lang.RuntimeException: Unexpected S3 error: NoSuchKey

Both of the above were caused by GC running while a batch of sequences was in the middle of being preprocessed: GC detected EMBL flatfiles produced by the prepro pipeline as orphans because I put the orphan threshold at 0 for testing:

  • Prepro config has create_embl_file: true by default
  • prepro creates and uploads a flatfiles for a batch to S3
  • GC runs; identifies the flatfiles as orphans (they are not yet referenced by entries in processed_data), deletes them -> logs that it deleted a bunch of files
  • once the batch is processed, prepro attempts to set files to public that no longer exist -> logs the NoSuchKey error

This shouldn't be an issue with a more realistic orphan threshold, but it's worth keeping in mind. To address this on the preview, I put the initialDelay to 90 minutes just so all the inital preprocessing of ingested sequences is finished before the garbage collection task starts to run. After testing, I added code to clamp maxOrphanAgeDays to a minimum value, so the GC task usesmax(maxOrphanAgeDays, 1).

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude Bot added the backend related to the loculus backend component label May 31, 2026
@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

This PR may be related to: #3935 (Build a garbage collector for file sharing)

@anna-parker

Copy link
Copy Markdown
Contributor

I believe the changes here: #6323 will still be required to ensure submissions that were discarded on the website are properly deleted

@maverbiest maverbiest force-pushed the s3-garbage-collection branch from ff145ac to 1eb50e1 Compare June 5, 2026 13:47
@maverbiest maverbiest marked this pull request as ready for review June 8, 2026 13:18
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @maverbiest's task in 2m 23s —— View job


Code Review

  • Read and analyze all changed files
  • Review code quality, security, performance, testing, and documentation
  • Post findings

Good implementation overall. The approach of scanning unprocessed_data/processed_data directly to find orphans rather than adding a submitted_at bookkeeping column is sound. The SQL query is well-constructed and the test coverage for the orphan-detection logic is thorough. Two bugs to fix before merge:

Bugs (inline comments posted):

  1. Missing space in audit log string (S3GarbageCollectionTask.kt:58–59): "referenced by a" + "submission" concatenates to "referenced by asubmission".
  2. Incorrect deletion count in log.info (S3GarbageCollectionTask.kt:53): Reports orphans.size deleted even when some failed — should be orphans.size - deleteFailures.

Minor notes:

  • The error-handling catch block (lines 45–48) has a subtle but benign retry behaviour: if S3 deletion succeeds but deleteFileEntry throws, the DB entry survives and the next GC run will call deleteObject again — AWS S3's deleteObject is idempotent, so this self-heals. Worth a comment.
  • No index exists on files.upload_requested_at. For large files tables the WHERE upload_requested_at < ? filter will full-scan. Probably fine for a daily task at current scale, but worth a migration if the table grows.
  • The failure path (deleteFailures > 0) has no test coverage. A test that makes s3Service.deleteFile throw for one file and verifies the other orphan is still cleaned up would improve confidence in the error path.

@maverbiest

maverbiest commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

I believe the changes here: #6323 will still be required to ensure submissions that were discarded on the website are properly deleted

Looking for file references in the unprocessed_data and processed_data jsonb columns should take care of this, I think! (sorry @anna-parker not sure what state this PR was in when you commented)

@maverbiest maverbiest requested review from anna-parker and tombch June 8, 2026 13:56
@maverbiest maverbiest added update_db_schema preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Jun 9, 2026
fun task() {
// `maxOrphanAge` must be at least 1 or files produced by preprocessing may be
// garbage collected before they're attached to sequence entries
val maxOrphanAge = max(maxOrphanAge, 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should really just be configurable to allow for testing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values lower than 1 will brick a deployment (GC will delete files out from under the prepro pipeline), so I do think clamping is warranted here. For testing, I'd prefer to create backdated entries that look older than they are so that the GC will pick them up


fun getOrphanedFileIds(threshold: LocalDateTime): Set<FileId> {
val sql = """
-- check for files for which an upload was requested > threshold days ago

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we're reading through the whole database for this to find referenced files.

If this becomes perf problematic or otherwise we might want to consider normalizing files with proper foreign keys etc to allow full power of SQL.

chunk.filterNot { it in existingIds }
}, 1).toSet()

fun getOrphanedFileIds(threshold: LocalDateTime): Set<FileId> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran this on staging to do some perf (added one entry in the files table to actually run). Scanning every entry is definitely not teh greatest, so in future we might need to refactor but this should work for now :-)

Details

⏺ What This Query Plan Does

  The logical query: Find rows in files where the file's id does NOT appear as a fileId in any sequence entry's JSON data — i.e., find orphaned files.

  ---
  Why It Finished in 2ms Despite Terrifying Cost Estimates

  The plan executed a Merge Anti Join, which works like this:

  1. Sort the left side (files)
  2. Sort the right side (all fileIds extracted from JSON in sequence entries)
  3. Walk both sorted lists; emit left-side rows that have no match on the right

  The critical detail: the left side (files filtered by upload_requested_at < now()) returned 0 rows. Once the left side is exhausted, the right side is never executed. Every node tagged (never executed) was skipped entirely.

  ---
  The Right Side Would Have Been Catastrophic
  
  If any files had matched the filter, the right side would need to:

  sequence_entries (309k rows)
    → jsonb_each(metadata) → explode into category/value pairs      ~30M rows
      → jsonb_array_elements(fil) → explode arrays of file refs     ~3B rows

  sequence_entries_preprocessed_data (614k rows, joined via parallel hash join)
    → jsonb_each → jsonb_array_elements                             ~6B rows

  Total estimated: 9.2 billion rows to sort and deduplicate

  PostgreSQL has almost no statistics about what jsonb_each and jsonb_array_elements return — it assumes 100 rows per call by default, which compounds wildly across nested loops. The 2.6 trillion cost estimate is the planner's honest (if terrible) guess.

  ---
  Key Observations

  ┌───────────────────────────────────────────┬─────────────────────────────────────────────────────┐
  │                   Thing                   │                    What it means                    │
  ├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
  │ rows=0 on Seq Scan of files               │ The filter found nothing — query was trivially fast │
  ├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
  │ (never executed) everywhere on right side │ Short-circuited; never touched                      │
  ├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
  │ Cost 2645430954                           │ Planner's estimate, not actual work done            │
  ├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
  │ Actual time=0.010..0.013 on top node      │ 13 microseconds of real work                        │
  ├───────────────────────────────────────────┼─────────────────────────────────────────────────────┤
  │ Workers Launched: 0 (planned 2)           │ Parallel workers weren't even started               │
  └───────────────────────────────────────────┴─────────────────────────────────────────────────────┘

  ---
  The Real Risk
  
  This query has a latent performance bomb. If the files table ever has rows matching the filter, PostgreSQL will attempt to sort and deduplicate ~9 billion JSON-extracted values. The fix would be one of:

  - A materialized CTE or temp table to pre-extract fileIds with proper indexing
  - An index on a normalized file_references table instead of JSON scanning
  - Rewriting as NOT EXISTS (SELECT 1 FROM sequence_entries WHERE metadata @> ...) with a GIN index on the JSONB column

✻ Baked for 25s

❯ how can we insert a few rows just to trigger the right side                                                                                                                                                                                                                                                                                                                                                                                                                                         

⏺ Let me find the files table schema and the query itself.

  Searched for 6 patterns, read 2 files (ctrl+o to expand)
  ⎿  Loaded backend/CLAUDE.md
  ⎿  Loaded backend/AGENTS.md

⏺ Now I have everything. Here's the minimal insert to trigger the right side:

  INSERT INTO files (id, upload_requested_at, uploader, group_id, size, multipart_completed, multipart_upload_id)
  VALUES (
      gen_random_uuid(),
      now() - interval '1 day',  -- must be < now() to pass the filter
      'test-user',
      1,                          -- any group_id that exists
      NULL,
      false,
      NULL
  );
  
  The filter is upload_requested_at < now(), so any timestamp in the past works. The group_id needs to be a valid integer but there's no FK constraint enforced at the DB level based on the schema, so 1 should work regardless.

  After inserting, re-run the EXPLAIN ANALYZE and you'll see the right side execute — and likely see it sorting/deduplicating hundreds of millions of rows (or running out of memory/time). That's the performance cliff.

  To clean up afterwards:
  
  DELETE FROM files WHERE uploader = 'test-user';

✻ Baked for 26s

❯ guess how logn it will take                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

⏺ Hard to say precisely without knowing actual JSON density, but here's the reasoning:

  The 9.2 billion row estimate is almost certainly fiction. PostgreSQL blindly assumes jsonb_each returns 100 rows and jsonb_array_elements returns another 100 — multiplied across 309K + 614K base rows. In reality, most sequences probably have 0 or a handful of file references (file sharing looks like a newer/optional feature in this codebase).

  What actually drives the time:
  
  1. Seq scan of sequence_entries — 309K rows × 966 bytes each ≈ 300MB to read
  2. Seq scan + hash join of sequence_entries_preprocessed_data — 614K rows × 783 bytes ≈ 480MB
  3. JSON expansion — if most submitted_data->'files' are null/empty, jsonb_each('{}') emits 0 rows and it collapses fast
  4. Sort + Unique — only expensive if expansion actually produces many rows

  My guess:

  - If file sharing is sparsely used (most rows have no files): 10–60 seconds, dominated by the table scans
  - If sequences have ~1–5 file refs each: 2–10 minutes, sort of a few million rows
  - If the 9.2B estimate were anywhere close to real: hours, and it would probably spill to disk

  You can tell which world you're in after inserting the test row by watching whether the Sort node in EXPLAIN ANALYZE says Memory: Xkb (fast, fits in work_mem) or Sort Method: external merge  Disk: XMB (slow, spilling).

✻ Baked for 32s

❯ there is just one file and this came back "Merge Anti Join  (cost=2645430954.56..2837916997.28 rows=105 width=16) (actual time=165112.005..165115.272 rows=1 loops=1)"                                                                                                                                                                                                                                                                                                                              
  "  Merge Cond: (f.id = (((fil.value ->> 'fileId'::text))::uuid))"                                                                                                                                                                                                                                                                                                                                                                                                                                   
  "  Buffers: shared hit=1497697 read=232824"                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  "  I/O Timings: shared read=156340.591"                                                                                                                                                                                                                                                                                                                                                                                                                                                             
  "  ->  Sort  (cost=27.55..28.07 rows=210 width=16) (actual time=0.019..0.020 rows=1 loops=1)"                                                                                                                                                                                                                                                                                                                                                                                                       
  "        Sort Key: f.id"                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  "        Sort Method: quicksort  Memory: 25kB"                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  "        Buffers: shared hit=1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
  "        ->  Seq Scan on files f  (cost=0.00..19.45 rows=210 width=16) (actual time=0.015..0.016 rows=1 loops=1)"                                                                                                                                                                                                                                                                                                                                                                                   
  "              Filter: (upload_requested_at < now())"                                                                                                                                                                                                                                                                                                                                                                                                                                               
  "              Buffers: shared hit=1"                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  "  ->  Unique  (cost=2645430927.01..2691627577.01 rows=9239330000 width=16) (actual time=165111.984..165115.249 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                                    
  "        Buffers: shared hit=1497696 read=232824"                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  "        I/O Timings: shared read=156340.591"                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  "        ->  Sort  (cost=2645430927.01..2668529252.01 rows=9239330000 width=16) (actual time=165111.983..165115.248 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                                
  "              Sort Key: (((fil.value ->> 'fileId'::text))::uuid)"                                                                                                                                                                                                                                                                                                                                                                                                                                  
  "              Sort Method: quicksort  Memory: 25kB"                                                                                                                                                                                                                                                                                                                                                                                                                                                
  "              Buffers: shared hit=1497696 read=232824"                                                                                                                                                                                                                                                                                                                                                                                                                                             
  "              I/O Timings: shared read=156340.591"                                                                                                                                                                                                                                                                                                                                                                                                                                                 
  "              ->  Append  (cost=0.02..326590763.34 rows=9239330000 width=16) (actual time=165111.981..165115.245 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                                  
  "                    Buffers: shared hit=1497696 read=232824"                                                                                                                                                                                                                                                                                                                                                                                                                                       
  "                    I/O Timings: shared read=156340.591"                                                                                                                                                                                                                                                                                                                                                                                                                                           
  "                    ->  Nested Loop  (cost=0.02..62874337.85 rows=3090840000 width=16) (actual time=36509.936..36509.938 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                          
  "                          Buffers: shared hit=238372 read=53174"                                                                                                                                                                                                                                                                                                                                                                                                                                   
  "                          I/O Timings: shared read=34721.468"                                                                                                                                                                                                                                                                                                                                                                                                                                      
  "                          ->  Nested Loop  (cost=0.01..671081.85 rows=30908400 width=32) (actual time=36509.935..36509.936 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                        
  "                                Buffers: shared hit=238372 read=53174"                                                                                                                                                                                                                                                                                                                                                                                                                             
  "                                I/O Timings: shared read=34721.468"                                                                                                                                                                                                                                                                                                                                                                                                                                
  "                                ->  Seq Scan on sequence_entries  (cost=0.00..52913.84 rows=309084 width=966) (actual time=0.002..177.216 rows=309742 loops=1)"                                                                                                                                                                                                                                                                                                                                    
  "                                      Buffers: shared hit=49823"                                                                                                                                                                                                                                                                                                                                                                                                                                   
  "                                ->  Function Scan on jsonb_each cat  (cost=0.01..1.01 rows=100 width=32) (actual time=0.117..0.117 rows=0 loops=309742)"                                                                                                                                                                                                                                                                                                                                           
  "                                      Buffers: shared hit=188549 read=53174"                                                                                                                                                                                                                                                                                                                                                                                                                       
  "                                      I/O Timings: shared read=34721.468"                                                                                                                                                                                                                                                                                                                                                                                                                          
  "                          ->  Memoize  (cost=0.01..1.01 rows=100 width=32) (never executed)"                                                                                                                                                                                                                                                                                                                                                                                                       
  "                                Cache Key: cat.v"                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  "                                Cache Mode: binary"                                                                                                                                                                                                                                                                                                                                                                                                                                                
  "                                ->  Function Scan on jsonb_array_elements fil  (cost=0.00..1.00 rows=100 width=32) (never executed)"                                                                                                                                                                                                                                                                                                                                                               
  "                    ->  Nested Loop  (cost=11917.49..125126475.49 rows=6148490000 width=16) (actual time=128602.042..128605.303 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                   
  "                          Buffers: shared hit=1259324 read=179650"                                                                                                                                                                                                                                                                                                                                                                                                                                 
  "                          I/O Timings: shared read=121619.123"                                                                                                                                                                                                                                                                                                                                                                                                                                     
  "                          ->  Nested Loop  (cost=11917.48..1388013.24 rows=61484900 width=32) (actual time=128602.041..128605.300 rows=0 loops=1)"                                                                                                                                                                                                                                                                                                                                                 
  "                                Buffers: shared hit=1259324 read=179650"                                                                                                                                                                                                                                                                                                                                                                                                                           
  "                                I/O Timings: shared read=121619.123"                                                                                                                                                                                                                                                                                                                                                                                                                               
  "                                ->  Gather  (cost=11917.47..158315.23 rows=614849 width=764) (actual time=61.485..366.351 rows=617729 loops=1)"                                                                                                                                                                                                                                                                                                                                                    
  "                                      Workers Planned: 2"                                                                                                                                                                                                                                                                                                                                                                                                                                          
  "                                      Workers Launched: 2"                                                                                                                                                                                                                                                                                                                                                                                                                                         
  "                                      Buffers: shared hit=133642"                                                                                                                                                                                                                                                                                                                                                                                                                                  
  "                                      ->  Parallel Hash Join  (cost=10917.47..95830.33 rows=256187 width=764) (actual time=56.310..279.466 rows=205910 loops=3)"                                                                                                                                                                                                                                                                                                                                   
  "                                            Hash Cond: ((sepd.accession = se.accession) AND (sepd.version = se.version))"                                                                                                                                                                                                                                                                                                                                                                          
  "                                            Buffers: shared hit=133642"                                                                                                                                                                                                                                                                                                                                                                                                                            
  "                                            ->  Parallel Seq Scan on sequence_entries_preprocessed_data sepd  (cost=0.00..83567.87 rows=256187 width=783) (actual time=0.007..87.833 rows=205910 loops=3)"                                                                                                                                                                                                                                                                                         
  "                                                  Buffers: shared hit=81006"                                                                                                                                                                                                                                                                                                                                                                                                                       
  "                                            ->  Parallel Hash  (cost=8985.69..8985.69 rows=128785 width=19) (actual time=55.291..55.291 rows=103247 loops=3)"                                                                                                                                                                                                                                                                                                                                      
  "                                                  Buckets: 524288  Batches: 1  Memory Usage: 21120kB"                                                                                                                                                                                                                                                                                                                                                                                              
  "                                                  Buffers: shared hit=52490"                                                                                                                                                                                                                                                                                                                                                                                                                       
  "                                                  ->  Parallel Index Only Scan using sequence_entries_pkey on sequence_entries se  (cost=0.42..8985.69 rows=128785 width=19) (actual time=0.082..20.626 rows=103247 loops=3)"                                                                                                                                                                                                                                                                      
  "                                                        Heap Fetches: 1924"                                                                                                                                                                                                                                                                                                                                                                                                                        
  "                                                        Buffers: shared hit=52490"                                                                                                                                                                                                                                                                                                                                                                                                                 
  "                                ->  Function Scan on jsonb_each cat_1  (cost=0.01..1.01 rows=100 width=32) (actual time=0.207..0.207 rows=0 loops=617729)"                                                                                                                                                                                                                                                                                                                                         
  "                                      Buffers: shared hit=1125682 read=179650"                                                                                                                                                                                                                                                                                                                                                                                                                     
  "                                      I/O Timings: shared read=121619.123"                                                                                                                                                                                                                                                                                                                                                                                                                         
  "                          ->  Memoize  (cost=0.01..1.01 rows=100 width=32) (never executed)"                                                                                                                                                                                                                                                                                                                                                                                                       
  "                                Cache Key: cat_1.v"                                                                                                                                                                                                                                                                                                                                                                                                                                                
  "                                Cache Mode: binary"                                                                                                                                                                                                                                                                                                                                                                                                                                                
  "                                ->  Function Scan on jsonb_array_elements fil_1  (cost=0.00..1.00 rows=100 width=32) (never executed)"                                                                                                                                                                                                                                                                                                                                                             
  "Planning:"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  "  Buffers: shared hit=310"                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  "Planning Time: 0.403 ms"                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  "Execution Time: 165135.771 ms"                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

⏺ 165 seconds, almost entirely I/O. Here's what actually happened:

  The Real Story

  Zero actual file references exist in any sequence. Both jsonb_each calls returned 0 rows across ~930K invocations:

  jsonb_each cat   → rows=0  loops=309,742   (every sequence_entry)
  jsonb_each cat_1 → rows=0  loops=617,729   (every preprocessed_data row)

  Yet it still took 165 seconds because PostgreSQL had to read the entire JSONB blob for every row just to discover there's no files key. You can see it in the I/O:

  ┌───────────────────────────────────────────────┬──────────────────────────┬──────────┐
  │                    Branch                     │  Blocks read from disk   │ I/O time │
  ├───────────────────────────────────────────────┼──────────────────────────┼──────────┤
  │ sequence_entries jsonb_each                   │ 53,174 blocks (~415 MB)  │ 34.7s    │
  ├───────────────────────────────────────────────┼──────────────────────────┼──────────┤
  │ sequence_entries_preprocessed_data jsonb_each │ 179,650 blocks (~1.4 GB) │ 121.6s   │
  ├───────────────────────────────────────────────┼──────────────────────────┼──────────┤
  │ Total                                         │ ~1.8 GB                  │ 156.3s   │
  └───────────────────────────────────────────────┴──────────────────────────┴──────────┘
  
  The data wasn't in shared_buffers — so 156 of the 165 seconds were just waiting on disk reads. CPU work was trivial (sorting 0 rows is instant).

  The Structural Problem
  
  The query has no way to skip a row without reading its JSONB. To check submitted_data->'files', it must load the whole submitted_data blob. With 309K rows at ~966 bytes each (compressed; uncompressed likely much larger), that's a guaranteed full-scan of every JSONB column every time this runs.

  A targeted fix would be to add a GIN index and pre-filter before the lateral join:

  CREATE INDEX ON sequence_entries USING GIN (submitted_data jsonb_path_ops);

  -- then in the CTE:
  FROM sequence_entries
  WHERE submitted_data ? 'files'   -- index hit, skips rows with no files key
  LATERAL jsonb_each(...)

  That would reduce the 309K jsonb_each calls to however many sequences actually have a files key — likely near zero right now, making the whole query instant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corneliusroemer and I also discussed one potential race condition. If a user submits a file, waits for like a month (or however long the grace period is) and then submits data with the files at the same exact time this transaction is run the fileId will get into the list of files to delete but the submission will not be rejected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont foresee this happening but it might be good to somehow mark columns as "marked for deletion" after running this job, stopping submission of files that are marked for deletion and then rerunning this transaction again before deleting

insertFile(old, groupId, daysAgo(10))
insertFile(recent, groupId, daysAgo(1))

val orphans = filesDatabaseService.getOrphanedFileIds(daysAgo(5))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constants somewhere at top for "young" and "old" then reuse those instead of using magic values that only are clear from extended context by referencing defaults (that might be changed easily)


@Suppress("ktlint:standard:max-line-length")
@Test
fun `GIVEN a file only referenced in archive_of_submitted_data THEN it is orphaned, but submitted_data is protected`() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably not delete files referenced in archive

val sql = """
-- check for files for which an upload was requested > threshold days ago
-- but are not referenced by a submission. For this, check the submitted_data
-- and processed_data jsonb objects (but not archive_of_submitted_data)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets check the file is also not in the archive for consistency

@maverbiest maverbiest Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I'll make it scan the archive as well, although I expect it will take the query take even longer unfortunately.

The more I think about it the more I become convinced we should take files out of the jsonb and add join tables from files to se and sepd, storing the files type (raw_reads, annotation, etc.) as a column on the files table. That's a pretty big refactor though so maybe something for a follow-up PR

Comment on lines +143 to +193
private fun insertSequenceEntry(
accession: String,
version: Long,
archive: SubmittedData<CompressedSequence>?,
submitted: SubmittedData<CompressedSequence>?,
) = transaction {
SequenceEntriesTable.insert {
it[accessionColumn] = accession
it[versionColumn] = version
it[organismColumn] = DEFAULT_ORGANISM
it[submissionIdColumn] = "submission-$accession-$version"
it[submitterColumn] = "testuser"
it[groupIdColumn] = groupId
it[submittedAtTimestampColumn] = dateProvider.getCurrentDateTime()
it[archiveOfSubmittedDataColumn] = archive
it[submittedDataColumn] = submitted
}
}

private fun insertPreprocessedData(
accession: String,
version: Long,
pipelineVersion: Long,
processed: ProcessedData<CompressedSequence>,
) = transaction {
SequenceEntriesPreprocessedDataTable.insert {
it[accessionColumn] = accession
it[versionColumn] = version
it[pipelineVersionColumn] = pipelineVersion
it[processedDataColumn] = processed
it[processingStatusColumn] = "PROCESSED"
it[startedProcessingAtColumn] = dateProvider.getCurrentDateTime()
}
}

private fun makeUnprocessedData(fileId: UUID?): SubmittedData<CompressedSequence> = SubmittedData(
metadata = emptyMap(),
unalignedNucleotideSequences = emptyMap(),
files = fileId?.let { mapOf("rawReads" to listOf(FileIdAndName(it, "raw.fastq"))) },
)

private fun makeProcessedData(fileId: UUID): ProcessedData<CompressedSequence> = ProcessedData(
metadata = emptyMap(),
unalignedNucleotideSequences = emptyMap(),
alignedNucleotideSequences = emptyMap(),
nucleotideInsertions = emptyMap(),
alignedAminoAcidSequences = emptyMap(),
aminoAcidInsertions = emptyMap(),
files = mapOf("processedOutput" to listOf(FileIdAndName(fileId, "aligned.bam"))),
)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually have a different pattern for creating test data in the backend tests. I wonder if the existing pattern could be used instead of doing direct db operations.

If the existing test helpers don't allow customizing file id and insertion date they could be generalized to support these features.

The helpers seem quite ok except creating very minimal submissions. The data inserted is very simple, there are no sequences in there, no metadata etc. This shouldn't matter but more realistic data could catch some things we miss.

Things that are missing in tests:

  • What if multiple files are referenced by a single sequence, right now it seems to be only one file per sequence per datatype

.createNewGroup(group = DEFAULT_GROUP, jwt = jwtForDefaultUser)
.andGetGroupId()
val orphan = UUID.randomUUID()
insertFile(orphan, groupId, daysAgo(2))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit magic now, might be better to call this TIME_LONGER_THAN_GC or something like this.

assertThat(filesDatabaseService.getNonExistentFileIds(setOf(orphan)), `is`(emptySet()))
}

@Suppress("ktlint:standard:max-line-length")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be clearer here to explicitly set dry run to false - otherwise it's easy to miss that the config from above no longer applies - this also isolates against default changes.

}
}

private fun insertPreprocessedData(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the PreparedProcessedData object I wonder if we could reuse it here instead

assertThat(orphans.isEmpty(), `is`(true))
}

private fun insertSequenceEntry(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also have prepareUnprocessedSequenceEntry maybe this could be used here? just trying to avoid code duplications

Comment on lines +59 to +60
s3Service.deleteFile(fileId)
filesDatabaseService.deleteFileEntry(fileId)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transactionally this might be some of the most critical code here - shoudl we really wrap it in a single try/catch? No retry, no recovery if deletion fails at step 1? Or at step 2?

Off the top of my head, for example, we might want to handle s3 deletion failure by setting a flag in our db about it. Vice versa, if s3 deletion succeeds but then db deletion doesn't what do we do? In the future deletion might fail every time as the file is no longer there? So we end in unrecoverable state?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An actual failure mode that can happen is that we have 2 backends in production - they might happen to cross and run at the same time racing to deletion. We might want to add a lock for this, maybe a table that tracks last cleanup time and if cleanup is in progress so we don't cleanup at the same time twice.

@anna-parker anna-parker Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to check this out... chatGPT tells me the s3 deletion should not raise an error if the file is missing which means we would still try to remove the file from the db:

Usually deleting a missing object from S3 does not raise an error.

DeleteObject in S3 is idempotent: if the key does not exist, S3 still returns a successful response (typically HTTP 204 No Content).

But then Im not sure why we have code to handle NoSuchKey

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: NoSuchKey is only thrown when trying to get a file that is missing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes trying to delete a deleted file should be fine as far as I'm aware. The NoSuchKey errors happen when trying to access a deleted file, which came up when GC was deleting files generated by prepro before they were attached to files

s3Service.deleteFile(fileId)
filesDatabaseService.deleteFileEntry(fileId)
} catch (e: Exception) {
log.warn("Failed to delete $fileId", e)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it we definitely at the very least want different error messages for failing to delete file from s3 and from postgres db.

@corneliusroemer corneliusroemer Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is under service/submission? To me this seems more service/files related.

* Runs once daily (with an initial delay of 15 minutes) and deletes S3 objects older than
* `loculus.s3.max-orphan-age-days` that are not referenced in submitted_data or processed_data
*/
@Scheduled(initialDelay = 15, fixedDelay = 60 * 24, timeUnit = TimeUnit.MINUTES)

@corneliusroemer corneliusroemer Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial delay should also be configurable and also jitter. Otherwise if 2 backends come up at the same time they'll run at the same time.

We have 2 backends in prod, so they would quite possibly either end up in deadlock or cause unnecessary high load.

val threshold = dateProvider.getCurrentInstant()
.minus(maxOrphanAge, DateTimeUnit.DAY, DateProvider.timeZone)
.toLocalDateTime(DateProvider.timeZone)
val orphans = filesDatabaseService.getOrphanedFileIds(threshold)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% certain but this whole task looks like it has not transaction isolation at all. Considering the 2 backends racing this is probably not a good idea.

@@ -0,0 +1 @@
CREATE INDEX IF NOT EXISTS files_upload_requested_at_idx ON files (upload_requested_at);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real index we'd need is knowing fileIds referenced by se and sepd - this one here doesn't harm but is not very helpful either..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do select based on upload_requested_at so it should help some :-)

Comment on lines +71 to +83
SequenceEntriesTable.insert {
it[accessionColumn] = "A1"
it[versionColumn] = 1
it[organismColumn] = DEFAULT_ORGANISM
it[submissionIdColumn] = "submission-A1"
it[submitterColumn] = "testuser"
it[groupIdColumn] = groupId
it[submittedAtTimestampColumn] = dateProvider.getCurrentDateTime()
it[submittedDataColumn] = SubmittedData(
metadata = emptyMap(),
unalignedNucleotideSequences = emptyMap(),
files = mapOf("rawReads" to listOf(FileIdAndName(referenced, "raw.fastq"))),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially the same function from the other test file?

Comment on lines +28 to +36
@EndpointTest(
properties = [
"${BackendSpringProperty.S3_ENABLED}=true",
"${BackendSpringProperty.S3_GC_DRY_RUN}=true",
"${BackendSpringProperty.S3_MAX_ORPHAN_AGE_DAYS}=1",
],
)
class S3GarbageCollectionTaskTest(
@Autowired val s3GarbageCollectionTask: S3GarbageCollectionTask,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to not configure some defaults at top level, autowire and then override for 1 of the 2 tests - instead explicitly configure in each of the 2 tests to make behavior more local.

Comment on lines +46 to +48
val threshold = dateProvider.getCurrentInstant()
.minus(maxOrphanAge, DateTimeUnit.DAY, DateProvider.timeZone)
.toLocalDateTime(DateProvider.timeZone)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same function daysAgo that's in testHelpers?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost:

fun daysAgo(days: Long): LocalDateTime = Clock.System.now()
    .minus(days, DateTimeUnit.DAY, DateProvider.timeZone)
    .toLocalDateTime(DateProvider.timeZone)

Can probably be unified

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be using dateprovider for a reason - so maybe shouldn't be replaced by Clock.System.now() in tests

"by part number.",
)

"NoSuchKey" -> UnprocessableEntityException(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me there are no tests for non-happy-path cases like if we're trying to delete something that doesn't exist.

Comment thread backend/src/main/kotlin/org/loculus/backend/service/files/FilesDatabaseService.kt Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component update_db_schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build a garbage collector for file sharing

5 participants