Skip to content

fix: improve scheduler startup resilience on low-resource Postgres#43

Merged
Theauxm merged 1 commit into
mainfrom
fix/startup-resilience-low-resource-postgres
Apr 1, 2026
Merged

fix: improve scheduler startup resilience on low-resource Postgres#43
Theauxm merged 1 commit into
mainfrom
fix/startup-resilience-low-resource-postgres

Conversation

@Theauxm
Copy link
Copy Markdown
Member

@Theauxm Theauxm commented Apr 1, 2026

Summary

  • Orphan manifest pruning timed out on a 2 vCPU Postgres instance because EF Core inlined 5000+ expected external IDs as NOT IN(...) string parameters, overwhelming the query planner. The fix loads all manifest ID pairs with a lightweight SELECT and computes the orphan set in C#, then deletes in batches of 500 by integer PK.
  • RecoverStuckJobs now uses batched ExecuteUpdateAsync instead of loading all stuck metadata into memory — matches the reap junction pattern.
  • New MaxWorkQueueEntriesPerCycle config (default: 200, null = unlimited) caps how many work queue entries CreateWorkQueueEntriesJunction creates per manifest manager cycle, preventing write bursts after extended downtime.
  • Applied the same server-side filtering pattern to PruneStaleManifestsAsync in TraxScheduler for consistency.

Test plan

  • All 650 integration tests pass
  • All 113 unit tests pass
  • New integration tests for large expected ID sets (100 expected + 5 orphans, 50 expected + orphan with FK data)
  • New integration test for orphan count exceeding batch size (510 orphans)
  • New integration test for stuck job count exceeding batch size (510 stuck jobs)
  • New integration tests for MaxWorkQueueEntriesPerCycle (limit=3 with 10 due, null with 10 due)
  • Stress test: 5000 orphans pruned in 226ms across 10 batches (500 expected kept)
  • Stress test: large expected set scenario (450 expected, 50 pruned)
  • Zero build warnings, CSharpier formatted

Orphan manifest pruning timed out on a 2 vCPU Postgres instance because
EF Core inlined 5000+ expected external IDs as NOT IN(...) string
parameters, overwhelming the query planner. The fix loads all manifest
ID pairs with a lightweight SELECT and computes the orphan set in C#,
then deletes in batches of 500 by integer PK.

Also batches RecoverStuckJobs via ExecuteUpdateAsync (same pattern as
the reap junctions) and adds MaxWorkQueueEntriesPerCycle (default 200)
to cap work queue creation per manifest manager cycle.
@Theauxm Theauxm merged commit 8de147a into main Apr 1, 2026
1 check passed
@Theauxm Theauxm deleted the fix/startup-resilience-low-resource-postgres branch April 1, 2026 16:26
@traxsharp
Copy link
Copy Markdown

traxsharp Bot commented Apr 1, 2026

This PR is included in version 1.24.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant