fix: improve scheduler startup resilience on low-resource Postgres by Theauxm · Pull Request #43 · TraxSharp/Trax.Scheduler

Theauxm · 2026-04-01T16:24:27Z

Summary

Orphan manifest pruning timed out on a 2 vCPU Postgres instance because EF Core inlined 5000+ expected external IDs as NOT IN(...) string parameters, overwhelming the query planner. The fix loads all manifest ID pairs with a lightweight SELECT and computes the orphan set in C#, then deletes in batches of 500 by integer PK.
RecoverStuckJobs now uses batched ExecuteUpdateAsync instead of loading all stuck metadata into memory — matches the reap junction pattern.
New MaxWorkQueueEntriesPerCycle config (default: 200, null = unlimited) caps how many work queue entries CreateWorkQueueEntriesJunction creates per manifest manager cycle, preventing write bursts after extended downtime.
Applied the same server-side filtering pattern to PruneStaleManifestsAsync in TraxScheduler for consistency.

Test plan

All 650 integration tests pass
All 113 unit tests pass
New integration tests for large expected ID sets (100 expected + 5 orphans, 50 expected + orphan with FK data)
New integration test for orphan count exceeding batch size (510 orphans)
New integration test for stuck job count exceeding batch size (510 stuck jobs)
New integration tests for MaxWorkQueueEntriesPerCycle (limit=3 with 10 due, null with 10 due)
Stress test: 5000 orphans pruned in 226ms across 10 batches (500 expected kept)
Stress test: large expected set scenario (450 expected, 50 pruned)
Zero build warnings, CSharpier formatted

Orphan manifest pruning timed out on a 2 vCPU Postgres instance because EF Core inlined 5000+ expected external IDs as NOT IN(...) string parameters, overwhelming the query planner. The fix loads all manifest ID pairs with a lightweight SELECT and computes the orphan set in C#, then deletes in batches of 500 by integer PK. Also batches RecoverStuckJobs via ExecuteUpdateAsync (same pattern as the reap junctions) and adds MaxWorkQueueEntriesPerCycle (default 200) to cap work queue creation per manifest manager cycle.

traxsharp · 2026-04-01T16:29:10Z

This PR is included in version 1.24.1

Theauxm mentioned this pull request Apr 1, 2026

docs: document startup resilience improvements TraxSharp/Trax.Docs#64

Merged

2 tasks

Theauxm merged commit 8de147a into main Apr 1, 2026
1 check passed

Theauxm deleted the fix/startup-resilience-low-resource-postgres branch April 1, 2026 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve scheduler startup resilience on low-resource Postgres#43

fix: improve scheduler startup resilience on low-resource Postgres#43
Theauxm merged 1 commit into
mainfrom
fix/startup-resilience-low-resource-postgres

Theauxm commented Apr 1, 2026

Uh oh!

Uh oh!

traxsharp Bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Theauxm commented Apr 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

traxsharp Bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant