diff --git a/scheduler/admin-trains/manifest-manager.md b/scheduler/admin-trains/manifest-manager.md index f13c15c..e3603ad 100644 --- a/scheduler/admin-trains/manifest-manager.md +++ b/scheduler/admin-trains/manifest-manager.md @@ -24,6 +24,8 @@ Projects all enabled manifests into lightweight `ManifestDispatchView` records u The projection uses `AsNoTracking()` — the results are read-only snapshots used for scheduling decisions only. No unbounded child collections are loaded into memory. +> **Scaling note:** LoadManifestsJunction loads all enabled manifests in a single query. For typical production deployments (up to 10K manifests), this is efficient with proper indexes. The junction cannot be paginated without breaking the reap junctions' ability to identify stale jobs across all manifests. + ### CancelTimedOutJobsJunction Finds InProgress metadata that has exceeded `DefaultJobTimeout` and requests cooperative cancellation. Sets `CancellationRequested = true` in the database (picked up by `CancellationCheckProvider` at the next junction boundary) and attempts same-server instant cancellation via the `CancellationRegistry`. @@ -80,6 +82,8 @@ For dependent manifests, `DependentPriorityBoost` is still added on top of the g Each entry is saved individually. If one fails (e.g., a serialization issue for a specific manifest), the others still get queued. Errors are logged per-manifest. +The number of entries created per cycle is limited by `MaxWorkQueueEntriesPerCycle` (default: 200). When more manifests are due than the limit allows, excess manifests are deferred to the next polling cycle. This prevents a burst of DB writes from saturating low-resource database instances, particularly after extended downtime when many manifests become due simultaneously via the `FireOnceNow` misfire policy. Set to `null` for unlimited. + ## Concurrency Model: Two-Layer Defense The ManifestManager uses a layered approach to prevent duplicate work queue entries. Each layer addresses a different failure mode. diff --git a/scheduler/orphan-manifest-cleanup.md b/scheduler/orphan-manifest-cleanup.md index d2bc98c..3c1a4ed 100644 --- a/scheduler/orphan-manifest-cleanup.md +++ b/scheduler/orphan-manifest-cleanup.md @@ -31,6 +31,8 @@ At startup, after seeding all configured manifests via upsert, the scheduler com If deleting an orphaned manifest would break a `DependsOnManifestId` foreign key on another manifest, that reference is set to `null` before deletion. +Orphan pruning deletes manifests in batches (500 per batch) to keep SQL `IN(...)` clauses small and avoid command timeouts on large prune operations. Each batch clears FK references, then deletes in FK-safe order: WorkQueues, DeadLetters, Metadata, and finally the manifests themselves. This makes the operation resilient to restarts — each batch commits independently, so partial progress is preserved. + After manifest pruning, any `ManifestGroup` with no remaining manifests is also deleted. ## Configuration diff --git a/sdk-reference/scheduler-api/add-scheduler.md b/sdk-reference/scheduler-api/add-scheduler.md index 0250f31..d2a1970 100644 --- a/sdk-reference/scheduler-api/add-scheduler.md +++ b/sdk-reference/scheduler-api/add-scheduler.md @@ -97,6 +97,7 @@ These methods are available on the `SchedulerConfigurationBuilder` passed to the | `MaxDispatchAttempts(int)` | maxAttempts | 5 | Max dispatch attempts before permanently failing a work queue entry. When dispatch fails, the entry is requeued for the next cycle. Set to 0 to disable requeuing (fail immediately). See [Failure Handling](/docs/scheduler/remote-execution#failure-handling) | | `MaxActiveJobs(int?)` | maxJobs | 10 | Max concurrent active jobs (Pending + InProgress) globally. `null` = unlimited. Per-group limits can also be set from the dashboard on each ManifestGroup | | `MaxQueuedJobsPerCycle(int?)` | limit | 100 | Max queued work queue entries loaded per JobDispatcher cycle. Prevents unbounded memory usage when the queue is large. `null` = unlimited. Provides headroom beyond `MaxActiveJobs` for per-group limit skipping | +| `MaxWorkQueueEntriesPerCycle(int?)` | limit | 200 | Max work queue entries created per ManifestManager cycle. Prevents write bursts after extended downtime when many manifests become due simultaneously. Excess manifests are deferred to the next cycle. `null` = unlimited | | `ExcludeFromMaxActiveJobs()` | — | — | Excludes a train type from the MaxActiveJobs count | | `DefaultMaxRetries(int)` | maxRetries | 3 | Retry attempts before dead-lettering | | `DefaultRetryDelay(TimeSpan)` | delay | 5 minutes | Base delay between retries |