Skip to content

feat(backend): ensure a task is only executed once regardless of how many backend instances are run#6711

Open
anna-parker wants to merge 24 commits into
mainfrom
shedLock_alternative
Open

feat(backend): ensure a task is only executed once regardless of how many backend instances are run#6711
anna-parker wants to merge 24 commits into
mainfrom
shedLock_alternative

Conversation

@anna-parker

@anna-parker anna-parker commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

resolves #6704

Alternative to #6705 (shedLock seems quite easy to implement so one less dependency that we fully understand)

Background

Today every @Scheduled task runs independently on each backend replica, so the effective execution frequency scales with the replica count (e.g. with 3 replicas the cleanup task runs ~3× as often, and the CrossRef citation fetch hits the external service 3×).

Overview

This PR adds a new table called task_lock with the columns task_name, started_at and locked_util. When a task attempts to start it will try to get a lock on this table for that task, this is only possible if there is no row for task_name or locked_util is less than NOW. When it gets a lock it populates the row with task_name, started_at=NOW and locked_until=NOW + maxLockTime

If a task finishes it will delete the row if NOW is past locked_until or update lock_until with started_at + minLockTime

minLockTime and maxLockTime are set to be a factor of the desired frequency a task runs at where the frequency needs to be set as higher than the anticipated run time of a task. The task can run up to maxLockTime without fear of parallel executions, maxLockTime is by default 5*frequency. minLockTime is set to be 0.9 of the expected frequency to prevent subsequent tasks being blocked from starting due to a previous lock being delayed in its removal due to clock skew etc.

Note we need to use the DB time for the task_lock and now comparisons.

Changes in behavior

  • Now only one backend instance will run each task (was previously multiple on PPX prod) -> this is intended.
  • On previews the db will always be deleted and tasks will start on backend start up as configured (after initialDelay) -> same behavior as currently.
  • On persistent instances when the backend restarts e.g. after a rollout the task will not only get run after initialStart up but after the defined task run frequency -> this is a change in behavior, which I think is ok.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://shedlock-alternative.loculus.org

@claude claude Bot added the backend related to the loculus backend component label Jun 17, 2026
@anna-parker anna-parker marked this pull request as ready for review June 17, 2026 21:58
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 2m 50s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze implementation
  • Post review feedback

The approach is clean and correct: a PostgreSQL INSERT ... ON CONFLICT DO UPDATE WHERE ... RETURNING upsert is atomic at the row level, so the mutual-exclusion guarantee holds across replicas. Good test coverage for the core service, and the lock-reset pattern in multi-call tests is sensible.

A few things worth addressing before merge (see inline comments for details):

  1. started_at IS NULL is dead code — no code path ever sets the column to NULL, so the branch in the upsert WHERE clause is never triggered. Either remove it or add a releaseLock helper.

  2. CleanUpAuxTableTask hardcodes the lock interval — the 1L hours value is duplicated from the @Scheduled annotation. If the schedule changes, the lock interval won't follow. Consider injecting it via @Value like the other tasks do.

  3. started_at should be NOT NULL in the migration — the column is always set to NOW() on write; allowing NULL only makes sense if there's a release mechanism (see point 1).

  4. Raw TRUNCATE TABLE task_lock in tests — used in both CitationEndpointsTest and UseNewerProcessingPipelineVersionTaskTest. A small helper would avoid coupling tests to the table name.

  5. StatementType.SELECT needs a comment — it's a non-obvious Exposed workaround for CTEs that contain DML; a brief inline note would help future readers.

@anna-parker anna-parker changed the title feat(backend): ensure a task is only executed once regardless of how … feat(backend): ensure a task is only executed once regardless of how many backend instances are run Jun 17, 2026
Comment thread backend/src/main/kotlin/org/loculus/backend/service/scheduler/TaskLockService.kt Outdated
Comment thread backend/src/main/resources/db/migration/V1.31__add_task_lock_table.sql Outdated
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jun 18, 2026
@anna-parker anna-parker force-pushed the shedLock_alternative branch from 6e197bb to 3909681 Compare June 19, 2026 14:37
@anna-parker

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 3m 46s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze implementation
  • Post review feedback

The core mechanism is solid: the PostgreSQL INSERT ... ON CONFLICT DO UPDATE WHERE ... RETURNING upsert is atomic and correctly provides the mutual-exclusion guarantee. Good to see the previous review's schema issues (NOT NULL on started_at, dead IS NULL branch) have been addressed. A few new issues to consider before merge:


Bug: Missing try/finally in three tasks

SeqSetCrossRefCitationsTask correctly wraps its body in try/finally to guarantee releaseLock is called even when an exception is thrown. The other three tasks do not:

CleanUpAuxTableTask.kt:

if (!taskLockService.acquireLock("clean-up-aux-table")) return
val deletedCount = uploadDatabaseService.deleteAuxTableEntriesOlderThan(...) // exception here?
// ...
taskLockService.releaseLock("clean-up-aux-table") // never reached

Same pattern in CleanUpStaleSequencesInProcessingTask and UseNewerProcessingPipelineVersionTask.

If the task body throws, releaseLock is skipped and the lock is held for maxDuration = 5 × frequencyInterval. For CleanUpAuxTableTask with the default 24h interval, that's 5 days of being blocked. All three tasks should use the same try/finally pattern as SeqSetCrossRefCitationsTask:

if (!taskLockService.acquireLock("my-task")) return
try {
    // task body
} finally {
    taskLockService.releaseLock("my-task")
}

Fix this →


CleanUpAuxTableTask: System property inconsistency and misleading comment

This task reads its frequency from a JVM system property:

frequencyIntervalSeconds = TimeUnit.HOURS.toSeconds(
    System.getProperty("loculus.maintenance.clean-up-aux-table.run-every-hours", "24").toLong(),
)

This is inconsistent with the other three tasks, which all use Spring @Value injection from application.properties. Setting a JVM -D flag is a different mechanism from application.properties / env vars and isn't obvious. Consider using @Value("\${loculus.maintenance.clean-up-aux-table.run-every-hours:24}") with a property added to application.properties.

Additionally, the KDoc comment says "Runs every hour and deletes auxTable entries older than 24 hours", but with the default 24h lock interval (minLockFactor × 24h = 21.6h), the task actually does work at most once every ~21.6 hours. The @Scheduled(fixedDelay = 1, timeUnit = HOURS) controls the polling frequency, not the execution frequency. The comment should clarify this.

Fix this →


SQL injection in test helper

TaskLockServiceTest.lockDeltaSeconds interpolates taskName directly into raw SQL:

// backend/src/test/kotlin/.../TaskLockServiceTest.kt:110
"SELECT EXTRACT(EPOCH FROM (locked_until - started_at)) FROM task_lock WHERE task_name = '$taskName'"

Low risk since this is test-only code with controlled inputs, but still bad practice. Use a parameterized exec(sql, args) call instead.

Fix this →


Remaining from previous review

Two items flagged in the prior review are still open:

  • TRUNCATE TABLE task_lock in tests — still appears verbatim in CitationEndpointsTest.kt:267 and UseNewerProcessingPipelineVersionTaskTest.kt:48,54,149,160. A small test helper (or constant for the table name) would decouple tests from the raw table name.
  • explicitStatementType commentsStatementType.SELECT in acquireLock and StatementType.UPDATE in releaseLock are non-obvious Exposed quirks (a CTE starting with INSERT/DELETE requires explicitly overriding the default statement type to get a ResultSet). A brief inline comment would help future readers.

What looks good

  • The core PostgreSQL upsert atomic CTE (INSERT ... ON CONFLICT DO UPDATE WHERE ... RETURNING) is correct and will safely arbitrate between replicas.
  • The two-phase releaseLock (DELETE if past minDuration, otherwise shorten locked_until) is a neat design.
  • SeqSetCrossRefCitationsTask correctly uses try/finally — that's the pattern to follow across all tasks.
  • TaskLockServiceTest has solid coverage of the lock lifecycle, including the early-release and late-release paths.
  • EndpointTestExtension.clearDatabaseStatement now includes task_lock in the global truncation, so test isolation is handled correctly.
  • The schema migration has NOT NULL DEFAULT NOW() on both timestamp columns — correct and self-documenting.

@theosanderson

Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b95ac625fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/src/main/kotlin/org/loculus/backend/service/scheduler/TaskLockService.kt Outdated
anna-parker and others added 8 commits June 22, 2026 17:30
…TaskLockService.kt

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
## Summary

This PR replaces the repeated task-lock acquire/release boilerplate in
scheduled backend tasks with a reusable `@TaskLock` annotation. The
annotation keeps the lock name and schedule interval at the method
callsite, while `TaskLockAspect` wraps the method execution with the
existing `TaskLockService` acquire/release behavior.

The affected scheduled tasks now keep their bodies focused on task
logic: cleaning stale sequences, checking for newer preprocessing
pipeline versions, and updating SeqSet CrossRef citations. The aspect
resolves Spring property placeholders from the annotation, converts
non-second intervals when needed, skips the task body if the lock is
already held, and releases the lock in a `finally` block after
execution.

This also adds Spring Boot AOP support and locks the new AOP
dependencies in `gradle.lockfile`. A focused unit test covers the aspect
behavior for skipped execution, successful execution, exception
handling, and minute-to-second interval conversion.

## Validation

Ran with Docker-backed test infrastructure:

```bash
./gradlew test --tests org.loculus.backend.service.submission.CleanUpStaleSequencesInProcessingTaskTest --tests org.loculus.backend.service.submission.UseNewerProcessingPipelineVersionTaskTest --tests org.loculus.backend.service.scheduler.TaskLockAspectTest --console=plain
./gradlew ktlintFormat --console=plain
```

Both commands completed successfully.

🚀 Preview: Add `preview` label to enable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component preview Triggers a deployment to argocd update_db_schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make scheduled tasks not scale in frequency with number of backend replicas

4 participants