Skip to content

fix: address semaphore/mutex unsoundness for Initalize#16160

Open
isubasinghe wants to merge 2 commits into
argoproj:mainfrom
pipekit:fix-semaphore-init-soundness
Open

fix: address semaphore/mutex unsoundness for Initalize#16160
isubasinghe wants to merge 2 commits into
argoproj:mainfrom
pipekit:fix-semaphore-init-soundness

Conversation

@isubasinghe
Copy link
Copy Markdown
Member

@isubasinghe isubasinghe commented Jun 1, 2026

Motivation

On restart the controller rebuilds its in-memory lock map from the holders that Running workflows record in their status. The old Initialize silently dropped any holder it couldn't re-establish. A racing workflow's
tryAcquire then finds the lock absent, makes a fresh one, and acquires a lock that persisted state says is already held. For a mutex that means two workflows running at once under the same mutex.

Modifications

  • Add reacquire to the semaphore interface. It re-establishes a recorded holder ignoring the current limit, so if the limit was lowered below the live holder count, new acquirers wait for the count to drain rather than a
    holder getting dropped.
  • Initialize now returns an error, and the controller fails closed (won't start) when a holder can't be recovered at all: an undecodable lock name, or a database-backed hold with no DB session.
  • For cases that are recoverable but unresolvable right now (say a transient ConfigMap read failure), install a poisonedLock that refuses every acquire and reports a message on the waiting node's sync status. It lives in
    memory only and clears on the next restart, when Initialize re-evaluates.

Verification

go test ./workflow/sync/.... Unit tests in sync_manager_test.go cover the fatal, poison, and reacquire paths, including InitializeLoweredLimitForceRegistersAllHolders for the in-memory (ConfigMap) semaphore and
TestDatabaseInitializeLoweredLimitAfterRestart for the database-backed one (Postgres and MySQL via testcontainers): two workflows acquire under limit 2, the limit drops to 1, the controller restarts, both keep their locks,
and a new contender waits until the holders drain below the limit.

Documentation

Not needed. This is internal controller restart behavior with no user-facing API, CRD, or config change.

AI

Claude Code (Opus 4.8) was used to analyze the restart soundness, write the testcontainers test, and draft this description.

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as ready for review June 2, 2026 03:31
@isubasinghe isubasinghe requested a review from a team as a code owner June 2, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant