Skip to content

OCTOU Race Condition in Custom Metric Initialization Causes Workflow to Fail with Panic #16106

@panicboat

Description

@panicboat

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When a custom metric with realtime: true is defined (e.g. via workflowDefault) and two or more workflows are processed by the controller simultaneously for the first time after a controller restart, a TOCTOU race condition in ensureBaseMetric causes a panic.

internal error: unexpected userdata on custom metric <metric-name>

The workflow immediately transitions to Error phase without executing any steps.

Root cause

ensureBaseMetric has a window between createCustomMetric and inst.SetUserdata where another goroutine can find the newly created Instrument via matchExistingMetric,
call getOrCreateValue, and then customUserData(inst, true) — which panics because inst.GetUserdata() is still nil at that point.

Goroutine 1: matchExistingMetric → nil
Goroutine 1: createCustomMetric   (userdata = nil)
Goroutine 2: matchExistingMetric → inst found (userdata = nil)
Goroutine 2: getOrCreateValue → customUserData(inst, true) → PANIC
Goroutine 1: inst.SetUserdata(newUserData())

This only occurs when the metric is being initialized for the first time (i.e., after a controller restart), so it is rare in practice.

Version(s)

v4.0.5

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: metrics-race-
spec:
  entrypoint: main
  metrics:
    prometheus:
      - name: exec_duration_gauge
        help: "Duration of execution"
        labels:
          - key: workflow_uid
            value: "{{ workflow.uid }}"
          - key: status
            value: "{{ workflow.status }}"
        gauge:
          realtime: true
          value: "{{ workflow.duration }}"
  templates:
    - name: main
      container:
        image: argoproj/argosay:v2
        command: [sh, -c]
        args: ["sleep 5"]

Logs from the workflow controller

time=2026-05-16T15:05:00.019Z level=ERROR msg="Recovered from panic" r="internal error: unexpected userdata on custom metric exec_duration_gauge" component=workflow_worker workflow=<workflow-name> namespace=<namespace> stack="goroutine 358 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:26 +0x64
github.com/argoproj/argo-workflows/v4/workflow/controller.(*wfOperationCtx).operate.func2()
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:201 +0x50
panic({0x229e2c0?, 0x400103e680?})
	/usr/local/go/src/runtime/panic.go:783 +0x120
github.com/argoproj/argo-workflows/v4/workflow/metrics.customUserData(0x4000fc8480?, 0x24?)
	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:73 +0xa8
github.com/argoproj/argo-workflows/v4/workflow/metrics.getOrCreateValue(0x4000e36660?, {0x40009428c0, 0x134}, {0x40008fa380, 0x9, 0x10})
	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:80 +0x44
github.com/argoproj/argo-workflows/v4/workflow/metrics.(*Metrics).UpsertCustomMetric(0x4000700c80, {0x2ced3b0, 0x4001581080}, 0x4000e36660, {0x4000fc9b00, 0x24}, 0x400103e350)
	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:188 +0x84
github.com/argoproj/argo-workflows/v4/workflow/controller.(*wfOperationCtx).computeMetrics(0x400141c840, {0x2ced3b0, 0x4001581080}, {0x4000dca120, 0x2, 0x1a?}, 0x4000f82e10, 0x4000f82e70, 0x1)
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:4122 +0x5e0
github.com/argoproj/argo-workflows/v4/workflow/controller.(*wfOperationCtx).operate(0x400141c840, {0x2ced3b0, 0x4001581080})
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:286 +0xa34
..."
time=2026-05-16T15:05:00.019Z level=INFO msg="updated phase" workflow=<workflow-name> namespace=<namespace> component=workflow_worker fromPhase="" toPhase=Error

Logs from in your workflow's wait container

N/A — the workflow fails before any pod is scheduled (`fromPhase=""`)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions