Pre-requisites
What happened? What did you expect to happen?
When a custom metric with realtime: true is defined (e.g. via workflowDefault) and two or more workflows are processed by the controller simultaneously for the first time after a controller restart, a TOCTOU race condition in ensureBaseMetric causes a panic.
internal error: unexpected userdata on custom metric <metric-name>
The workflow immediately transitions to Error phase without executing any steps.
Root cause
ensureBaseMetric has a window between createCustomMetric and inst.SetUserdata where another goroutine can find the newly created Instrument via matchExistingMetric,
call getOrCreateValue, and then customUserData(inst, true) — which panics because inst.GetUserdata() is still nil at that point.
Goroutine 1: matchExistingMetric → nil
Goroutine 1: createCustomMetric (userdata = nil)
Goroutine 2: matchExistingMetric → inst found (userdata = nil)
Goroutine 2: getOrCreateValue → customUserData(inst, true) → PANIC
Goroutine 1: inst.SetUserdata(newUserData())
This only occurs when the metric is being initialized for the first time (i.e., after a controller restart), so it is rare in practice.
Version(s)
v4.0.5
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: metrics-race-
spec:
entrypoint: main
metrics:
prometheus:
- name: exec_duration_gauge
help: "Duration of execution"
labels:
- key: workflow_uid
value: "{{ workflow.uid }}"
- key: status
value: "{{ workflow.status }}"
gauge:
realtime: true
value: "{{ workflow.duration }}"
templates:
- name: main
container:
image: argoproj/argosay:v2
command: [sh, -c]
args: ["sleep 5"]
Logs from the workflow controller
time=2026-05-16T15:05:00.019Z level=ERROR msg="Recovered from panic" r="internal error: unexpected userdata on custom metric exec_duration_gauge" component=workflow_worker workflow=<workflow-name> namespace=<namespace> stack="goroutine 358 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:26 +0x64
github.com/argoproj/argo-workflows/v4/workflow/controller.(*wfOperationCtx).operate.func2()
/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:201 +0x50
panic({0x229e2c0?, 0x400103e680?})
/usr/local/go/src/runtime/panic.go:783 +0x120
github.com/argoproj/argo-workflows/v4/workflow/metrics.customUserData(0x4000fc8480?, 0x24?)
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:73 +0xa8
github.com/argoproj/argo-workflows/v4/workflow/metrics.getOrCreateValue(0x4000e36660?, {0x40009428c0, 0x134}, {0x40008fa380, 0x9, 0x10})
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:80 +0x44
github.com/argoproj/argo-workflows/v4/workflow/metrics.(*Metrics).UpsertCustomMetric(0x4000700c80, {0x2ced3b0, 0x4001581080}, 0x4000e36660, {0x4000fc9b00, 0x24}, 0x400103e350)
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:188 +0x84
github.com/argoproj/argo-workflows/v4/workflow/controller.(*wfOperationCtx).computeMetrics(0x400141c840, {0x2ced3b0, 0x4001581080}, {0x4000dca120, 0x2, 0x1a?}, 0x4000f82e10, 0x4000f82e70, 0x1)
/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:4122 +0x5e0
github.com/argoproj/argo-workflows/v4/workflow/controller.(*wfOperationCtx).operate(0x400141c840, {0x2ced3b0, 0x4001581080})
/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:286 +0xa34
..."
time=2026-05-16T15:05:00.019Z level=INFO msg="updated phase" workflow=<workflow-name> namespace=<namespace> component=workflow_worker fromPhase="" toPhase=Error
Logs from in your workflow's wait container
N/A — the workflow fails before any pod is scheduled (`fromPhase=""`)
Pre-requisites
:latestimage tag (i.e.quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on:latest. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
When a custom metric with
realtime: trueis defined (e.g. viaworkflowDefault) and two or more workflows are processed by the controller simultaneously for the first time after a controller restart, a TOCTOU race condition inensureBaseMetriccauses a panic.The workflow immediately transitions to Error phase without executing any steps.
Root cause
ensureBaseMetric has a window between createCustomMetric and inst.SetUserdata where another goroutine can find the newly created Instrument via matchExistingMetric,
call getOrCreateValue, and then customUserData(inst, true) — which panics because inst.GetUserdata() is still nil at that point.
This only occurs when the metric is being initialized for the first time (i.e., after a controller restart), so it is rare in practice.
Version(s)
v4.0.5
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container