Skip to content

feat: wire hub OTel metrics + pipeline health observability (Phases 1-3)#239

Open
ptone wants to merge 8 commits into
mainfrom
scion/metrics-delivery
Open

feat: wire hub OTel metrics + pipeline health observability (Phases 1-3)#239
ptone wants to merge 8 commits into
mainfrom
scion/metrics-delivery

Conversation

@ptone

@ptone ptone commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 1 — Wire Hub OTel Recorders

  • Create hubmetrics.NewMeterProvider() with GCP Cloud Monitoring exporter
  • Wire dbmetrics and dispatchmetrics recorders in server_foreground.go
  • Add metric group toggles via env vars

Phase 2 — Bridge Hub In-Memory Counters to OTel

  • Add OTelMetricsRecorder implementing MetricsRecorder with OTel instruments + atomic snapshots
  • Add OTelGCPTokenMetrics wrapping GCP token methods with OTel instruments
  • Wire in server_foreground.go with graceful fallback

Phase 3 — Pipeline Health Observability

  • Add scion.telemetry.pipeline.status gauge that self-reports via cloud exporter every 60s
  • Add scion.telemetry.export.errors counter with signal/error_type attributes
  • Upgrade pipeline credential logging to structured slog fields
  • Fix slog handler to render key=value attributes

Closes #241

Test plan

  • Hub metrics export enabled with valid GCPProjectID
  • Graceful degradation when GCPProjectID is empty
  • OTelMetricsRecorder dual-writes to OTel + atomic snapshot
  • /api/metrics JSON endpoint backward compatible
  • Pipeline health gauge registers with cloud exporter, cancels on Stop()
  • Export error counter increments on failure with correct error_type
  • classifyError buckets timeout/auth/quota/other correctly
  • All telemetry package tests pass (53 tests)
  • Verify scion.telemetry.pipeline.status appears in Cloud Monitoring

The hub's dbmetrics and dispatchmetrics recorders were created with
NewDisabled() — all OTel instruments recorded to no-op sinks. This
wires them to a real GCP Cloud Monitoring MeterProvider during server
startup, lighting up ~20 existing instruments (pg LISTEN/NOTIFY
latency, notifications published/delivered/dropped, subscriber lag,
dispatch lifecycle, pool stats).

New package pkg/observability/hubmetrics provides NewMeterProvider()
which creates a PeriodicReader (60s) with the GCP metric exporter.
Metric groups (db-notify, db-pool, dispatch, hub-auth, hub-gcp) can
be independently disabled via env vars using OTel View Drop().

Graceful degradation: when GCPProjectID is empty or the exporter
fails, the hub logs a warning and continues with disabled recorders
— identical to the previous behavior.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates OpenTelemetry metrics export to GCP Cloud Monitoring for the hub server, introducing database and dispatch metrics recorders. It adds a new hubmetrics package to manage the MeterProvider, which supports disabling specific metric groups via environment variables. The review feedback suggests two important improvements: utilizing a context with a timeout during MeterProvider shutdown to prevent potential hangs, and refactoring the TestIsGroupDisabled test to correctly use its defined table fields instead of hardcoded checks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cmd/server_foreground.go Outdated
if mpErr != nil {
log.Printf("WARNING: hub metrics export disabled: %v", mpErr)
} else {
defer mp.Shutdown(context.Background())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() directly in mp.Shutdown can cause the application shutdown to hang indefinitely if the OpenTelemetry exporter is unable to flush metrics (e.g., due to network issues or GCP API latency). It is recommended to use a context with a reasonable timeout (e.g., 5 seconds) to ensure graceful degradation and prevent hanging container terminations.

				defer func() {
					shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
					defer cancel()
					if err := mp.Shutdown(shutdownCtx); err != nil {
						log.Printf("WARNING: failed to shutdown meter provider: %v", err)
					}
				}()

Comment on lines +56 to +90
func TestIsGroupDisabled(t *testing.T) {
tests := []struct {
value string
disabled bool
}{
{"", false},
{"true", false},
{"1", false},
{"false", false},
{"0", false},
}

for _, tc := range tests {
t.Run(tc.value, func(t *testing.T) {
envVar := "SCION_METRICS_TEST_GROUP"
if tc.value != "" {
os.Setenv(envVar, tc.value)
t.Cleanup(func() { os.Unsetenv(envVar) })
} else {
os.Unsetenv(envVar)
}
got := isGroupDisabled(envVar)
// Re-check directly since we test the env var we set
if tc.value == "false" || tc.value == "0" {
if !got {
t.Errorf("isGroupDisabled(%q=%q) = false, want true", envVar, tc.value)
}
} else {
if got {
t.Errorf("isGroupDisabled(%q=%q) = true, want false", envVar, tc.value)
}
}
})
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The table-driven test TestIsGroupDisabled defines a disabled field in its test cases but completely ignores it during execution. Instead, it hardcodes the expected outcomes based on tc.value == "false" || tc.value == "0". Additionally, the test cases for "false" and "0" incorrectly set disabled to false in the struct definition. Refactoring the test to use the disabled field directly makes the test cases accurate and the test logic cleaner.

func TestIsGroupDisabled(t *testing.T) {
	tests := []struct {
		value    string
		disabled bool
	}{
		{"", false},
		{"true", false},
		{"1", false},
		{"false", true},
		{"0", true},
	}

	for _, tc := range tests {
		t.Run(tc.value, func(t *testing.T) {
			envVar := "SCION_METRICS_TEST_GROUP"
			if tc.value != "" {
				os.Setenv(envVar, tc.value)
				t.Cleanup(func() { os.Unsetenv(envVar) })
			} else {
				os.Unsetenv(envVar)
			}
			got := isGroupDisabled(envVar)
			if got != tc.disabled {
				t.Errorf("isGroupDisabled(%q=%q) = %t, want %t", envVar, tc.value, got, tc.disabled)
			}
		})
	}
}

Scion Agent (metrics-phase1-dev) added 3 commits June 11, 2026 05:24
Use the table's 'want' field directly instead of re-deriving the
expected value in the assertion body.
Prevents indefinite hang if the GCP metric exporter is unresponsive
during server shutdown.
Add OTelMetricsRecorder and OTelGCPTokenMetrics that implement the
existing MetricsRecorder and new GCPTokenMetricsRecorder interfaces
using OTel instruments for Cloud Monitoring export. Both use a
dual-write pattern — OTel instruments for cloud export plus embedded
atomic structs for the /api/metrics JSON snapshot endpoint.

New metrics: scion.hub.auth.*, scion.hub.registration.count,
scion.hub.join.*, scion.hub.rotation.count, scion.hub.dispatch.*,
scion.hub.brokers.connected, scion.hub.gcp.token.*, scion.hub.gcp.iam.duration.

Closes #240
@ptone ptone changed the title feat: wire hub OTel metric recorders to Cloud Monitoring feat: wire hub OTel metrics to Cloud Monitoring (Phase 1 + Phase 2) Jun 11, 2026
Scion Agent (metrics-phase2-dev) added 2 commits June 11, 2026 05:48
M1: Add operation attribute to RecordDispatch OTel instruments so
dispatch failures can be broken down by operation type.

M2: Expand hub-auth metric group to cover all broker lifecycle
instruments (registration, join, rotation, brokers, dispatch) —
not just scion.hub.auth.*.

M3: Gate SetMetrics(otelMetrics) on broker auth being enabled,
preventing /api/metrics from showing an all-zeros broker block
when auth is disabled.
… logging

Phase 3 of metrics-delivery: add observability to the agent-side telemetry
pipeline so we can confirm it's working end-to-end in production.

- Add scion.telemetry.pipeline.status gauge (Int64, value=1) that self-reports
  via the cloud exporter on a 60s ticker, confirming the pipeline is alive
- Add scion.telemetry.export.errors counter with signal and error_type
  attributes, incrementing on metric/span/log export failures
- Add classifyError() to bucket errors into timeout/auth/quota/other
- Upgrade credential logging at pipeline startup from log.Info format strings
  to structured slog.Info with credentials_file, source, project_id, provider,
  and cloud_configured fields
- Add structured slog.Warn when cloud export is not configured, including the
  env var and well-known path that were checked
- Fix slog handler in pkg/sciontool/log to render key=value attributes instead
  of silently dropping them
- Add pipeline_health_test.go with tests for health gauge lifecycle, nil-safe
  export error recording, and classifyError bucketing
@ptone ptone changed the title feat: wire hub OTel metrics to Cloud Monitoring (Phase 1 + Phase 2) feat: wire hub OTel metrics + pipeline health observability (Phases 1-3) Jun 11, 2026
Scion Agent (metrics-phase3-dev) added 2 commits June 11, 2026 13:03
…trics

initSelfMetrics() creates providers via NewProviders() but only needs the
MeterProvider. The TracerProvider and LoggerProvider were never shut down,
leaking goroutines. Shut them down immediately after creation.
- Replace os.Setenv+cleanup with t.Setenv in hubmetrics tests
- Wrap standalone os.Unsetenv calls with error checks
- Handle Shutdown errors on TracerProvider, LoggerProvider, MeterProvider
- Check pipeline.Stop error in test cleanup
- Convert switch to tagged form (staticcheck QF1002)
- Fix MeterProvider leak on early return in startHealthGauge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metrics-delivery Phase 3: pipeline health observability

1 participant