Skip to content

fix: resolve CI failures — gofmt and errcheck lint#245

Open
ptone wants to merge 9 commits into
mainfrom
scion/ci-main-fix
Open

fix: resolve CI failures — gofmt and errcheck lint#245
ptone wants to merge 9 commits into
mainfrom
scion/ci-main-fix

Conversation

@ptone

@ptone ptone commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

CI Failures Fixed

  1. Format Check: gofmt -l . flagged 6 files: caching_skill_resolver_test.go, skill_resolver.go, skill_uri.go, skill_handlers.go, skills.go, skill_store.go
  2. golangci-lint errcheck: 3 unchecked os.Setenv return values in pipeline_health_test.go lines 19-21

Test plan

  • gofmt -l . returns no output
  • go build ./... passes
  • go vet ./... passes
  • go test ./pkg/sciontool/telemetry/... passes

ptone and others added 9 commits June 11, 2026 12:00
…oogleCloudPlatform#403)

* Add `scion build` command for local harness-config image builds

Introduces a top-level `scion build <harness-config-name>` CLI command
that builds a container image from a Dockerfile bundled in a harness-config
directory. Supports --tag, --base-image, --push, --platform, and --dry-run
flags. After a successful build, updates the harness-config's config.yaml
image field to reference the built image.

Also fixes Dockerfile content hashing: Dockerfiles were previously excluded
from ComputeHarnessConfigRevision, so changes to them did not trigger
re-sync to the Hub. Removes Dockerfile from the skipBasenames exclusion list.

Extracts detectContainerRuntime() from pkg/hub/maintenance_executors.go
into a shared pkg/runtime/container.go so both the new build command and
the Hub executor can use it.

* Address PR review: yaml.Node config update, error handling, dedup settings

H1: Replace destructive yaml.Unmarshal/Marshal round-trip with targeted
yaml.Node edit for config.yaml image update. Preserves comments, field
order, and unknown fields.

M1: Handle all os.Stat errors on Dockerfile, not just IsNotExist.

M2: Load settings once instead of twice when both --base-image is unset
and --push is set.

L3: Remove extra blank line in maintenance_executors.go.

* Guard against empty harness-config path and non-mapping YAML nodes

Add empty-path check after FindHarnessConfigDir to prevent synthetic
harness-configs (e.g. 'generic') from resolving Dockerfile against CWD.

Verify yaml.MappingNode kind before manipulating doc.Content[0] to
handle malformed config.yaml gracefully.

---------

Co-authored-by: Scion Agent (harness-local-build-p1) <agent@scion.dev>
* fix: test-login hardening, agent CLI access, and UI nits

- M1: Distinguish store.ErrNotFound from transient DB errors in
  GetUserByEmail — return 500 for unexpected failures instead of
  silently creating duplicate users.
- L1: Add http.MaxBytesReader(4096) body size limit.
- L2: Validate email contains "@" before processing.
- L3: Track whether displayName was explicitly provided so the default
  (email) doesn't overwrite an existing user's display name.
- Add harness-config subcommands to the agentAllowed map so agents
  can manage harness configs via the CLI.

Fixes 3 (file-browser badge) and 4 (page title) were already resolved
in the current codebase.

* style: fix gofmt alignment in cli_mode.go

* ci: retry after flaky TestPipeline_LogHandlerRegistered port conflict

* test: update cli_mode tests to expect harness-config in agentAllowed

---------

Co-authored-by: Scion Agent (dev-followup-pr) <agent@scion.dev>
…GoogleCloudPlatform#401)

* feat: wire hub OTel metric recorders to Cloud Monitoring

The hub's dbmetrics and dispatchmetrics recorders were created with
NewDisabled() — all OTel instruments recorded to no-op sinks. This
wires them to a real GCP Cloud Monitoring MeterProvider during server
startup, lighting up ~20 existing instruments (pg LISTEN/NOTIFY
latency, notifications published/delivered/dropped, subscriber lag,
dispatch lifecycle, pool stats).

New package pkg/observability/hubmetrics provides NewMeterProvider()
which creates a PeriodicReader (60s) with the GCP metric exporter.
Metric groups (db-notify, db-pool, dispatch, hub-auth, hub-gcp) can
be independently disabled via env vars using OTel View Drop().

Graceful degradation: when GCPProjectID is empty or the exporter
fails, the hub logs a warning and continues with disabled recorders
— identical to the previous behavior.

* test: clean up TestIsGroupDisabled table-driven test

Use the table's 'want' field directly instead of re-deriving the
expected value in the assertion body.

* fix: add 10s timeout to hub MeterProvider shutdown

Prevents indefinite hang if the GCP metric exporter is unresponsive
during server shutdown.

* feat: replace hub in-memory counters with OTel instruments (Phase 2)

Add OTelMetricsRecorder and OTelGCPTokenMetrics that implement the
existing MetricsRecorder and new GCPTokenMetricsRecorder interfaces
using OTel instruments for Cloud Monitoring export. Both use a
dual-write pattern — OTel instruments for cloud export plus embedded
atomic structs for the /api/metrics JSON snapshot endpoint.

New metrics: scion.hub.auth.*, scion.hub.registration.count,
scion.hub.join.*, scion.hub.rotation.count, scion.hub.dispatch.*,
scion.hub.brokers.connected, scion.hub.gcp.token.*, scion.hub.gcp.iam.duration.

Closes #240

* fix: address Phase 2 review findings (M1, M2, M3)

M1: Add operation attribute to RecordDispatch OTel instruments so
dispatch failures can be broken down by operation type.

M2: Expand hub-auth metric group to cover all broker lifecycle
instruments (registration, join, rotation, brokers, dispatch) —
not just scion.hub.auth.*.

M3: Gate SetMetrics(otelMetrics) on broker auth being enabled,
preventing /api/metrics from showing an all-zeros broker block
when auth is disabled.

* feat: add pipeline health gauge, export error counter, and structured logging

Phase 3 of metrics-delivery: add observability to the agent-side telemetry
pipeline so we can confirm it's working end-to-end in production.

- Add scion.telemetry.pipeline.status gauge (Int64, value=1) that self-reports
  via the cloud exporter on a 60s ticker, confirming the pipeline is alive
- Add scion.telemetry.export.errors counter with signal and error_type
  attributes, incrementing on metric/span/log export failures
- Add classifyError() to bucket errors into timeout/auth/quota/other
- Upgrade credential logging at pipeline startup from log.Info format strings
  to structured slog.Info with credentials_file, source, project_id, provider,
  and cloud_configured fields
- Add structured slog.Warn when cloud export is not configured, including the
  env var and well-known path that were checked
- Fix slog handler in pkg/sciontool/log to render key=value attributes instead
  of silently dropping them
- Add pipeline_health_test.go with tests for health gauge lifecycle, nil-safe
  export error recording, and classifyError bucketing

* fix: shut down unused TracerProvider and LoggerProvider in initSelfMetrics

initSelfMetrics() creates providers via NewProviders() but only needs the
MeterProvider. The TracerProvider and LoggerProvider were never shut down,
leaking goroutines. Shut them down immediately after creation.

* fix(metrics-delivery): address errcheck and staticcheck CI failures

- Replace os.Setenv+cleanup with t.Setenv in hubmetrics tests
- Wrap standalone os.Unsetenv calls with error checks
- Handle Shutdown errors on TracerProvider, LoggerProvider, MeterProvider
- Check pipeline.Stop error in test cleanup
- Convert switch to tagged form (staticcheck QF1002)
- Fix MeterProvider leak on early return in startHealthGauge

---------

Co-authored-by: Scion Agent (metrics-phase1-dev) <agent@scion.dev>
Run gofmt on 6 skill-related files from PR GoogleCloudPlatform#399 and replace os.Setenv
with t.Setenv in pipeline_health_test.go from PR GoogleCloudPlatform#401 to satisfy the
errcheck linter.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new build command for container images, integrates OpenTelemetry metrics export to Cloud Monitoring for the Hub server, adds self-monitoring metrics to the telemetry pipeline, and enhances the test-login endpoint with input validation and size limits. The review feedback suggests performing atomic writes when updating config.yaml to prevent corruption, passing context.Context through the GCP token metrics recorder to align with OpenTelemetry best practices, and avoiding the use of high-cardinality broker_id as a metric attribute to prevent cardinality explosion.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cmd/build.go
Comment on lines +165 to +167
if err := os.WriteFile(configPath, updatedData, 0644); err != nil {
return fmt.Errorf("failed to write updated config.yaml: %w", err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Writing directly to config.yaml using os.WriteFile can leave the file truncated or corrupted if the write operation is interrupted or fails (e.g., due to a full disk or process termination).

To ensure robustness and prevent file corruption, perform an atomic write by writing the updated data to a temporary file in the same directory first, and then renaming it to the target path using os.Rename.

		tmpConfigPath := configPath + ".tmp"
		if err := os.WriteFile(tmpConfigPath, updatedData, 0644); err != nil {
			return fmt.Errorf("failed to write temporary config.yaml: %w", err)
		}
		if err := os.Rename(tmpConfigPath, configPath); err != nil {
			_ = os.Remove(tmpConfigPath)
			return fmt.Errorf("failed to update config.yaml atomically: %w", err)
		}

Comment thread pkg/hub/gcp_metrics.go
Comment on lines +24 to +29
type GCPTokenMetricsRecorder interface {
RecordAccessTokenRequest(success bool, latency time.Duration)
RecordIDTokenRequest(success bool, latency time.Duration)
RecordRateLimitRejection()
GetSnapshot() *GCPTokenMetricsSnapshot
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To align with OpenTelemetry best practices, the GCPTokenMetricsRecorder interface methods should accept a context.Context as their first parameter.

Passing the active context allows the OpenTelemetry SDK to associate measurements with the active span/trace (enabling exemplars) and propagates baggage or other context-scoped attributes.

Suggested change
type GCPTokenMetricsRecorder interface {
RecordAccessTokenRequest(success bool, latency time.Duration)
RecordIDTokenRequest(success bool, latency time.Duration)
RecordRateLimitRejection()
GetSnapshot() *GCPTokenMetricsSnapshot
}
type GCPTokenMetricsRecorder interface {
RecordAccessTokenRequest(ctx context.Context, success bool, latency time.Duration)
RecordIDTokenRequest(ctx context.Context, success bool, latency time.Duration)
RecordRateLimitRejection(ctx context.Context)
GetSnapshot() *GCPTokenMetricsSnapshot
}

Comment on lines +94 to +121
func (r *OTelGCPTokenMetrics) RecordAccessTokenRequest(success bool, latency time.Duration) {
ctx := context.Background()
r.accessRequests.Add(ctx, 1)
if success {
r.accessSuccesses.Add(ctx, 1)
} else {
r.accessFailures.Add(ctx, 1)
}
r.iamDuration.Record(ctx, float64(latency.Milliseconds()))
r.snap.RecordAccessTokenRequest(success, latency)
}

func (r *OTelGCPTokenMetrics) RecordIDTokenRequest(success bool, latency time.Duration) {
ctx := context.Background()
r.idRequests.Add(ctx, 1)
if success {
r.idSuccesses.Add(ctx, 1)
} else {
r.idFailures.Add(ctx, 1)
}
r.iamDuration.Record(ctx, float64(latency.Milliseconds()))
r.snap.RecordIDTokenRequest(success, latency)
}

func (r *OTelGCPTokenMetrics) RecordRateLimitRejection() {
r.rateLimitRejects.Add(context.Background(), 1)
r.snap.RecordRateLimitRejection()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the OTelGCPTokenMetrics recorder implementation to accept and propagate the active context.Context instead of using context.Background(). This enables proper trace-metric correlation (exemplars) and context propagation.

func (r *OTelGCPTokenMetrics) RecordAccessTokenRequest(ctx context.Context, success bool, latency time.Duration) {
	r.accessRequests.Add(ctx, 1)
	if success {
		r.accessSuccesses.Add(ctx, 1)
	} else {
		r.accessFailures.Add(ctx, 1)
	}
	r.iamDuration.Record(ctx, float64(latency.Milliseconds()))
	r.snap.RecordAccessTokenRequest(success, latency)
}

func (r *OTelGCPTokenMetrics) RecordIDTokenRequest(ctx context.Context, success bool, latency time.Duration) {
	r.idRequests.Add(ctx, 1)
	if success {
		r.idSuccesses.Add(ctx, 1)
	} else {
		r.idFailures.Add(ctx, 1)
	}
	r.iamDuration.Record(ctx, float64(latency.Milliseconds()))
	r.snap.RecordIDTokenRequest(success, latency)
}

func (r *OTelGCPTokenMetrics) RecordRateLimitRejection(ctx context.Context) {
	r.rateLimitRejects.Add(ctx, 1)
	r.snap.RecordRateLimitRejection()
}

Comment thread pkg/hub/otel_metrics.go

func (r *OTelMetricsRecorder) RecordAuthAttempt(brokerID string, success bool, latency time.Duration) {
ctx := context.Background()
attrs := metric.WithAttributes(attribute.String("broker_id", brokerID))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using broker_id (which is a high-cardinality unique identifier like a UUID) as a metric attribute/dimension can lead to a cardinality explosion in the metrics backend (e.g., Google Cloud Monitoring). This can significantly increase ingestion costs, degrade query performance, or cause metrics to be dropped due to backend limits.

Consider whether tracking individual broker_ids is strictly necessary at the metrics layer. If individual broker tracking is required, it is generally better suited for logs or distributed traces. For metrics, consider aggregating by a lower-cardinality attribute (such as broker version, region, or status) or removing the broker_id dimension entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant