Skip to content

Add HPA diagnosis insights#916

Merged
nadaverell merged 5 commits into
mainfrom
hpa-diagnosis-insights
Jun 14, 2026
Merged

Add HPA diagnosis insights#916
nadaverell merged 5 commits into
mainfrom
hpa-diagnosis-insights

Conversation

@nadaverell

@nadaverell nadaverell commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

HPAs can fail quietly: the target workload may still have healthy-looking pods while autoscaling is capped, unable to read metrics, pinned by configuration, or paused at zero replicas. This PR makes HPA diagnosis a first-class Radar insight so operators can understand autoscaling state directly from Radar instead of reconstructing it from raw HPA YAML and conditions.

The feature is deliberately conservative about where it creates noise. Broad scan surfaces only promote high-signal autoscaling failures, while detail drawers keep the richer context for states like partial metric gaps, min-bound scaling, stale status, pinned replica bounds, and stabilization windows.

What Changed

Shared HPA Diagnosis Engine

  • Added pkg/hpadiag, a shared analyzer for autoscaling/v2 HPAs.
  • The analyzer produces a structured diagnosis with normalized state, operator-facing summary, target reference, replica bounds, metric rows, and condition-backed evidence.
  • Added fixture coverage for maxed, metrics unavailable, partial metrics missing, unable to scale, disabled, pinned, scaling, stale, min-limited, stabilized, stable, and "at max without controller limit condition" cases.

Signal Policy

  • Maxed now requires controller evidence: ScalingLimited=True with TooManyReplicas.
  • An HPA merely sitting at current == desired == maxReplicas is treated as normal unless Kubernetes says it wanted more replicas and was capped.
  • ScalingActive=False is classified as metrics unavailable unless it is the intentional zero-replica ScalingDisabled case.
  • AbleToScale=False is classified separately as unable to scale.
  • Partial metric gaps, min bounds, stale observed generation, pinned min=max, disabled zero-replica targets, and scale-down stabilization remain drawer/detail context rather than dashboard issue spam.

Backend Surfaces

  • Resource detail responses include optional hpaDiagnosis for HPA resources.
  • Resource context includes an HPA summary so MCP/AI workflows can reason about autoscaler state without fetching raw YAML first.
  • AI summary output uses the shared analyzer for HPA issue text.
  • Dashboard/problem detection delegates HPA state decisions to the shared analyzer instead of maintaining separate heuristics.
  • Topology resource wrapper types now carry HPA diagnosis through existing resource detail response paths.

Frontend / UX

  • The shared HPA renderer has a Diagnosis section with summary, state badge, replica bounds, evidence rows, and normalized metric rows.
  • The HPA Metrics section prefers backend diagnosis metrics when present and falls back to existing raw status metrics otherwise.
  • HPA table status uses conservative classification so broad lists highlight only scan-worthy states.
  • Workload details show compact autoscaler context when the workload is controlled by an HPA.
  • Manual replica scaling is disabled for HPA/KEDA-controlled workloads, with a custom tooltip explaining which scaler owns replicas.
  • Workload autoscaler summaries are compact in the narrow status row and avoid mid-word wrapping; short controller refs stay inline as hpa/name, with wrapping only at the separator when space is tight.

Shared UI / Types

  • Added shared HPA diagnosis TypeScript types to @skyhook-io/k8s-ui.
  • Added resource-utils-hpa for table-state classification, label/tone mapping, and status badge generation.
  • Wired the HPA renderer through the existing renderer override path so Radar's app can inject Prometheus charts while the shared renderer stays host-agnostic.

Reviewer Focus

  • HPA state policy in pkg/hpadiag: whether each condition/state maps to the right Radar severity and surface.
  • Product signal policy: whether list/dashboard signals are conservative enough. This PR intentionally does not promote metrics_incomplete, limited_min, stale, or stabilized into table warnings.
  • Drawer UX: whether the summary plus evidence rows give enough context to act without making raw Kubernetes condition text the primary interface.
  • API shape: whether optional hpaDiagnosis on resource detail responses is the right contract for Radar app consumers.

Testing

Automated:

  • go test ./hpadiag ./resourcecontext ./ai/context from pkg/
  • go test ./... from pkg/
  • go test ./internal/k8s ./internal/server
  • make test
  • make tsc
  • npm --workspace @skyhook-io/k8s-ui run tsc
  • npm --workspace @skyhook-io/k8s-ui test -- --run src/components/resources/renderers/HPARenderer.test.tsx src/components/resources/renderers/WorkloadRenderer.test.tsx src/components/resources/resource-utils-hpa.test.ts
  • npm --workspace @skyhook-io/k8s-ui test -- --run src/components/resources/renderers/WorkloadRenderer.test.tsx
  • make build

Live visual test:

  • Cluster: kind-radar-gitops-demo
  • Namespace: radar-hpa-visual-test
  • Fixture HPAs: hpa-vt-disabled, hpa-vt-maxed, hpa-vt-metrics-incomplete, hpa-vt-metrics-unavailable, hpa-vt-min-limited, hpa-vt-pinned, hpa-vt-scaling-up, hpa-vt-stabilized, hpa-vt-stable, hpa-vt-stale, hpa-vt-unable-to-scale

Screens covered:

  • HPA list: verified all 11 fixtures and the conservative table policy. Maxed, Metrics unavailable, Unable to scale, Disabled, and Pinned surfaced; min-bound, stale, stabilized, stable, and scaling-up fixtures stayed quiet unless scan-worthy.
  • Maxed HPA drawer: verified synthesized summary, max-bound badge, replica bounds, ScalingLimited / TooManyReplicas evidence, CPU metric row, and amber condition rendering.
  • Metrics-unavailable HPA drawer: verified missing-request summary, condition evidence, missing metric row, no bogus unknown status_only metric row, and only ScalingActive counted as failing.
  • Workload drawer for Deployment/hpa-vt-maxed: verified compact HPA autoscaler context and disabled manual Scale action.
  • Workload drawer for Deployment/hpa-vt-metrics-incomplete: verified compact missing-metrics copy, inline hpa/name controller badge, and word-boundary wrapping.
  • Workload drawer for Deployment/hpa-vt-pinned: verified compact Fixed at 5 replicas copy, inline hpa/name controller badge, and disabled manual Scale action.
  • Disabled Scale tooltip: verified custom Tooltip rendering with role="tooltip", explanatory aria-label, and no native title on the Scale button.

Not covered by live visual test:

  • MCP/AI context responses were covered by Go tests, not by driving an MCP client against the fixture cluster.
  • Dashboard/problem detection was covered by analyzer/server tests and HPA list classification, not by a separate dashboard screenshot.
  • Prometheus charts were not part of live validation; diagnosis is based on Kubernetes HPA spec/status/conditions.

Notes / Tradeoffs

  • This does not add live Prometheus metric diagnosis beyond the existing HPA charts.
  • Metric row normalization is best-effort across resource, container resource, pods, object, and external metrics.
  • The table and drawer intentionally do not use identical state visibility. The drawer is the complete diagnosis surface; the table is for scan-worthy operational signal.

Note

Medium Risk
Changes when HPAs are flagged as maxed in problem detection (fewer false positives) and adds new optional API fields consumed by UI; scope is autoscaling/observability rather than auth or data paths, with broad fixture and test coverage.

Overview
Introduces pkg/hpadiag as the single place that interprets autoscaling/v2 HPAs (state, summary, bounds, metrics, condition-backed reasons), and wires it through backend detection, resource/AI context, and the k8s-ui drawers.

Detection policy tightens: “maxed” problems now require controller evidence (ScalingLimited=True / TooManyReplicas), not merely current == desired == maxReplicas. Metrics and scale failures still surface as separate cannot-scale issues; min-bound, stale, stabilized, and pinned cases stay out of broad scan noise.

API & context: HPA GET responses optionally include hpaDiagnosis on ResourceWithRelationships; resource context gains hpaSummary; AI summary/minify uses the same analyzer for HPA issue text.

UI: HPA detail gets a Diagnosis section (replacing ad-hoc condition heuristics); HPA list status uses conservative table classification; workload views show compact inline autoscaler diagnosis when scale is HPA-blocked and fetch sibling HPAs for that context. Shared types, condition warning tones, and layout tweaks support the new surfaces.

Reviewed by Cursor Bugbot for commit 2b353cf. Bugbot is set up for automated code reviews on this repo. Configure here.

@nadaverell nadaverell requested a review from hisco as a code owner June 13, 2026 20:54
Comment thread web/src/components/resources/renderers/WorkloadRenderer.tsx Outdated
@nadaverell nadaverell force-pushed the hpa-diagnosis-insights branch from cb0d80a to 2db4ac8 Compare June 13, 2026 21:02
Comment thread pkg/hpadiag/diagnosis.go
@nadaverell nadaverell force-pushed the hpa-diagnosis-insights branch 3 times, most recently from 347883b to 255cb83 Compare June 13, 2026 22:58
Comment thread internal/k8s/detect_workload.go
Comment thread pkg/hpadiag/diagnosis.go
@nadaverell nadaverell force-pushed the hpa-diagnosis-insights branch from 255cb83 to 32c9045 Compare June 13, 2026 23:05

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4801a0b. Configure here.

Comment thread pkg/hpadiag/diagnosis.go
@nadaverell nadaverell merged commit 181f920 into main Jun 14, 2026
9 checks passed
@nadaverell nadaverell deleted the hpa-diagnosis-insights branch June 14, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant