Skip to content

feat(resourcecontext): Build generator + /api/ai/resources/* GET wiring (T6)#721

Merged
nadaverell merged 17 commits into
mainfrom
feat/t6-resourcecontext-build
May 21, 2026
Merged

feat(resourcecontext): Build generator + /api/ai/resources/* GET wiring (T6)#721
nadaverell merged 17 commits into
mainfrom
feat/t6-resourcecontext-build

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 17, 2026

Wave 2 / Phase 2b. Implements resourcecontext.Build — the canonical generator that produces a ResourceContext for AI surfaces — and wires it into /api/ai/resources/{kind}/{ns}/{name} GET. Plus the cross-handler RBAC preflight, fetch + relationship lookup correctness for cross-group CRDs, and a deliberate decision to drop prose Hints from v1.

What lands

pkg/resourcecontext/build.goBuild(ctx, obj, opts) *ResourceContext. Populates per-field:

  • ManagedBy: reads topology.Relationships.ManagedBy (T23-synthesized). When the caller has a topology, calls topology.GetRelationshipsWithObject(...) so synthesis is group-aware. Without topology, falls back to SynthesizeManagedBy(obj, …) for label/annotation signals only.
  • Exposes / SelectedBy / Uses / RunsOn / ScaledBy: walked from the topology graph (or pod.Spec.* for K8s-spec fields). RunsOn + Uses.ServiceAccount consume rel.Node / rel.ServiceAccount with a pod.Spec.NodeName fallback any time rel.Node is missing.
  • IssueSummary / AuditSummary: caller pre-computes via internal/issues + pkg/audit; passed in as DTOs so pkg/resourcecontext doesn't import internal/*.
  • PolicySummary.Kyverno: counts on the basic tier when opts.PolicyReports is set.

pkg/topology/pseudokinds.go (new) — exported KindForGVK(kind, group) maps the three known cross-group CRD collisions (Knative {Service, Configuration, Revision, Route}, CAPI Cluster, Istio Gateway) to their topology pseudo-kind label. Used by the handler to feed group-aware kinds into the relationship lookup.

Cross-handler RBAC preflight

Extracted Server.preflightResourceGet in server.go. Both handleGetResource and handleAIGetResource call it — gates run BEFORE fetch in lockstep, closing a real bypass where /api/ai/resources/* was missing namespaces SAR, cluster-scoped SAR, and per-namespace Secret SAR.

Group-aware fetch + relationship lookup

Two related fixes for cross-group CRDs:

  • GET fetch: fetchAIResource was returning the core/v1 Service when called with services?group=serving.knative.dev because typed cache hit before group check. Now routes group != "" directly to GetDynamicWithGroup, matching /api/resources/ GET.
  • Relationship lookup: even with the right object fetched, GetRelationshipsWithObject(URL_kind, ns, name, obj, ...) resolved against the WRONG topology node (service/... instead of knativeservice/...). Now derives gvk := obj.GetObjectKind().GroupVersionKind() and passes KindForGVK(gvk.Kind, gvk.Group) so the right topology node is found.

Regression tests verified to FAIL on the pre-fix code (handler diff stashed, tests re-run, bug-shape failures confirmed).

Audit + issue summary correctness

  • Audit: previous keying matched only (namespace, name), dropping the Kind component — a Deployment web in prod inherited findings from a co-named Service. Now uses bpaudit.ResourceKey(canonicalKind, ns, name) direct lookup with canonicalKind from obj's TypeMeta. Deterministic top-finding selection via sort by (severity desc, CheckID asc).
  • Issue summary: URL-plural kind ("deployments") never matched applyFilters's case-folded wantKind["deployments"] against Issue.Kind→"deployment". Now passes canonicalKind (Pascal singular from TypeMeta).

Hints dropped from v1

The earlier prototype shipped a Hints []string prose projection (deterministic Go-template summary lines like "Managed by Application storefront", "3 issues (critical): CrashLoopBackOff"). After implementation review, dropped because:

  • Our dominant agent consumer (Claude-class) composes triage prose from the structured fields trivially — Hints added wire bytes + prompt tokens with no net signal for that consumer.
  • Once shipped, agents pattern-matching on Hint substrings would have ossified the wording.
  • Asymmetric cost of being wrong: easy to add back later as a separate explain_resource tool if a real consumer emerges; hard to remove or evolve once depended on.

~430 LoC removed. Doc comment on ResourceContext records the decision so future readers don't reintroduce the projection without revisiting the tradeoff.

Speculative wire surface trimmed

Same YAGNI logic applied to the rest of the v1 types. Deleted fields/enum values with no v1 consumer and no concrete plan to populate them in T7/T10/T13:

  • ContextRef.Reason + RefReason enum (10 values): structurally redundant with the parent field name (selectedBy → selector match, runsOn → node binding, etc.). The ConfigMap/Secret refs were tagged with a single Reason that was inaccurate for the env-from vs volume-mount half of each set — actively misleading on the wire.
  • ContextRef.Source + RefSource enum (5 values): internal provenance describing which radar subsystem produced the fact. No agent or UI consumer branches on it; planned policy_report/audit_engine sources never get a ContextRef anyway (T10 emits PolicySummary/AuditSummary rollups).
  • ContextRef.Confidence: reserved field, never set. Defer until fuzzy joins (Trivy/ConfigAudit) actually land.
  • Options.MaxTokens: reserved field, never enforced. Add when there's budgeting code.
  • ResourceContext.Truncated: never set by Build — no truncation path exists in the generator (RBAC drops are reported via Omitted).
  • OmittedKindUnsupported, OmittedProviderDisabled: unused enum values.

Kept OmittedRBACDenied / OmittedBudgetExceeded / OmittedCacheCold / OmittedNotInstalled — all four are populated today or planned for T10's policy-report diagnostic per the explicit TODO(T10) in internal/k8s/policy_reports.go. ~170 LoC removed from types + tests.

/api/ai/* outside OpenAPI

Top-of-file doc on internal/server/ai_handlers.go declares /api/ai/* intentionally outside the OpenAPI spec. The original OpenAPI-first discipline in radar exists for SPA ↔ backend type safety (regenerated TS client from one spec). /api/ai/* has no SPA consumer; agents read MCP tool descriptions or in-prompt instructions, not OpenAPI specs. The doc records the opt-out + revisit triggers (surface stability + public-SDK commitment) so a future maintainer doesn't reflexively bring /api/ai/* under the spec without revisiting why.

Verification

go build ./... + full go test -count=1 ./... green on both root + pkg sub-modules. Golden + RBAC + group + issue-summary + pseudo-kind tests pin each fix.

Out of scope

  • MCP get_resource does not yet emit resourceContext. That's T7 — small follow-on calling Build from handleGetResource. MCP get_resource also has the same group-blind fetch as the original fetchAIResource; bundle that fix with T7.
  • handleAIListResources and /api/search — T89's territory (PR feat(resourcecontext): summaryContext on list_resources + search hits (T8+T9) #722).
  • /api/resources/* UI primary endpoint — intentionally untouched; UI adopts opportunistically via T14.

Note

Medium Risk
Touches authorization and single-resource fetch paths for the new /api/ai/resources/* surface, plus changes audit/issue indexing keys to include API group; mistakes could cause unintended data exposure or missing/incorrect context until exercised by tests.

Overview
Adds a new pkg/resourcecontext.Build generator and wires it into /api/ai/resources/{kind}/{namespace}/{name} so AI clients receive {resource, resourceContext} (with context=none to return just the minified resource).

Hardens correctness and security on the AI GET path: it now runs the same centralized preflightResourceGet RBAC gates as the REST GET, fetches CRDs group-first via the dynamic cache to avoid cross-group kind collisions, and uses topology.KindForGVK so relationship lookups don’t leak core-object edges into same-kind CRDs.

Makes issue/audit/policy enrichment group-aware and deterministic: issues/events/audit findings now populate/resolve Group (with a shared built-in Kind→Group fallback), pkg/audit.ResourceKey/indexes include group to prevent cross-group collisions, audit severity is normalized (dangercritical), and summary “top” selection is stabilized via sorting. Adds test helpers for dynamic cache/discovery fakes and extensive regression tests for RBAC, group routing, and summary behavior.

Reviewed by Cursor Bugbot for commit 7a1f006. Bugbot is set up for automated code reviews on this repo. Configure here.

@nadaverell nadaverell requested a review from hisco as a code owner May 17, 2026 07:51
Comment thread internal/server/ai_handlers.go Outdated
Comment thread pkg/resourcecontext/build.go Outdated
Comment thread internal/server/ai_handlers.go Outdated
Comment thread pkg/resourcecontext/hints.go Outdated
@nadaverell nadaverell force-pushed the feat/t6-resourcecontext-build branch from 9207f7c to 23e5db8 Compare May 17, 2026 09:13
Comment thread internal/server/ai_handlers.go Outdated
if cache == nil {
return nil
}
results := audit.RunFromCache(cache, []string{namespace}, nil)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full audit engine runs on every single GET

Medium Severity

computeAuditSummaryForResource calls audit.RunFromCache, which runs the entire best-practice check suite against all resources in the namespace, on every single /api/ai/resources/{kind}/{ns}/{name} GET request. The results are not cached or memoized, so each request re-runs all checks just to filter down to findings for one resource.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 23e5db8. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and deferred. The full audit suite per /api/ai/resources GET is real (~10–50ms on a moderate cluster) but the fix is a caching layer that adds non-trivial scope. Tracked as a post-wave follow-up — see internal task list. Not blocking T6 merge because the cost is bounded and the rest of the GET path (relationships, issues, RBAC preflight) is already non-trivial overhead too. If we want to bring it down, the right move is one audit-result cache shared across the AI surface, not a Build-internal optimization.

nadaverell added a commit that referenced this pull request May 17, 2026
…k + selectedBy mislabel

Reviewer's critical (#721): computeAuditSummaryForResource dropped the
Kind from its signature and matched only (namespace, name) by iterating
the audit index. A Deployment "web" in "prod" silently inherited audit
findings from a Service or ConfigMap of the same name in the same
namespace. The loop's map-iteration order also made TopFinding
non-deterministic across runs for tied severities — breaking the
determinism guarantee SynthesizeHints downstream advertises.

Fix derives the canonical kind from obj's TypeMeta in
buildAIResourceContext (Pascal singular — exactly what audit's check
runner writes into Finding.Kind) and threads it into the audit lookup.
The lookup is now a single map access via bpaudit.ResourceKey instead
of an O(n) scan. TopFinding selection sorts by (severity desc, CheckID
asc) so ties resolve identically every run.

Important fixes also addressed:

- RunsOn fallback gap (pkg/resourcecontext/build.go:197): the `else if
  rel == nil` guard meant pod.Spec.NodeName was used only when there
  was no topology at all. When topology was present but rel.Node was
  nil (Node informer cold, node not yet indexed), RunsOn stayed empty
  even though the Pod spec named a node. Now falls back any time
  rel.Node is missing.

- selectedByHint mislabel (pkg/resourcecontext/hints.go:124): every
  non-PDB ref was rendered as NetworkPolicy, including any future kind
  added to SelectedBy. Now explicit-match each known kind
  (PodDisruptionBudget, NetworkPolicy) and drop unrecognized kinds
  through summarizeKindsCounts instead of mislabeling.
Comment thread pkg/resourcecontext/hints.go Outdated
nadaverell added a commit that referenced this pull request May 17, 2026
…ting

Two residual findings on the T89 work:

1. Search summaryContext dropped issueCount for cluster-scoped hits.
   handleSearch passed scanNamespaces (a strictly-namespaced filter)
   into the single issue index used for every hit. Search returns
   MIXED-kind hits (a query can return both namespaced Pods and
   cluster-scoped Nodes), and cluster-scoped problems live at
   namespace="" — the per-namespace filter dropped them, silently
   zeroing issueCount on every Node/PV/cluster-scoped-CRD hit.

   Fix: dual issue index per search request. namespacedIdx is scoped
   to scanNamespaces; clusterIdx is composed cluster-wide (nil filter)
   so namespace="" issues surface. The summaryContextBuilder closure
   dispatches per-hit via k8s.ClassifyKindScope(kind, group), so a
   single response correctly counts both Pod and Node issues.
   CanReadClusterScoped already gates which cluster-scoped kinds the
   user can see — the cluster-wide index doesn't expose unauthorized
   rows. Mirrored on the MCP search path.

2. AI list with group ignored group and tried typed cache first.
   handleAIListResources called k8s.FetchResourceList(cache, kind,
   namespaces) before consulting group, so kind=services&group=
   serving.knative.dev silently returned core Services. Same fix
   shape PR #721 applied to GET: group != "" short-circuits straight
   to aiListDynamic. Mirrored on MCP handleListResources.

Tests:
- server: TestSummaryContextBuilderFromIndexes_DispatchesByScope and
  TestNewSearchSummaryContextBuilder_BuildsDualIndex pin the per-hit
  scope-based dispatch and the dual-index construction shape.
- server: TestAI_ListServices_WithGroup_RoutesToDynamicCache pins the
  group routing on /api/ai/resources/{kind} (typed core Service for
  no group; dynamic-cache path for group=serving.knative.dev).
- mcp: same two builder tests plus
  TestHandleListResources_GroupRoutesToDynamic for the MCP list path.
Comment thread pkg/resourcecontext/build.go
nadaverell added a commit that referenced this pull request May 18, 2026
…, 500 on unknown

Three Bugbot findings on PR #721:

1. Build's relationship fallback used `ident.Kind` directly (the raw GVK
   kind like "Service") when opts.Relationships was nil but opts.Topology
   was set. For Knative serving.knative.dev/Service this resolved to the
   CORE Service topology node, leaking core relationships into the CRD's
   resourceContext — defeating the KindForGVK fix that the handler-side
   pre-computation already applies. Mirror the handler's resolution here
   so the fallback agrees with the primary path.

2. buildUsesFromPod had the dominant-pattern Reason labels reversed:
   ConfigMaps got ReasonEnvVarRef, Secrets got ReasonVolumeMount.
   ConfigMaps surface primarily via volume mounts (config files);
   Secrets primarily via env (SecretKeyRef). Swapped to match the
   common case. Doc comment clarifies the label is best-effort per
   kind — both paths feed both kinds in practice.

3. writeAIFetchError default branch returned 404 NOT_FOUND for any
   unrecognized error. Unknown errors are server-side (e.g. "resource
   discovery not initialized") and should surface as 500, not be
   masked as missing-resource. One-line fix.

No test changes — the fixes are correctness-only and existing tests
remain valid.
nadaverell added a commit that referenced this pull request May 18, 2026
Mirror the T6 (#721) wire-surface cleanup so types.go stays in sync
across the in-flight wave-2 PRs. No T89 consumer touches the deleted
fields — ResourceSummaryContext / ManagedByRef changes are unaffected.

Deletes:
- ContextRef.Reason / RefReason enum
- ContextRef.Source / RefSource enum
- ContextRef.Confidence
- ResourceContext.Truncated
- OmittedKindUnsupported, OmittedProviderDisabled
nadaverell added a commit that referenced this pull request May 18, 2026
Mirror the T6 (#721) wire-surface cleanup so types.go stays in sync
across the in-flight wave-2 PRs. T11 doesn't reference any of the
deleted fields — Kyverno issue source changes are unaffected.

Deletes:
- ContextRef.Reason / RefReason enum
- ContextRef.Source / RefSource enum
- ContextRef.Confidence
- ResourceContext.Truncated
- OmittedKindUnsupported, OmittedProviderDisabled

Retained OmittedCacheCold and OmittedNotInstalled per the explicit
TODO(T10) in internal/k8s/policy_reports.go that plans to populate
these when the diagnostic-tier Kyverno consumer arrives.
nadaverell added a commit that referenced this pull request May 18, 2026
Mirror the T6 (#721) wire-surface cleanup so types.go stays in sync
across the in-flight wave-2 PRs.

Deletes:
- ContextRef.Reason / RefReason enum
- ContextRef.Source / RefSource enum
- ContextRef.Confidence
- ResourceContext.Truncated
- OmittedKindUnsupported, OmittedProviderDisabled

Note: OmittedBudgetExceeded stays on the wire — c2f8d49 already
removed the redundant emission alongside truncated:true. The enum
value remains for non-truncation budget overflows (e.g. Kyverno
warmup cap, planned in T10).
Comment thread internal/server/ai_handlers.go
Comment thread internal/server/ai_handlers.go
Comment thread internal/server/ai_handlers.go Outdated
nadaverell added a commit that referenced this pull request May 18, 2026
…k + selectedBy mislabel

Reviewer's critical (#721): computeAuditSummaryForResource dropped the
Kind from its signature and matched only (namespace, name) by iterating
the audit index. A Deployment "web" in "prod" silently inherited audit
findings from a Service or ConfigMap of the same name in the same
namespace. The loop's map-iteration order also made TopFinding
non-deterministic across runs for tied severities — breaking the
determinism guarantee SynthesizeHints downstream advertises.

Fix derives the canonical kind from obj's TypeMeta in
buildAIResourceContext (Pascal singular — exactly what audit's check
runner writes into Finding.Kind) and threads it into the audit lookup.
The lookup is now a single map access via bpaudit.ResourceKey instead
of an O(n) scan. TopFinding selection sorts by (severity desc, CheckID
asc) so ties resolve identically every run.

Important fixes also addressed:

- RunsOn fallback gap (pkg/resourcecontext/build.go:197): the `else if
  rel == nil` guard meant pod.Spec.NodeName was used only when there
  was no topology at all. When topology was present but rel.Node was
  nil (Node informer cold, node not yet indexed), RunsOn stayed empty
  even though the Pod spec named a node. Now falls back any time
  rel.Node is missing.

- selectedByHint mislabel (pkg/resourcecontext/hints.go:124): every
  non-PDB ref was rendered as NetworkPolicy, including any future kind
  added to SelectedBy. Now explicit-match each known kind
  (PodDisruptionBudget, NetworkPolicy) and drop unrecognized kinds
  through summarizeKindsCounts instead of mislabeling.
nadaverell added a commit that referenced this pull request May 18, 2026
…, 500 on unknown

Three Bugbot findings on PR #721:

1. Build's relationship fallback used `ident.Kind` directly (the raw GVK
   kind like "Service") when opts.Relationships was nil but opts.Topology
   was set. For Knative serving.knative.dev/Service this resolved to the
   CORE Service topology node, leaking core relationships into the CRD's
   resourceContext — defeating the KindForGVK fix that the handler-side
   pre-computation already applies. Mirror the handler's resolution here
   so the fallback agrees with the primary path.

2. buildUsesFromPod had the dominant-pattern Reason labels reversed:
   ConfigMaps got ReasonEnvVarRef, Secrets got ReasonVolumeMount.
   ConfigMaps surface primarily via volume mounts (config files);
   Secrets primarily via env (SecretKeyRef). Swapped to match the
   common case. Doc comment clarifies the label is best-effort per
   kind — both paths feed both kinds in practice.

3. writeAIFetchError default branch returned 404 NOT_FOUND for any
   unrecognized error. Unknown errors are server-side (e.g. "resource
   discovery not initialized") and should surface as 500, not be
   masked as missing-resource. One-line fix.

No test changes — the fixes are correctness-only and existing tests
remain valid.
@nadaverell nadaverell force-pushed the feat/t6-resourcecontext-build branch from f3cb5d5 to 396ee23 Compare May 18, 2026 15:22
nadaverell added a commit that referenced this pull request May 18, 2026
…ting

Two residual findings on the T89 work:

1. Search summaryContext dropped issueCount for cluster-scoped hits.
   handleSearch passed scanNamespaces (a strictly-namespaced filter)
   into the single issue index used for every hit. Search returns
   MIXED-kind hits (a query can return both namespaced Pods and
   cluster-scoped Nodes), and cluster-scoped problems live at
   namespace="" — the per-namespace filter dropped them, silently
   zeroing issueCount on every Node/PV/cluster-scoped-CRD hit.

   Fix: dual issue index per search request. namespacedIdx is scoped
   to scanNamespaces; clusterIdx is composed cluster-wide (nil filter)
   so namespace="" issues surface. The summaryContextBuilder closure
   dispatches per-hit via k8s.ClassifyKindScope(kind, group), so a
   single response correctly counts both Pod and Node issues.
   CanReadClusterScoped already gates which cluster-scoped kinds the
   user can see — the cluster-wide index doesn't expose unauthorized
   rows. Mirrored on the MCP search path.

2. AI list with group ignored group and tried typed cache first.
   handleAIListResources called k8s.FetchResourceList(cache, kind,
   namespaces) before consulting group, so kind=services&group=
   serving.knative.dev silently returned core Services. Same fix
   shape PR #721 applied to GET: group != "" short-circuits straight
   to aiListDynamic. Mirrored on MCP handleListResources.

Tests:
- server: TestSummaryContextBuilderFromIndexes_DispatchesByScope and
  TestNewSearchSummaryContextBuilder_BuildsDualIndex pin the per-hit
  scope-based dispatch and the dual-index construction shape.
- server: TestAI_ListServices_WithGroup_RoutesToDynamicCache pins the
  group routing on /api/ai/resources/{kind} (typed core Service for
  no group; dynamic-cache path for group=serving.knative.dev).
- mcp: same two builder tests plus
  TestHandleListResources_GroupRoutesToDynamic for the MCP list path.
Comment thread internal/server/ai_handlers.go Outdated
…ET (T6)

Adds the basic-tier resourceContext builder and wires it into the
/api/ai/resources/{kind}/{ns}/{name} GET handler. The package layer is
pure-Go and depends only on pkg/topology — callers in internal/* pre-
compute IssueSummary and AuditSummary so this package doesn't reach
into internal/issues or internal/audit.

What ships:

- pkg/resourcecontext/build.go: Build(ctx, obj, opts) walks topology
  relationships + pod spec to populate ManagedBy (Argo tracking-id /
  Flux labels / owner ref), Exposes, SelectedBy (PDB + NetworkPolicy
  split), Uses (ConfigMaps/Secrets/PVCs/SA), RunsOn, ScaledBy, and
  PolicySummary. Every ContextRef passes through opts.AccessChecker;
  denials are dropped and recorded in Omitted with rbac_denied reason.

- pkg/resourcecontext/hints.go: SynthesizeHints renders 5–8 short,
  deterministic prose lines for AI consumers. EmitHints=false on UI
  callers leaves Hints empty.

- internal/server/rc_rbac.go: requestScopedChecker wraps Server.canRead
  with a per-request (verb,group,kind,namespace) memoization layer.
  Collapses the ~30 ref-checks per response into ~5 SAR calls without
  changing the underlying PermissionCache TTL.

- internal/server/ai_handlers.go: handleAIGetResource now returns
  { resource, resourceContext } by default. ?context=none keeps the
  pre-T6 bare shape. IssueSummary uses issues.ComposeWithStats scoped
  to the resource; AuditSummary uses pkg/audit.IndexByResource.
  handleAIListResources is unchanged — T8/T9 territory.

Golden tests cover Pod (full enrichment + RBAC denial + JSON shape),
Deployment (Flux HelmRelease label precedence), Service (Ingress
Exposes), NetworkPolicy (outgoing EdgeProtects intentionally not
surfaced), ConfigMap (owner-chain only), PolicyReports rollup, HPA
identity, and the EmitHints=false path. Hints determinism is pinned
by a separate golden in hints_test.go.
Reviewer caught that buildAIResourceContext never set
opts.PolicyReports, so policySummary.kyverno stayed absent on the REST
get-resource path even after T5 (index) and T11 (composer source). The
Build generator already supports it via PolicyReportLookup; the handler
just wasn't passing it.

Adds a narrow policyReportLookupAdapter that wraps
internal/k8s.GetPolicyReportIndex() — translating the richer
pkg/policyreports.Finding shape (Severity + Category) into the
agent-facing resourcecontext.KyvernoFinding shape (Policy / Rule /
Result / Message). Keeping the projection narrow at the adapter layer
means future additions to policyreports.Finding don't perturb the wire
contract.

When Kyverno isn't installed, GetPolicyReportIndex() returns nil and
opts.PolicyReports stays unset — Build emits no policySummary, which
is the correct degraded behavior.
Reviewer caught a plan-vs-code mismatch: the locked v1 contract says
basic tier emits Kyverno counts only (fail/warn/pass), with Top[]
findings reserved for diagnostic tier. The code emitted Top[] on both
tiers, inflating basic-tier wire size with details that belong on the
deeper agent investigation path.

Fix: buildPolicySummary now takes ContextTier. Basic emits counts only;
diagnostic adds the Top[] (cap 3, ordered fail > warn > error > pass).
Existing TestBuild_PolicyReports_RolledUp split into two cases that
pin both tier outputs.
… Pod hygiene fields (T23 dedup)

T23 (#720) made topology.Relationships the canonical projection for
server-synthesized ManagedBy (owner-chain + GitOps detection) and Pod
hygiene fields (ServiceAccount, Node). Drop the parallel owner-walk
and label/annotation scanning inside resourcecontext.Build — read
Relationships.{ManagedBy,ServiceAccount,Node} instead.

Build now calls topology.GetRelationshipsWithObject when Topology is
set, passing the fetched obj so kind/group disambiguation works
(Knative serving.knative.dev/Service vs core/v1 Service). Single-
resource callers (REST GET /api/ai/resources) pass idx=nil — the per-
call O(E) scan is fine for one walk per request; bulk callers should
pre-build via topology.IndexByResource(topo) and pass through
Options.RelIndex or Options.Relationships.

No deprecation aliases: buildManagedBy, parseArgoTrackingID,
pickControllerOwner, groupFromAPIVersion, readPair, and the
gitops/flux label constants are deleted outright (per saved feedback
preference). Their tests are removed — the same logic is exercised
in pkg/topology/managedby_test.go.
handleAIGetResource skipped the user-RBAC gates that handleGetResource
runs, so a user with namespace access but no per-namespace get-secrets
SAR could read Secret values via the AI endpoint, and a user without
cluster-scoped node SAR could read Node objects. The AI surface returns
the same resource bytes (just minified + wrapped) as the REST surface,
so it must enforce the same gates.

Extract the gate block into Server.preflightResourceGet and call it
from both handlers. Single helper keeps the two endpoints in lockstep
so future RBAC adjustments touch one place. Tests cover the three deny
arms (per-ns get-secrets, cluster-scoped get-node, namespace access)
plus a passing-case sanity check on the AI envelope shape.
…k + selectedBy mislabel

Reviewer's critical (#721): computeAuditSummaryForResource dropped the
Kind from its signature and matched only (namespace, name) by iterating
the audit index. A Deployment "web" in "prod" silently inherited audit
findings from a Service or ConfigMap of the same name in the same
namespace. The loop's map-iteration order also made TopFinding
non-deterministic across runs for tied severities — breaking the
determinism guarantee SynthesizeHints downstream advertises.

Fix derives the canonical kind from obj's TypeMeta in
buildAIResourceContext (Pascal singular — exactly what audit's check
runner writes into Finding.Kind) and threads it into the audit lookup.
The lookup is now a single map access via bpaudit.ResourceKey instead
of an O(n) scan. TopFinding selection sorts by (severity desc, CheckID
asc) so ties resolve identically every run.

Important fixes also addressed:

- RunsOn fallback gap (pkg/resourcecontext/build.go:197): the `else if
  rel == nil` guard meant pod.Spec.NodeName was used only when there
  was no topology at all. When topology was present but rel.Node was
  nil (Node informer cold, node not yet indexed), RunsOn stayed empty
  even though the Pod spec named a node. Now falls back any time
  rel.Node is missing.

- selectedByHint mislabel (pkg/resourcecontext/hints.go:124): every
  non-PDB ref was rendered as NetworkPolicy, including any future kind
  added to SelectedBy. Now explicit-match each known kind
  (PodDisruptionBudget, NetworkPolicy) and drop unrecognized kinds
  through summarizeKindsCounts instead of mislabeling.
… kind collisions

fetchAIResource was calling FetchResource (typed cache) before consulting
the group qualifier. For /api/ai/resources/services/ns/name?group=
serving.knative.dev, the typed cache returns the core/v1 Service and the
?group= is silently dropped, leaking the wrong object via the AI surface.

Branch on group != "" first and route directly to GetDynamicWithGroup,
mirroring handleGetResource's dispatch in server.go. The typed-first path
remains correct (and faster) when no group is passed.

Same bug class as T12's group-blind root lookup, on the single-resource
GET path.
…lationship lookup

Two residual bugs on the AI GET resourceContext path were dropping signal
silently:

1) computeIssueSummaryForResource was called with the URL-plural kind
   ("deployments"), which the issues composer's Filters.Kinds matcher
   case-folds but does NOT plural-to-singular convert. Issue.Kind is the
   canonical Pascal singular ("Deployment"), so every issue got filtered
   out and IssueSummary.Count silently collapsed to 0 (Build then omits
   the field entirely). Fix: pass canonicalKind (derived from obj.GVK)
   into the function — same pattern computeAuditSummaryForResource was
   already using.

2) GetRelationshipsWithObject was called with the URL-plural kind, which
   buildNodeID resolved to the wrong topology node for cross-group CRDs
   whose Kind collides with a core kind. For a Knative serving.knative.dev
   Service request, "services" → "service/ns/name" picked the CORE
   Service node instead of "knativeservice/ns/name", so relationship
   walks returned the core Service's edges. Same shape for CAPI Cluster
   ("capicluster") and Istio Gateway ("istiogateway").

   Added topology.KindForGVK as a small exported helper that maps
   (kind, group) → pseudo-kind for the three known cross-group
   collisions; the handler now funnels gvk through it before the
   relationship lookup. Non-colliding kinds (core, apps, batch, Gateway
   API, etc.) pass through unchanged so buildNodeID's existing kindMap
   handles them.

Tests:
- pkg/topology/pseudokinds_test.go: table tests covering every remap
  case plus the pass-throughs (including the wrong-group-same-kind
  guards: Service under argoproj.io, Route under route.openshift.io).
- internal/server/ai_handlers_group_test.go:
  * TestAIGetResource_GroupRoutesRelationshipsToKnative — seeds a
    Knative Service in the dynamic cache co-named with the core Service
    plus an Ingress backend-ref'd to the core Service. Asserts the
    knative-routed response does NOT leak the core Service's Ingress
    into resourceContext.exposes (locked the regression to fail
    without the KindForGVK funnel).
  * TestAIGetResource_IssueSummaryCountsURLPluralKind — asserts a
    broken Deployment (UnavailableReplicas=3) surfaces with count > 0
    in resourceContext.issueSummary when fetched via the URL plural.

Fixture additions in TestMain:
- broken/stuck-app Deployment (UnavailableReplicas=3): seeds a known
  problem for the issue-summary regression test, in its own namespace
  so it doesn't perturb default-namespace smoke tests.
- default/nginx-ingress Ingress routing to Service "nginx": the
  distinguishing edge between core Service and Knative Service
  topology nodes.
Removes the prose `Hints []string` field, the `EmitHints` option, the
SynthesizeHints generator, and all associated tests. ~430 LoC of net
removal.

Rationale: our dominant agent consumer (Claude-class) composes triage
prose from the structured fields trivially. The pre-baked Hints array
added wire bytes + prompt tokens with no net signal for that consumer.
Once shipped, agents pattern-matching on Hint substrings would have
ossified the wording. The structured fields (ManagedBy, Exposes,
SelectedBy, Uses, RunsOn, ScaledBy, IssueSummary, AuditSummary,
PolicySummary, Omitted) carry every fact a derived prose line would
encode — agents that need narrative can compose it.

If a real consumer emerges that needs deterministic prose, add it as a
separate `explain_resource` MCP tool. Keeping it inline was a premature
bet against asymmetric costs: easy to add later, hard to remove or
evolve once agents depend on the strings.

The doc comment on ResourceContext records the decision so future
readers don't re-introduce the projection without revisiting the
tradeoff.
Radar is OpenAPI-first per CLAUDE.md, but that discipline was adopted
to keep the Go backend and the TypeScript SPA client in sync — one
spec, regenerated as server stubs + TS client. /api/ai/* has no SPA
consumer; agents read MCP tool descriptions or in-prompt instructions,
not OpenAPI specs.

Without this doc, a future maintainer might reflexively bring
/api/ai/* under openapi.yaml on the assumption that radar's
OpenAPI-first stance covers every endpoint. That would pay the
spec-authoring tax during agent-surface evolution without earning
the SDK-generation benefit. The wave-2 wire shapes already evolved
across three review rounds; locking them down in YAML during that
churn would have produced spec/code drift, not contract clarity.

Top-of-file doc records the intentional opt-out + the revisit
triggers (surface stability + public-SDK commitment). Future
readers see explicit reasoning instead of accidental omission.
…, 500 on unknown

Three Bugbot findings on PR #721:

1. Build's relationship fallback used `ident.Kind` directly (the raw GVK
   kind like "Service") when opts.Relationships was nil but opts.Topology
   was set. For Knative serving.knative.dev/Service this resolved to the
   CORE Service topology node, leaking core relationships into the CRD's
   resourceContext — defeating the KindForGVK fix that the handler-side
   pre-computation already applies. Mirror the handler's resolution here
   so the fallback agrees with the primary path.

2. buildUsesFromPod had the dominant-pattern Reason labels reversed:
   ConfigMaps got ReasonEnvVarRef, Secrets got ReasonVolumeMount.
   ConfigMaps surface primarily via volume mounts (config files);
   Secrets primarily via env (SecretKeyRef). Swapped to match the
   common case. Doc comment clarifies the label is best-effort per
   kind — both paths feed both kinds in practice.

3. writeAIFetchError default branch returned 404 NOT_FOUND for any
   unrecognized error. Unknown errors are server-side (e.g. "resource
   discovery not initialized") and should surface as 500, not be
   masked as missing-resource. One-line fix.

No test changes — the fixes are correctness-only and existing tests
remain valid.
Remove fields and enum values with no consumer in v1 and no concrete
plan to populate them in T7/T10/T13:

- ContextRef.Reason (RefReason enum, 10 values): structurally redundant
  with the parent field name (selectedBy → selector match, runsOn →
  node binding, etc.). The ConfigMap/Secret refs were tagged with a
  single Reason that was inaccurate for the env-from vs volume-mount
  half of each set — actively misleading on the wire.
- ContextRef.Source (RefSource enum, 5 values): internal provenance
  describing which radar subsystem produced the fact. No agent or UI
  consumer branches on it; planned policy_report/audit_engine sources
  never get a ContextRef anyway (T10 emits PolicySummary/AuditSummary
  rollups).
- ContextRef.Confidence: reserved field, never set. Defer until fuzzy
  joins (Trivy/ConfigAudit) actually land.
- Options.MaxTokens: reserved field, never enforced. Add when there's
  budgeting code.
- ResourceContext.Truncated: never set by Build — no truncation path
  exists in the generator (RBAC drops are reported via Omitted).
- OmittedKindUnsupported + OmittedProviderDisabled: unused enum values.
  Kept rbac_denied / budget_exceeded / cache_cold / not_installed —
  all four are populated today or planned for T10's policy-report
  diagnostic per the TODO in internal/k8s/policy_reports.go.
- Drop the handler-side pre-compute of Relationships in
  buildAIResourceContext. Build's own fallback already does the
  identical KindForGVK pseudo-kind resolution; pre-computing here
  doubled the work whenever the lookup returned nil (handler call
  returned nil → opts.Relationships=nil → Build's `rel==nil &&
  opts.Topology!=nil` branch re-ran the same scan).

- Sort issue-summary rows by (severity desc, Reason asc) before
  picking topReason. The sibling computeAuditSummaryForResource
  already sorts (severity desc, CheckID asc) for determinism;
  the issue path's iteration-order pick could vary across runs
  when multiple rows tie on severity.
…ped lookups

computeAuditSummaryForResource unconditionally passed []string{namespace}
to audit.RunFromCache, even when namespace=="" (cluster-scoped
resources). That filters audit to literally namespace="" resources
instead of scanning all namespaces. computeIssueSummaryForResource
already guards with `if namespace != ""` — match it.

Latent today since the audit suite doesn't currently cover
cluster-scoped kinds, but the inconsistency would silently miss
findings the moment a cluster-scoped check lands.
Two T6 fixes required after rebasing onto post-#723 (Kyverno) and
post-#726 (RBAC reverse-lookup + Crossplane) main:

- T11 (#723) threaded API group through the PolicyReport lookup,
  changing policyreports.Index.FindingsFor from (kind, ns, name) to
  (group, kind, ns, name). Updated resourcecontext.PolicyReportLookup
  interface, the build.go call site (passing ident.Group), the
  policyReportLookupAdapter in ai_handlers.go, and the test mock.
  Group threading is a strict improvement — two CRDs sharing
  kind+ns+name across API groups now get disjoint findings instead
  of one inheriting the other's.

- internal/server/server_smoke_test.go acquired a `rbacv1` import on
  main (#726) at the same line our `networkingv1` import lands on
  this branch. Conflict resolution: keep both, sorted.
@nadaverell nadaverell force-pushed the feat/t6-resourcecontext-build branch from 351f100 to 3742ffa Compare May 18, 2026 19:45
nadaverell added a commit that referenced this pull request May 18, 2026
…ting

Two residual findings on the T89 work:

1. Search summaryContext dropped issueCount for cluster-scoped hits.
   handleSearch passed scanNamespaces (a strictly-namespaced filter)
   into the single issue index used for every hit. Search returns
   MIXED-kind hits (a query can return both namespaced Pods and
   cluster-scoped Nodes), and cluster-scoped problems live at
   namespace="" — the per-namespace filter dropped them, silently
   zeroing issueCount on every Node/PV/cluster-scoped-CRD hit.

   Fix: dual issue index per search request. namespacedIdx is scoped
   to scanNamespaces; clusterIdx is composed cluster-wide (nil filter)
   so namespace="" issues surface. The summaryContextBuilder closure
   dispatches per-hit via k8s.ClassifyKindScope(kind, group), so a
   single response correctly counts both Pod and Node issues.
   CanReadClusterScoped already gates which cluster-scoped kinds the
   user can see — the cluster-wide index doesn't expose unauthorized
   rows. Mirrored on the MCP search path.

2. AI list with group ignored group and tried typed cache first.
   handleAIListResources called k8s.FetchResourceList(cache, kind,
   namespaces) before consulting group, so kind=services&group=
   serving.knative.dev silently returned core Services. Same fix
   shape PR #721 applied to GET: group != "" short-circuits straight
   to aiListDynamic. Mirrored on MCP handleListResources.

Tests:
- server: TestSummaryContextBuilderFromIndexes_DispatchesByScope and
  TestNewSearchSummaryContextBuilder_BuildsDualIndex pin the per-hit
  scope-based dispatch and the dual-index construction shape.
- server: TestAI_ListServices_WithGroup_RoutesToDynamicCache pins the
  group routing on /api/ai/resources/{kind} (typed core Service for
  no group; dynamic-cache path for group=serving.knative.dev).
- mcp: same two builder tests plus
  TestHandleListResources_GroupRoutesToDynamic for the MCP list path.
Cross-group kind collisions (e.g. core/Service vs serving.knative.dev/Service
sharing namespace + name) previously caused audit/issue summaries
to silently inherit findings from the wrong resource. The fetch and
relationship paths in T6 were already group-aware; the summary
lookups were not.

pkg/audit:
- Finding and ResourceGroup gain a Group field (omitempty on the wire).
- ResourceKey signature changes to (group, kind, ns, name) — encoded
  "group|Kind|ns|name" with "|" delimiter matching the issue-source
  key in internal/summarycontext.
- Exported GroupForBuiltinKind(kind) — single source of truth for the
  built-in (Kind→Group) map. buildResults populates Finding.Group via
  this helper so the 20 per-check emission sites stay terse.

internal/server:
- buildAIResourceContext threads canonicalGroup from obj GVK through
  to both computeIssueSummaryForResource and computeAuditSummaryForResource.
- Audit lookup uses the new group-aware ResourceKey.
- Issue summary adds a strict row.Group == group check inside the
  match loop. The previous kind-only filter could silently pull in
  a colliding core kind's issues for a CRD lookup.
- UI audit drill-down (handleResourceAudit) explicitly passes group=""
  since the URL doesn't carry group today — comment points to #35
  for the CRD-aware drill-down work.

internal/issues:
- Centralised Group resolution via resolveGroup(group, kind): use the
  explicit group if set, else fall back to audit.GroupForBuiltinKind.
- Applied at fromProblem (legacy k8s.DetectProblems sites still emit
  Group="" for built-in workloads), fromAudit (passes through
  audit.Finding.Group which buildResults populates), fromWarningEvent
  (split apiVersion → group, with "v1"-shape → core).

Pin tests:
- TestResourceKey_GroupAware: distinct keys across groups.
- TestIndexByResource_NoCrossGroupCollision: lookup per-group returns
  only its own findings.
- TestGroupForBuiltinKind: the table.

Closes #35.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9d6658a. Configure here.

Comment thread internal/server/ai_handlers.go Outdated
… the wire

auditSummary.highestSeverity used to emit raw audit vocabulary
("danger" / "warning") while sibling issueSummary.highestSeverity
emits issue vocabulary ("critical" / "warning"). Two fields under the
same resourceContext disagreeing on what "highest severity" means is
a real wire-shape footgun — consumers parsing one will mis-handle the
other.

Mirrors the same mapping internal/issues.fromAudit applies when audit
findings flow through the unified issue stream, so the two paths now
agree. Empty / future audit severities pass through unchanged so the
contract stays explicit if pkg/audit grows new values.

Pinned by TestNormalizeAuditSeverity.
@nadaverell nadaverell merged commit 058552c into main May 21, 2026
8 checks passed
@nadaverell nadaverell deleted the feat/t6-resourcecontext-build branch May 21, 2026 12:07
nadaverell added a commit that referenced this pull request May 21, 2026
…ting

Two residual findings on the T89 work:

1. Search summaryContext dropped issueCount for cluster-scoped hits.
   handleSearch passed scanNamespaces (a strictly-namespaced filter)
   into the single issue index used for every hit. Search returns
   MIXED-kind hits (a query can return both namespaced Pods and
   cluster-scoped Nodes), and cluster-scoped problems live at
   namespace="" — the per-namespace filter dropped them, silently
   zeroing issueCount on every Node/PV/cluster-scoped-CRD hit.

   Fix: dual issue index per search request. namespacedIdx is scoped
   to scanNamespaces; clusterIdx is composed cluster-wide (nil filter)
   so namespace="" issues surface. The summaryContextBuilder closure
   dispatches per-hit via k8s.ClassifyKindScope(kind, group), so a
   single response correctly counts both Pod and Node issues.
   CanReadClusterScoped already gates which cluster-scoped kinds the
   user can see — the cluster-wide index doesn't expose unauthorized
   rows. Mirrored on the MCP search path.

2. AI list with group ignored group and tried typed cache first.
   handleAIListResources called k8s.FetchResourceList(cache, kind,
   namespaces) before consulting group, so kind=services&group=
   serving.knative.dev silently returned core Services. Same fix
   shape PR #721 applied to GET: group != "" short-circuits straight
   to aiListDynamic. Mirrored on MCP handleListResources.

Tests:
- server: TestSummaryContextBuilderFromIndexes_DispatchesByScope and
  TestNewSearchSummaryContextBuilder_BuildsDualIndex pin the per-hit
  scope-based dispatch and the dual-index construction shape.
- server: TestAI_ListServices_WithGroup_RoutesToDynamicCache pins the
  group routing on /api/ai/resources/{kind} (typed core Service for
  no group; dynamic-cache path for group=serving.knative.dev).
- mcp: same two builder tests plus
  TestHandleListResources_GroupRoutesToDynamic for the MCP list path.
nadaverell added a commit that referenced this pull request May 24, 2026
…hape, RBAC count leak

Cursor Bugbot + multi-agent review fixes:

- RBAC: PodSelectionSummary.Total now counts visible refs only, matching the
  ReferencedBy.Total fix in cf03039 (pre-RBAC len(pods) leaked the hidden
  pod count to callers without Pod read access).
- Silent failures: attachResourceExtras surfaces eventsError/metricsError/
  logsError instead of dropping errors, matching diagnose's wire contract;
  handleGetWorkloadLogs surfaces logsError when kube client is missing.
- Issue summary: /api/ai/resources GET now passes issues.NoLimit (not
  MaxLimit) to the per-resource composer, matching the MCP sibling — a
  hard cap before the per-resource filter could silently zero issueSummary
  on clusters with >1000 issues.
- Wire shape: get_changes and get_events empty-allowed-namespaces early
  exits now wrap in the response struct so capped + uncapped + denied all
  agree on shape.
- Log cap hint: handleGetPodLogs + handleGetWorkloadLogs track raw line
  count pre-grep (new countLines helper + RawLines on podLogEntry) so the
  truncation hint fires correctly even when grep filters heavily.
- get_changes hint: cites queryOpts.Limit (actual store cap) instead of
  user-facing limit, which is 10x lower on the name-filter path.
- top_resources: pod/node list errors now fall back to metrics-only rows
  instead of returning nil, matching the cache.Pods()==nil branch above.

Cleanup:
- Drop dead `var _ = runtime.Object(nil)` + runtime import.
- Drop dead `var _ = context.Background` + context import in test.
- Strip "PR #721" reference from handleGetResource comment.
- Drop podTotalRestarts redundant docstring with "Used by X" caller ref.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant