Skip to content

feat(resourcecontext): summaryContext on list_resources + search hits (T8+T9)#722

Merged
nadaverell merged 16 commits into
mainfrom
feat/t89-summarycontext
May 23, 2026
Merged

feat(resourcecontext): summaryContext on list_resources + search hits (T8+T9)#722
nadaverell merged 16 commits into
mainfrom
feat/t89-summarycontext

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 17, 2026

Wave 2 / Phase 5 (T8 + T9 combined). Attaches a compact ResourceSummaryContext to every row of list_resources and search hits — three cheap fields (managedBy, health, issueCount) at ≤60 bytes/row that eliminate most per-row follow-up calls on agent journeys. Plus tightens AI list-handler RBAC, fixes group-disambiguation + cluster-scope correctness bugs, and centralises the per-row builder behind one shared package.

What lands

pkg/resourcecontext/summary.goBuildSummary(obj, opts) *ResourceSummaryContext with deterministic per-kind heuristics:

  • managedBy: reads Relationships.ManagedBy[0] (T23-synthesized) with Relationships.Owner as fallback. Dropped the local Deployment-grandparent shortcut.
  • health: tiny per-kind rollup matching the UI's status-tone scheme. Deployment + ReplicaSet use Spec.Replicas (desired) not Status.Replicas (current) — the latter inflates during scale-down and falsely reports degraded.
  • issueCount: total issues at the per-resource bucket.

ManagedByFromOwner classifies source as argocd / flux / helm (native HelmRelease + empty group) / native.

The wire type is ResourceSummaryContext (renamed from SummaryContext) — explicitly the row-tier companion of ResourceContext. JSON tag stays summaryContext so external consumers see no wire shape change.

Shared core: internal/summarycontext

REST and MCP previously duplicated ~200 LoC of identical logic (issueIndex + count + issueIndexKey, canonicalSingular, buildIssueIndex, managedByFromRelationships, the dispatch closure). Lifted into one shared package. REST and MCP wrappers (~30 LoC each) supply the protocol-specific topology source — REST passes s.broadcaster.GetCachedTopology(), MCP supplies a 5-second memoized build — and delegate to summarycontext.BuilderFromIndexes. Eliminates the maintenance hazard of two copies drifting.

The four attach helpers (groupFromObject, groupFromUnstructured, AttachToTypedList, AttachToUnstructuredList) also live in this package — REST and MCP both call the same implementation rather than carrying byte-identical local copies.

Search dual-index dispatch

A single search query can return both namespaced Pods and cluster-scoped Nodes. A single issue index can't be right for both: scoped to user's namespaces it zeros issueCount on Node/PV/cluster CRD hits (whose issues live at namespace=""); composed cluster-wide it would leak rows the namespace-restricted user shouldn't see. Fix: build TWO indexes per request — namespacedIdx scoped to scanNamespaces, clusterIdx composed cluster-wide. Closure dispatches per-hit via k8s.ClassifyKindScope(kind, group). Skipped when scanNamespaces == nil (one Compose suffices).

Search builder runs after truncation

SummaryBuilder previously ran inside buildHit for every matched candidate, BEFORE sort + truncate to opts.Limit. Broad queries paid topology lookups for thousands of matches, shipped 50. Refactored to buffer hits with their source obj in a private pendingHit slice, sort and truncate THAT, then call SummaryBuilder only on the kept hits. Dead summaryBuilder parameter dropped from buildHit for clarity.

Group-aware AI list routing

handleAIListResources was calling FetchResourceList(kind, namespaces) first, ignoring the ?group=… query param. For services?group=serving.knative.dev, this returned core Services. Now short-circuits to aiListDynamic when group != "", matching the GET-handler pattern in PR #721. Same shape applied to MCP handleListResources.

RBAC preflight on AI list

Extracted Server.preflightResourceList helper. Both handleListResources (REST /api/resources/) and handleAIListResources call it BEFORE fetch. Closes the gap where the AI list path was missing cluster-scoped SAR, namespaces SAR, and per-namespace list secrets SAR. REST converts deny → 200 [] (preserves SPA non-leaking shape); AI returns explicit 403.

Cluster-scoped issueCount for cluster-scoped list rows

handleAIListResources for kind=nodes previously passed user's namespaced list to buildIssueIndex, filtering out Node issues at namespace="". Now detects cluster-scoped kinds via ClassifyKindScope (helper issueIndexNamespaces) and passes nil to the index builder. Same fix in MCP.

Issue source correctness

  • Group-aware key: issueIndexKey is now "group|kind|ns|name" so CRDs sharing kind+ns+name across API groups get independent counts.
  • Uncapped count: added issues.NoLimit = -1 sentinel; ComposeWithStats skips truncation when feeding the per-resource index.
  • Native Helm source: sourceForOwner returns "helm" for kind=HelmRelease, group="".
  • Problem.Group population: built-in workload sites in k8s.DetectProblems now set Group (apps/batch/autoscaling).
  • CRD scan perf: detectGenericCRDIssues now skips ListDynamic for GVRs whose kind isn't in f.Kinds upfront.
  • CRD plural no longer zeroes issueCount: BuildIssueIndex used to push the URL kind through CanonicalSingular into filters.Kinds on the Compose call. CanonicalSingular only knows built-in plurals, so a CRD listed by its plural (e.g. ArgoCD "applications") fell through unchanged and silently mismatched the canonical singular "Application" the issue engine emits — every CRD row's issueCount came back zero. Dropped the Kinds filter entirely: bucketing on issueIndexKey(group, kind, ns, name) is already correct since the lookup side runs the row's singular Kind through the same normalization. Pinned by TestBuildIssueIndex_CRDPlural_NonZeroCount.

MCP topology memoization

MCP has no broadcaster cache, so every list_resources / search previously paid a full topology build. Added a package-level summaryCtxTopoMemo (5s TTL — matches REST broadcaster cadence) so bursty agent traffic amortises the cost.

Wire-surface alignment with T6

Mirrors PR #721's speculative-surface trim of pkg/resourcecontext/types.go — drops ContextRef.Reason/Source/Confidence, the RefReason/RefSource enums, ResourceContext.Truncated, and two unused OmittedReason values. None of T89's consumers reference the deleted fields; this keeps the shared types in sync across the wave-2 PRs so they don't conflict on merge regardless of order. (Auto-dropped on rebase post-#724 merge — patch already upstream.)

Verification

go vet ./... clean. go build ./... clean. go test -count=1 ./... green across all root packages.

Out of scope

  • /api/resources/* UI primary endpoint — intentionally untouched; UI adopts opportunistically via T14.
  • Diagnostic-tier top[] lists on summary rows — T10's territory.

Note

Medium Risk
Adds a new per-row summaryContext enrichment and refactors list/search execution and RBAC gating; correctness depends on scope/group disambiguation and issue-index behavior, and it can affect performance on large clusters.

Overview
Adds a compact per-row summaryContext (managedBy, health, issueCount) to AI /api/ai/resources lists, REST /api/search hits, and MCP list_resources/search, with a context=none opt-out.

Introduces shared internal/summarycontext + pkg/resourcecontext builders to compute managed-by, health heuristics, and group-aware per-resource issue counts, including dual issue indexes for search (namespaced vs cluster-wide) and deferring summary computation until after search sort/limit truncation.

Fixes several correctness/perf gaps: group-qualified list routing (typed vs dynamic) for ambiguous kinds, AI list RBAC preflight shared with REST list (explicit 403 for AI), correct cluster-scoped issueCount (namespace=""), uncapped issue composition via issues.NoLimit, populates Problem.Group for built-ins, and skips CRD ListDynamic scans when a kind filter excludes the GVR.

Reviewed by Cursor Bugbot for commit 8ef09d8. Bugbot is set up for automated code reviews on this repo. Configure here.

@nadaverell nadaverell requested a review from hisco as a code owner May 17, 2026 07:51
Comment thread internal/mcp/summary_context.go Outdated
@nadaverell nadaverell force-pushed the feat/t89-summarycontext branch from 9a3e952 to 266e057 Compare May 17, 2026 09:07
Comment thread pkg/resourcecontext/summary.go Outdated
Comment thread internal/search/search.go Outdated
nadaverell added a commit that referenced this pull request May 17, 2026
…Helm classification

Three reviewer findings on PR #722:

1. issueIndexKey was group-blind: two CRDs sharing kind+ns+name in
   different API groups (Knative Service vs corev1 Service, two
   operators each shipping a "Cluster" CRD) collided on the same
   bucket and inherited each other's count. Threaded group through
   the builder + key — format is now "group|kind|ns|name", "|" is
   illegal in K8s API groups so it's a safe delimiter.

2. buildIssueIndex used Limit:MaxLimit (1000), so on >1000-issue
   clusters resources whose issues fell in the post-sort tail
   silently got issueCount=0. Picked Option B: added an explicit
   NoLimit sentinel in internal/issues. Fewer files touched than
   Option A (a separate CountByResource API), and the index is the
   only caller that wants the uncapped set — public /api/issues +
   issues_list keep their tight caps. ComposeStats.TotalMatched
   already reported the true total regardless of truncation.

3. sourceForOwner returned "native" for {Kind:"HelmRelease", Group:""}
   — but topology.detectManagedByFromMeta emits exactly that shape
   for native Helm installs (distinguished from Flux's HelmRelease
   CR which lives at helm.toolkit.fluxcd.io). Added a "helm" branch
   ahead of the group-based GitOps switch.

Tests:
- pkg/resourcecontext: native Helm classification.
- internal/server + internal/mcp: group-aware index (two CRDs same
  kind+ns+name), tail-of-MaxLimit overflow.

Signature change: summaryContextBuilder and search.SummaryBuilderFunc
now take `group` as a parameter. All callers updated to source group
from typed object GVK (SetTypeMeta path) or unstructured apiVersion.
Comment thread internal/mcp/summary_context.go Outdated
nadaverell added a commit that referenced this pull request May 17, 2026
…lter comment

Reviewer's important (#722): MCP list_resources and search build a full
topology snapshot on every invocation. REST reuses the broadcaster's
cached snapshot but MCP has no equivalent shared cache, so on a
multi-thousand-resource cluster every agent list/search paid a
multi-second topology build cost.

Fix: introduce a package-level topology.Memoizer (5s TTL — matches the
REST broadcaster's cadence) and route buildSummaryContextTopology
through it. Bursty agent traffic now amortizes the build cost over
many calls instead of paying it per request. Other MCP tools
(handleGetResource, get_neighborhood) still build inline; threading
them through is a separate follow-up.

Suggestion: also fix the misleading comment in
internal/server/summary_context.go that claimed "Sources are restricted
to 'problem' + 'condition'". Filters.Sources was never set — the
exclusion of audit/event sources comes from the false-by-default
IncludeAudit / IncludeEvents flags. Updated the comment to reference
the actual mechanism so future readers don't assume there's an explicit
Sources allowlist.
nadaverell added a commit that referenced this pull request May 18, 2026
…uncation

Two Bugbot findings on PR #722:

1. deriveHealth for Deployment and ReplicaSet compared ReadyReplicas against
   Status.Replicas (current pod count) instead of *Spec.Replicas (desired).
   During scale-down or rolling updates Status.Replicas can exceed Spec —
   surplus pods draining — and ReadyReplicas trails. The previous code
   reported "degraded" even when all DESIRED replicas were ready, which is
   the steady-state-healthy condition. StatefulSet/DaemonSet cases were
   already correct. Two new regression tests pin the scale-down scenario.

2. search.go::Search called SummaryBuilder inline inside buildHit for every
   matched candidate, BEFORE sort + truncate to opts.Limit. A broad query
   matching thousands of objects paid topology lookups for all of them,
   shipped at most Limit=50. Refactored to buffer hits + their source obj
   in a private pendingHit slice, sort and truncate that, then run
   SummaryBuilder only on the kept hits. The deferred topology lookups
   now scale with output size, not match size.
Comment thread internal/search/search.go Outdated
Comment thread internal/search/search.go
Comment thread internal/mcp/tools.go Outdated
Comment thread internal/server/ai_handlers.go Outdated
Comment thread internal/summarycontext/summarycontext.go
nadaverell added a commit that referenced this pull request May 18, 2026
…Helm classification

Three reviewer findings on PR #722:

1. issueIndexKey was group-blind: two CRDs sharing kind+ns+name in
   different API groups (Knative Service vs corev1 Service, two
   operators each shipping a "Cluster" CRD) collided on the same
   bucket and inherited each other's count. Threaded group through
   the builder + key — format is now "group|kind|ns|name", "|" is
   illegal in K8s API groups so it's a safe delimiter.

2. buildIssueIndex used Limit:MaxLimit (1000), so on >1000-issue
   clusters resources whose issues fell in the post-sort tail
   silently got issueCount=0. Picked Option B: added an explicit
   NoLimit sentinel in internal/issues. Fewer files touched than
   Option A (a separate CountByResource API), and the index is the
   only caller that wants the uncapped set — public /api/issues +
   issues_list keep their tight caps. ComposeStats.TotalMatched
   already reported the true total regardless of truncation.

3. sourceForOwner returned "native" for {Kind:"HelmRelease", Group:""}
   — but topology.detectManagedByFromMeta emits exactly that shape
   for native Helm installs (distinguished from Flux's HelmRelease
   CR which lives at helm.toolkit.fluxcd.io). Added a "helm" branch
   ahead of the group-based GitOps switch.

Tests:
- pkg/resourcecontext: native Helm classification.
- internal/server + internal/mcp: group-aware index (two CRDs same
  kind+ns+name), tail-of-MaxLimit overflow.

Signature change: summaryContextBuilder and search.SummaryBuilderFunc
now take `group` as a parameter. All callers updated to source group
from typed object GVK (SetTypeMeta path) or unstructured apiVersion.
@nadaverell nadaverell force-pushed the feat/t89-summarycontext branch from ea43c4e to cecc4f4 Compare May 18, 2026 15:23
nadaverell added a commit that referenced this pull request May 18, 2026
…lter comment

Reviewer's important (#722): MCP list_resources and search build a full
topology snapshot on every invocation. REST reuses the broadcaster's
cached snapshot but MCP has no equivalent shared cache, so on a
multi-thousand-resource cluster every agent list/search paid a
multi-second topology build cost.

Fix: introduce a package-level topology.Memoizer (5s TTL — matches the
REST broadcaster's cadence) and route buildSummaryContextTopology
through it. Bursty agent traffic now amortizes the build cost over
many calls instead of paying it per request. Other MCP tools
(handleGetResource, get_neighborhood) still build inline; threading
them through is a separate follow-up.

Suggestion: also fix the misleading comment in
internal/server/summary_context.go that claimed "Sources are restricted
to 'problem' + 'condition'". Filters.Sources was never set — the
exclusion of audit/event sources comes from the false-by-default
IncludeAudit / IncludeEvents flags. Updated the comment to reference
the actual mechanism so future readers don't assume there's an explicit
Sources allowlist.
nadaverell added a commit that referenced this pull request May 18, 2026
…uncation

Two Bugbot findings on PR #722:

1. deriveHealth for Deployment and ReplicaSet compared ReadyReplicas against
   Status.Replicas (current pod count) instead of *Spec.Replicas (desired).
   During scale-down or rolling updates Status.Replicas can exceed Spec —
   surplus pods draining — and ReadyReplicas trails. The previous code
   reported "degraded" even when all DESIRED replicas were ready, which is
   the steady-state-healthy condition. StatefulSet/DaemonSet cases were
   already correct. Two new regression tests pin the scale-down scenario.

2. search.go::Search called SummaryBuilder inline inside buildHit for every
   matched candidate, BEFORE sort + truncate to opts.Limit. A broad query
   matching thousands of objects paid topology lookups for all of them,
   shipped at most Limit=50. Refactored to buffer hits + their source obj
   in a private pendingHit slice, sort and truncate that, then run
   SummaryBuilder only on the kept hits. The deferred topology lookups
   now scale with output size, not match size.
nadaverell added a commit that referenced this pull request May 18, 2026
…Helm classification

Three reviewer findings on PR #722:

1. issueIndexKey was group-blind: two CRDs sharing kind+ns+name in
   different API groups (Knative Service vs corev1 Service, two
   operators each shipping a "Cluster" CRD) collided on the same
   bucket and inherited each other's count. Threaded group through
   the builder + key — format is now "group|kind|ns|name", "|" is
   illegal in K8s API groups so it's a safe delimiter.

2. buildIssueIndex used Limit:MaxLimit (1000), so on >1000-issue
   clusters resources whose issues fell in the post-sort tail
   silently got issueCount=0. Picked Option B: added an explicit
   NoLimit sentinel in internal/issues. Fewer files touched than
   Option A (a separate CountByResource API), and the index is the
   only caller that wants the uncapped set — public /api/issues +
   issues_list keep their tight caps. ComposeStats.TotalMatched
   already reported the true total regardless of truncation.

3. sourceForOwner returned "native" for {Kind:"HelmRelease", Group:""}
   — but topology.detectManagedByFromMeta emits exactly that shape
   for native Helm installs (distinguished from Flux's HelmRelease
   CR which lives at helm.toolkit.fluxcd.io). Added a "helm" branch
   ahead of the group-based GitOps switch.

Tests:
- pkg/resourcecontext: native Helm classification.
- internal/server + internal/mcp: group-aware index (two CRDs same
  kind+ns+name), tail-of-MaxLimit overflow.

Signature change: summaryContextBuilder and search.SummaryBuilderFunc
now take `group` as a parameter. All callers updated to source group
from typed object GVK (SetTypeMeta path) or unstructured apiVersion.
nadaverell added a commit that referenced this pull request May 18, 2026
…lter comment

Reviewer's important (#722): MCP list_resources and search build a full
topology snapshot on every invocation. REST reuses the broadcaster's
cached snapshot but MCP has no equivalent shared cache, so on a
multi-thousand-resource cluster every agent list/search paid a
multi-second topology build cost.

Fix: introduce a package-level topology.Memoizer (5s TTL — matches the
REST broadcaster's cadence) and route buildSummaryContextTopology
through it. Bursty agent traffic now amortizes the build cost over
many calls instead of paying it per request. Other MCP tools
(handleGetResource, get_neighborhood) still build inline; threading
them through is a separate follow-up.

Suggestion: also fix the misleading comment in
internal/server/summary_context.go that claimed "Sources are restricted
to 'problem' + 'condition'". Filters.Sources was never set — the
exclusion of audit/event sources comes from the false-by-default
IncludeAudit / IncludeEvents flags. Updated the comment to reference
the actual mechanism so future readers don't assume there's an explicit
Sources allowlist.
nadaverell added a commit that referenced this pull request May 18, 2026
…uncation

Two Bugbot findings on PR #722:

1. deriveHealth for Deployment and ReplicaSet compared ReadyReplicas against
   Status.Replicas (current pod count) instead of *Spec.Replicas (desired).
   During scale-down or rolling updates Status.Replicas can exceed Spec —
   surplus pods draining — and ReadyReplicas trails. The previous code
   reported "degraded" even when all DESIRED replicas were ready, which is
   the steady-state-healthy condition. StatefulSet/DaemonSet cases were
   already correct. Two new regression tests pin the scale-down scenario.

2. search.go::Search called SummaryBuilder inline inside buildHit for every
   matched candidate, BEFORE sort + truncate to opts.Limit. A broad query
   matching thousands of objects paid topology lookups for all of them,
   shipped at most Limit=50. Refactored to buffer hits + their source obj
   in a private pendingHit slice, sort and truncate that, then run
   SummaryBuilder only on the kept hits. The deferred topology lookups
   now scale with output size, not match size.
@nadaverell nadaverell force-pushed the feat/t89-summarycontext branch from 7de5afd to feb0073 Compare May 18, 2026 19:45
…ch hits (T8+T9)

Add pkg/resourcecontext.BuildSummary — a tiny per-row enrichment helper that
returns ManagedBy + Health + IssueCount (≤ ~60 bytes/row). Wire it through:

- REST /api/ai/resources/{kind} list handler (handleAIListResources +
  aiListDynamic)
- MCP list_resources tool (handleListResources + listDynamicResources)
- REST /api/search executor (Hit.SummaryContext)
- MCP search tool (handleSearch)

Each handler builds the per-namespace issue index (via
internal/issues.Compose) and topology snapshot once per request, then
invokes BuildSummary per row. pkg/resourcecontext stays free of
internal/* and pkg/topology imports — callers pre-compute ManagedBy
via the new ManagedByFromOwner helper.

All four surfaces honor a context=none opt-out (query param for REST,
input arg for MCP) that returns bare rows.

Tests:

- Golden-file BuildSummary across Pod phases, Deployment replica states,
  NetworkPolicy (no health heuristic), CRD Ready/Available conditions,
  Health override, nil-object safety.
- ManagedByFromOwner source classification (argocd / flux / native).
- attach* helpers wired end-to-end with stub builders; defensive length-
  mismatch handling.
- managedByFromRelationships prefers Deployment grandparent shortcut over
  noisy ReplicaSet owner.
- Search executor invokes SummaryBuilder per kept hit when set, leaves
  SummaryContext nil when unset.
…+ shared index)

Now that T23 (#720) merged the server-synthesized Relationships.ManagedBy
chain and the RelationshipsIndex inverted-edges lookup, the per-row
SummaryContext builder no longer needs to re-detect ownership and no
longer pays O(E) per call.

- Read rel.ManagedBy[0] for the ManagedByRef (ArgoCD > Flux > Helm >
  topmost K8s owner). Falls back to rel.Owner when synthesis declines.
  Drops the local Deployment-grandparent shortcut — ManagedBy walks past
  the noisy hash-suffixed ReplicaSet for us.
- Build topology.IndexByResource(topo) once per request and thread it
  into GetRelationshipsWithObject on every row. Without the index, the
  list_resources + search hot paths were O(N x E).
- Use GetRelationshipsWithObject (not -WithIndex) so synthesis is
  group-aware when the row has the typed/unstructured object handy —
  avoids kind/plural collisions like Knative Service vs corev1 Service.
- Update server tests to pin the ManagedBy preference and the new
  Owner fallback shape.
…Helm classification

Three reviewer findings on PR #722:

1. issueIndexKey was group-blind: two CRDs sharing kind+ns+name in
   different API groups (Knative Service vs corev1 Service, two
   operators each shipping a "Cluster" CRD) collided on the same
   bucket and inherited each other's count. Threaded group through
   the builder + key — format is now "group|kind|ns|name", "|" is
   illegal in K8s API groups so it's a safe delimiter.

2. buildIssueIndex used Limit:MaxLimit (1000), so on >1000-issue
   clusters resources whose issues fell in the post-sort tail
   silently got issueCount=0. Picked Option B: added an explicit
   NoLimit sentinel in internal/issues. Fewer files touched than
   Option A (a separate CountByResource API), and the index is the
   only caller that wants the uncapped set — public /api/issues +
   issues_list keep their tight caps. ComposeStats.TotalMatched
   already reported the true total regardless of truncation.

3. sourceForOwner returned "native" for {Kind:"HelmRelease", Group:""}
   — but topology.detectManagedByFromMeta emits exactly that shape
   for native Helm installs (distinguished from Flux's HelmRelease
   CR which lives at helm.toolkit.fluxcd.io). Added a "helm" branch
   ahead of the group-based GitOps switch.

Tests:
- pkg/resourcecontext: native Helm classification.
- internal/server + internal/mcp: group-aware index (two CRDs same
  kind+ns+name), tail-of-MaxLimit overflow.

Signature change: summaryContextBuilder and search.SummaryBuilderFunc
now take `group` as a parameter. All callers updated to source group
from typed object GVK (SetTypeMeta path) or unstructured apiVersion.
…em.Group for built-ins

Extracts preflightResourceList as a shared helper between handleListResources
and handleAIListResources. The AI list path previously skipped the
cluster-scoped SAR (Node, PVs, cluster-scoped CRDs), the list-namespaces
SAR (kind=namespaces), and the per-namespace / cluster-wide list-secrets
SAR — bypassing the gates the REST list path enforces. Same gates now run
in the same order on both paths; REST keeps its 200-with-[] legacy shape
for denies, AI returns the explicit 403 so agents see the failure instead
of confusing "empty cluster" output.

Populates Problem.Group on all built-in DetectProblems sites (Deployment,
StatefulSet, DaemonSet → "apps"; HPA → "autoscaling"; Job, CronJob →
"batch"). The new group-aware summary_context index keys per-resource
issue counts as "group|kind|ns|name", so empty Group was zeroing
issueCount for every broken workload — a regression vs the pre-group-aware
behavior. CAPI sites already set Group, verified.

Tests:
- internal/server/ai_handlers_rbac_test.go: 403 on per-namespace secrets
  deny, cluster-scope nodes deny, list-namespaces deny; happy-path 200
  asserts summaryContext envelope.
- internal/k8s/problems_test.go: fake clientset with broken Deployment +
  StatefulSet + DaemonSet + maxed HPA + stuck Job, asserts each emitted
  Problem.Group matches its canonical API group.
…lter comment

Reviewer's important (#722): MCP list_resources and search build a full
topology snapshot on every invocation. REST reuses the broadcaster's
cached snapshot but MCP has no equivalent shared cache, so on a
multi-thousand-resource cluster every agent list/search paid a
multi-second topology build cost.

Fix: introduce a package-level topology.Memoizer (5s TTL — matches the
REST broadcaster's cadence) and route buildSummaryContextTopology
through it. Bursty agent traffic now amortizes the build cost over
many calls instead of paying it per request. Other MCP tools
(handleGetResource, get_neighborhood) still build inline; threading
them through is a separate follow-up.

Suggestion: also fix the misleading comment in
internal/server/summary_context.go that claimed "Sources are restricted
to 'problem' + 'condition'". Filters.Sources was never set — the
exclusion of audit/event sources comes from the false-by-default
IncludeAudit / IncludeEvents flags. Updated the comment to reference
the actual mechanism so future readers don't assume there's an explicit
Sources allowlist.
Three correctness/perf fixes on the summaryContext list path:

1. Cluster-scoped issues silently filtered out for cluster-scoped rows.
   handleAIListResources / handleListResources passed the user's
   namespaced-access set straight into the issue index. Cluster-scoped
   issues (Node, PV, cluster-scoped CRDs) live at namespace="" — the
   namespaced filter dropped them all, zeroing issueCount on every Node
   row even when the user had cluster-scoped Node access. Now: detect
   cluster-scoped kinds via k8s.ClassifyKindScope and pass nil for the
   issue index (cluster-wide compose). RBAC was already enforced upstream.

2. detectGenericCRDIssues scanned every watched CRD before kindFilter
   applied. A pods-only list_resources request still iterated every
   watched CRD GVR and called ListDynamic on each; applyFilters then
   discarded the rows. On clusters with hundreds of CRDs this dominated
   the per-row issue index build. Push f.Kinds awareness down — match
   the GVR's kind via p.KindForGVR (lowercase comparison mirrors
   applyFilters) and skip the ListDynamic call entirely for
   non-matching GVRs.

3. Trailing blank line at EOF in internal/mcp/summary_context.go.

Tests:
- internal/server: issueIndexNamespaces helper drops the namespace
  filter for cluster-scoped kinds and preserves it for namespaced kinds.
  buildIssueIndex end-to-end: a Node issue at namespace="" surfaces
  when the index is built without a namespace filter and disappears
  when a namespace slice is passed (matches CacheProvider's per-ns walk).
- internal/mcp: mirror of the server test against the MCP buildIssueIndex.
- internal/issues: countingProvider asserts detectGenericCRDIssues does
  NOT call ListDynamic for GVRs whose kind is filtered out, and DOES
  call it for matching GVRs and when Kinds is empty.

Fake provider's DetectProblems in both server + MCP tests updated to
mirror CacheProvider.flattenNamespacedProblems (drop cluster-scoped
rows under a namespace filter) so the regression test pins the actual
production behavior.
…ting

Two residual findings on the T89 work:

1. Search summaryContext dropped issueCount for cluster-scoped hits.
   handleSearch passed scanNamespaces (a strictly-namespaced filter)
   into the single issue index used for every hit. Search returns
   MIXED-kind hits (a query can return both namespaced Pods and
   cluster-scoped Nodes), and cluster-scoped problems live at
   namespace="" — the per-namespace filter dropped them, silently
   zeroing issueCount on every Node/PV/cluster-scoped-CRD hit.

   Fix: dual issue index per search request. namespacedIdx is scoped
   to scanNamespaces; clusterIdx is composed cluster-wide (nil filter)
   so namespace="" issues surface. The summaryContextBuilder closure
   dispatches per-hit via k8s.ClassifyKindScope(kind, group), so a
   single response correctly counts both Pod and Node issues.
   CanReadClusterScoped already gates which cluster-scoped kinds the
   user can see — the cluster-wide index doesn't expose unauthorized
   rows. Mirrored on the MCP search path.

2. AI list with group ignored group and tried typed cache first.
   handleAIListResources called k8s.FetchResourceList(cache, kind,
   namespaces) before consulting group, so kind=services&group=
   serving.knative.dev silently returned core Services. Same fix
   shape PR #721 applied to GET: group != "" short-circuits straight
   to aiListDynamic. Mirrored on MCP handleListResources.

Tests:
- server: TestSummaryContextBuilderFromIndexes_DispatchesByScope and
  TestNewSearchSummaryContextBuilder_BuildsDualIndex pin the per-hit
  scope-based dispatch and the dual-index construction shape.
- server: TestAI_ListServices_WithGroup_RoutesToDynamicCache pins the
  group routing on /api/ai/resources/{kind} (typed core Service for
  no group; dynamic-cache path for group=serving.knative.dev).
- mcp: same two builder tests plus
  TestHandleListResources_GroupRoutesToDynamic for the MCP list path.
…uncation

Two Bugbot findings on PR #722:

1. deriveHealth for Deployment and ReplicaSet compared ReadyReplicas against
   Status.Replicas (current pod count) instead of *Spec.Replicas (desired).
   During scale-down or rolling updates Status.Replicas can exceed Spec —
   surplus pods draining — and ReadyReplicas trails. The previous code
   reported "degraded" even when all DESIRED replicas were ready, which is
   the steady-state-healthy condition. StatefulSet/DaemonSet cases were
   already correct. Two new regression tests pin the scale-down scenario.

2. search.go::Search called SummaryBuilder inline inside buildHit for every
   matched candidate, BEFORE sort + truncate to opts.Limit. A broad query
   matching thousands of objects paid topology lookups for all of them,
   shipped at most Limit=50. Refactored to buffer hits + their source obj
   in a private pendingHit slice, sort and truncate that, then run
   SummaryBuilder only on the kept hits. The deferred topology lookups
   now scale with output size, not match size.
After the deferred-summary refactor, every buildHit call site passes nil
for summaryBuilder; the actual attachment happens in Search's post-
truncation loop. The parameter + the `if summaryBuilder != nil` branch
inside buildHit are dead code and a maintenance trap — a future caller
passing non-nil would defeat the post-truncation optimization the
refactor exists to enable.

Drop the parameter, remove the inner branch, and tighten the doc to
explicitly say SummaryContext attachment is Search's responsibility
after sort + Limit truncation. No behavior change; existing tests cover
both code paths.
Commit 9ba47cb landed a partial rename of SummaryContext →
ResourceSummaryContext and ManagedByRef.Ref → ResourceSummaryRef in
summary.go / summary_test.go / search.go, but types.go still defines
the flat SummaryContext + ManagedByRef shape that every consumer in
internal/server, internal/mcp, internal/search, and pkg/ai/context
references. The half-applied rename left the tree in a broken-build
state — `go build ./...` failed with `undefined: ResourceSummaryContext`
and `unknown field Ref in struct literal of type ManagedByRef`.

Restore the original SummaryContext / ManagedByRef shape in summary.go
+ search.go + the summary_test.go golden. Spec.Replicas health fix from
9ba47cb stays.
The REST and MCP SummaryContext builders carried ~200 LoC of duplicated
implementation — the issue-index map, group-aware key arithmetic,
canonicalSingular kind normalization, BuildIssueIndex compose loop,
managedByFromRelationships extraction, and the per-row dispatch
closure body that picks namespaced vs cluster-wide index by scope. The
only substantive difference was the topology source (REST: broadcaster
cache; MCP: 5s memoizer).

Lift the shared core into a new internal/summarycontext package that
exports:
- Builder (the per-request closure type)
- IssueIndex with Count(group, kind, ns, name) int
- CanonicalSingular(kind) string
- BuildIssueIndex(provider, namespaces, kindFilter) IssueIndex
- ManagedByFromRelationships(rel) *ManagedByRef
- BuilderFromIndexes(topo, namespacedIdx, clusterIdx) Builder

BuilderFromIndexes takes the topology snapshot as an argument so REST
and MCP keep their respective sources unchanged — REST passes
s.broadcaster.GetCachedTopology(), MCP passes the memoized build.

internal/server/summary_context.go and internal/mcp/summary_context.go
shrink to thin wrappers that own their topology source and forward
everything else to the shared package. The MCP file additionally keeps
its local summaryCtxTopoMemo + buildSummaryContextTopology helpers
(MCP-specific, no analogue on the REST side).

Pure-function tests (issue-index key arithmetic, BuildIssueIndex over a
fake provider, CanonicalSingular, ManagedByFromRelationships, the
dual-index dispatch shape) move into the new package alongside the
shared code. The REST-side wiring tests (attachSummaryContextToList
end-to-end, issueIndexNamespaces dispatch) stay in internal/server
since they exercise handler plumbing. The MCP-side test file was
pure-function-only and is deleted outright — coverage now lives in
internal/summarycontext.

Net: ~390 LoC dropped from the tree; no wire-shape or behavior changes.
…text

Telegraph the relationship to ResourceContext: this is the row-tier
companion of the detail-tier ResourceContext. The previous "SummaryContext"
name implied a different concept; the new name makes the role explicit
(row-tier enrichment on lists + search hits).

ManagedByRef stays flat — only the type rename plus the field reference
chase across server / mcp / search / summarycontext / ai-context callers.
Wire JSON tag is unchanged ("summaryContext") so external consumers see
no shape difference.
groupFromObject, groupFromUnstructured, and the two
attachResourceSummaryContext* helpers were byte-identical between
internal/server/ai_handlers.go and internal/mcp/tools.go. The
internal/summarycontext package exists explicitly to centralize
shared per-row builder logic — moved all four there as
GroupFromObject, GroupFromUnstructured, AttachToTypedList, and
AttachToUnstructuredList.

Eliminates the maintenance hazard of two copies drifting. REST and
MCP wrappers now share one source of truth alongside the existing
Builder type.
…ount

BuildIssueIndex used to set filters.Kinds = [CanonicalSingular(kindFilter)]
on the Compose call. CanonicalSingular only knows built-in plurals, so
a CRD listed by its plural form (e.g. ArgoCD "applications") fell
through unchanged. Compose's case-insensitive Kind filter then failed
against the singular "Application" the issue engine emits, and every
CRD row's issueCount silently became 0. The MCP tool description
encourages plural forms ("e.g. pods, deployments"), so this hit the
common path for agents inspecting CRDs.

Fix: drop the Kinds filter entirely. The per-row bucketing in
issueIndexKey(group, kind, ns, name) already discriminates correctly
because the lookup side runs the row's singular Kind through
CanonicalSingular too — index and query agree without needing a
Compose-time filter. Per-namespace issue volumes are tiny (~tens to
low hundreds), so the extra bucket work is negligible.

Removed `kindFilter` from BuildIssueIndex's signature and from both
newResourceSummaryContextBuilder wrappers (REST + MCP) since neither
needs it anymore. Added TestBuildIssueIndex_CRDPlural_NonZeroCount to
pin the contract.
…vider

T11 (#723) added KyvernoFindings + KyvernoStatus to the issues.Provider
interface so the composer can route PolicyReport findings into the
unified issue stream and surface index-lifecycle status. Our test
fake didn't implement either, so summarycontext tests stopped
compiling after the rebase onto post-T11 main. Returning nil/"" is
correct for these tests: BuildIssueIndex doesn't read Kyverno
findings (kindFilter was dropped in the prior commit; the index
buckets problem+condition only), and KyvernoStatus is consumed by
the issues meta block — not by the index path under test.
@nadaverell nadaverell force-pushed the feat/t89-summarycontext branch from feb0073 to 8ef09d8 Compare May 21, 2026 12:16
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8ef09d8. Configure here.

// still resolves.
func BuilderFromIndexes(topo *topology.Topology, namespacedIdx, clusterIdx IssueIndex) Builder {
resourceProvider := k8s.NewTopologyResourceProvider(k8s.GetResourceCache())
dynamicProvider := k8s.NewTopologyDynamicProvider(k8s.GetDynamicResourceCache(), k8s.GetResourceDiscovery())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package-level global captures potentially nil cache at build time

Medium Severity

BuilderFromIndexes eagerly calls k8s.GetResourceCache() and k8s.GetDynamicResourceCache() at closure-construction time to create resourceProvider and dynamicProvider. These are captured by the returned closure. If the resource cache is nil at construction time but becomes available later (or vice versa), the closure captures a stale nil provider. More critically, k8s.NewTopologyResourceProvider(nil) is called when cache is nil — this provider is then passed to topology.GetRelationshipsWithObject inside the closure whenever topo != nil, which could panic or produce incorrect results depending on how the provider handles a nil inner cache.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8ef09d8. Configure here.

builder := topology.NewBuilder(k8s.NewTopologyResourceProvider(cache)).
WithDynamic(k8s.NewTopologyDynamicProvider(k8s.GetDynamicResourceCache(), k8s.GetResourceDiscovery()))
return builder.Build(opts)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCP topology memoizer ignores namespace differences in cache key

Medium Severity

buildSummaryContextTopology sets opts.Namespaces based on the caller's namespace list, then passes opts as the cache key to summaryCtxTopoMemo.Get. If the memoizer keys by the BuildOptions struct and the namespace slices differ between requests (e.g., user A lists in ["prod"], user B lists in ["staging"]), the first request's cached topology — scoped to user A's namespaces — may be returned to user B within the 5-second TTL. This would cause managedBy resolution to use topology data from the wrong namespace scope, potentially returning incorrect or missing relationship pointers for the second user.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8ef09d8. Configure here.

@nadaverell nadaverell merged commit 58e0bc1 into main May 23, 2026
8 of 9 checks passed
@nadaverell nadaverell deleted the feat/t89-summarycontext branch May 23, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant