Skip to content

feat(crs): add tracking rules for new custom resources #359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Fral738
Copy link

@Fral738 Fral738 commented May 22, 2025

No description provided.

@Fral738 Fral738 self-assigned this May 22, 2025
@Fral738 Fral738 requested a review from ilya-lesikov May 22, 2025 09:51
Signed-off-by: Evgeniy Frolov <[email protected]>
@@ -48,6 +48,7 @@ rules:
failed:
- "False"
# ArgoCD Application
# https://github.com/argoproj/gitops-engine/blob/master/pkg/health/health.go#L12
- resourceGroup: "argoproj.io"
resourceKind: "Application"
jsonPath: "$.status.health.status"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure that's the best field? I also found this one, and these statuses do look much more clear and on point. But whatever you choose.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

I looked into status.operationState.phase, and while it indeed provides detailed insight into the current operation (e.g., sync or rollback), it’s not always present — it's only set during or immediately after an operation. Outside of that window, the field can be missing.

On the other hand, status.health.status is always available and reflects the overall state of the application from Argo CD’s perspective — including conditions like Healthy, Progressing, Degraded, and Missing. It’s also the field used by the Argo CD UI to display application status.

So for general status tracking across all phases of the application lifecycle (even when no operation is ongoing), I think status.health.status is the most reliable and consistent choice.

That said, using status.operationState.phase as an additional rule could be a good enhancement in the future for more granular operation tracking.

Let me know what you think!

- "Suspended"
failed:
- "Degraded"
- "Missing"
- "Unknown"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Unknown here? I mentioned, that there should be as little as possible false positives in failed. You sure Unknown really indicates an error? Same with Missing, and even Degraded to some extent... This is why I don't like this status.health.status field, very poor indicators, no clear meaning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up and I appreciate your concern around false positives.

You're absolutely right that the semantics of status.health.status are a bit broad and can be vague. Here's how I approached it:

  • Unknown — According to the Argo CD health source, this indicates a failure to assess the health, not necessarily a failure of the resource itself. So technically, it's not always an error, but from an automation or alerting perspective, it means we can't reliably say things are working — which may justify treating it as failed, depending on the use case. Still, I'm open to moving it out of failed if we aim for minimal false positives.

  • Missing — This one is more concrete: it means the resource doesn't exist in the cluster. Unless the resource was intentionally deleted or part of a partial sync, this is likely a real problem — which is why I classified it under failed.

  • Degraded — Per the docs and code, this is used when health checks failed or the resource didn’t become healthy within a timeout. Again, this may or may not be fatal, but in practice, it's often a symptom of underlying issues (e.g. CrashLoopBackOff, failed rollout), so treating it as failed seems appropriate unless we're tuning for very optimistic reporting.

That said, I completely agree the health field is not perfect. It's a coarse indicator, and there's no strong signal for whether an app is actually broken or just not healthy yet. But it’s still the most consistently available field, which is why I chose to use it for now.

Let me know what you think — I can tune the mapping if we want to reduce the risk of false alarms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants