Skip to content

Generic metrics and alerts for the cluster data conflict(PVC/VM) - cherrypick PR#1974 #2020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pruthvitd
Copy link
Member

#1974 (comment)

Summary:

This PR enhances the Cluster Data Conflict alerting system by introducing severity-based alerts and direct links to the OpenShift Console for better debugging and faster resolution.

Changes Introduced:

  1. Severity-Based Alerts:

    • Warning (severity: warning) → Triggered when ramen_cluster_data_conflict == 1 (Secondary Conflict).
    • Critical (severity: critical) → Triggered when ramen_cluster_data_conflict == 2 (Primary Conflict).
  2. Direct Console Links:

    • Alerts now dynamically construct a clickable OpenShift Console link pointing to the specific DRPlacementControl resource.
    • Example generated link:
      https://console-openshift.example.com/k8s/ns/<namespace>/ramendr.openshift.io~v1alpha1~DRPlacementControl/<drpc_name>
      
    • This helps users quickly navigate to DRPC without manual searching.

@pruthvitd pruthvitd marked this pull request as ready for review April 30, 2025 09:39
@pruthvitd pruthvitd requested a review from rakeshgm April 30, 2025 09:46
@pruthvitd
Copy link
Member Author

Validated it upstream using prometheus:

1. Metric is 1 and alert not fired when replication of the protected pvc is succeeding:

(ramen) [root@ramen1 ramen]# curl http://localhost:9090/metrics -s | grep "workload_protection_status"
Handling connection for 9090
# HELP ramen_workload_protection_status Status regarding workload protection health
# TYPE ramen_workload_protection_status gauge
ramen_workload_protection_status{obj_name="vm-cirros-dr",obj_namespace="ramen-ops",obj_type="DRPlacementControl"} 1

2. Metric is 0 and alert is fired when a conflicting resource(VM/PVC) is deployed on primary managed cluster, resulting error:

# oc get drpc vm-cirros-dr -n ramen-ops --context hub -o json | jq .status.conditions
[
  {
    "lastTransitionTime": "2025-05-06T07:45:59Z",
    "message": "Initial deployment completed",
    "observedGeneration": 1,
    "reason": "Deployed",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2025-05-06T07:45:59Z",
    "message": "Ready",
    "observedGeneration": 1,
    "reason": "Success",
    "status": "True",
    "type": "PeerReady"
  },
  {
    "lastTransitionTime": "2025-05-06T08:41:34Z",
    "message": "VolumeReplicationGroup (ramen-ops/vm-cirros-dr) on cluster dr1 is progressing on protecting workload resources (Kube objects capture has not started), retrying till ClusterDataProtected condition is met",
    "observedGeneration": 1,
    "reason": "Progressing",
    "status": "False",
    "type": "Protected"
  }
]
:
:
(ramen) [root@ramen1 ramen]# curl http://localhost:9090/metrics -s | grep "workload_protection_status"
Handling connection for 9090
# HELP ramen_workload_protection_status Status regarding workload protection health
# TYPE ramen_workload_protection_status gauge
ramen_workload_protection_status{obj_name="vm-cirros-dr",obj_namespace="ramen-ops",obj_type="DRPlacementControl"} 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants