You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe what happened:
We often receive a burst of "Unexpected error, query data not found in result" errors in our logs for various metric queries all at the same timestamp. This in turn generates FailedGetExternalMetric events in Kubernetes, which fire off alerts to the engineering teams responsible for the relevant metrics. The metrics_retriever code says "this should never happen": https://github.com/DataDog/datadog-agent/blob/7.55.x/pkg/clusteragent/autoscaling/externalmetrics/metrics_retriever.go#L190-L192 - however I suspect it occurs because there can be a partial failure scenario where some metrics are successful but others are not, and the code only assumes global (ie total) errors will occur.
In order to investigate the issue further, I deployed a custom build of Datadog Cluster Agent with the following patch to our staging environment, which revealed the partial failures were due to rate-limiting:
API error 429 Too Many Requests: {"status":"error","code":429,"errors":["Too many requests"],"statuspage":"http://status.datadoghq.com","twitter":"http://twitter.com/datadogops","email":"[email protected]"}
Describe what you expected:
Datadog Cluster Agent logs the error it receives from the API in the case of partial failures, and gracefully handles this condition by retrying later without raising a FailedGetExternalMetric (or if it does, with the reason being rate limiting so we can route it differently).
Steps to reproduce the issue:
Unknown, as it requires a combination of a successful metric query and a server error. Outside of Datadog, can likely only be reproduced in test.
Agent Environment
Datadog Cluster Agent v0.55.3
Describe what happened:
We often receive a burst of "Unexpected error, query data not found in result" errors in our logs for various metric queries all at the same timestamp. This in turn generates FailedGetExternalMetric events in Kubernetes, which fire off alerts to the engineering teams responsible for the relevant metrics. The metrics_retriever code says "this should never happen": https://github.com/DataDog/datadog-agent/blob/7.55.x/pkg/clusteragent/autoscaling/externalmetrics/metrics_retriever.go#L190-L192 - however I suspect it occurs because there can be a partial failure scenario where some metrics are successful but others are not, and the code only assumes global (ie total) errors will occur.
In order to investigate the issue further, I deployed a custom build of Datadog Cluster Agent with the following patch to our staging environment, which revealed the partial failures were due to rate-limiting:
Describe what you expected:
Datadog Cluster Agent logs the error it receives from the API in the case of partial failures, and gracefully handles this condition by retrying later without raising a FailedGetExternalMetric (or if it does, with the reason being rate limiting so we can route it differently).
Steps to reproduce the issue:
Unknown, as it requires a combination of a successful metric query and a server error. Outside of Datadog, can likely only be reproduced in test.
Additional environment details (Operating System, Cloud provider, etc):
Support Ticket: https://help.datadoghq.com/hc/en-us/requests/1808765
Operating System: Ubuntu 22.04
Kubernetes Version: v1.29.6
Cloud Provider: AWS
The text was updated successfully, but these errors were encountered: