[BUG] MetricsRetriever does not handle partial metric retrieval failures gracefully #28497

adammw · 2024-08-15T08:17:46Z

Agent Environment
Datadog Cluster Agent v0.55.3

Describe what happened:
We often receive a burst of "Unexpected error, query data not found in result" errors in our logs for various metric queries all at the same timestamp. This in turn generates FailedGetExternalMetric events in Kubernetes, which fire off alerts to the engineering teams responsible for the relevant metrics. The metrics_retriever code says "this should never happen": https://github.com/DataDog/datadog-agent/blob/7.55.x/pkg/clusteragent/autoscaling/externalmetrics/metrics_retriever.go#L190-L192 - however I suspect it occurs because there can be a partial failure scenario where some metrics are successful but others are not, and the code only assumes global (ie total) errors will occur.

In order to investigate the issue further, I deployed a custom build of Datadog Cluster Agent with the following patch to our staging environment, which revealed the partial failures were due to rate-limiting:

API error 429 Too Many Requests: {"status":"error","code":429,"errors":["Too many requests"],"statuspage":"http://status.datadoghq.com","twitter":"http://twitter.com/datadogops","email":"[email protected]"}

Describe what you expected:
Datadog Cluster Agent logs the error it receives from the API in the case of partial failures, and gracefully handles this condition by retrying later without raising a FailedGetExternalMetric (or if it does, with the reason being rate limiting so we can route it differently).

Steps to reproduce the issue:
Unknown, as it requires a combination of a successful metric query and a server error. Outside of Datadog, can likely only be reproduced in test.

Additional environment details (Operating System, Cloud provider, etc):
Support Ticket: https://help.datadoghq.com/hc/en-us/requests/1808765
Operating System: Ubuntu 22.04
Kubernetes Version: v1.29.6
Cloud Provider: AWS

The text was updated successfully, but these errors were encountered:

adammw added the team/triage label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] MetricsRetriever does not handle partial metric retrieval failures gracefully #28497

[BUG] MetricsRetriever does not handle partial metric retrieval failures gracefully #28497

adammw commented Aug 15, 2024

[BUG] MetricsRetriever does not handle partial metric retrieval failures gracefully #28497

[BUG] MetricsRetriever does not handle partial metric retrieval failures gracefully #28497

Comments

adammw commented Aug 15, 2024