-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datadog Cluster Agent errors after release 7.35.0 #12110
Comments
I also observe this behaviour with cluster-agent 1.19.0 and agent 7.35.2 running on a GKE cluster with version 1.22.8. My helm config looks like # Helm values for datadog agent chart
# Source: https://github.com/DataDog/helm-charts/blob/main/charts/datadog/values.yaml
datadog:
apiKey: ${api_key}
site: datadoghq.eu
clusterName: ${cluster_name}
logLevel: critical
# kube-state-metrics don't seem to play nicely with 1.22
# and is considered a legacy service which is not needed
# going forward
# https://docs.datadoghq.com/integrations/kubernetes_state_core
kubeStateMetricsEnabled: false
kubeStateMetricsCore:
enabled: true
logs:
enabled: true
containerCollectAll: true
kubelet:
tlsVerify: false
env:
# The logs from gke system pods can be noisy and contain lots of errors.
# We don't really care about them as we trust gke will work.
- name: DD_CONTAINER_EXCLUDE
value: "kube_namespace:kube-system" |
I'm seeing this as well, is it causing your pods to crashloop? |
I'm seeing this issue too that started about 4 days ago. Any quick remediation steps? |
It's not crashing the pods or anything like that, just printing the error on logs. I am not sure on the impact the error causes. I have not noticed any differences so far. |
We have 1 pod that's in crashloop which is caused by the trace-agent error. Other containers in the pod are fine
|
Are we talking about the same error here @sharonx ? My case the error is printed by
|
That is the error I am seeing. I had to increase the memory limit for my datadog-cluster-agent pod or it had been getting OOMKilled since the errors started appearing. |
I see the same log in
I don't see any logs printed from the |
Update for my own issue. I've confirmed that my issue is NOT relevant. It's a misconfiguration of k8s and the actual cause of my issue is
Sorry for the confusion. |
Memory consumption is less than half for me on both cluster and standard agents. So all looks good, but the error prints quite often. |
Seeing this as well, between this and this issue #11126 (1.20 release is not out yet with the change) - the cluster agent is quite noisy at the moment 😄 |
Seeing this as well, there is quick remediation steps? |
Hello @sharonx, Have you fixed this issue? |
Any news about this issue? |
Any update on this? |
Are we just SOL here? |
any udpate |
hi, facing the same issue with datadog agent v7.55.1 running in ECS.
|
How did you fix it? |
Hi,
After the release of 2.33.4 version of the Datadog Helm Chart for Kubernetes we started to get this error on the logs:
This happened around 45.000 times in the space of one week.
The call itself relates to port 5005 of the cluster agent, when the agent communicates with the cluster agent over TLS.
This is printed by the
datadog-cluster-agent
. However, after doing some tests of changing the image tag names for both the agent and the cluster agent, it seems that the problem occurs only on 7.35.+ version of the standard agent image.Cluster Agent is on 1.19.0 and agent on 7.35.2.
It might be some sort of regression issue?
Thanks.
The text was updated successfully, but these errors were encountered: