Datadog Cluster Agent errors after release 7.35.0 #12110

michelzanini · 2022-05-19T18:44:45Z

Hi,

After the release of 2.33.4 version of the Datadog Helm Chart for Kubernetes we started to get this error on the logs:

Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

This happened around 45.000 times in the space of one week.
The call itself relates to port 5005 of the cluster agent, when the agent communicates with the cluster agent over TLS.

This is printed by the datadog-cluster-agent. However, after doing some tests of changing the image tag names for both the agent and the cluster agent, it seems that the problem occurs only on 7.35.+ version of the standard agent image.

Cluster Agent is on 1.19.0 and agent on 7.35.2.

It might be some sort of regression issue?
Thanks.

The text was updated successfully, but these errors were encountered:

th0masb · 2022-05-24T11:45:02Z

I also observe this behaviour with cluster-agent 1.19.0 and agent 7.35.2 running on a GKE cluster with version 1.22.8. My helm config looks like

# Helm values for datadog agent chart
# Source: https://github.com/DataDog/helm-charts/blob/main/charts/datadog/values.yaml
datadog:
 apiKey: ${api_key}
 site: datadoghq.eu
 clusterName: ${cluster_name}
 logLevel: critical
 # kube-state-metrics don't seem to play nicely with 1.22
 # and is considered a legacy service which is not needed
 # going forward
 # https://docs.datadoghq.com/integrations/kubernetes_state_core
 kubeStateMetricsEnabled: false
 kubeStateMetricsCore:
   enabled: true
 logs:
   enabled: true
   containerCollectAll: true
 kubelet:
   tlsVerify: false
 env:
   # The logs from gke system pods can be noisy and contain lots of errors.
   # We don't really care about them as we trust gke will work.
   - name: DD_CONTAINER_EXCLUDE
     value: "kube_namespace:kube-system"

nextanner · 2022-05-24T16:50:30Z

I'm seeing this as well, is it causing your pods to crashloop?

sharonx · 2022-05-24T16:53:42Z

I'm seeing this issue too that started about 4 days ago. Any quick remediation steps?

michelzanini · 2022-05-24T17:11:25Z

I'm seeing this as well, is it causing your pods to crashloop?

It's not crashing the pods or anything like that, just printing the error on logs. I am not sure on the impact the error causes. I have not noticed any differences so far.

sharonx · 2022-05-24T17:20:07Z

I'm seeing this as well, is it causing your pods to crashloop?

We have 1 pod that's in crashloop which is caused by the trace-agent error. Other containers in the pod are fine

  - containerID: docker://d5cfac4026a0279810700b165d5d9f8601ed2025f724d536cef25760e29b140d
    image: gcr.io/datadoghq/agent:7.32.3
    imageID: docker-pullable://gcr.io/datadoghq/agent@sha256:706a9e65a8c82872e8f8165830387c0af5d9785774cebf44dd29121a5907e3ea
    lastState:
      terminated:
        containerID: docker://d5cfac4026a0279810700b165d5d9f8601ed2025f724d536cef25760e29b140d
        exitCode: 143
        finishedAt: "2022-05-24T17:16:46Z"
        reason: Error
        startedAt: "2022-05-24T17:15:52Z"
    name: trace-agent
    ready: false
    restartCount: 11
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=trace-agent pod=datadog-agent-m7ljq_datadog(ae681201-11fb-4a75-9f3c-558c2f8d5c05)
        reason: CrashLoopBackOff

michelzanini · 2022-05-24T17:22:56Z

Are we talking about the same error here @sharonx ?

My case the error is printed by datadog-cluster-agent not trace-agent. The log line is like this:

Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

nextanner · 2022-05-24T17:34:10Z

Are we talking about the same error here @sharonx ?

My case the error is printed by datadog-cluster-agent not trace-agent. The log line is like this:
Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

That is the error I am seeing. I had to increase the memory limit for my datadog-cluster-agent pod or it had been getting OOMKilled since the errors started appearing.

sharonx · 2022-05-24T17:34:23Z

Error from the agent http API server: http: TLS handshake error

I see the same log in agent not cluster-agent.

2022-05-24 17:31:08 UTC | CORE | INFO | (pkg/util/log/log.go:610 in func1) | runtime: final GOMAXPROCS value is: 2
2022-05-24 17:31:08 UTC | CORE | INFO | (pkg/util/log/log.go:610 in func1) | Features detected from environment: kubernetes,docker
2022-05-24 17:31:11 UTC | CORE | INFO | (cmd/agent/app/run.go:248 in StartAgent) | Starting Datadog Agent v7.32.3
2022-05-24 17:31:25 UTC | CORE | INFO | (cmd/agent/app/run.go:317 in StartAgent) | Hostname is: i-02478c6f339a3f62e
2022-05-24 17:31:52 UTC | CORE | INFO | (cmd/agent/app/run.go:357 in StartAgent) | GUI server port -1 specified: not starting the GUI.
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/forwarder/forwarder.go:205 in NewDefaultForwarder) | Retry queue storage on disk is disabled
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/forwarder/forwarder.go:304 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://7-32-3-app.agent.datadoghq.com" (1 api key(s))
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:275 in CheckConnectivity) | Checking HTTP connectivity...
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:281 in CheckConnectivity) | Sending HTTP connectivity request to https://agent-http-intake.logs.datadoghq.com/api/v2/logs...
2022-05-24 17:31:55 UTC | CORE | INFO | (pkg/dogstatsd/listeners/uds_common.go:142 in Listen) | dogstatsd-uds: starting to listen on /var/run/datadog/dsd.socket
2022-05-24 17:32:02 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [healthcheck dogstatsd-main aggregator]
2022-05-24 17:32:04 UTC | CORE | WARN | (pkg/logs/client/http/destination.go:284 in CheckConnectivity) | HTTP connectivity failure: Post "https://agent-http-intake.logs.datadoghq.com/api/v2/logs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-05-24 17:32:05 UTC | CORE | WARN | (pkg/logs/config/config.go:128 in BuildEndpointsWithConfig) | You are currently sending Logs to Datadog through TCP (either because logs_config.use_tcp or logs_config.socks5_proxy_address is set or the HTTP connectivity test has failed) To benefit from increased reliability and better network performances, we strongly encourage switching over to compressed HTTPS which is now the default protocol.
2022-05-24 17:32:05 UTC | CORE | INFO | (pkg/dogstatsd/listeners/udp.go:95 in Listen) | dogstatsd-udp: starting to listen on [::]:8125
2022-05-24 17:32:05 UTC | CORE | ERROR | (/goroot/src/net/http/server.go:1829 in serve) | Error from the agent http API server: http: TLS handshake error from 127.0.0.1:57796: read tcp 127.0.0.1:5001->127.0.0.1:57796: read: connection reset by peer
2022-05-24 17:32:05 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [healthcheck aggregator dogstatsd-main healthcheck forwarder]
2022-05-24 17:32:05 UTC | CORE | INFO | (pkg/logs/logs.go:112 in start) | Starting logs-agent...
2022-05-24 17:32:10 UTC | CORE | INFO | (pkg/logs/logs.go:122 in start) | logs-agent started

I don't see any logs printed from the trace-agent. When I describe the pod, it says the trace-agent is in crashloop.

sharonx · 2022-05-24T17:50:01Z

Update for my own issue.

I've confirmed that my issue is NOT relevant. It's a misconfiguration of k8s and the actual cause of my issue is

error pulling from collector "kubelet": couldn't fetch "podlist": error performing kubelet query 
https://<redacted>.compute.internal/.:10250/pods: 
Get "https://<redacted>.compute.internal/.:10250/pods": context deadline exceeded

Sorry for the confusion.

michelzanini · 2022-05-24T17:52:58Z

Are we talking about the same error here @sharonx ?
My case the error is printed by datadog-cluster-agent not trace-agent. The log line is like this:
Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF
That is the error I am seeing. I had to increase the memory limit for my datadog-cluster-agent pod or it had been getting OOMKilled since the errors started appearing.

Memory consumption is less than half for me on both cluster and standard agents. So all looks good, but the error prints quite often.

tcd156 · 2022-05-25T19:48:24Z

Seeing this as well, between this and this issue #11126 (1.20 release is not out yet with the change) - the cluster agent is quite noisy at the moment 😄

omamoo · 2022-06-27T09:18:49Z

Seeing this as well, there is quick remediation steps?

vietnq6 · 2022-11-17T09:24:06Z

Update for my own issue.

I've confirmed that my issue is NOT relevant. It's a misconfiguration of k8s and the actual cause of my issue is
error pulling from collector "kubelet": couldn't fetch "podlist": error performing kubelet query 
https://<redacted>.compute.internal/.:10250/pods: 
Get "https://<redacted>.compute.internal/.:10250/pods": context deadline exceeded
Sorry for the confusion.

Hello @sharonx, Have you fixed this issue?

newb1e · 2023-05-03T08:05:54Z

Any news about this issue?

rpriyanshu9 · 2023-05-29T04:07:28Z

Any update on this?

gertnerbot · 2023-08-15T16:41:51Z

Are we just SOL here?

vl-kp · 2023-09-12T13:43:29Z

any udpate

marcossv9 · 2024-07-16T14:55:30Z

hi, facing the same issue with datadog agent v7.55.1 running in ECS.

2024-07-16 14:40:31 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:55972: EOF

leantorres73 · 2025-01-28T13:14:48Z

hi, facing the same issue with datadog agent v7.55.1 running in ECS.

2024-07-16 14:40:31 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:55972: EOF

How did you fix it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datadog Cluster Agent errors after release 7.35.0 #12110

Datadog Cluster Agent errors after release 7.35.0 #12110

michelzanini commented May 19, 2022

th0masb commented May 24, 2022 •

edited

Loading

nextanner commented May 24, 2022

sharonx commented May 24, 2022

michelzanini commented May 24, 2022

sharonx commented May 24, 2022 •

edited

Loading

michelzanini commented May 24, 2022

nextanner commented May 24, 2022

sharonx commented May 24, 2022

sharonx commented May 24, 2022

michelzanini commented May 24, 2022

tcd156 commented May 25, 2022

omamoo commented Jun 27, 2022

vietnq6 commented Nov 17, 2022

newb1e commented May 3, 2023

rpriyanshu9 commented May 29, 2023

gertnerbot commented Aug 15, 2023

vl-kp commented Sep 12, 2023

marcossv9 commented Jul 16, 2024

leantorres73 commented Jan 28, 2025

Datadog Cluster Agent errors after release 7.35.0 #12110

Datadog Cluster Agent errors after release 7.35.0 #12110

Comments

michelzanini commented May 19, 2022

th0masb commented May 24, 2022 • edited Loading

nextanner commented May 24, 2022

sharonx commented May 24, 2022

michelzanini commented May 24, 2022

sharonx commented May 24, 2022 • edited Loading

michelzanini commented May 24, 2022

nextanner commented May 24, 2022

sharonx commented May 24, 2022

sharonx commented May 24, 2022

michelzanini commented May 24, 2022

tcd156 commented May 25, 2022

omamoo commented Jun 27, 2022

vietnq6 commented Nov 17, 2022

newb1e commented May 3, 2023

rpriyanshu9 commented May 29, 2023

gertnerbot commented Aug 15, 2023

vl-kp commented Sep 12, 2023

marcossv9 commented Jul 16, 2024

leantorres73 commented Jan 28, 2025

th0masb commented May 24, 2022 •

edited

Loading

sharonx commented May 24, 2022 •

edited

Loading