Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog Cluster Agent errors after release 7.35.0 #12110

Open
michelzanini opened this issue May 19, 2022 · 19 comments
Open

Datadog Cluster Agent errors after release 7.35.0 #12110

michelzanini opened this issue May 19, 2022 · 19 comments

Comments

@michelzanini
Copy link

Hi,

After the release of 2.33.4 version of the Datadog Helm Chart for Kubernetes we started to get this error on the logs:

Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

This happened around 45.000 times in the space of one week.
The call itself relates to port 5005 of the cluster agent, when the agent communicates with the cluster agent over TLS.

This is printed by the datadog-cluster-agent. However, after doing some tests of changing the image tag names for both the agent and the cluster agent, it seems that the problem occurs only on 7.35.+ version of the standard agent image.

Cluster Agent is on 1.19.0 and agent on 7.35.2.

It might be some sort of regression issue?
Thanks.

@th0masb
Copy link

th0masb commented May 24, 2022

I also observe this behaviour with cluster-agent 1.19.0 and agent 7.35.2 running on a GKE cluster with version 1.22.8. My helm config looks like

# Helm values for datadog agent chart
# Source: https://github.com/DataDog/helm-charts/blob/main/charts/datadog/values.yaml
datadog:
 apiKey: ${api_key}
 site: datadoghq.eu
 clusterName: ${cluster_name}
 logLevel: critical
 # kube-state-metrics don't seem to play nicely with 1.22
 # and is considered a legacy service which is not needed
 # going forward
 # https://docs.datadoghq.com/integrations/kubernetes_state_core
 kubeStateMetricsEnabled: false
 kubeStateMetricsCore:
   enabled: true
 logs:
   enabled: true
   containerCollectAll: true
 kubelet:
   tlsVerify: false
 env:
   # The logs from gke system pods can be noisy and contain lots of errors.
   # We don't really care about them as we trust gke will work.
   - name: DD_CONTAINER_EXCLUDE
     value: "kube_namespace:kube-system"

@nextanner
Copy link

I'm seeing this as well, is it causing your pods to crashloop?

@sharonx
Copy link

sharonx commented May 24, 2022

I'm seeing this issue too that started about 4 days ago. Any quick remediation steps?

@michelzanini
Copy link
Author

I'm seeing this as well, is it causing your pods to crashloop?

It's not crashing the pods or anything like that, just printing the error on logs. I am not sure on the impact the error causes. I have not noticed any differences so far.

@sharonx
Copy link

sharonx commented May 24, 2022

I'm seeing this as well, is it causing your pods to crashloop?

We have 1 pod that's in crashloop which is caused by the trace-agent error. Other containers in the pod are fine

  - containerID: docker://d5cfac4026a0279810700b165d5d9f8601ed2025f724d536cef25760e29b140d
    image: gcr.io/datadoghq/agent:7.32.3
    imageID: docker-pullable://gcr.io/datadoghq/agent@sha256:706a9e65a8c82872e8f8165830387c0af5d9785774cebf44dd29121a5907e3ea
    lastState:
      terminated:
        containerID: docker://d5cfac4026a0279810700b165d5d9f8601ed2025f724d536cef25760e29b140d
        exitCode: 143
        finishedAt: "2022-05-24T17:16:46Z"
        reason: Error
        startedAt: "2022-05-24T17:15:52Z"
    name: trace-agent
    ready: false
    restartCount: 11
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=trace-agent pod=datadog-agent-m7ljq_datadog(ae681201-11fb-4a75-9f3c-558c2f8d5c05)
        reason: CrashLoopBackOff

@michelzanini
Copy link
Author

Are we talking about the same error here @sharonx ?

My case the error is printed by datadog-cluster-agent not trace-agent. The log line is like this:

Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

@nextanner
Copy link

Are we talking about the same error here @sharonx ?

My case the error is printed by datadog-cluster-agent not trace-agent. The log line is like this:

Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

That is the error I am seeing. I had to increase the memory limit for my datadog-cluster-agent pod or it had been getting OOMKilled since the errors started appearing.

@sharonx
Copy link

sharonx commented May 24, 2022

Error from the agent http API server: http: TLS handshake error

I see the same log in agent not cluster-agent.

2022-05-24 17:31:08 UTC | CORE | INFO | (pkg/util/log/log.go:610 in func1) | runtime: final GOMAXPROCS value is: 2
2022-05-24 17:31:08 UTC | CORE | INFO | (pkg/util/log/log.go:610 in func1) | Features detected from environment: kubernetes,docker
2022-05-24 17:31:11 UTC | CORE | INFO | (cmd/agent/app/run.go:248 in StartAgent) | Starting Datadog Agent v7.32.3
2022-05-24 17:31:25 UTC | CORE | INFO | (cmd/agent/app/run.go:317 in StartAgent) | Hostname is: i-02478c6f339a3f62e
2022-05-24 17:31:52 UTC | CORE | INFO | (cmd/agent/app/run.go:357 in StartAgent) | GUI server port -1 specified: not starting the GUI.
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/forwarder/forwarder.go:205 in NewDefaultForwarder) | Retry queue storage on disk is disabled
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/forwarder/forwarder.go:304 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://7-32-3-app.agent.datadoghq.com" (1 api key(s))
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:275 in CheckConnectivity) | Checking HTTP connectivity...
2022-05-24 17:31:54 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:281 in CheckConnectivity) | Sending HTTP connectivity request to https://agent-http-intake.logs.datadoghq.com/api/v2/logs...
2022-05-24 17:31:55 UTC | CORE | INFO | (pkg/dogstatsd/listeners/uds_common.go:142 in Listen) | dogstatsd-uds: starting to listen on /var/run/datadog/dsd.socket
2022-05-24 17:32:02 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [healthcheck dogstatsd-main aggregator]
2022-05-24 17:32:04 UTC | CORE | WARN | (pkg/logs/client/http/destination.go:284 in CheckConnectivity) | HTTP connectivity failure: Post "https://agent-http-intake.logs.datadoghq.com/api/v2/logs": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-05-24 17:32:05 UTC | CORE | WARN | (pkg/logs/config/config.go:128 in BuildEndpointsWithConfig) | You are currently sending Logs to Datadog through TCP (either because logs_config.use_tcp or logs_config.socks5_proxy_address is set or the HTTP connectivity test has failed) To benefit from increased reliability and better network performances, we strongly encourage switching over to compressed HTTPS which is now the default protocol.
2022-05-24 17:32:05 UTC | CORE | INFO | (pkg/dogstatsd/listeners/udp.go:95 in Listen) | dogstatsd-udp: starting to listen on [::]:8125
2022-05-24 17:32:05 UTC | CORE | ERROR | (/goroot/src/net/http/server.go:1829 in serve) | Error from the agent http API server: http: TLS handshake error from 127.0.0.1:57796: read tcp 127.0.0.1:5001->127.0.0.1:57796: read: connection reset by peer
2022-05-24 17:32:05 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [healthcheck aggregator dogstatsd-main healthcheck forwarder]
2022-05-24 17:32:05 UTC | CORE | INFO | (pkg/logs/logs.go:112 in start) | Starting logs-agent...
2022-05-24 17:32:10 UTC | CORE | INFO | (pkg/logs/logs.go:122 in start) | logs-agent started

I don't see any logs printed from the trace-agent. When I describe the pod, it says the trace-agent is in crashloop.

@sharonx
Copy link

sharonx commented May 24, 2022

Update for my own issue.

I've confirmed that my issue is NOT relevant. It's a misconfiguration of k8s and the actual cause of my issue is

error pulling from collector "kubelet": couldn't fetch "podlist": error performing kubelet query 
https://<redacted>.compute.internal/.:10250/pods: 
Get "https://<redacted>.compute.internal/.:10250/pods": context deadline exceeded

Sorry for the confusion.

@michelzanini
Copy link
Author

Are we talking about the same error here @sharonx ?
My case the error is printed by datadog-cluster-agent not trace-agent. The log line is like this:

Error from the agent http API server: http: TLS handshake error from 10.0.XXX.XXX:XXXX: EOF

That is the error I am seeing. I had to increase the memory limit for my datadog-cluster-agent pod or it had been getting OOMKilled since the errors started appearing.

Memory consumption is less than half for me on both cluster and standard agents. So all looks good, but the error prints quite often.

@tcd156
Copy link

tcd156 commented May 25, 2022

Seeing this as well, between this and this issue #11126 (1.20 release is not out yet with the change) - the cluster agent is quite noisy at the moment 😄

@omamoo
Copy link

omamoo commented Jun 27, 2022

Seeing this as well, there is quick remediation steps?

@vietnq6
Copy link

vietnq6 commented Nov 17, 2022

Update for my own issue.

I've confirmed that my issue is NOT relevant. It's a misconfiguration of k8s and the actual cause of my issue is

error pulling from collector "kubelet": couldn't fetch "podlist": error performing kubelet query 
https://<redacted>.compute.internal/.:10250/pods: 
Get "https://<redacted>.compute.internal/.:10250/pods": context deadline exceeded

Sorry for the confusion.

Hello @sharonx, Have you fixed this issue?

@newb1e
Copy link

newb1e commented May 3, 2023

Any news about this issue?

@rpriyanshu9
Copy link
Contributor

Any update on this?

@gertnerbot
Copy link

Are we just SOL here?

@vl-kp
Copy link

vl-kp commented Sep 12, 2023

any udpate

@marcossv9
Copy link

hi, facing the same issue with datadog agent v7.55.1 running in ECS.

2024-07-16 14:40:31 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:55972: EOF

@leantorres73
Copy link

hi, facing the same issue with datadog agent v7.55.1 running in ECS.

2024-07-16 14:40:31 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:55972: EOF

How did you fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests