Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog Cluster Agent TLS Handshake Error #12540

Open
omamoo opened this issue Jun 27, 2022 · 8 comments
Open

Datadog Cluster Agent TLS Handshake Error #12540

omamoo opened this issue Jun 27, 2022 · 8 comments

Comments

@omamoo
Copy link

omamoo commented Jun 27, 2022

Output of the info page (if this is a bug)

===============================
Datadog Cluster Agent (v1.19.0)
===============================

  Status date: 2022-06-27 09:32:43.046 UTC (1656322363046)
  Agent start: 2022-06-27 09:11:56.09 UTC (1656321116090)
  Pid: 1
  Go Version: go1.17.6
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System time: 2022-06-27 09:32:43.046 UTC (1656322363046)

  Hostnames
  =========
    ec2-hostname: ip-10-200-70-115.us-east-2.compute.internal
    hostname: i-01646694ab9dedea3
    instance-id: i-01646694ab9dedea3
    socket-fqdn: datadog-cluster-agent-5b6d6c676-dxxsl
    socket-hostname: datadog-cluster-agent-5b6d6c676-dxxsl
    hostname provider: aws
    unused hostname providers:
      azure: azure_hostname_style is set to 'os'
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: GCE metadata API error: status code 401 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-5b6d6c676-dxxsl
  Last Acquisition of the lease: Mon, 27 Jun 2022 09:13:15 UTC
  Renewed leadership: Mon, 27 Jun 2022 09:32:31 UTC
  Number of leader transitions: 7 transitions

Custom Metrics Server
=====================
  Disabled: The external metrics provider is not enabled on the Cluster Agent

Cluster Checks Dispatching
==========================
  Status: Leader, serving requests
  Active agents: 6
  Check Configurations: 0
    - Dispatched: 0
    - Unassigned: 0

Admission Controller
====================
  Disabled: The admission controller is not enabled on the Cluster Agent
  

=========
Collector
=========

  Running Checks
  ==============
    
    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 83
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 2, Total: 2
      Service Checks: Last Run: 5, Total: 385
      Average Execution Time : 1.98s
      Last Execution Date : 2022-06-27 09:32:29 UTC (1656322349000)
      Last Successful Execution Date : 2022-06-27 09:32:29 UTC (1656322349000)
      
    
    orchestrator
    ------------
      Instance ID: orchestrator:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/conf.yaml.default
      Total Runs: 125
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 9ms
      Last Execution Date : 2022-06-27 09:32:37 UTC (1656322357000)
      Last Successful Execution Date : 2022-06-27 09:32:37 UTC (1656322357000)
      
=========
Forwarder
=========

  Transactions
  ============
    Cluster: 7
    ClusterRole: 7
    ClusterRoleBinding: 7
    CronJob: 0
    DaemonSet: 7
    Deployment: 7
    Dropped: 0
    HighPriorityQueueFull: 0
    Ingress: 0
    Job: 0
    Node: 37
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 7
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 7
    RoleBinding: 7
    Service: 7
    ServiceAccount: 7
    StatefulSet: 0

  Transaction Successes
  =====================
    Total number: 275
    Successes By Endpoint:
      check_run_v1: 83
      intake: 2
      orchestrator: 107
      series_v1: 83

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 3ce93

=====================
Orchestrator Explorer
=====================
  Collection Status: The collection is at least partially running since the cache has been populated.
  Cluster Name: devuse2dep2eks1
  Cluster ID: ad9d7b7b-034d-4be5-a513-e965f1c9be12
  Container scrubbing: enabled

  ======================
  Orchestrator Endpoints
  ======================
    https://orchestrator.datadoghq.com - API Key ending with: 3ce93

  ===========
  Cache Stats
  ===========
    Elements in the cache: 439

    ClusterRoleBinding
      Last Run: (Hits: 77 Miss: 0) | Total: (Hits: 8470 Miss: 539)

    ClusterRole
      Last Run: (Hits: 94 Miss: 0) | Total: (Hits: 10340 Miss: 658)

    Cluster
      Last Run: (Hits: 1 Miss: 0) | Total: (Hits: 110 Miss: 7)

    CronJob
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    DaemonSet
      Last Run: (Hits: 4 Miss: 0) | Total: (Hits: 440 Miss: 28)

    Deployment
      Last Run: (Hits: 20 Miss: 0) | Total: (Hits: 2200 Miss: 140)

    Job
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Node
      Last Run: (Hits: 6 Miss: 0) | Total: (Hits: 654 Miss: 48)

    PersistentVolumeClaim
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    PersistentVolume
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Pod
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    ReplicaSet
      Last Run: (Hits: 99 Miss: 0) | Total: (Hits: 10890 Miss: 693)

    RoleBinding
      Last Run: (Hits: 21 Miss: 0) | Total: (Hits: 2310 Miss: 147)

    Role
      Last Run: (Hits: 20 Miss: 0) | Total: (Hits: 2200 Miss: 140)

    ServiceAccount
      Last Run: (Hits: 65 Miss: 0) | Total: (Hits: 7150 Miss: 455)

    Service
      Last Run: (Hits: 14 Miss: 0) | Total: (Hits: 1540 Miss: 98)

    StatefulSet
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

Describe what happened:
We can see the following logs on the cluster agent.

2022-06-27 09:34:42 UTC | CLUSTER | ERROR | (/goroot/src/net/http/server.go:3158 in logf) | Error from the agent http API server: http: TLS handshake error from 10.200.80.202:38838: EOF

This log is spamming our logs, and environments we cannot upgrade to latest until this problem solved.

Describe what you expected:
No log happens.

Steps to reproduce the issue:
Install latest cluster agent 1.19.0

Additional environment details (Operating System, Cloud provider, etc):
Cloud Provider: AWS & EKS

@evanharmon1
Copy link

We are getting this error as well, although in the agent status, ours says:

unused hostname providers:
      azure: azure_hostname_style is set to 'os'
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

We are not using GCE. We are on AWS EKS. I think this issue has more info here - Datadog Agent tries to fetch hostname from GCE when provider is AWS. · Issue #3566 · DataDog/datadog-agent

@ericbuehl
Copy link

Seeing this issue still with the latest cluster agent (7.44.1)

@ericbuehl
Copy link

Just got a response from support claiming this may be fixed in 7.45 (released 3 days ago)

@rpriyanshu9
Copy link
Contributor

I am getting these errors in 7.45.0 as well.

@clairefinnie
Copy link

I am getting this error in v7.51.

It would be so usefull if github issues were updated with solutions.

@marcossv9
Copy link

marcossv9 commented Jul 18, 2024

Hi folks, I'm getting a similar error too:

2024-07-18 17:52:54 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:58894: EOF

Running the agent in AWS ECS Fargate. using the latest version available, 7.55.1.

@leantorres73
Copy link

Hi folks, I'm getting a similar error too:

2024-07-18 17:52:54 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:58894: EOF

Running the agent in AWS ECS Fargate. using the latest version available, 7.55.1.

How did you fix it?

@marcossv9
Copy link

Hi folks, I'm getting a similar error too:

2024-07-18 17:52:54 UTC | CORE | ERROR | (/usr/local/go/src/net/http/server.go:1900 in serve) | Error from the Agent HTTP server 'CMD API Server': http: TLS handshake error from 127.0.0.1:58894: EOF

Running the agent in AWS ECS Fargate. using the latest version available, 7.55.1.

How did you fix it?

sorry just saw this. we ended up migrating the agent to EKS and running the latest version available.
all good for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants