ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570

xsc · 2020-10-15T09:56:28Z

Output of the info page (if this is a bug)

Sorry, access to this is limited right now; hopefully the below description still helps.

Describe what happened:

The datadog agent stopped working after a panic has been encountered:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                                                                                 message                                                                                                                                                  |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1603746245218 | 2020-10-26 21:04:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746294953 | 2020-10-26 21:04:54 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received                                                                                                                                                                                                    |
| 1603746295226 | 2020-10-26 21:04:55 UTC | PROCESS | INFO | (collector.go:161 in runCheck) | Finish container check #4200 in 4.243073ms                                                                                                                                                                                   |
| 1603746305218 | 2020-10-26 21:05:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746308234 | 2020-10-26 21:05:08 UTC | CORE | INFO | (pkg/serializer/serializer.go:356 in sendMetadata) | Sent metadata payload, size (raw/compressed): 467/287 bytes.                                                                                                                                                |
| 1603746308250 | 2020-10-26 21:05:08 UTC | CORE | INFO | (pkg/serializer/serializer.go:376 in SendJSONToV1Intake) | Sent processes metadata payload, size: 392 bytes.                                                                                                                                                     |
| 1603746354953 | 2020-10-26 21:05:54 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received                                                                                                                                                                                                    |
| 1603746365218 | 2020-10-26 21:06:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746424952 | 2020-10-26 21:07:04 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received                                                                                                                                                                                                    |
| 1603746425218 | 2020-10-26 21:07:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746485218 | 2020-10-26 21:08:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746485758 | 2020-10-26 21:08:05 UTC | CORE | ERROR | (pkg/autodiscovery/listeners/ecs.go:117 in refreshServices) | failed to get task metadata, not refreshing services - Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                             |
| 1603746487758 | 2020-10-26 21:08:07 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider ecs: Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                        |
| 1603746487758 | 2020-10-26 21:08:07 UTC | CORE | ERROR | (pkg/autodiscovery/listeners/ecs.go:117 in refreshServices) | failed to get task metadata, not refreshing services - Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                             |
| 1603746488833 | 2020-10-26 21:08:08 UTC | TRACE | WARN | (pkg/tagger/tagger.go:211 in pull) | Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                                                                                             |
| 1603746488856 | 2020-10-26 21:08:08 UTC | CORE | WARN | (pkg/tagger/tagger.go:211 in pull) | Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                                                                                              |
| 1603746488875 | 2020-10-26 21:08:08 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:118 in LogMessage) | ecs_fargate:d884b5186b651429 | (ecs_fargate.py:90) | Error fetching Fargate http://169.254.170.2/v2/metadata endpoint                                                                               |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn                                                                                                                                                                                           |
| 1603746488875 |     conn = connection.create_connection(                                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection                                                                                                                                                                               |
| 1603746488875 |     raise err                                                                                                                                                                                                                                                                                            |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection                                                                                                                                                                               |
| 1603746488875 |     sock.connect(sa)                                                                                                                                                                                                                                                                                     |
| 1603746488875 | ConnectionRefusedError: [Errno 111] Connection refused                                                                                                                                                                                                                                                   |
| 1603746488875 | During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                      |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen                                                                                                                                                                                         |
| 1603746488875 |     httplib_response = self._make_request(                                                                                                                                                                                                                                                               |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request                                                                                                                                                                                   |
| 1603746488875 |     conn.request(method, url, **httplib_request_kw)                                                                                                                                                                                                                                                      |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1255, in request                                                                                                                                                                                                                 |
| 1603746488875 |     self._send_request(method, url, body, headers, encode_chunked)                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1301, in _send_request                                                                                                                                                                                                           |
| 1603746488875 |     self.endheaders(body, encode_chunked=encode_chunked)                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1250, in endheaders                                                                                                                                                                                                              |
| 1603746488875 |     self._send_output(message_body, encode_chunked=encode_chunked)                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1010, in _send_output                                                                                                                                                                                                            |
| 1603746488875 |     self.send(msg)                                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 950, in send                                                                                                                                                                                                                     |
| 1603746488875 |     self.connect()                                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect                                                                                                                                                                                             |
| 1603746488875 |     conn = self._new_conn()                                                                                                                                                                                                                                                                              |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn                                                                                                                                                                                           |
| 1603746488875 |     raise NewConnectionError(                                                                                                                                                                                                                                                                            |
| 1603746488875 | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f97799bac70>: Failed to establish a new connection: [Errno 111] Connection refused                                                                                                                                |
| 1603746488875 | During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                      |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send                                                                                                                                                                                                 |
| 1603746488875 |     resp = conn.urlopen(                                                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen                                                                                                                                                                                         |
| 1603746488875 |     retries = retries.increment(                                                                                                                                                                                                                                                                         |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 446, in increment                                                                                                                                                                                           |
| 1603746488875 |     raise MaxRetryError(_pool, url, error or ResponseError(cause))                                                                                                                                                                                                                                       |
| 1603746488875 | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='169.254.170.2', port=80): Max retries exceeded with url: /v2/metadata (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f97799bac70>: Failed to establish a new connection: [Errno 111] Connection refused'))    |
| 1603746488875 | During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                      |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ecs_fargate/ecs_fargate.py", line 79, in check                                                                                                                                                                            |
| 1603746488875 |     request = self.http.get(metadata_endpoint)                                                                                                                                                                                                                                                           |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 283, in get                                                                                                                                                                                     |
| 1603746488875 |     return self._request('get', url, options)                                                                                                                                                                                                                                                            |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 332, in _request                                                                                                                                                                                |
| 1603746488875 |     return getattr(requests, method)(url, **new_options)                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get                                                                                                                                                                                                        |
| 1603746488875 |     return request('get', url, params=params, **kwargs)                                                                                                                                                                                                                                                  |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request                                                                                                                                                                                                    |
| 1603746488875 |     return session.request(method=method, url=url, **kwargs)                                                                                                                                                                                                                                             |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request                                                                                                                                                                                              |
| 1603746488875 |     resp = self.send(prep, **send_kwargs)                                                                                                                                                                                                                                                                |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send                                                                                                                                                                                                 |
| 1603746488875 |     r = adapter.send(request, **kwargs)                                                                                                                                                                                                                                                                  |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send                                                                                                                                                                                                 |
| 1603746488875 |     raise ConnectionError(e, request=request)                                                                                                                                                                                                                                                            |
| 1603746488875 | requests.exceptions.ConnectionError: HTTPConnectionPool(host='169.254.170.2', port=80): Max retries exceeded with url: /v2/metadata (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f97799bac70>: Failed to establish a new connection: [Errno 111] Connection refused')) |
| 1603746488940 | 2020-10-26 21:08:08 UTC | PROCESS | WARN | (pkg/tagger/tagger.go:211 in pull) | Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                                                                                           |
| 1603746489761 | panic: runtime error: slice bounds out of range [:12] with length 0                                                                                                                                                                                                                                      |
| 1603746489761 | goroutine 220 [running]:                                                                                                                                                                                                                                                                                 |
| 1603746489761 | github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).createService(0xc00079b2c0, 0xc000742250, 0x3, 0xc000836180, 0xc0008700a0, 0x47, 0x0, 0x0, 0x0, 0x0, ...)                                                                                                                    |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:207 +0x829                                                                                                                                                                                          |
| 1603746489761 | github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).refreshServices(0xc00079b2c0, 0xc000eb5f00)                                                                                                                                                                                  |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:146 +0x871                                                                                                                                                                                          |
| 1603746489761 | github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).Listen.func1(0xc00079b2c0)                                                                                                                                                                                                   |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:100 +0x114                                                                                                                                                                                          |
| 1603746489761 | created by github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).Listen                                                                                                                                                                                                            |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:91 +0x5c                                                                                                                                                                                            |
| 1603746489766 | AGENT EXITED WITH CODE 2, SIGNAL 0, KILLING CONTAINER                                                                                                                                                                                                                                                    |
| 1603746489773 | 2020-10-26 21:08:09 UTC | TRACE | INFO | (main.go:23 in handleSignal) | received signal 15 (terminated)                                                                                                                                                                                                  |
| 1603746489773 | 2020-10-26 21:08:09 UTC | TRACE | INFO | (pkg/trace/agent/agent.go:132 in loop) | Exiting...                                                                                                                                                                                                             |
| 1603746489773 | 2020-10-26 21:08:09 UTC | TRACE | INFO | (pkg/trace/stats/concentrator.go:105 in Run) | Exiting concentrator, computing remaining stats                                                                                                                                                                  |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.                                                                                                                                                     |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:280 in Stop) | stopping the Forwarder                                                                                                                                                                                             |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | INFO | (pkg/forwarder/domain_forwarder.go:233 in Stop) | domainForwarder stopped                                                                                                                                                                                     |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:280 in Stop) | stopping the Forwarder                                                                                                                                                                                             |
| 1603746489773 | trace-agent exited with code 0, disabling                                                                                                                                                                                                                                                                |
| 1603746489775 | process-agent exited with code 0, disabling                                                                                                                                                                                                                                                              |
| 1603746489776 | [cont-finish.d] executing container finish scripts...                                                                                                                                                                                                                                                    |
| 1603746489777 | [cont-finish.d] done.                                                                                                                                                                                                                                                                                    |
| 1603746489777 | [s6-finish] waiting for services.                                                                                                                                                                                                                                                                        |
| 1603746489981 | [s6-finish] sending all processes the TERM signal.                                                                                                                                                                                                                                                       |
| 1603746492986 | [s6-finish] sending all processes the KILL signal and exiting.                                                                                                                                                                                                                                           |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

However, the defined health check still seems to be succeeding since the task is not restarted and the container is still marked as HEALTHY.

          ...
          HealthCheck:
            Command:
              - CMD-SHELL
              - /probe.sh
            Interval: 120
            Retries: 2
            Timeout: 5
          ...

From a short dive into the code, the problem seems to be here:

datadog-agent/pkg/autodiscovery/listeners/ecs.go

Lines 206 to 208 in b1324f0

    
           if err != nil { 
        
           	log.Errorf("Failed to extract tags for container %s - %s", c.DockerID[:12], err) 
        
           }

There seem to be assumptions around c.DockerID that are not met.

Describe what you expected:

At the very least, the health check to fail.

Steps to reproduce the issue:

Unsure. This happens very sporadically, but the root cause seems to be a failed HTTP request to the metadata endpoint.

Additional environment details (Operating System, Cloud provider, etc):

Cloud Provider: AWS
Orchestration: ECS Fargate
Agent Image: datadog/agent:latest-jmx
Environment Variables:

          Environment:
            - Name: ECS_FARGATE
              Value: "true"
            - Name: DD_TAGS
              Value: !Sub "accountid:${AWS::AccountId} env:${ServiceEnvironment} service_name:${ServiceName} service_version:${ImageTag}"
            - Name: DD_APM_ENABLED
              Value: "true"
            - Name: DD_JMXFETCH_ENABLED
              Value: "true"
            - Name: DD_HISTOGRAM_AGGREGATES
              Value: max median avg count sum
            # See: https://github.com/DataDog/datadog-agent/issues/3159
            - Name: DD_DOGSTATSD_TAG_CARDINALITY
              Value: orchestrator

Update (27.10.2020): Provided more comprehensive agent log.

The text was updated successfully, but these errors were encountered:

Victormeriqui · 2020-10-28T14:34:41Z

+1

bitlejac · 2020-10-29T10:12:47Z

+1

alexmarqs · 2020-10-29T10:13:43Z

+1

kwuerl · 2020-11-12T10:30:20Z

+1

aweraw · 2020-11-19T05:18:10Z

+1

houqp · 2020-12-25T08:36:50Z

We are running into the same issue. Looks like ECS issue to me, should we add a graceful retry in datadog agent to workaround it?

peter-featherstone · 2021-01-25T15:38:19Z

Not sure if it's an ECS issue if the agent health check itself is still passing which is what I assume ECS would be using to determine if the container needs restarting? Having the same issue here.

houqp · 2021-01-25T18:43:51Z

yeah, we should also fail agent health check in case of this kind of unrecoverable error.

ravikashyap1 · 2021-02-02T16:59:03Z

Is there an update to this bug fix? Seeing the same in ECS Fargate.

lockwobr · 2021-04-05T22:02:29Z

I am seeing this same issue in AWS ECS Fargate, but seems like an s6-overlay issue. I have this agent as an essential sidecar to my application, (both containers are essential, which if either exits the task should exit) this error happens but it seems task does not exiting overall, which to me make me feel that maybe the s6-overlay is not allowing the container to exit when this error is happening, and thus my application does not exit either. Perhaps that is to be expected? To me that would be unexpected behavior.

Simwar · 2021-05-05T13:32:50Z

Hey all,

The issue has been fixed with this PR: #6938 and is in the version 7.25 of the Datadog Agent.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570

ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570

xsc commented Oct 15, 2020 •

edited

Loading

Victormeriqui commented Oct 28, 2020

bitlejac commented Oct 29, 2020

alexmarqs commented Oct 29, 2020

kwuerl commented Nov 12, 2020

aweraw commented Nov 19, 2020

houqp commented Dec 25, 2020

peter-featherstone commented Jan 25, 2021

houqp commented Jan 25, 2021

ravikashyap1 commented Feb 2, 2021

lockwobr commented Apr 5, 2021

Simwar commented May 5, 2021

ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570

ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570

Comments

xsc commented Oct 15, 2020 • edited Loading

Victormeriqui commented Oct 28, 2020

bitlejac commented Oct 29, 2020

alexmarqs commented Oct 29, 2020

kwuerl commented Nov 12, 2020

aweraw commented Nov 19, 2020

houqp commented Dec 25, 2020

peter-featherstone commented Jan 25, 2021

houqp commented Jan 25, 2021

ravikashyap1 commented Feb 2, 2021

lockwobr commented Apr 5, 2021

Simwar commented May 5, 2021

xsc commented Oct 15, 2020 •

edited

Loading