-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570
Comments
+1 |
4 similar comments
+1 |
+1 |
+1 |
+1 |
We are running into the same issue. Looks like ECS issue to me, should we add a graceful retry in datadog agent to workaround it? |
Not sure if it's an ECS issue if the agent health check itself is still passing which is what I assume ECS would be using to determine if the container needs restarting? Having the same issue here. |
yeah, we should also fail agent health check in case of this kind of unrecoverable error. |
Is there an update to this bug fix? Seeing the same in ECS Fargate. |
I am seeing this same issue in AWS ECS Fargate, but seems like an s6-overlay issue. I have this agent as an essential sidecar to my application, (both containers are essential, which if either exits the task should exit) this error happens but it seems task does not exiting overall, which to me make me feel that maybe the s6-overlay is not allowing the container to exit when this error is happening, and thus my application does not exit either. Perhaps that is to be expected? To me that would be unexpected behavior. |
Hey all, The issue has been fixed with this PR: #6938 and is in the version 7.25 of the Datadog Agent. Thanks! |
Output of the info page (if this is a bug)
Sorry, access to this is limited right now; hopefully the below description still helps.
Describe what happened:
The datadog agent stopped working after a panic has been encountered:
However, the defined health check still seems to be succeeding since the task is not restarted and the container is still marked as
HEALTHY
.From a short dive into the code, the problem seems to be here:
datadog-agent/pkg/autodiscovery/listeners/ecs.go
Lines 206 to 208 in b1324f0
There seem to be assumptions around
c.DockerID
that are not met.Describe what you expected:
At the very least, the health check to fail.
Steps to reproduce the issue:
Unsure. This happens very sporadically, but the root cause seems to be a failed HTTP request to the metadata endpoint.
Additional environment details (Operating System, Cloud provider, etc):
Cloud Provider: AWS
Orchestration: ECS Fargate
Agent Image:
datadog/agent:latest-jmx
Environment Variables:
Update (27.10.2020): Provided more comprehensive agent log.
The text was updated successfully, but these errors were encountered: