-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159
Comments
on a regular aws ecs host the hostname seems to be set at the |
@mfpierre i'm wondering if this commit 5867aae has the un-intended consequence of having DogStatsD relayed metrics show up with no we have confirmed that if the |
@mfpierre Please see the text below which describes the condition i think this commit 5867aae inadvertently triggered, especially for
|
Hey @danbf thanks for the report, indeed the removal of the hostname looks problematic in your use case. One thing you could do instead of setting The other solution would be to set the agent dogstatsd tagger in datadog-agent/pkg/config/config.go Lines 357 to 363 in b3b64f9
|
@mfpierre here's the thing. it seems the default behavior of the datadog agent in fargate is to drop metrics that would get reported by the same version datadiog agent running on an ecs ec2 cluster. And that is exactly what we saw when we moved a service from an ecs ec2 cluster to fargate. what other things are affected by say backing out 5867aae and letting the |
Hi @danbf That includes core Datadog agent metrics like Another reason why we disable hostname, and host tags is to avoid that tasks show up on your bill as hosts. I assure you it wouldn't be in your favor 😄 The solution @mfpierre recommends is our solution for this use case, please let us know if it doesn't work for you, we would be interested to understand why. |
@hkaj is this documented anywhere in the fargate datadog agent deployment guide at the least i think it should. i'm not seeing it here: https://www.datadoghq.com/blog/monitor-aws-fargate/ or here: https://docs.datadoghq.com/integrations/ecs_fargate/ but i fundamentally disagree. implementing PR, #1182 did a bunch of things in disabling the host level checks. i think it went just one step too far, and disabled the |
The point of Fargate is to abstract the host away, and focus on the task. We respected this when building the integration, and removed the host tag. Even if we had kept the host tag, there's no information about the EC2 instance available via the API or otherwise. It also doesn't make sense to surface anything host-related about Fargate workload, even if we could retrofit the task name in the host field (which we could, you're right). It would make the product more confusing, since Fargate users don't expect to have to care about hosts. The recommended way to differentiate tasks is to use the solution that mfpierre suggested. If this is not satisfying we're open to other suggestions of how we can make the user experience better here, but using the host tag for a task is not it. |
@hkaj Will the proposed solution of @mfpierre be a problem if we deploy multiple times a day, yielding a lot of different task-ids and therefore lots of different tags? The documentation [1] states that the number of metrics is limited, so we tried to avoid tags like that until now and also used the hostname tag (which works for our use cases so far). [1] https://docs.datadoghq.com/developers/metrics/custom_metrics/#how-many-custom-metrics-am-i-allowed |
@tom-mi you're right, it will impact billing, because it creates one time series per task instance. It really depends on what level of granularity you need. If you don't need visibility about your custom metrics per task, i'd suggest not setting any task-level tag, to reduce the # of time series. If you need to aggregate them by task, you will need to add the task arn in there. |
Frustrated DataDog customer here. 👋 Between this issue and #2288, I'd say the current DataDog agent behavior is going to be problematic for the large majority of Fargate users. It's unintuitive, confusing, and pretty much undocumented. Basic stuff like making sure there isn't any one task instance that's low on memory or counting number of requests served by all tasks isn't possible without custom configuration! |
Hi @jfirebaugh
The only caveat with the current setup is with dogstatsd and using multiple instances of the same task. Please reach out to our support team ([email protected]) if you'd like to open another feature request that you think is relevant. Simon |
Sure, but using multiple instances is what everyone who wants redundancy or to scale horizontally will be doing. It's one of the main attractions of containerization.
That's great to hear! I think it will resolve the issue to everyone's satisfaction. It's almost exactly what I've implemented manually as a workaround, only I send just the task ID (last part of the ARN). My variant of @danbf's entrypoint script: #!/bin/bash
if [[ -n "${ECS_FARGATE}" ]]; then
task_id=$(curl --silent 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk -F\" '{print $1}')
export DD_TAGS="$DD_TAGS task_id:$task_id"
export DD_DOGSTATSD_TAGS="$DD_DOGSTATSD_TAGS task_id:$task_id"
fi
/init |
Update for those who may be using the above workaround themselves: I found that the task_id=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last') and added jq installation to the Dockerfile: RUN apt-get update && apt-get install -y jq && rm -rf /var/lib/apt/lists/* |
@Simwar do you have an ETA for when the feature request will be worked on and released? |
Can this issue be solved by using the DD_DOGSTATSD_TAG_CARDINALITY=orcehstrator which seems to append the task ARN automatically? (possible billing surcharges still being an issue) |
Can we get a bit more clarification on the costs you mention? I somewhat understand the DD_DOGSTATSD_TAGS cost already since that just seems to be an extra tag attached to each metric. But how does adding For context, we deploy our application multiple times a day and probably have around 70-80 task instances per deploy. |
we've switched away from bash for this, but will be looking at #5324 shortly. also would like to know the cost trade-off here. our latest entrypoint.sh here:
|
We've tried #5324 but it seems like this does not always work. Multiple times when AWS autoscales and starts a new task the task does not send its task_arn. This can cause the mentioned under reporting as soon as two tasks are started and do not send the task_arn. This has already happened to us despite having |
Hi @SteffenDE, thanks for looking into the |
Just opened a ticket (#389042). I hope this helps you to find the cause of this issue. |
I would like to +1 @SteffenDE's problem, we are encountering the same issue where task-arn is N/A. Appreciate if you could bump up the priority of this since multiple people are getting the same issue. |
Datadog is currently investigating this. For now we're using this workaround adapted from the comments above: #!/bin/bash
set -e
set -o pipefail
if [[ -n "${ECS_FARGATE}" ]]; then
echo "datadog agent starting up in ecs!"
echo "trying to get task_arn from metadata endpoint..."
until [ -n "${task_arn}" ]; do
task_arn=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')
done
echo "got it. starting up with task_arn $task_arn"
export DD_HOSTNAME=task-$task_arn
fi
/init FROM datadog/agent:7
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
RUN apt-get update \
&& apt-get install --no-install-recommends -y jq \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
ENTRYPOINT ["/entrypoint.sh"] |
…orting statsd metrics see this issue: DataDog/datadog-agent#3159
I found this issue by accident while trying to find the documentation on the "correct" way to work with custom metrics in Fargate, and whether sidecars were still the preferred option. Since reading through it I admit I'm now very glad that I'm aware of the problem before any of our teams run up against it the hard way, and disappointed that I don't seem to see it mentioned in the blog posts or docs about using DD with Fargate. I'm also not entirely clear, at this stage, on what's needed to have reliable counts for multiple tasks that are part of the same service. Do I need to now supply If I do this, will it add cardinality for any custom metric reported by that task, similar to how metrics from an individual EC2 host would add cardinality? Is this in the docs now, and I just missed it? |
While the workaround I've tried a number of environment variables with the latest version but none resolve the issue. Here are a number of settings I've tried:
What has changed in 7.26.0? |
I recently filled #7602 for this. |
Describe what happened:
under reporting of count metrics is observed when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks instances per services. when using a
Service Type
of replica andNumber of tasks
> 1 then count metrics are under reported by the1/<Number of tasks>
. this only occurs forNumber of tasks
> 1.this happens as a result of two behaviors.
hostname
parameter value. this is the current aws fargate behavior.as a result the count metrics from each service's task instance are considered as coming from the same source(
hostname
), and so one count metrics is processed for each sample interval with the remaining discarded. this reduces the summed count per interval to only count metric rather then the sum of multiple counts. if each of the service's task instance's had a unique hostname set by aws fargate then all the count metrics would be processed and summed together as the summed count for that sample interval.while
hostname
is not set uniquely per task instance for a service, there is a parameter that is, theTaskARN
and it's available to the container via theTask Metadata Endpoint
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v2.html .so incorporating something like below that leverages the uniqueness of the
TaskARN
in the ecs entrypoint for the datadog agent https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-ecs.sh#L13 would fix that by setting theDD_HOSTNAME
to something unique per task instance.this is based off of #2288 (comment) and https://github.com/aws/amazon-ecs-agent/issues/3#issuecomment-437643239 and we have confirmed this is working setting out dockerfile to:
and our
/entrypoint.sh
toDescribe what you expected:
count metrics should be counted from each task instance run in fargate.
Steps to reproduce the issue:
Service Type
of replica andNumber of tasks
> 1 and utilizes the datadog container as a sidecar per https://www.datadoghq.com/blog/monitor-aws-fargate/Additional environment details (Operating System, Cloud provider, etc):
AWS Fargate
DataDog agents datadog/agent:6.5.2 and datadog/agent:6.10.1 from dockerhub
The text was updated successfully, but these errors were encountered: