under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

danbf · 2019-03-14T18:54:24Z

sorry can't paste agent info here as it's a fargate sidecar and i can't ssh into it.

Describe what happened:
under reporting of count metrics is observed when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks instances per services. when using a Service Type of replica and Number of tasks > 1 then count metrics are under reported by the 1/<Number of tasks>. this only occurs for Number of tasks > 1.

this happens as a result of two behaviors.

metrics of type count only accept one count per sample interval for a single source. any more that are received are considered duplicates for that sample interval are considered extraneous and dropped. this is normal behavior.
aws fargate assigns each task instance running for a single service and task definition to the same hostname parameter value. this is the current aws fargate behavior.

they seem to get a hostname of the format:
`fargate_task:arn:aws:ecs:<region>:<account>:task/prod/<task identifier>`

but the `<task identifier>` is not set to be unique.

as a result the count metrics from each service's task instance are considered as coming from the same source(hostname), and so one count metrics is processed for each sample interval with the remaining discarded. this reduces the summed count per interval to only count metric rather then the sum of multiple counts. if each of the service's task instance's had a unique hostname set by aws fargate then all the count metrics would be processed and summed together as the summed count for that sample interval.

while hostname is not set uniquely per task instance for a service, there is a parameter that is, the TaskARN and it's available to the container via the Task Metadata Endpoint https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v2.html .

so incorporating something like below that leverages the uniqueness of the TaskARN in the ecs entrypoint for the datadog agent https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-ecs.sh#L13 would fix that by setting the DD_HOSTNAME to something unique per task instance.

if [[ -n "${ECS_FARGATE}" ]]; then
  taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk  -F\" '{print $1}')
  export DD_HOSTNAME=$taskid
fi

this is based off of #2288 (comment) and https://github.com/aws/amazon-ecs-agent/issues/3#issuecomment-437643239 and we have confirmed this is working setting out dockerfile to:

FROM datadog/agent:6.10.1

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]

and our /entrypoint.sh to

#!/bin/bash

if [[ -n "${ECS_FARGATE}" ]]; then
  taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk  -F\" '{print $1}')
  export DD_HOSTNAME=$taskid
fi

/init

Describe what you expected:
count metrics should be counted from each task instance run in fargate.

Steps to reproduce the issue:

setup a fargate aws service that using a Service Type of replica and Number of tasks > 1 and utilizes the datadog container as a sidecar per https://www.datadoghq.com/blog/monitor-aws-fargate/
have that service's container produce a count metric and have it upload that to datadog via the DogStatsD interface.
that count metrics should be under reported

Additional environment details (Operating System, Cloud provider, etc):
AWS Fargate
DataDog agents datadog/agent:6.5.2 and datadog/agent:6.10.1 from dockerhub

The text was updated successfully, but these errors were encountered:

danbf · 2019-03-14T21:52:19Z

on a regular aws ecs host the hostname seems to be set at the docker container id which you can get via the docker ps command on that host

danbf · 2019-03-15T18:02:53Z

@mikezvi i'm suspecting this is related to #2288 .

seems without a host object then several task instances are all somehow assigned the same host object.

danbf · 2019-03-20T15:23:54Z

@mfpierre i'm wondering if this commit 5867aae has the un-intended consequence of having DogStatsD relayed metrics show up with no host set for them and as a result count metrics under-reporting when a service runs several task instances?

we have confirmed that if the DD_HOSTNAME is set uniquely on a per task instance basis in fargate that the metric under-reporting goes away. could we keep the hostname detected, but still leave out the host checks maybe?

danbf · 2019-03-29T23:21:41Z

@mfpierre Please see the text below which describes the condition i think this commit 5867aae inadvertently triggered, especially for count metrics:

Note: When removing the host tag, you are removing a unique identifier for the submission of custom metrics. When two datapoints are submitted with the same timestamp/metric/tag combination and do not have unique identifiers, the last received/processed value overwrites the value stored. To avoid this edge case, ensure that no host is submitting the same exact metric/tag combination at any given timestamp.

from: https://docs.datadoghq.com/developers/faq/how-to-remove-the-host-tag-when-submitting-metrics-via-dogstatsd/#pagetitle

mfpierre · 2019-04-02T09:46:24Z

Hey @danbf thanks for the report, indeed the removal of the hostname looks problematic in your use case.

One thing you could do instead of setting DD_HOSTNAME is using DD_DOGSTATSD_TAGS to inject the task_arn tag to all the dogstatsd metric.

The other solution would be to set the agent dogstatsd tagger in orchestrator mode to be able to inject automaticaly the taks_arn tag in dogstatsd metrics, for this to work you'll need to set up dogstatsd origin detection though (via UDS) (see https://docs.datadoghq.com/developers/dogstatsd/unix_socket/) + setup the dogstatsd cardinality in the agent via DD_DOGSTATSD_TAG_CARDINALITY (with the warning it implies)

datadog-agent/pkg/config/config.go

Lines 357 to 363 in b3b64f9

    
           // The cardinality of tags to send for checks and dogstatsd respectively. 
        
           // Choices are: low, orchestrator, high. 
        
           // WARNING: sending orchestrator, or high tags for dogstatsd metrics may create more metrics 
        
           // (one per container instead of one per host). 
        
           // Changing this setting may impact your custom metrics billing. 
        
           config.BindEnvAndSetDefault("checks_tag_cardinality", "low") 
        
           config.BindEnvAndSetDefault("dogstatsd_tag_cardinality", "low")

danbf · 2019-04-02T14:43:04Z

@mfpierre here's the thing. it seems the default behavior of the datadog agent in fargate is to drop metrics that would get reported by the same version datadiog agent running on an ecs ec2 cluster. And that is exactly what we saw when we moved a service from an ecs ec2 cluster to fargate. what other things are affected by say backing out 5867aae and letting the host get set in fargate as well.

hkaj · 2019-04-03T16:41:58Z

Hi @danbf
In Fargate, we purposely strip the hostname. We're not able to pull any of the host data to the DD agent because AWS hides it from us (which is exactly the point, you don't have to worry about the host here).

That includes core Datadog agent metrics like system.*, host metadata payloads, as well as the host tags. Traditionally with ECS, the host tag and its aliases were pulled from the existing EC2 + metadata endpoints, both which are not available in Fargate.

Another reason why we disable hostname, and host tags is to avoid that tasks show up on your bill as hosts. I assure you it wouldn't be in your favor 😄

The solution @mfpierre recommends is our solution for this use case, please let us know if it doesn't work for you, we would be interested to understand why.

danbf · 2019-04-05T20:20:27Z

@hkaj is this documented anywhere in the fargate datadog agent deployment guide at the least i think it should. i'm not seeing it here: https://www.datadoghq.com/blog/monitor-aws-fargate/ or here: https://docs.datadoghq.com/integrations/ecs_fargate/

but i fundamentally disagree. implementing PR, #1182 did a bunch of things in disabling the host level checks. i think it went just one step too far, and disabled the host tag. from what i see, the host level checks and metrics could be disabled and yet the setting of the host tag for DOGSTATSD could have been maintained. i'm also hopeful that any billing issues could be worked out.

hkaj · 2019-04-11T10:23:15Z

The point of Fargate is to abstract the host away, and focus on the task. We respected this when building the integration, and removed the host tag.

Even if we had kept the host tag, there's no information about the EC2 instance available via the API or otherwise. os.Hostname() returns the task name, looking for the fqdn name would give either the task name or some name that contains the IP address of the ENI, that, again is tied to a task, not the host.

It also doesn't make sense to surface anything host-related about Fargate workload, even if we could retrofit the task name in the host field (which we could, you're right). It would make the product more confusing, since Fargate users don't expect to have to care about hosts.

The recommended way to differentiate tasks is to use the solution that mfpierre suggested. If this is not satisfying we're open to other suggestions of how we can make the user experience better here, but using the host tag for a task is not it.

tom-mi · 2019-07-05T13:07:54Z

@hkaj Will the proposed solution of @mfpierre be a problem if we deploy multiple times a day, yielding a lot of different task-ids and therefore lots of different tags? The documentation [1] states that the number of metrics is limited, so we tried to avoid tags like that until now and also used the hostname tag (which works for our use cases so far).

[1] https://docs.datadoghq.com/developers/metrics/custom_metrics/#how-many-custom-metrics-am-i-allowed

hkaj · 2019-07-08T17:21:41Z

@tom-mi you're right, it will impact billing, because it creates one time series per task instance. It really depends on what level of granularity you need. If you don't need visibility about your custom metrics per task, i'd suggest not setting any task-level tag, to reduce the # of time series. If you need to aggregate them by task, you will need to add the task arn in there.

jfirebaugh · 2019-09-05T20:15:35Z

Frustrated DataDog customer here. 👋 Between this issue and #2288, I'd say the current DataDog agent behavior is going to be problematic for the large majority of Fargate users. It's unintuitive, confusing, and pretty much undocumented. Basic stuff like making sure there isn't any one task instance that's low on memory or counting number of requests served by all tasks isn't possible without custom configuration!

Simwar · 2019-09-09T13:05:18Z

Hi @jfirebaugh

... making sure there isn't any one task instance that's low on memory or counting number of requests served by all tasks isn't possible without custom configuration!
You can still scope the metrics ecs.fargate.* over the container_id, container_name and the ecs_container_name tags to do this in addition of the task_arn(which is unique).

The only caveat with the current setup is with dogstatsd and using multiple instances of the same task.
We have a feature request opened on our side to add the task_arn as a tag when sending custom metrics with dogstatsd (this would be the same as the agent as both containers are running in the same task). It should resolve the issue by giving a unique tag (with a higher cardinality) without adding a hostname to the agent, which could cause billing issue.

Please reach out to our support team ([email protected]) if you'd like to open another feature request that you think is relevant.

Simon

jfirebaugh · 2019-09-09T18:39:17Z

The only caveat with the current setup is with dogstatsd and using multiple instances of the same task.

Sure, but using multiple instances is what everyone who wants redundancy or to scale horizontally will be doing. It's one of the main attractions of containerization.

We have a feature request opened on our side to add the task_arn as a tag when sending custom metrics with dogstatsd (this would be the same as the agent as both containers are running in the same task). It should resolve the issue by giving a unique tag (with a higher cardinality) without adding a hostname to the agent, which could cause billing issue.

That's great to hear! I think it will resolve the issue to everyone's satisfaction. It's almost exactly what I've implemented manually as a workaround, only I send just the task ID (last part of the ARN). My variant of @danbf's entrypoint script:

#!/bin/bash

if [[ -n "${ECS_FARGATE}" ]]; then
  task_id=$(curl --silent 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk -F\" '{print $1}')
  export DD_TAGS="$DD_TAGS task_id:$task_id"
  export DD_DOGSTATSD_TAGS="$DD_DOGSTATSD_TAGS task_id:$task_id"
fi

/init

jfirebaugh · 2019-09-10T23:19:19Z

Update for those who may be using the above workaround themselves: I found that the grep/awk pipeline did not reliably extract the correct value. I replaced it with jq:

  task_id=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')

and added jq installation to the Dockerfile:

RUN apt-get update && apt-get install -y jq && rm -rf /var/lib/apt/lists/*

kevinwsin · 2019-09-11T13:32:30Z

@Simwar do you have an ETA for when the feature request will be worked on and released?

pcothenet · 2020-02-26T23:18:44Z

Can this issue be solved by using the DD_DOGSTATSD_TAG_CARDINALITY=orcehstrator which seems to append the task ARN automatically? (possible billing surcharges still being an issue)

nakulpathak3 · 2020-03-05T01:38:56Z

Can we get a bit more clarification on the costs you mention? I somewhat understand the DD_DOGSTATSD_TAGS cost already since that just seems to be an extra tag attached to each metric.

But how does adding DD_HOSTNAME affect my cost? Is it just the cost of increased metrics or do I get charged for 1000 unique hosts if I had 1000 unique task instances in a month?
https://www.datadoghq.com/pricing/#section-infrastructure says $1 per Fargate Task. Are the costs mentioned here with using DD_HOSTNAME in addition to the $1 I'd already be paying? A breakdown would be helpful to get clarity on which solution to pick - DD_HOSTNAME or DD_DOGSTATSD_TAGS.

For context, we deploy our application multiple times a day and probably have around 70-80 task instances per deploy.

danbf · 2020-05-18T23:21:23Z

we've switched away from bash for this, but will be looking at #5324 shortly. also would like to know the cost trade-off here.

our latest entrypoint.sh here:

#!/bin/bash

set -e
set -o pipefail

if [[ -n "${ECS_FARGATE}" ]]; then

  until [ -n "${private_ip}" ]; do
    private_ip=$(curl  --silent 169.254.170.2/v2/metadata | python -c "import json, sys; print(json.loads(sys.stdin.read())['Containers'][0]['Networks'][0]['IPv4Addresses'][0])")
  done

  export DD_HOSTNAME=fargate-$private_ip

fi

/init

SteffenDE · 2020-08-19T19:56:48Z

We've tried #5324 but it seems like this does not always work. Multiple times when AWS autoscales and starts a new task the task does not send its task_arn.

This can cause the mentioned under reporting as soon as two tasks are started and do not send the task_arn. This has already happened to us despite having DD_DOGSTATSD_TAG_CARDINALITY= orchestrator.

khewonc · 2020-08-27T16:43:09Z

Hi @SteffenDE, thanks for looking into the DD_DOGSTATSD_TAG_CARDINALITY=orchestrator setting and for reporting this. Would you be able to open up a support ticket so that our team can look into why the task_arn isn't being added as a tag?

SteffenDE · 2020-08-27T17:05:39Z

Just opened a ticket (#389042). I hope this helps you to find the cause of this issue.

andreasterrius · 2020-09-02T10:27:54Z

I would like to +1 @SteffenDE's problem, we are encountering the same issue where task-arn is N/A.
We have also opened a ticket (#388***) but still without resolution.

Appreciate if you could bump up the priority of this since multiple people are getting the same issue.

SteffenDE · 2020-09-08T16:04:17Z

Datadog is currently investigating this. For now we're using this workaround adapted from the comments above:

#!/bin/bash

set -e
set -o pipefail

if [[ -n "${ECS_FARGATE}" ]]; then
  echo "datadog agent starting up in ecs!"
  echo "trying to get task_arn from metadata endpoint..."

  until [ -n "${task_arn}" ]; do
    task_arn=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')
  done

  echo "got it. starting up with task_arn $task_arn"
  export DD_HOSTNAME=task-$task_arn

fi

/init

FROM datadog/agent:7

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

RUN apt-get update \
    && apt-get install --no-install-recommends -y jq \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ENTRYPOINT ["/entrypoint.sh"]

…orting statsd metrics see this issue: DataDog/datadog-agent#3159

mipearson · 2021-02-17T05:21:42Z

I found this issue by accident while trying to find the documentation on the "correct" way to work with custom metrics in Fargate, and whether sidecars were still the preferred option.

Since reading through it I admit I'm now very glad that I'm aware of the problem before any of our teams run up against it the hard way, and disappointed that I don't seem to see it mentioned in the blog posts or docs about using DD with Fargate.

I'm also not entirely clear, at this stage, on what's needed to have reliable counts for multiple tasks that are part of the same service.

Do I need to now supply DD_DOGSTATSD_TAG_CARDINALITY=orchestrator as part of the DD agent sidecar in my task definition?

If I do this, will it add cardinality for any custom metric reported by that task, similar to how metrics from an individual EC2 host would add cardinality?

Is this in the docs now, and I just missed it?

dortort · 2021-03-03T17:32:50Z

While the workaround DD_DOGSTATSD_TAG_CARDINALITY=orchestrator has worked until now (agent version 7.25.1), since agent version 7.26.0 metrics have reverted to under reporting.

I've tried a number of environment variables with the latest version but none resolve the issue. Here are a number of settings I've tried:

DD_DOGSTATSD_TAG_CARDINALITY=high
removing DD_DOGSTATSD_TAG_CARDINALITY
AUTCONFIG_FROM_ENVIRONMENT=false

What has changed in 7.26.0?

bentmann · 2021-03-06T14:46:50Z

While the workaround DD_DOGSTATSD_TAG_CARDINALITY=orchestrator has worked until now (agent version 7.25.1), since agent version 7.26.0 metrics have reverted to under reporting.

I recently filled #7602 for this.

danbf changed the title ~~under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per~~ under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service Mar 22, 2019

mfpierre added component/dogstatsd team/containers component/tagger labels Apr 2, 2019

Simwar mentioned this issue Sep 9, 2019

Can't attach host tags to metrics on AWS Fargate #2288

Open

xornivore mentioned this issue Apr 10, 2020

ECS Fargate/Dogstatsd task_arn tag on with orchestrator cardinality #5324

Merged

CharlyF mentioned this issue Sep 15, 2020

[ECS Fargate] Fix missing task_arn tag due to race condition #6382

Merged

philherbert added a commit to department-of-veterans-affairs/notification-api that referenced this issue Jan 19, 2021

exposes the datadog agent task arn, to hopefully fix datadog underrep…

a87eb0f

…orting statsd metrics see this issue: DataDog/datadog-agent#3159

Connor-Callahan mentioned this issue Jun 14, 2021

Update README.md DataDog/integrations-core#9426

Closed

4 tasks

credpath-seek mentioned this issue Apr 17, 2023

Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

Closed

tmm1 mentioned this issue Oct 5, 2024

Update ecs_fargate docs to note DD_DOGSTATSD_TAG_CARDINALITY setting DataDog/integrations-core#18776

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

danbf commented Mar 14, 2019 •

edited

Loading

danbf commented Mar 14, 2019

danbf commented Mar 15, 2019

danbf commented Mar 20, 2019 •

edited

Loading

danbf commented Mar 29, 2019

mfpierre commented Apr 2, 2019

danbf commented Apr 2, 2019

hkaj commented Apr 3, 2019

danbf commented Apr 5, 2019

hkaj commented Apr 11, 2019

tom-mi commented Jul 5, 2019

hkaj commented Jul 8, 2019

jfirebaugh commented Sep 5, 2019 •

edited

Loading

Simwar commented Sep 9, 2019

jfirebaugh commented Sep 9, 2019

jfirebaugh commented Sep 10, 2019

kevinwsin commented Sep 11, 2019

pcothenet commented Feb 26, 2020

nakulpathak3 commented Mar 5, 2020 •

edited

Loading

danbf commented May 18, 2020

SteffenDE commented Aug 19, 2020

khewonc commented Aug 27, 2020

SteffenDE commented Aug 27, 2020

andreasterrius commented Sep 2, 2020

SteffenDE commented Sep 8, 2020

mipearson commented Feb 17, 2021

dortort commented Mar 3, 2021 •

edited

Loading

bentmann commented Mar 6, 2021

under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

Comments

danbf commented Mar 14, 2019 • edited Loading

danbf commented Mar 14, 2019

danbf commented Mar 15, 2019

danbf commented Mar 20, 2019 • edited Loading

danbf commented Mar 29, 2019

mfpierre commented Apr 2, 2019

danbf commented Apr 2, 2019

hkaj commented Apr 3, 2019

danbf commented Apr 5, 2019

hkaj commented Apr 11, 2019

tom-mi commented Jul 5, 2019

hkaj commented Jul 8, 2019

jfirebaugh commented Sep 5, 2019 • edited Loading

Simwar commented Sep 9, 2019

jfirebaugh commented Sep 9, 2019

jfirebaugh commented Sep 10, 2019

kevinwsin commented Sep 11, 2019

pcothenet commented Feb 26, 2020

nakulpathak3 commented Mar 5, 2020 • edited Loading

danbf commented May 18, 2020

SteffenDE commented Aug 19, 2020

khewonc commented Aug 27, 2020

SteffenDE commented Aug 27, 2020

andreasterrius commented Sep 2, 2020

SteffenDE commented Sep 8, 2020

mipearson commented Feb 17, 2021

dortort commented Mar 3, 2021 • edited Loading

bentmann commented Mar 6, 2021

danbf commented Mar 14, 2019 •

edited

Loading

danbf commented Mar 20, 2019 •

edited

Loading

jfirebaugh commented Sep 5, 2019 •

edited

Loading

nakulpathak3 commented Mar 5, 2020 •

edited

Loading

dortort commented Mar 3, 2021 •

edited

Loading