High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

hemantgir · 2024-05-30T15:54:51Z

Summary of problem

We have noticed that after upgrading ddtrace to any version above 2.3.0, results in a significant increase in CPU utilization, which leds to the maximum number of replicas being deployed.

For instance, our Kubernetes application is configured with an auto-scaling limit of 36 maximum replicas. Prior to the upgrade, our stage environment would typically use only 6-8 pods while idle. However, post-upgrade, we are reaching the upper limit of 36 replicas.

This unexpected behavior suggests that there may be a spike in resource usage introduced in versions above 2.3.0. We would like to understand the cause of this increased resource consumption and seek a solution to optimize it.

Additionally, updated datadog_lambda==5.83.0 to be compatible with ddtrace==2.3.0 version.

( Maybe a red herring - we also noticed calls to POST /telemetry/proxy/api/v2/apmtelemetry increase on versions above 2.3.0 ).

Datadog screenshots (Kubernetes pods are in idle state):
on ddtrace 2.7.5:
sum:kubernetes_state.deployment.replicas_available{env:... ,service:...}

APM POST /telemetry/proxy/api/v2/apmtelemetry

on ddtrace 2.3.0:
sum:kubernetes_state.deployment.replicas_available{env:... ,service:...}

APM POST /telemetry/proxy/api/v2/apmtelemetry

Which version of dd-trace-py are you using?

Originally had bumped to 2.7.5, but now downgraded to 2.3.0. Have also tried with latest 2.8.5.

Which version of pip are you using?

pip 24.0

Spike with:

Any version above ddtrace 2.3.0

pip freeze

aioboto3==9.5.0
aiobotocore==2.2.0
aiodns==3.0.0
aiohttp==3.9.5
aiohttp-retry==2.4.5
aioitertools==0.8.0
aioredis==1.3.1
aioredis-cluster==1.5.2
aiosignal==1.2.0
ansible==9.1.0
ansible-core==2.16.4
asgiref==3.8.0
asn1crypto==1.5.1
async-kinesis==1.1.5
async-timeout==4.0.2
asyncio-throttle==1.0.2
atomicwrites==1.4.0
attrs==20.3.0
aws-kinesis-agg==1.1.3
aws-xray-sdk==2.6.0
awscli==1.22.76
bcrypt==3.2.0
black==24.4.2
blinker==1.7.0
boto==2.45.0
boto3==1.21.21
botocore==1.24.21
Brotli==1.0.9
brotlipy==0.7.0
bytecode==0.15.1
CacheControl==0.12.6
cachetools==4.1.1
cattrs==22.2.0
certifi==2023.7.22
cffi==1.16.0
chardet==3.0.4
charset-normalizer==2.0.8
cityhash==0.4.7
click==8.1.7
colorama==0.4.1
coverage==7.0.4
cryptography==42.0.5
dal-admin-filters==1.1.0
datadog==0.41.0
datadog_lambda==5.91.0
ddsketch==2.0.4
ddtrace==2.7.4
decorator==4.4.2
defusedxml==0.7.1
Deprecated==1.2.14
deprecation==2.1.0
Django==4.2.11
django-auditlog==3.0.0
django-autocomplete-light==3.11.0
django-cleanup==6.0.0
django-cors-headers==3.7.0
django-csp==3.7
django-discover-runner==1.0
django-extensions==3.1.5
django-filter==2.4.0
django-health-check==3.18.1
django-hosts==5.1
django-json-widget==2.0.1
django-nested-admin==3.4.0
django-redis==4.11.0
django-rest-serializer-field-permissions==4.1.0
django-role-permissions==2.2.0
django-rq==2.10.2
django-ses==3.5.0
django-snowflake==4.2.2
django-storages==1.12.3
django-webpack-loader==0.5.0
django_reverse_admin==2.9.6
djangorestframework==3.14.0
djangorestframework-csv==2.1.0
djangorestframework-gis==0.18
dnspython==2.6.1
docutils==0.15.2
dogslow==1.2
drf-flex-fields==0.9.8
drf-jwt==1.19.2
elementpath==2.2.3
envier==0.5.1
et-xmlfile==1.1.0
execnet==1.9.0
fakeredis==2.7.1
filelock==3.12.2
frozenlist==1.4.1
future==0.18.3
geojson==2.4.1
googleapis-common-protos==1.53.0
grpcio==1.62.0
grpcio-health-checking==1.62.0
grpcio-reflection==1.62.0
grpcio-status==1.62.0
gunicorn==22.0.0
hiredis==2.3.2
httplib2==0.19.0
idna==3.7
importlib-metadata==6.11.0
importlib-resources==5.8.0
iniconfig==2.0.0
intervaltree==3.1.0
isort==5.13.2
Jinja2==3.1.3
jmespath==0.10.0
json-stream==2.3.2
json-stream-rs-tokenizer==0.4.25
jsonpickle==3.0.3
jsonschema==4.5.1
magicattr==0.1.5
MarkupSafe==2.1.1
more-itertools==8.6.0
msgpack==1.0.0
multidict==5.1.0
mypy-extensions==1.0.0
nplusone==1.0.0
openpyxl==3.0.7
opentelemetry-api==1.23.0
orjson==3.9.15
packaging==24.0
paramiko==3.4.0
pathspec==0.12.1
pillow==10.3.0
platformdirs==3.8.1
pluggy==1.0.0
protobuf==4.21.7
psycopg2==2.9.9
psycopg2-binary==2.9.9
py-dateutil==2.2
pyasn1==0.4.8
pycares==4.2.0
pycodestyle==2.5.0
pycountry==22.3.5
pycparser==2.20
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pytest==7.2.0
pytest-cov==4.0.0
pytest-django==4.5.2
pytest-shard==0.1.2
pytest-xdist==3.1.0
python-dateutil==2.8.0
python-json-logger==0.1.8
python-memcached==1.59
python-monkey-business==1.0.0
pytz==2020.4
PyYAML==5.3.1
redis==3.5.3
redis-py-cluster==2.1.3
requests==2.31.0
resolvelib==0.5.4
rq==1.14.0
rsa==4.7
s3transfer==0.5.0
setproctitle==1.1.10
Shapely==1.6.4
simplejson==3.14.0
six==1.16.0
snowflake-connector-python==3.7.1
sortedcontainers==2.4.0
splunk-handler==2.0.7
sqlparse==0.5.0
tenacity==6.2.0
tomlkit==0.12.1
typing_extensions==4.7.1
unicodecsv==0.14.1
urllib3==1.26.18
Werkzeug==3.0.1
whitenoise==6.0.0
wrapt==1.14.0
xmlschema==1.2.5
xmltodict==0.13.0
yarl==1.9.4
zipp==3.18.1

How can we reproduce your problem?

I'm not sure how you can replicate the issue from your end. We are utilizing Datadog tools, and we have established metrics that continuously monitor and provide results whether in an idle or running.

What is the result that you get?

High CPU utilization causing Kubernetes pod scaling upto the max replicas even in idle condition, on ddtrace > 2.3.0.

What is the result that you expected?

CPU utilization and Kubernetes pod scaling only as much as required, on ddtrace > 2.3.0

The text was updated successfully, but these errors were encountered:

emmettbutler · 2024-06-03T12:23:21Z

Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.

hemantgir · 2024-06-05T10:14:10Z

Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.

Thank you for your response. Please find the list of environment variables below:

DD_DBM_PROPAGATION_MODE : disabled
DD_DJANGO_USE_HANDLER_RESOURCE_FORMAT : True
DD_ENV : stage
DD_LOGS_INJECTION : True
DD_SERVICE : Django
DD_TRACE_SAMPLE_RATE : 1
DD_TRACE_SAMPLING_RULES : [{"sample_rate": 1}]

joshverma · 2024-10-04T15:13:12Z

Did you ever figure this out?
@hemantgir

kousiksundara · 2024-10-24T22:00:28Z

Hi there,

I am impacted by this issue as well - Python service running on Kubernetes.
Upgrading from 2.7.2. We were able to upgrade till 2.8.0 without the cpu spike hitting us.

Tried 2.14.2, 2.10.0, 2.9.2 - All these versions caused the initial cpu spike.

Any updates on this? Pretty much blocks us from upgrading ddtrace any further.

hemantgir · 2024-10-25T08:03:49Z

Accidentally close this issue and i don't have permission to reopen this.
can someone please reopen this issue again @emmettbutler @DataDog @Kyle-Verhoog .

taegyunkim · 2024-11-01T02:09:28Z

What Python version do you use?

kousiksundara · 2024-11-01T21:14:40Z

We are using python "3.10.14".

We were seeing very minor cpu spikes until we upgraded from 2.7.2 -> 2.14.2, 2.10.0, 2.9.2. After which the spike was much bigger and stayed for much longer.

2.8.0, 2.8.1 sent it back to 2.7.2 levels

fb-justin · 2025-02-19T00:43:08Z

Also seeing this after going from 2.7.4 -> 2.21.0

delfick · 2025-02-19T01:18:49Z

in case it's connected, I made this bug the other day #12370

github-actions bot added the stale label Aug 5, 2024

github-actions bot removed the stale label Oct 5, 2024

hemantgir closed this as completed Oct 25, 2024

sanchda reopened this Oct 25, 2024

github-actions bot added the stale label Jan 1, 2025

github-actions bot removed the stale label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

hemantgir commented May 30, 2024

emmettbutler commented Jun 3, 2024

hemantgir commented Jun 5, 2024

joshverma commented Oct 4, 2024

kousiksundara commented Oct 24, 2024

hemantgir commented Oct 25, 2024

taegyunkim commented Nov 1, 2024

kousiksundara commented Nov 1, 2024

fb-justin commented Feb 19, 2025

delfick commented Feb 19, 2025

High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

Comments

hemantgir commented May 30, 2024

Summary of problem

Which version of dd-trace-py are you using?

Which version of pip are you using?

Spike with:

pip freeze

How can we reproduce your problem?

What is the result that you get?

What is the result that you expected?

emmettbutler commented Jun 3, 2024

hemantgir commented Jun 5, 2024

joshverma commented Oct 4, 2024

kousiksundara commented Oct 24, 2024

hemantgir commented Oct 25, 2024

taegyunkim commented Nov 1, 2024

kousiksundara commented Nov 1, 2024

fb-justin commented Feb 19, 2025

delfick commented Feb 19, 2025