Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

Open
hemantgir opened this issue May 30, 2024 · 9 comments
Open

Comments

@hemantgir
Copy link

Summary of problem

We have noticed that after upgrading ddtrace to any version above 2.3.0, results in a significant increase in CPU utilization, which leds to the maximum number of replicas being deployed.

For instance, our Kubernetes application is configured with an auto-scaling limit of 36 maximum replicas. Prior to the upgrade, our stage environment would typically use only 6-8 pods while idle. However, post-upgrade, we are reaching the upper limit of 36 replicas.

This unexpected behavior suggests that there may be a spike in resource usage introduced in versions above 2.3.0. We would like to understand the cause of this increased resource consumption and seek a solution to optimize it.

Additionally, updated datadog_lambda==5.83.0 to be compatible with ddtrace==2.3.0 version.

( Maybe a red herring - we also noticed calls to POST /telemetry/proxy/api/v2/apmtelemetry increase on versions above 2.3.0 ).

Datadog screenshots (Kubernetes pods are in idle state):
on ddtrace 2.7.5:
sum:kubernetes_state.deployment.replicas_available{env:... ,service:...}
image

APM POST /telemetry/proxy/api/v2/apmtelemetry
image

on ddtrace 2.3.0:
sum:kubernetes_state.deployment.replicas_available{env:... ,service:...}
image

APM POST /telemetry/proxy/api/v2/apmtelemetry
image

Which version of dd-trace-py are you using?

Originally had bumped to 2.7.5, but now downgraded to 2.3.0. Have also tried with latest 2.8.5.

Which version of pip are you using?

pip 24.0

Spike with:

Any version above ddtrace 2.3.0

pip freeze

aioboto3==9.5.0
aiobotocore==2.2.0
aiodns==3.0.0
aiohttp==3.9.5
aiohttp-retry==2.4.5
aioitertools==0.8.0
aioredis==1.3.1
aioredis-cluster==1.5.2
aiosignal==1.2.0
ansible==9.1.0
ansible-core==2.16.4
asgiref==3.8.0
asn1crypto==1.5.1
async-kinesis==1.1.5
async-timeout==4.0.2
asyncio-throttle==1.0.2
atomicwrites==1.4.0
attrs==20.3.0
aws-kinesis-agg==1.1.3
aws-xray-sdk==2.6.0
awscli==1.22.76
bcrypt==3.2.0
black==24.4.2
blinker==1.7.0
boto==2.45.0
boto3==1.21.21
botocore==1.24.21
Brotli==1.0.9
brotlipy==0.7.0
bytecode==0.15.1
CacheControl==0.12.6
cachetools==4.1.1
cattrs==22.2.0
certifi==2023.7.22
cffi==1.16.0
chardet==3.0.4
charset-normalizer==2.0.8
cityhash==0.4.7
click==8.1.7
colorama==0.4.1
coverage==7.0.4
cryptography==42.0.5
dal-admin-filters==1.1.0
datadog==0.41.0
datadog_lambda==5.91.0
ddsketch==2.0.4
ddtrace==2.7.4
decorator==4.4.2
defusedxml==0.7.1
Deprecated==1.2.14
deprecation==2.1.0
Django==4.2.11
django-auditlog==3.0.0
django-autocomplete-light==3.11.0
django-cleanup==6.0.0
django-cors-headers==3.7.0
django-csp==3.7
django-discover-runner==1.0
django-extensions==3.1.5
django-filter==2.4.0
django-health-check==3.18.1
django-hosts==5.1
django-json-widget==2.0.1
django-nested-admin==3.4.0
django-redis==4.11.0
django-rest-serializer-field-permissions==4.1.0
django-role-permissions==2.2.0
django-rq==2.10.2
django-ses==3.5.0
django-snowflake==4.2.2
django-storages==1.12.3
django-webpack-loader==0.5.0
django_reverse_admin==2.9.6
djangorestframework==3.14.0
djangorestframework-csv==2.1.0
djangorestframework-gis==0.18
dnspython==2.6.1
docutils==0.15.2
dogslow==1.2
drf-flex-fields==0.9.8
drf-jwt==1.19.2
elementpath==2.2.3
envier==0.5.1
et-xmlfile==1.1.0
execnet==1.9.0
fakeredis==2.7.1
filelock==3.12.2
frozenlist==1.4.1
future==0.18.3
geojson==2.4.1
googleapis-common-protos==1.53.0
grpcio==1.62.0
grpcio-health-checking==1.62.0
grpcio-reflection==1.62.0
grpcio-status==1.62.0
gunicorn==22.0.0
hiredis==2.3.2
httplib2==0.19.0
idna==3.7
importlib-metadata==6.11.0
importlib-resources==5.8.0
iniconfig==2.0.0
intervaltree==3.1.0
isort==5.13.2
Jinja2==3.1.3
jmespath==0.10.0
json-stream==2.3.2
json-stream-rs-tokenizer==0.4.25
jsonpickle==3.0.3
jsonschema==4.5.1
magicattr==0.1.5
MarkupSafe==2.1.1
more-itertools==8.6.0
msgpack==1.0.0
multidict==5.1.0
mypy-extensions==1.0.0
nplusone==1.0.0
openpyxl==3.0.7
opentelemetry-api==1.23.0
orjson==3.9.15
packaging==24.0
paramiko==3.4.0
pathspec==0.12.1
pillow==10.3.0
platformdirs==3.8.1
pluggy==1.0.0
protobuf==4.21.7
psycopg2==2.9.9
psycopg2-binary==2.9.9
py-dateutil==2.2
pyasn1==0.4.8
pycares==4.2.0
pycodestyle==2.5.0
pycountry==22.3.5
pycparser==2.20
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pytest==7.2.0
pytest-cov==4.0.0
pytest-django==4.5.2
pytest-shard==0.1.2
pytest-xdist==3.1.0
python-dateutil==2.8.0
python-json-logger==0.1.8
python-memcached==1.59
python-monkey-business==1.0.0
pytz==2020.4
PyYAML==5.3.1
redis==3.5.3
redis-py-cluster==2.1.3
requests==2.31.0
resolvelib==0.5.4
rq==1.14.0
rsa==4.7
s3transfer==0.5.0
setproctitle==1.1.10
Shapely==1.6.4
simplejson==3.14.0
six==1.16.0
snowflake-connector-python==3.7.1
sortedcontainers==2.4.0
splunk-handler==2.0.7
sqlparse==0.5.0
tenacity==6.2.0
tomlkit==0.12.1
typing_extensions==4.7.1
unicodecsv==0.14.1
urllib3==1.26.18
Werkzeug==3.0.1
whitenoise==6.0.0
wrapt==1.14.0
xmlschema==1.2.5
xmltodict==0.13.0
yarl==1.9.4
zipp==3.18.1

How can we reproduce your problem?

I'm not sure how you can replicate the issue from your end. We are utilizing Datadog tools, and we have established metrics that continuously monitor and provide results whether in an idle or running.

What is the result that you get?

High CPU utilization causing Kubernetes pod scaling upto the max replicas even in idle condition, on ddtrace > 2.3.0.

What is the result that you expected?

CPU utilization and Kubernetes pod scaling only as much as required, on ddtrace > 2.3.0

@emmettbutler
Copy link
Collaborator

Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.

@hemantgir
Copy link
Author

Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.

Thank you for your response. Please find the list of environment variables below:

DD_DBM_PROPAGATION_MODE : disabled
DD_DJANGO_USE_HANDLER_RESOURCE_FORMAT : True
DD_ENV : stage
DD_LOGS_INJECTION : True
DD_SERVICE : Django
DD_TRACE_SAMPLE_RATE : 1
DD_TRACE_SAMPLING_RULES : [{"sample_rate": 1}]

@github-actions github-actions bot added the stale label Aug 5, 2024
@joshverma
Copy link

Did you ever figure this out?
@hemantgir

@github-actions github-actions bot removed the stale label Oct 5, 2024
@kousiksundara
Copy link

Hi there,

I am impacted by this issue as well - Python service running on Kubernetes.
Upgrading from 2.7.2. We were able to upgrade till 2.8.0 without the cpu spike hitting us.

Tried 2.14.2, 2.10.0, 2.9.2 - All these versions caused the initial cpu spike.

Any updates on this? Pretty much blocks us from upgrading ddtrace any further.

@hemantgir
Copy link
Author

Accidentally close this issue and i don't have permission to reopen this.
can someone please reopen this issue again @emmettbutler @DataDog @Kyle-Verhoog .

@sanchda sanchda reopened this Oct 25, 2024
@taegyunkim
Copy link
Contributor

What Python version do you use?

@kousiksundara
Copy link

We are using python "3.10.14".

We were seeing very minor cpu spikes until we upgraded from 2.7.2 -> 2.14.2, 2.10.0, 2.9.2. After which the spike was much bigger and stayed for much longer.

2.8.0, 2.8.1 sent it back to 2.7.2 levels

@github-actions github-actions bot added the stale label Jan 1, 2025
@fb-justin
Copy link

Also seeing this after going from 2.7.4 -> 2.21.0

@delfick
Copy link

delfick commented Feb 19, 2025

in case it's connected, I made this bug the other day #12370

@github-actions github-actions bot removed the stale label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants