-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template errors result in long development/production response times #10954
Comments
@humitos Have you noticed this at all? |
Yes, I've experienced this as well. My guess is because how we are formatting the exceptions in the console using structlog. I've seen the console outputting a lot of text due that all the traceback is formatted with nice colors, all the context variables and more. I found myself this pretty useful, but sometimes pretty distracting and annoying to understand the error... There is a trade-off there. |
@agjohnson are you able to reproduce the issue after commenting this line? https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/settings/base.py#L866 |
Okay, well at least I'm not the only one 😅
I really like the formatting and especially the additional context/locals information, but the stack trace is way to long to be useful usually, at least when working with templates. It's hard to find something useful in the stack trace when the exception starts in the template rendering (which was also always sort of the case with Django though).
I am, though without the structlog rendering of the stack trace, the default stack trace isn't as delayed. But it still takes a long time for a response and trace, and the database is still getting hit hard with queries. Edit: actually, the structlog change does seem to help some too, I get 504 responses very frequently instead without that change. I think the culprit is probably the DB CPU ultimately though. |
Ah I learned a new thing about debug toolbar! You can inspect/see the code and stack around the query in the debug toolbar SQL pane. This is a clue. When working properly, the query shows the template code that calls the query. When there is an exception raised, this is the stack around the query being executed repeatedly: I disabled cacheops and the same thing happens, so it's not that layer. So seems next candidate is Django debug output? Edit: disabling DEBUG does not fix this either. Now I'm really confused. |
Yet another clue:
|
I found this discussion, and hoped it was related, but so far nothing there seems to help: https://forum.djangoproject.com/t/changing-queryset-repr-to-avoid-unexpected-queries/21272/22?page=2 I've disabled debug toolbar, logging, and debug mode entirely. This must be something related to |
I believe #11061 is replicating this behavior in production, but waiting to confirm. If confirmed, we should probably get a fix for this issue on the roadmap. |
Noted here: The PR fixing 11061 confirms (to me) that this bug is affecting production as well. At very least the beta instance is affected by the bug. I haven't tested the current dashboard in production yet. I'll update the description here, but the bug is that a template that is evaluated via I've escalated this to a bug on our roadmap. Priority is medium -- the effect is really bad but it's only triggered when templates trigger exceptions. |
I'm not sure if it's related or not, but I think it's worth to note that we are using readthedocs.org/readthedocs/settings/base.py Lines 397 to 411 in 505ee4c
|
I tried this by removing the Without using It seems I'm not able to reproduce this issue. Maybe there is something else going on here. I'm using just |
Your results sounds about right for reproducing locally. What does the page response time look like without errors? In production, as the database is larger, response times from this are closer to 30-60s. Locally it's more like 2-4s, for me. I noted a template to test here I doubt it will give different response times but probably worth testing the same thing. Also, this is less just about the response time. Noted above, but the the underlying issue seems to be a loop in Django logic and in SQL queries. You can see this on DDT: The SQL query seems to execute over and over, I see this locally and in production |
I performed the test you mentioned and I have However, without a template raising an exception: |
This is an example to show that disabling the nice formatter, reduces the response time and SQL queries executed when there is a `TemplateSyntaxError`. See #10954
I found the issue. It's what I guessed it could be in #10954 (comment) Disabling the "nice logging output", I get a similar CPU times and only 12 SQL queries more when it returns 500 (I suppose these are performed to show the traceback in the Django 500 debug page). I pushed this changes to 592c108 to show how to disable this nice logger. |
Nice! I had tried a similar fix that you suggested (here though the permalink looks off), but it didn't help much. Maybe your fix above affects things at a deeper level. We can test this rather easily. On web-extra we can replicate #11061 or something similar quite easily. With your fix, I would at least expect the response time to be better. |
Production uses a completely different set of loggers, tho. Mainly However, they may be being affected by the same issue. Also note that this could be a readthedocs.org/requirements/pip.in Lines 164 to 174 in 8b66ce9
We may want to read their changelogs and give it another try to upgrade those packages before going deeper and deeper on debugging this. It may have fixed already. Related #10933 |
I created an initial PR for this change at #11166 -- we can use it to do some local QA and see if it works better. |
Thinking a little more about this, I think all those SQL queries are executed by the |
This can be tested further on web-extra with a template change and dropping the settings for sentry et al |
I tried removing I re-deployed this instance and gave it another try and I failed again. I may leave this for you @agjohnson in case you know how to test this in production.
I'm getting
Reverting the changes in the template, the instance works again. |
Yeah, I've encountered this, the package is not installed as editable. All that should be required is:
I think that should be enough to unblock you, so I'll leave you with assignment. But let me know if I can point you in the right direction if you're stuck. |
Yeah, that's what I did and it didn't work. |
I just tested this and it worked. 🤷 I was able to edit the base template and the changes showed in the beta instance. |
I may not be sure what I'm looking for here, then. With "it didn't work" I meant that I still got So, let's rewind a little bit here and assume that the steps from #10954 (comment) are correct, settings and templates changes are picked correctly. Then, what I should do? or what I should be looking at? |
I think I'm just confused by your reply. I understood "it didn't work" to mean that the application was still not starting or the new dashboard was not used. Do you have a working instance with the new dashboard when you make the change I posted above? As long as the 502 response is a template error in the application logs, then it sounds like you have replicated the issue, no? I would also expect a very long response time before eventual timeout and Cloudflare returning the 502. Have you had look into the issue I noted above? That issue has the same response you are describing. The underlying cause should be an excessive response time.
Why util01.org? I assume this is incorrect and you're testing this on ext or ext-theme, correct?
I'm not quite following your updates here, but it seems to me that you have replicated the issue I've described here and removing Does this not match what you are noticing? |
Oh, I see. The instance was starting and using the new dashboard properly. It was just giving 502, tho.
No, I was doing these changes on the only instance we have in the
Yes, this was a "brain typo" 😄 . I'm testing on ext-theme. |
Settings diff I'm using: docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/readthedocs.org$ diff -u readthedocs/settings/managed.py.orig readthedocs/settings/managed.py
--- readthedocs/settings/managed.py.orig 2024-03-05 11:12:44.246182303 +0000
+++ readthedocs/settings/managed.py 2024-03-05 11:13:21.966164143 +0000
@@ -443,33 +443,6 @@
'formatter': 'newrelic',
}
- # Sentry logging
- logging['formatters']['sentry'] ={
- "()": structlog.stdlib.ProcessorFormatter,
- "processors": [
- structlog.stdlib.ProcessorFormatter.remove_processors_meta,
- SentryProcessor(
- event_level=ERROR,
- as_context=False,
- tag_keys="__all__",
- ignore_loggers=[
- 'django_structlog.celery.receivers',
- 'django_structlog.middlewares.request',
- 'django.security.DisallowedHost',
- 'celery.worker.strategy',
- 'celery.worker.consumer.mingle',
- 'celery.worker.consumer.connection',
- ],
- ),
- ],
- "foreign_pre_chain": shared_processors,
- }
- logging['handlers']['sentry'] = {
- 'level': 'ERROR',
- 'class': 'logging.handlers.SysLogHandler',
- 'formatter': 'sentry',
- }
-
logging['handlers']['syslog'] = {
'level': 'INFO',
'class': 'logging.handlers.SysLogHandler',
@@ -481,7 +454,7 @@
}
logging['loggers'] = {
'': {
- 'handlers': ['console', 'syslog', 'sentry', 'newrelic'],
+ 'handlers': ['console', 'syslog', 'newrelic'],
'level': 'DEBUG',
},
'django_structlog.celery.receivers': {
docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/readthedocs.org$ Theme diff: docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/ext-theme$ git diff
WARNING: terminal is not fully functional
Press RETURN to continue
diff --git a/readthedocsext/theme/templates/projects/partials/project_list.html b/readthedocsext/theme/templates/projects/partials/project_list.html
index 86a22f3..0e3d716 100644
--- a/readthedocsext/theme/templates/projects/partials/project_list.html
+++ b/readthedocsext/theme/templates/projects/partials/project_list.html
@@ -1,6 +1,5 @@
{% extends "includes/crud/table_list.html" %}
-{% load i18n %}
{% load humanize %}
{% load privacy_tags %}
docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/ext-theme$ With this configuration and running |
I left only I ran out of ideas here 😕 . Something else may be happening here. |
Just to clarify, when you say 502, are you noticing a long response time too? I think I debugged the build list issue with pyinstrument in production, or if not, it might be worth a shot. It's easy to plug in to Django, though I forget the exact steps. |
Yes, I'm getting the Cloudflare 502 error page and it takes some time to display that page. However, I can't say for sure that's more time that just loading the dashboard when it works, since it takes ~6s from my side to get the initial HTML page. |
Hah yeah, that page response is pretty bad already. Is there a difference in load time if you don't apply your fix versus if you do apply your fix? |
Waiting on a meeting to discuss possible issues. |
We have a theory now for why this is happening, but it's still proving really hard to verify. @humitos noticed that many of the model I am assuming that if this is the error that we should be able to produce the timeout behavior in the normal dashboard as well, on What I'm a little confused by is that I can no longer reproduce the error this PR solved: https://github.com/readthedocs/ext-theme/pull/272/files -- accessing a |
Ah, nevermind. I was not testing the fix exactly and So, speaking to my comment above, I am not seeing a huge difference on the legacy dashboard between a view with an exception and a view without an exception. The extra queries seem very likely to be linked to our |
And finally, with the above clues and some temporary patches to various Without an exceptionWith an exceptionWith the The SQL queries are still deeply duplicated: And theses queries still come from So it seems that both the current |
So! This conversation looked very related: https://groups.google.com/g/django-developers/c/WzRZ0IfBipc It mentions much of the same issues, and got me thinking that we didn't try patching any of these QuerySet/Manager repr methods yet. So, I have two patches live in beta currently: diff --git a/readthedocsext/theme/templates/includes/elements/chips/build.html b/readthedocsext/theme/templates/includes/elements/chips/build.html
index 0d48789..7d90511 100644
--- a/readthedocsext/theme/templates/includes/elements/chips/build.html
+++ b/readthedocsext/theme/templates/includes/elements/chips/build.html
@@ -30,7 +30,12 @@
{% comment %}
This block is the same as the chip_text block above.
{% endcomment %}
- {% firstof text build.get_version_name %} #{{ build.id }}
+ {% if request.user and request.user.is_staff %}
+ {# This is an intentional exception #}
+ {{ text|default:build.version.verbose_name }}
+ {% else %}
+ {% firstof text build.get_version_name %}
+ {% endif %}
{% endblock popupcard_header %}
{% block popupcard_right %} diff --git a/readthedocs/projects/querysets.py b/readthedocs/projects/querysets.py
index cd612797d..92a21befd 100644
--- a/readthedocs/projects/querysets.py
+++ b/readthedocs/projects/querysets.py
@@ -14,6 +14,9 @@ class ProjectQuerySetBase(models.QuerySet):
use_for_related_fields = True
+ def __repr__(self):
+ return "ProjectQuerySetBase"
+
def _add_user_projects(self, queryset, user, admin=False, member=False):
"""Add projects from where `user` is an `admin` or a `member`."""
projects = AdminPermission.projects( The response times and query count from above were viewing the project listing where one of the project's latest build has With the patches above in on the beta instance, I get the expected 500 error page from our application, with a response time of 800ms! Locally, the CLI logging still eats up ~8s though, but the queries were reduced from 800 to 18: I'm guessing that this is all due to the beta templates using includes and extends more, especially around listing views. Overall, I think that the fix that should come out of this debugging is that:
@humitos what do you feel about the above change? Is there a better way to fix this without most likely removing some amount of project queryset debugging information in Sentry? |
I'm 👍🏼 on this plan. I'd like to have a comment on the code clearly explaining why we are overriding these Also, I think we could have a setting (enabled on local development and disabled on production) to decide whether or not override these methods. This way we can still have good debugging information locally but simplified on production. Also also, in production, we could detect if we are running under a Django shell and enable the debugging option so we can see nice rendered objects in the terminal 👍🏼
I'm not sure. Ideally, Sentry should collect all this data in an async way after returning a 500 response to the user --but I don't think that's possible. I saw that Sentry has a queue to send the messages after collecting the data, tho. At least, with the setting/configuration approach that I mentioned above, we can still have debugging information while debugging 😄 |
Yeah, I figured we'd want some conditional sort of output on these. All of that seems reasonable, though it might also make sense to start minimal to fix the bug and see what Sentry/CLI/etc do and then jump back after to add the fancy conditional output after. With debugging complete here, I'll hand this back for backend fixing, I won't jump into any further fixes here. |
This commit removes some debugging functionability in favor of production DB. `__str__` and `__repr__` methods won't be so descriptive now since we are removing some information from their rendering. This is because to render them properly we need to hit the DB multiple times in the worst case --generating 500 on some user requests that need to be logged in Sentry/New Relic. There are better ways for this: disabling logging on production and enabling it on DEBUG + Django Shell, but that requires more extra work that doesn't seems super priority right now. We can come back later and add them as we need them if we want. Closes #10954
* Adapt `__str__` and `__repr__` methods for DB This commit removes some debugging functionability in favor of production DB. `__str__` and `__repr__` methods won't be so descriptive now since we are removing some information from their rendering. This is because to render them properly we need to hit the DB multiple times in the worst case --generating 500 on some user requests that need to be logged in Sentry/New Relic. There are better ways for this: disabling logging on production and enabling it on DEBUG + Django Shell, but that requires more extra work that doesn't seems super priority right now. We can come back later and add them as we need them if we want. Closes #10954 * Make test to represent the changes
When a template that was included via
{% include %}
raises an exception, the request response time explodes. I originally noticed this in local development, but I believe that in #11061 we might have confirmed that this is a bug in production as well (so far maybe only affecting the beta instance).In local development, I noticed:
There is nothing special about my set up though, and my database is 27M.
In debugging #11061 I noticed:
To reproduce:
{% include %}
load i18n
from the template but add/retain a translated stringproject.foo.bar
The text was updated successfully, but these errors were encountered: