-
-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tackling the Big, Bad post_view
query
#5555
Comments
I think there are a bunch of "nibble around the edges" changes that could be made, especially around Since the term Clearly there will be difficulties around the boundaries of the tranches - maybe daily tranches are better? Any time restriction that excludes most of the Post records in a fast way is the key idea EDIT: The things I was saying about pagination and joins are now I think more relevant and I'm going to look there first now. |
Now is definitely the perfect time to make breaking changes. And database optimization is something we can use a lot of help with, so thanks for looking into this!
schema changes are fine. Migrations are already written in raw sql (see the migrations folder). There are also a bunch of triggers. We used materialized views in the past but got rid of them (#908). Tables are already denormalized in some cases to improve performance so thats okay.
Makes sense to me, post_view is very complex so it would be easier to maintain as several smaller files.
@dessalines is already working on removing all the leftover limit/offset pagination in favor of cursors (#5424, #5429
This definitely sounds useful. @dullbananas already created a small performance test: https://github.com/LemmyNet/lemmy/tree/main/crates/db_perf |
Lemmy uses keyset pagination already (except for legacy support). The ranking functions are all indexed where possible (and denormalized). We already replaced most of the joins with subqueries as far as I remember. Afaik it is not possible to optimize the post query much further, so I wouldn't get my hopes up too much. The problem is not solveable: A user has subscribed to any random subset of all communities. Depending on the communities subscribed to, the front page may contain posts from very varying time spans. If a user subscribes to many popular communities, the time frame will only be a few hours. A user that subscribes to few communities may see a time frame of multiple weeks. The time frame is different for every community. The time frame to query differs for every community based on its respective popularity. Postgresql sometimes chooses to filter communities first, ranking second. Sometimes it chooses to filter ranking first, then communities second. You can see these in the query plans. The first option is to basically loop through every community the user subscribes to individually, then get the corresponding page (based on ranking), then join. The other is to get posts from all communities, then filter out the ones not subscribed to. PG does all this internally, based on global statistics. This decision should be made on a per-user basis though, which the PG planner does not really do - this is something we could do manually in theory or maybe by tweaking the PG statistics. It would require multivariate statistics - the PG planner would need to at least know for every user ID the number of communities to expect, and to be good it needs to know the size of each of those communities as well (to estimate number of posts to expect per ranking). I doubt it's possible honestly. Making the decision manually could work, but it would make the code a lot more complicated, potentially incomprehensible, and I don't think the benefit would be large. The only "real" improvement afaik would be to cache the front page based on the user asking for it. That's what reddit does (did), cache the front page view of each user for an hour or so, including the next 10-20 pages. |
For joins, the important part (afair) is to reduce the number of joins below the geqo limit (default 12 joins). Afaik this is the case currently. Below that limit, i don't think there's a performance difference between a JOIN and a subquery? Since the planner will treat it the same (as long as the statistics don't tell it to treat the join as an early filter) |
The ranking-type scores are updated periodically and cached on the I think the best place to start, would be to expand upon @dullbananas 's pg_perf tests. Predictably measuring performance of all our queries is the best first step, then making improvements / tweaks, and comparing the before / after. Or if there are some postgres analyzer visualizers that can help us identify the join / query bottlenecks. I see the most potential gains coming from tweaking indices, and possibly moving some things to
|
If this ends up being our only choice (for the subscribed query at least), then I wouldn't be opposed to adding a postgres table for this (similar to our combined tables). IE a Should be a last resort though. |
@Nutomic @phiresky @dessalines Thank you for your responses. I still have a trick up my sleeve that I think will work but I will have to prove it out in code. I agree in principle that having the performance tests as the top priority is the way to go, but my brain just can't let go of this query optimization idea. My goal is to have a (semi-?)functional draft by Monday, could be wildly optimistic. |
There's also from_collapse_limit and join_collapse_limit which are 8 by default. https://www.postgresql.org/docs/17/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER |
After some conversation back and forth, I was asked to continue the discussion here
Context links:
https://lemmy.ml/post/27659153/17534521
https://matrix.to/#/!tUnhsBRCsiePXfsIGe:matrix.org/$OufmGMnF3ndZ2gP4Xo52qZyktt2kk8wE0hA5y1prQ10?via=matrix.org&via=mozilla.org&via=tchncs.de
====== Lemmy comment text ======
Good evening Dessalines, I have started looking at the posts query.
The lowest hanging fruit I think would be if we could replace some of the joins with
WHERE EXISTS
which can have a huge impact on the query time. It seems this is supported in Diesel: https://stackoverflow.com/a/74300447This is my first time looking at the codebase so I can’t tell yet which joins are purely for filtering (in which case they can be replaced by
WHERE EXISTS
) and which joins need to be left in because some of their columns end up in the finalSELECT
I can’t tell for sure yet but it also looks like this might also be using
LIMIT...OFFSET
pagination? That can be a real drag on performance but isn’t as easy to fix.EDIT:
Looking some more, and reading some linked github discussion - I think to really get this out of the performance pits will require some denormalization like a materialized view or manual cache tables populated by triggers. I really like the ranking algorithm but so far I’m finding it difficult to optimize from a query perspective
===== Matrix Message Text ======
Good day Lemmy dev team,
Following up on this thread: https://lemmy.ml/post/27659153/17538040
I want to help out and I think there are huge performance gains to be had but I want to get a pulse check from the maintainers before I go off making big changes.
Some of my preliminary questions:
LIMIT...OFFSET
pagination can we do that breaking change as part of the push to 1.0?I didn't even mention the question of testing, I think ultimately the project should have a way to bulk seed a local postgres with semi-realistic fake data. I would be open to working on that as well
cc:
@Nutomic @dessalines @dullbananas
The text was updated successfully, but these errors were encountered: