-
Notifications
You must be signed in to change notification settings - Fork 7
Optimize prefixes for queries #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Oh, as a Map, it's basically an inverted index now! I'll review and think about this. |
After my talk with @staltz I did some tests where I tried to create bitsets on top of the prefix arrays and using the bitsets to narrow down the number of rows we need to run through. Note that modern CPUs are blazingly fast at running through arrays so we are up against 10ms for 1.5million. The bitsets for one example query looks like this:
the first number being the number of number of matches, the second the size of the bitset in bytes. Using the 8 "best" bitsets we get the following results:
Meaning it used 4 of those bitsets to narrow down the result to 45k instead of 1.5m. Sadly the union operation is about the same speed as just running over the array, so I'm not sure if that idea is worth persuing more. I'll fix whatever is making the benchmark tests not run and see if we can save this to storage in a better way. There is the possibility of doing this prefix map as another type, so we have the existing prefix also. |
Another idea we discussed was not saving the 0 (key) buckets for the map implementation. Going to test that. |
I actually worked on this branch for 10mins just to put the perf.js inside benchmark/index.js, and it worked. I didn't commit and push because I had to go out, but the code is ready. I'll later rebase onto this branch, so you don't need to do it. :) @arj03 |
Benchmark results
|
Sweet. Thank you :) |
Aaaactually that was on db2, not in jitdb. But still, probably useful. |
So turns out the skip 0 is a really good idea for things where not all messages have values and you don't need to query for unknown. So like votes. Before:
After:
|
That translates of course into a bit faster initial load on votes if used. I'm not exactly sure where to go next with this. For votes this really makes a lot of sense. The indexes are smaller, and much faster. Not really for author so maybe add a new index type? Existing prefix:
|
Benchmark results
|
Yep, I'm all for making different types of indexes and tailoring the operators to use different types of indexes. |
Benchmark results
|
Benchmark results
|
Nice! I checked the code and looks neat so far. I have a few more things I'd like to review. Other question: have you tried BIPF instead of JSON.parse/stringify? |
Right I forgot to say that. I did. Sadly its not faster, especially for the saving part and we don't get any of the "I can just get this part", we really need to decode the whole thing. |
@arj03 I added a commit to benchmark prefix map indexes in CI. If I understood correctly, the 1st run of a prefix map is slightly slower than the 1st run of a prefix non-map index, but the 2nd run is noticeably faster (from ~16ms down to ~4ms). Is that your experience as well? (CI will report bench results soon) |
Exactly |
Benchmark results
|
index.js
Outdated
function updatePrefixMapIndex(opData, index, buffer, seq, offset) { | ||
if (seq > index.count - 1) { | ||
const fieldStart = opData.seek(buffer) | ||
if (fieldStart) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arj03 Thanks. Did you also see my comment about ~fieldStart
? Or is this intentionally testing for "field must not be at the beginning of the buffer"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was keeping it in line with normal prefix indexes. It seems you are correct in that both have a bug if its at position 0 in the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't realize that normal prefix indexes had that too! Oops
Note, |
Benchmark results
|
Pushed up a new commit. I'm away for the rest of the day, but feel free to merge if you want to try this out in manyverse :) And thanks for the review |
Benchmark results
|
Benchmark results
|
Benchmark results
|
This took a while to wrap my head around. Basically before we were making it easy to create prefix indexes (just insert into tarr), but slow to query them loop. Instead lets optimize for the query because prefix indexes are used for very often used queries (key, votes, hasRoot etc.). The natural choice was to store prefixes as maps instead of arrays. Mapping the lookup key to sequences. Besides a bit slower index creation (25-70%) the size has almost doubled. I did not find a better solution than just storing them JSON stringified, but I'm sure there is a better way because we are just storing numbers here.
Query numbers for already created indexes using perf script:
old
new