-
Notifications
You must be signed in to change notification settings - Fork 582
LMDB Cache: partially reuse GCed IDs so to save cycles on cleaning LMDB #17469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BTW, somehow I have designed similar structure in an earlier version of Work Partitioner's ID generator. I don't know if there's value factor out some common patterns. Probably for now it's not needed. If we run into these multiple time we'll consider that. |
| Some reused_id -> | ||
[%log debug] "Reusing LMDB key %d for a new KV pair" reused_id | ||
~metadata:[ ("index", `Int reused_id) ] ; | ||
Hash_set.remove garbage reused_id ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash_set.remove garbage reused_id ; |
in some code below we'll overwrite the value anyway:
Rw.set ~env db idx x ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being inside the hashset and being cleared from LMDB cache has different semantic. If we don't remove them from the garbage hashset here. It will be reassigned by some other invocations to put
, or being erased when there's too many garbages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right @glyh. This is Hash_set.remove
here and not Rw.remove
- the reused ID needs to be removed from the hash set tracking if we're going to reuse it.
00f7243
to
97a1a48
Compare
c841ecc
to
8a787ef
Compare
I'll force push later to adapt to earlier PR forced pushed in the train |
rebased |
| None -> | ||
incr counter ; !counter - 1 | ||
| Some reused_id -> | ||
[%log spam] "Reusing LMDB key %d for a new KV pair" reused_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this log please?
!ci-build-me |
!ci-bypass-changelog |
there's likely a merge conflict. I'll rebase later |
rebased so there's no merge conflicts |
!ci-build-me |
!ci-build-me |
let idx = !counter in | ||
incr counter ; | ||
let idx = | ||
match Hash_set.find garbage ~f:(const true) with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you are trying to achieve by using ~f:(const true)
.
This will always return either a random element (or the first, I don't remember if find
on Hashtbl
is deterministic) if the set is empty (which is the case until the GC has been executed at least once on a value) or it will increase the counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure that we reached the garbage size limit, otherwise, it will be potentially re-using some index whose values have not been removed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selecting a random element from hashset and use it as new key is exactly the expected behavior. Any key in that set is abandoned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selecting a random element from hashset and use it as new key is exactly the expected behavior. Any key in that set is abandoned
That also seems correct to me. If an int is in the set then the associated id
was judged unreachable and finalized by the GC, so there shouldn't be any live objects still using the id
and we're safe to reuse the int.
I think the rationale for the PR is that whereas before we would go through a cycle where we accumulate unused indices until we hit garbage_size_limit
and then do a full cleanup of the lmdb cache, now we avoid some of those cleanups by reusing cache slots.
As title. This is a follow up of using hashset to track GCed IDs.
NOTE: there's still IDs that can't be reused, if there's too many garbages( >= garbage_size_limit) appearing at the same time.