Skip to content

LMDB Cache: partially reuse GCed IDs so to save cycles on cleaning LMDB #17469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

glyh
Copy link
Member

@glyh glyh commented Jul 4, 2025

As title. This is a follow up of using hashset to track GCed IDs.

NOTE: there's still IDs that can't be reused, if there's too many garbages( >= garbage_size_limit) appearing at the same time.

@glyh glyh requested a review from a team as a code owner July 4, 2025 09:43
@glyh
Copy link
Member Author

glyh commented Jul 4, 2025

BTW, somehow I have designed similar structure in an earlier version of Work Partitioner's ID generator. I don't know if there's value factor out some common patterns.

Probably for now it's not needed. If we run into these multiple time we'll consider that.

| Some reused_id ->
[%log debug] "Reusing LMDB key %d for a new KV pair" reused_id
~metadata:[ ("index", `Int reused_id) ] ;
Hash_set.remove garbage reused_id ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hash_set.remove garbage reused_id ;

in some code below we'll overwrite the value anyway:

    Rw.set ~env db idx x ;

Copy link
Member Author

@glyh glyh Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being inside the hashset and being cleared from LMDB cache has different semantic. If we don't remove them from the garbage hashset here. It will be reassigned by some other invocations to put, or being erased when there's too many garbages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right @glyh. This is Hash_set.remove here and not Rw.remove - the reused ID needs to be removed from the hash set tracking if we're going to reuse it.

@glyh glyh added the enhancement Not big enough to be a feature, but is a smaller improvement label Jul 7, 2025
@glyh glyh force-pushed the lyh/fix-lmdb-deadlock branch from 00f7243 to 97a1a48 Compare July 7, 2025 12:26
@glyh glyh force-pushed the lyh/reuse-lmdb-keys branch from 97a341e to eaa1814 Compare July 7, 2025 13:06
@dannywillems dannywillems force-pushed the lyh/fix-lmdb-deadlock branch from c841ecc to 8a787ef Compare July 7, 2025 14:16
Base automatically changed from lyh/fix-lmdb-deadlock to compatible July 7, 2025 17:01
@glyh
Copy link
Member Author

glyh commented Jul 7, 2025

I'll force push later to adapt to earlier PR forced pushed in the train

@glyh glyh force-pushed the lyh/reuse-lmdb-keys branch from ed6a7c7 to 5af9c08 Compare July 8, 2025 03:04
@glyh
Copy link
Member Author

glyh commented Jul 8, 2025

rebased

@glyh glyh closed this Jul 9, 2025
@glyh glyh reopened this Jul 9, 2025
@glyh glyh force-pushed the lyh/reuse-lmdb-keys branch from 5af9c08 to 4abc6bc Compare July 9, 2025 02:22
| None ->
incr counter ; !counter - 1
| Some reused_id ->
[%log spam] "Reusing LMDB key %d for a new KV pair" reused_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this log please?

@dannywillems
Copy link
Member

!ci-build-me

@dannywillems
Copy link
Member

!ci-bypass-changelog

@glyh
Copy link
Member Author

glyh commented Jul 9, 2025

there's likely a merge conflict. I'll rebase later

@glyh glyh force-pushed the lyh/reuse-lmdb-keys branch from b4fb6eb to 3e5b7b4 Compare July 9, 2025 07:56
@glyh
Copy link
Member Author

glyh commented Jul 9, 2025

rebased so there's no merge conflicts

@glyh
Copy link
Member Author

glyh commented Jul 9, 2025

!ci-build-me

@glyh glyh force-pushed the lyh/reuse-lmdb-keys branch from 3e5b7b4 to 5174b79 Compare July 9, 2025 08:00
@glyh
Copy link
Member Author

glyh commented Jul 9, 2025

!ci-build-me

let idx = !counter in
incr counter ;
let idx =
match Hash_set.find garbage ~f:(const true) with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are trying to achieve by using ~f:(const true).
This will always return either a random element (or the first, I don't remember if find on Hashtbl is deterministic) if the set is empty (which is the case until the GC has been executed at least once on a value) or it will increase the counter

Copy link
Member

@dannywillems dannywillems Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure that we reached the garbage size limit, otherwise, it will be potentially re-using some index whose values have not been removed yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selecting a random element from hashset and use it as new key is exactly the expected behavior. Any key in that set is abandoned

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selecting a random element from hashset and use it as new key is exactly the expected behavior. Any key in that set is abandoned

That also seems correct to me. If an int is in the set then the associated id was judged unreachable and finalized by the GC, so there shouldn't be any live objects still using the id and we're safe to reuse the int.

I think the rationale for the PR is that whereas before we would go through a cycle where we accumulate unused indices until we hit garbage_size_limit and then do a full cleanup of the lmdb cache, now we avoid some of those cleanups by reusing cache slots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not big enough to be a feature, but is a smaller improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants