Skip to content

Conversation

@cuviper
Copy link
Member

@cuviper cuviper commented Oct 23, 2025

On HashTable<T, A>:

pub fn num_buckets(&self) -> usize;
pub fn find_bucket_index(&self, hash: u64, eq: impl FnMut(&T) -> bool) -> Option<usize>;
pub fn get_bucket(&self, index: usize) -> Option<&T>;
pub fn get_bucket_mut(&mut self, index: usize) -> Option<&mut T>;
pub fn get_bucket_entry(&mut self, index: usize) -> Option<OccupiedEntry<'_, T, A>>;
pub unsafe fn get_bucket_unchecked(&self, index: usize) -> &T;
pub unsafe fn get_bucket_unchecked_mut(&mut self, index: usize) -> &mut T;

On OccupiedEntry<'_, T, A>:

pub fn bucket_index(&self) -> usize;

Closes #613, although I chose different names.

@cuviper cuviper requested a review from Amanieu October 23, 2025 01:29
@cuviper
Copy link
Member Author

cuviper commented Oct 23, 2025

although I chose different names.

I'm still willing to change if other names are preferred.

Copy link
Contributor

@clarfonthey clarfonthey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that get_bucket and get_bucket_mut are good, but buckets should be renamed to num_buckets to make it clear it's returning a count and not an iterator.

In terms of find_bucket_index and get_bucket_entry, I left some comments. But if you wanted to merge the smaller set first, I'm fine with that.

Copy link
Contributor

@clarfonthey clarfonthey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not required at all, but since a lot of code using the HashTable API is going to be unsafe anyway, it might be worth adding get_bucket_unchecked and get_bucket_unchecked_mut as well, which don't return Option.

@moulins
Copy link
Contributor

moulins commented Oct 23, 2025

Does get_bucket_entry really need to take a hash? The Entry structs currently store the full hash, but I suspect the code could be refactored to only store a raw::control::Tag, which can then be fetched from the hashtable storage when constructing an OccupiedEntry.

@cuviper
Copy link
Member Author

cuviper commented Oct 23, 2025

Also, not required at all, but since a lot of code using the HashTable API is going to be unsafe anyway, it might be worth adding get_bucket_unchecked and get_bucket_unchecked_mut as well, which don't return Option.

I personally think HashTable should emphasize safe usage, but I guess that line has already been breached with get_disjoint_unchecked_mut. These would be easy to add if desired.

Does get_bucket_entry really need to take a hash? The Entry structs currently store the full hash, but I suspect the code could be refactored to only store a raw::control::Tag, which can then be fetched from the hashtable storage when constructing an OccupiedEntry.

Ooh, I like that! We couldn't support the full Option<Entry> suggested above, but this avoids all my concerns about breaking the hash probe sequence if there's no new hash involved. I think we also only need this Tag in VacantEntry, because OccupiedEntry::remove can just get it from the control byte.

@cuviper
Copy link
Member Author

cuviper commented Oct 23, 2025

Calling out this additional benefit of the Tag change, as I wrote in the commit:

Also, since OccupiedEntry is now smaller, enum Entry will be the
same size as VacantEntry by using a niche for the discriminant.
(Although this is not guaranteed by the compiler.)

@clarfonthey
Copy link
Contributor

re: unsafe methods, I guess the main concern is whether the bounds checks can easily be optimised out. While there isn't the same robust codegen testing as the compiler here, maybe it would be worth poking around on godbolt and seeing how find_bucket_index followed by one of the relevant methods is handled.

@cuviper
Copy link
Member Author

cuviper commented Oct 23, 2025

I've added those two unchecked methods.

seeing how find_bucket_index followed by one of the relevant methods is handled.

Using cargo asm I can confirm that the optimizer does not see that the checks are redundant after find_bucket_index -- comparing get_bucket_unchecked(index) to get_bucket(index).unwrap() adds:

+       inc rdi
+       cmp r10, rdi
+       jae .LBB6_11
+       cmp byte ptr [rcx + r10], 0
+       js .LBB6_11
        add rax, -8
        pop rcx
        .cfi_def_cfa_offset 8
        ret

... where .LBB6_11 calls unwrap_failed.

@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 23, 2025

I guess that another benefit to bumping MSRV would be the ability to add assert_unchecked here to potentially help, but I'm not 100% sure it actually would in this case. It's very finicky in cases like this.

Even without an MSRV bump, I guess you could use the nightly feature as a temporary measure.

@cuviper
Copy link
Member Author

cuviper commented Oct 24, 2025

I assume you mean adding that in find_bucket_index?

    pub fn find_bucket_index(&self, hash: u64, eq: impl FnMut(&T) -> bool) -> Option<usize> {
        match self.raw.find(hash, eq) {
            Some(bucket) => Some(unsafe {
                let index = self.raw.bucket_index(&bucket);
                core::hint::assert_unchecked(index < self.raw.buckets());
                core::hint::assert_unchecked(self.raw.is_bucket_full(index));
                index
            }),
            None => None,
        }
    }

It does remove the unwrap in my simple back-to-back test, but it also pessimizes the unchecked case a little, recomputing the final address. Different variations of that assertion had the same effect.

+        neg r9
+        lea rax, [rax + 8*r9]
         add rax, -8
         pop rcx
         .cfi_def_cfa_offset 8
         ret

In a real scenario I expect you'd also have other interim code, and who knows if the assertion could still be propagated. I don't think we should bother until there's a real measured benefit, and anyway even the unhinted checked case is still much less than a full hash probe that you would have needed before!

@cuviper
Copy link
Member Author

cuviper commented Oct 24, 2025

RE MSRV, I recently bumped indexmap to 1.82, which is just over a year old.

`VacantEntry` now stores a `Tag` instead of a full `hash: u64`. This
means that `OccupiedEntry` doesn't need to store anything, because it
can get that tag from the control byte for `remove -> VacantEntry`.
The `get_bucket_entry` method doesn't need a hash argument either.

Also, since `OccupiedEntry` is now smaller, `enum Entry` will be the
same size as `VacantEntry` by using a niche for the discriminant.
(Although this is not _guaranteed_ by the compiler.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bucket-based API for HashTable

3 participants