Skip to content

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Oct 7, 2025

Towards #625, specifically #625 (comment) items 1 and 2.

🦖 Optimization

When a standard functor, either transparent or integer-specialized, is passed to transform, along with all vector<bool> iterators, map that functor to a bitwise one to operate on the underlying type.

The mapping is done via template specialization, and not via if constexpr to make the dispatch working fine without <functional> included and functors defined.

Only do this for zero offset. Supporting all possible offset combination is much complexity for a little gain. Remember copy.

Extract pointers from iterators to help the compiler auto-vectorize. Yes, it does not auto-vectorize when using the whole iterators. Auto-vecotrization needs simplest ways of implementing loops.

Don't call transform again, to avoid unnecessary recursion, the operation is simple.

Don't process tails explicitly, yield to the existing loop for now.
Actually lets go for it, it is not that hard. Process tails with applying bit mask.

Don't do ranges yet. Other vector<bool> optimizations don't do them either. It is getting complicated, so instead of doing ranges separately, need to look into #1754 at last.

🏁 Benchmark

Feed the randomizer with some seed to make the inputs different 🐦

Since (auto-)vectorization is (expected to be) engaged, use alignment controlling allocator.

⏱️ Benchmark results

Benchmark Before After Speedup
transform_two_inputs_aligned<logical_and<>>/64 108 ns 2.55 ns 42.4
transform_two_inputs_aligned<logical_and<>>/4096 13869 ns 9.44 ns 1470
transform_two_inputs_aligned<logical_and<>>/65536 416424 ns 115 ns 3620
transform_two_inputs_aligned<logical_or<>>/64 123 ns 2.59 ns 47.40
transform_two_inputs_aligned<logical_or<>>/4096 14377 ns 9.07 ns 1590
transform_two_inputs_aligned<logical_or<>>/65536 409012 ns 112 ns 3650
transform_one_input_aligned<logical_not<>>/64 83.7 ns 2.14 ns 39.10
transform_one_input_aligned<logical_not<>>/4096 6891 ns 7.28 ns 947
transform_one_input_aligned<logical_not<>>/65536 264957 ns 82.7 ns 3200


template <class _Ty>
struct _Map_vb_functor<equal_to<_Ty>> {
using _Type = conditional_t<_Is_vbool_functor_arg<_Ty>, _Bit_xnor, void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can map to _Map_vb_functor itself and have operator() right here to save one struct.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Oct 8, 2025
@StephanTLavavej StephanTLavavej self-assigned this Oct 8, 2025
}

template <class _VbIt, class _OutIt, class _Mapped_fn>
_CONSTEXPR20 _OutIt _Transform_vbool_aligned(
Copy link
Contributor Author

@AlexGuteniev AlexGuteniev Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this out to <vector> from <algorithms> because other algorithms are moved out.
However, I'm not sure if it is useful.

For accessing vector<bool> representation it is not strictly necessary. Most of things are template-dependent member functions and datas. The only exception is _Vbase, which can be still deduced from iterators.

For throughput it does not look useful either. <vector> is more frequent than <algorithm> so it appears more useful to off-load <vector> instead.

For reference, 0f24d45 is the commit where this movement was made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

2 participants