fix(boxcar): current bucket pre-allocation code is over-allocating #73

alexpasmantier · 2025-02-03T15:36:16Z

Fixes #75

From my understanding, this code's intent is to eagerly allocate the next bucket if we're about to write the first entry of the last 1/8th of the current bucket.

                       7/8             1/8  
         <---------------------------><---->
...--]  [========== current bucket =========]  [-------------------------- next bucket...
                                      ↑
                          first entry of the last 1/8th
                          of the current bucket

The code currently checks if:

if index == (location.bucket_len - (location.bucket_len >> 3)) {...}
//                                  -------------------------
//                                 i.e. 1/8 * location.bucket_len

Where index is - if I'm not mistaken a global vector index and should really be location.entry instead which corresponds to the relative location inside the current bucket:

if location.entry == (location.bucket_len - (location.bucket_len >> 3)) {...}

NOTE: in practice, this would mean the current code was over-allocating, by allocating ahead of time for potentially unneeded buckets.

… current bucket

pascalkuthe · 2025-02-10T20:29:15Z

I am inclined to agree this is a bug and this is the correct fix.

This same code is in the orginal boxcar library I vendored this from. You seemed to do some benchmarking. I was wondering if this change had any performance impact (good or bad). This allocation is supposed to reduce contention when allocating the next shard so it's supposed to be an optimization

alexpasmantier · 2025-02-13T13:23:20Z

I was wondering if this change had any performance impact (good or bad). This allocation is supposed to reduce contention when allocating the next shard so it's supposed to be an optimization

Looking deeper into it, I believe this bug actually causes:

allocating the next bucket early for the first four buckets
never pre-allocating buckets beyond that point because the absolute index just gets too big compared to 7/8*bucket_len.

For the second point above, it's easy to convince yourself that is indeed the case by looking at an example:

say you're pushing index = 480 (which corresponds to the first position of bucket 4)
for that specific bucket:

(location.bucket_len - (location.bucket_len >> 3)) = 7/8 * bucket_4_len = 7/8 * 512 = 448

which means no position inside bucket 4 (ranging from absolute index 480 to 991) can actually satisfy the condition.

In a more general way:

bucket N has len 2^(N + SKIP) and so the alloc position for bucket N is 7/8 * 2^(N + SKIP)
bucket N starts at absolute index 2^(N + SKIP) - 2^SKIP = (2^SKIP)*(2^N - 1)

and beyond N>=4, the following holds


7/8 * 2^N < 2^N - 1

which is equivalent to

7/8 * 2^(N + SKIP) < (2^SKIP)*(2^N - 1)

-----------------     ------------------
 alloc position          bucket start

And that phenomenon just gets worse as you increase the bucket number.

What this means in the end is that the pre-allocation condition never seems to be met past bucket 4.

pascalkuthe · 2025-02-14T00:02:14Z

Yeah I agree, J was just curious if you.had dine any benchmarking ti see what the performance impact is

* feat(injector): add an `extend` method to Nucleo's injector * Update lib.rs Co-authored-by: Michael Davis <[email protected]> * remove benches and #73 patch * udpates following pascalkuthe's review * simplification * adding tests * remove unused method --------- Co-authored-by: Michael Davis <[email protected]>

alexpasmantier · 2025-02-15T20:10:54Z

So, did a bit of benchmarking.

Surprisingly, the location.entry version performs worse than the index version (saw decreases in performance all around the board, sequential and threaded).

I'm assuming this is because the location.entry is doing more work pre-allocating buckets (as explained above).

This led me to question whether that pre-allocation is doing anything useful at all and so I decided to benchmark and compare with no pre-allocation at all to see how that performed.

Removing the pre-allocation code altogether actually improved performance, as shown below:

     Running benches/main.rs (target/release/deps/main-0bbb4234313fcfae)
Gnuplot not found, using plotters backend
grow_boxcar/push/100    time:   [4.2707 µs 4.2760 µs 4.2861 µs]
                        change: [-11.076% -9.8177% -8.6602%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe
grow_boxcar/push/1000   time:   [7.0592 µs 7.0629 µs 7.0691 µs]
                        change: [-15.297% -14.940% -14.601%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  5 (5.00%) high severe
grow_boxcar/push/50000  time:   [220.13 µs 220.25 µs 220.42 µs]
                        change: [-3.2073% -2.5681% -1.9701%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  7 (7.00%) high severe
grow_boxcar/push/500000 time:   [2.9720 ms 2.9822 ms 2.9951 ms]
                        change: [-1.9094% -1.3285% -0.7802%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

grow_boxcar_push_threaded/push/100
                        time:   [21.779 µs 21.834 µs 21.896 µs]
                        change: [-10.092% -9.1059% -8.1001%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high severe
grow_boxcar_push_threaded/push/1000
                        time:   [56.159 µs 56.277 µs 56.394 µs]
                        change: [-13.812% -13.476% -13.128%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
grow_boxcar_push_threaded/push/50000
                        time:   [1.9849 ms 1.9992 ms 2.0138 ms]
                        change: [-9.1503% -8.2907% -7.3842%] (p = 0.00 < 0.05)
                        Performance has improved.
grow_boxcar_push_threaded/push/500000
                        time:   [23.296 ms 23.608 ms 23.909 ms]
                        change: [+0.5817% +2.3489% +4.2617%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild

Here is the benchmark code (uses criterion):

use std::{sync::Arc, thread::available_parallelism};

use criterion::{BenchmarkId, Criterion};
use nucleo::boxcar;
use rayon::prelude::*;

const TINY_LINE_COUNT: u32 = 100;
const SMALL_LINE_COUNT: u32 = 1_000;
const MEDIUM_LINE_COUNT: u32 = 50_000;
const LARGE_LINE_COUNT: u32 = 500_000;
const XLARGE_LINE_COUNT: u32 = 5_000_000;
const XXLARGE_LINE_COUNT: u32 = 20_000_000;

fn grow_boxcar(c: &mut Criterion) {
    let mut group = c.benchmark_group("grow_boxcar");
    for line_count in [
        TINY_LINE_COUNT,
        SMALL_LINE_COUNT,
        MEDIUM_LINE_COUNT,
        LARGE_LINE_COUNT,
        //XLARGE_LINE_COUNT,
        //XXLARGE_LINE_COUNT,
    ] {
        // generate random strings
        let lines = random_lines(line_count);

        group.bench_with_input(BenchmarkId::new("push", line_count), &lines, |b, lines| {
            b.iter(move || {
                let v = Arc::new(boxcar::Vec::with_capacity(2 * 1024, 1));
                for line in lines {
                    v.push(line, |_, _cols| {});
                }
            });
        });
    }
}

fn grow_boxcar_threaded(c: &mut Criterion) {
    let mut group = c.benchmark_group("grow_boxcar_push_threaded");
    for line_count in [
        TINY_LINE_COUNT,
        SMALL_LINE_COUNT,
        MEDIUM_LINE_COUNT,
        LARGE_LINE_COUNT,
        //XLARGE_LINE_COUNT,
        //XXLARGE_LINE_COUNT,
    ] {
        // generate random strings
        let lines = random_lines(line_count);
        let available_parallelism = available_parallelism().unwrap();
        let batch_size = lines.len() / usize::from(available_parallelism);

        group.bench_with_input(BenchmarkId::new("push", line_count), &lines, |b, lines| {
            b.iter(|| {
                let v = Arc::new(boxcar::Vec::with_capacity(2 * 1024, 1));
                lines
                    .chunks(batch_size)
                    .par_bridge()
                    .into_par_iter()
                    .for_each(|batch| {
                        for line in batch {
                            v.push(line, |_, _cols| {});
                        }
                    });
            });
        });
    }
}

fn random_lines(count: u32) -> Vec<String> {
    let count = i64::from(count);
    let word_count = 1;
    (0..count)
        .map(|_| fakeit::words::sentence(word_count))
        .collect()
}

criterion::criterion_group!(benches, grow_boxcar, grow_boxcar_threaded);
criterion::criterion_main!(benches);

🤔

fix(boxcar): pre-allocation of buckets now checks position inside the…

03317d9

… current bucket

alexpasmantier changed the title ~~fix(boxcar): pre-allocation of buckets now checks position inside the current bucket~~ fix(boxcar): current bucket pre-allocation code is over-allocating Feb 3, 2025

the-mikedavis mentioned this pull request Feb 7, 2025

feat(injector): add an extend method to Nucleo's injector #74

Merged

alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025

remove benches and helix-editor#73 patch

2dd9986

alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025

remove benches and helix-editor#73 patch

bca0298

alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025

remove benches and helix-editor#73 patch

ba6c552

alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025

remove benches and helix-editor#73 patch

fb31691

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(boxcar): current bucket pre-allocation code is over-allocating #73

fix(boxcar): current bucket pre-allocation code is over-allocating #73

Uh oh!

alexpasmantier commented Feb 3, 2025 •

edited

Loading

Uh oh!

pascalkuthe commented Feb 10, 2025

Uh oh!

alexpasmantier commented Feb 13, 2025 •

edited

Loading

Uh oh!

pascalkuthe commented Feb 14, 2025

Uh oh!

alexpasmantier commented Feb 15, 2025

Uh oh!

Uh oh!

fix(boxcar): current bucket pre-allocation code is over-allocating #73

Are you sure you want to change the base?

fix(boxcar): current bucket pre-allocation code is over-allocating #73

Uh oh!

Conversation

alexpasmantier commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pascalkuthe commented Feb 10, 2025

Uh oh!

alexpasmantier commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pascalkuthe commented Feb 14, 2025

Uh oh!

alexpasmantier commented Feb 15, 2025

Uh oh!

Uh oh!

alexpasmantier commented Feb 3, 2025 •

edited

Loading

alexpasmantier commented Feb 13, 2025 •

edited

Loading