Skip to content

fix(boxcar): current bucket pre-allocation code is over-allocating #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexpasmantier
Copy link
Contributor

@alexpasmantier alexpasmantier commented Feb 3, 2025

Fixes #75

From my understanding, this code's intent is to eagerly allocate the next bucket if we're about to write the first entry of the last 1/8th of the current bucket.

                       7/8             1/8  
         <---------------------------><---->
...--]  [========== current bucket =========]  [-------------------------- next bucket...
                                      ↑
                          first entry of the last 1/8th
                          of the current bucket

The code currently checks if:

if index == (location.bucket_len - (location.bucket_len >> 3)) {...}
//                                  -------------------------
//                                 i.e. 1/8 * location.bucket_len

Where index is - if I'm not mistaken a global vector index and should really be location.entry instead which corresponds to the relative location inside the current bucket:

if location.entry == (location.bucket_len - (location.bucket_len >> 3)) {...}

NOTE: in practice, this would mean the current code was over-allocating, by allocating ahead of time for potentially unneeded buckets.

@alexpasmantier alexpasmantier changed the title fix(boxcar): pre-allocation of buckets now checks position inside the current bucket fix(boxcar): current bucket pre-allocation code is over-allocating Feb 3, 2025
alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025
alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025
alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025
alexpasmantier added a commit to alexpasmantier/nucleo that referenced this pull request Feb 7, 2025
@pascalkuthe
Copy link
Member

I am inclined to agree this is a bug and this is the correct fix.

This same code is in the orginal boxcar library I vendored this from. You seemed to do some benchmarking. I was wondering if this change had any performance impact (good or bad). This allocation is supposed to reduce contention when allocating the next shard so it's supposed to be an optimization

@alexpasmantier
Copy link
Contributor Author

alexpasmantier commented Feb 13, 2025

I was wondering if this change had any performance impact (good or bad). This allocation is supposed to reduce contention when allocating the next shard so it's supposed to be an optimization

Looking deeper into it, I believe this bug actually causes:

  • allocating the next bucket early for the first four buckets
  • never pre-allocating buckets beyond that point because the absolute index just gets too big compared to 7/8*bucket_len.

For the second point above, it's easy to convince yourself that is indeed the case by looking at an example:

  • say you're pushing index = 480 (which corresponds to the first position of bucket 4)
  • for that specific bucket:
(location.bucket_len - (location.bucket_len >> 3)) = 7/8 * bucket_4_len = 7/8 * 512 = 448

which means no position inside bucket 4 (ranging from absolute index 480 to 991) can actually satisfy the condition.

In a more general way:

  • bucket N has len 2^(N + SKIP) and so the alloc position for bucket N is 7/8 * 2^(N + SKIP)
  • bucket N starts at absolute index 2^(N + SKIP) - 2^SKIP = (2^SKIP)*(2^N - 1)

and beyond N>=4, the following holds


7/8 * 2^N < 2^N - 1

which is equivalent to

7/8 * 2^(N + SKIP) < (2^SKIP)*(2^N - 1)

-----------------     ------------------
 alloc position          bucket start

And that phenomenon just gets worse as you increase the bucket number.

What this means in the end is that the pre-allocation condition never seems to be met past bucket 4.

@pascalkuthe
Copy link
Member

Yeah I agree, J was just curious if you.had dine any benchmarking ti see what the performance impact is

pascalkuthe pushed a commit that referenced this pull request Feb 14, 2025
* feat(injector): add an `extend` method to Nucleo's injector

* Update lib.rs

Co-authored-by: Michael Davis <[email protected]>

* remove benches and #73 patch

* udpates following pascalkuthe's review

* simplification

* adding tests

* remove unused method

---------

Co-authored-by: Michael Davis <[email protected]>
@alexpasmantier
Copy link
Contributor Author

So, did a bit of benchmarking.

Surprisingly, the location.entry version performs worse than the index version (saw decreases in performance all around the board, sequential and threaded).

I'm assuming this is because the location.entry is doing more work pre-allocating buckets (as explained above).

This led me to question whether that pre-allocation is doing anything useful at all and so I decided to benchmark and compare with no pre-allocation at all to see how that performed.

Removing the pre-allocation code altogether actually improved performance, as shown below:

     Running benches/main.rs (target/release/deps/main-0bbb4234313fcfae)
Gnuplot not found, using plotters backend
grow_boxcar/push/100    time:   [4.2707 µs 4.2760 µs 4.2861 µs]
                        change: [-11.076% -9.8177% -8.6602%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe
grow_boxcar/push/1000   time:   [7.0592 µs 7.0629 µs 7.0691 µs]
                        change: [-15.297% -14.940% -14.601%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  5 (5.00%) high severe
grow_boxcar/push/50000  time:   [220.13 µs 220.25 µs 220.42 µs]
                        change: [-3.2073% -2.5681% -1.9701%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  7 (7.00%) high severe
grow_boxcar/push/500000 time:   [2.9720 ms 2.9822 ms 2.9951 ms]
                        change: [-1.9094% -1.3285% -0.7802%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

grow_boxcar_push_threaded/push/100
                        time:   [21.779 µs 21.834 µs 21.896 µs]
                        change: [-10.092% -9.1059% -8.1001%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high severe
grow_boxcar_push_threaded/push/1000
                        time:   [56.159 µs 56.277 µs 56.394 µs]
                        change: [-13.812% -13.476% -13.128%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
grow_boxcar_push_threaded/push/50000
                        time:   [1.9849 ms 1.9992 ms 2.0138 ms]
                        change: [-9.1503% -8.2907% -7.3842%] (p = 0.00 < 0.05)
                        Performance has improved.
grow_boxcar_push_threaded/push/500000
                        time:   [23.296 ms 23.608 ms 23.909 ms]
                        change: [+0.5817% +2.3489% +4.2617%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild

Here is the benchmark code (uses criterion):

use std::{sync::Arc, thread::available_parallelism};

use criterion::{BenchmarkId, Criterion};
use nucleo::boxcar;
use rayon::prelude::*;

const TINY_LINE_COUNT: u32 = 100;
const SMALL_LINE_COUNT: u32 = 1_000;
const MEDIUM_LINE_COUNT: u32 = 50_000;
const LARGE_LINE_COUNT: u32 = 500_000;
const XLARGE_LINE_COUNT: u32 = 5_000_000;
const XXLARGE_LINE_COUNT: u32 = 20_000_000;

fn grow_boxcar(c: &mut Criterion) {
    let mut group = c.benchmark_group("grow_boxcar");
    for line_count in [
        TINY_LINE_COUNT,
        SMALL_LINE_COUNT,
        MEDIUM_LINE_COUNT,
        LARGE_LINE_COUNT,
        //XLARGE_LINE_COUNT,
        //XXLARGE_LINE_COUNT,
    ] {
        // generate random strings
        let lines = random_lines(line_count);

        group.bench_with_input(BenchmarkId::new("push", line_count), &lines, |b, lines| {
            b.iter(move || {
                let v = Arc::new(boxcar::Vec::with_capacity(2 * 1024, 1));
                for line in lines {
                    v.push(line, |_, _cols| {});
                }
            });
        });
    }
}

fn grow_boxcar_threaded(c: &mut Criterion) {
    let mut group = c.benchmark_group("grow_boxcar_push_threaded");
    for line_count in [
        TINY_LINE_COUNT,
        SMALL_LINE_COUNT,
        MEDIUM_LINE_COUNT,
        LARGE_LINE_COUNT,
        //XLARGE_LINE_COUNT,
        //XXLARGE_LINE_COUNT,
    ] {
        // generate random strings
        let lines = random_lines(line_count);
        let available_parallelism = available_parallelism().unwrap();
        let batch_size = lines.len() / usize::from(available_parallelism);

        group.bench_with_input(BenchmarkId::new("push", line_count), &lines, |b, lines| {
            b.iter(|| {
                let v = Arc::new(boxcar::Vec::with_capacity(2 * 1024, 1));
                lines
                    .chunks(batch_size)
                    .par_bridge()
                    .into_par_iter()
                    .for_each(|batch| {
                        for line in batch {
                            v.push(line, |_, _cols| {});
                        }
                    });
            });
        });
    }
}

fn random_lines(count: u32) -> Vec<String> {
    let count = i64::from(count);
    let word_count = 1;
    (0..count)
        .map(|_| fakeit::words::sentence(word_count))
        .collect()
}

criterion::criterion_group!(benches, grow_boxcar, grow_boxcar_threaded);
criterion::criterion_main!(benches);

🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Current bucket pre-allocation implementation is over-allocating
2 participants