Skip to content

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 29, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Mar 27, 2025

…ng on a single Utf8View )

Which issue does this PR close?

Rationale for this change

First we look at why Q3 case slow:

Here is source code:

 pub unsafe fn compare_unchecked(
        left: &GenericByteViewArray<T>,
        left_idx: usize,
        right: &GenericByteViewArray<T>,
        right_idx: usize,
    ) -> std::cmp::Ordering {
        let l_view = left.views().get_unchecked(left_idx);
        let l_len = *l_view as u32;

        let r_view = right.views().get_unchecked(right_idx);
        let r_len = *r_view as u32;

        if l_len <= 12 && r_len <= 12 {
            let l_data = unsafe { GenericByteViewArray::<T>::inline_value(l_view, l_len as usize) };
            let r_data = unsafe { GenericByteViewArray::<T>::inline_value(r_view, r_len as usize) };
            return l_data.cmp(r_data);
        }

        // one of the string is larger than 12 bytes,
        // we then try to compare the inlined data first
        let l_inlined_data = unsafe { GenericByteViewArray::<T>::inline_value(l_view, 4) };
        let r_inlined_data = unsafe { GenericByteViewArray::<T>::inline_value(r_view, 4) };
        if r_inlined_data != l_inlined_data {
            return l_inlined_data.cmp(r_inlined_data);
        }

        // unfortunately, we need to compare the full data
        let l_full_data: &[u8] = unsafe { left.value_unchecked(left_idx).as_ref() };
        let r_full_data: &[u8] = unsafe { right.value_unchecked(right_idx).as_ref() };

        l_full_data.cmp(r_full_data)
    }

We will mostly go to the last compare case: unfortunately, we need to compare the full data
due to the l_comments dataset will have many same prefix when compare. But compare with Utf8 full data compare, the Utf8View full data compare is slow, mostly i think is the stringView full data compare can't be optimized well by compiler due to many possible reasons:

  1. The stringview data buffers are going larger when we construct the cursor values to compare.
  2. The compare data are not data locality for stringview, but it is data locality for stringarray because they are copied to the only one buffer.
  3. The fulldata of stringarray enable better inlining and vectorized operations.

Solution:

Introduce GC() for stringview type when create cursor values.

Summary
By applying GC() to the stringview type when creating cursor values, we ensure that:

Only essential data is kept: Reducing the buffer size minimizes memory overhead.

Data becomes more cache-friendly: Smaller buffers allow for better CPU cache utilization, leading to faster string comparisons.

Compiler optimizations are more effective: Improved data locality and reduced buffer size enable better inlining and vectorized operations, contributing to overall performance gains.

This change is well-founded, as it addresses both memory efficiency and execution speed, especially in scenarios with intensive string comparison operations.

   /// # Garbage Collection
    ///
    /// Before GC:
    /// ```text
    ///                                        ┌──────┐
    ///                                        │......│
    ///                                        │......│
    /// ┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer
    /// │       View 1       │─ ─ ─ ─          │......│  with data that
    /// ├────────────────────┤                 │......│ is not referred
    /// │       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
    /// └────────────────────┘                 │......│      View 2
    ///                                        │......│
    ///    2 views, refer to                   │......│
    ///   small portions of a                  └──────┘
    ///      large buffer
    /// ```
    ///
    /// After GC:
    ///
    /// ```text
    /// ┌────────────────────┐                 ┌─────┐    After gc, only
    /// │       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is
    /// ├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by
    /// │       View 2       │─ ─ ─ ─          └─────┘     the views is
    /// └────────────────────┘                                 left
    ///
    ///
    ///         2 views
    /// ```

And of course gc() has some overhead, we need to copy the data, but we get huge performance than loss from the actual testing result.

What changes are included in this PR?

Introduce GC() for stringview type when create cursor values.

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single Utf8View )

Are these changes tested?

Yes, the Q3 will have huge improvement for about 40% performance.

python3 ./compare.py ./results/main/sort_tpch.json ./results/issue_15403/sort_tpch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ issue_15403 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1153.67ms │    145.64ms │ +1.06x faster │
│ Q2129.33ms │    132.22ms │     no change │
│ Q31341.97ms │    964.69ms │ +1.39x faster │
│ Q4224.86ms │    224.15ms │     no change │
│ Q5410.67ms │    406.52ms │     no change │
│ Q6425.51ms │    420.21ms │     no change │
│ Q7651.09ms │    644.93ms │     no change │
│ Q8458.20ms │    463.22ms │     no change │
│ Q9476.74ms │    472.85ms │     no change │
│ Q10679.90ms │    667.38ms │     no change │
│ Q11414.58ms │    427.22ms │     no change │
└──────────────┴───────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)5366.53ms │
│ Total Time (issue_15403)4969.04ms │
│ Average Time (main)487.87ms │
│ Average Time (issue_15403)451.73ms │
│ Queries Faster2 │
│ Queries Slower0 │
│ Queries with No Change9 │
└────────────────────────────┴───────────┘

Are there any user-facing changes?

No

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Mar 27, 2025

Very nice catch 💯 Thank you @zhuqi-lucas. Can you also run the other benchmarks? --might be any case where copy overhead dominates the gain?

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 27, 2025

Thank you @berkaysynnada for review. Currently only sort-tpch q3 will have the case that string larger than 12bytes for most cases and also high percentage of same 4 bytes prefix, so that we will meet the performance regression about stringview.

For other testing about sort stringview, it will not affect too much, see q11, it also for stringview sort merge, but with short string less than 12bytes, let's see the result has some overhead due to gc(), but it does not affect too much.

Q11414.58ms │    427.22ms │     no change

And the main data to compare for Q11 already improved 20% after we merged single column stringview compare:
#15348 (comment)

And our code already has batch level gc(), for exampe when we organize_stringview_arrays for spilling:

let new_array = string_view_array.gc();

I think it will not cause too much overhead.

Also updated the 10G sort tpch for Q3, the similar result:

This PR

cargo run --profile release-nonlto --target aarch64-apple-darwin   --bin dfbench -- sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1
    Finished `release-nonlto` profile [optimized] target(s) in 1m 25s
     Running `target/aarch64-apple-darwin/release-nonlto/dfbench sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1`
Q3 iteration 0 took 7879.6 ms and returned 59986052 rows
Q3 avg time: 7879.60 ms

Main:

cargo run --profile release-nonlto --target aarch64-apple-darwin   --bin dfbench -- sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1
    Finished `release-nonlto` profile [optimized] target(s) in 0.26s
     Running `target/aarch64-apple-darwin/release-nonlto/dfbench sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1`
Q3 iteration 0 took 10178.2 ms and returned 59986052 rows
Q3 avg time: 10178.20 ms

@zhuqi-lucas
Copy link
Contributor Author

After thinking more, i think the better way is to improve it in arrow-rs, so we will benefit more about the utf8view regression cases, created the ticket for arrow-rs:

apache/arrow-rs#7350

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed response. I think this is a beneficial change and mergable, but there could be more thoughts people want to share, so let's wait a little more.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 28, 2025

Thank you @berkaysynnada , at the same time i will track and try to improve it in arrow-rs because the sort merge is one of the regression cases. cc @alamb for more input.

First of all, i create the reproducer benchmark PR for arrow-rs, and it's also part of the regression not the full set:

apache/arrow-rs#7351

Benchmark Utf8 Time (µs) Utf8View Time (µs) Slower (%)
eq custom 97.426 - 97.940 210.93 - 214.13 116%
neq custom 100.27 - 102.04 211.39 - 212.46 111%
gt custom 84.338 - 84.634 196.90 - 197.52 133%
gt custom scalar 99.384 - 99.817 181.69 - 182.25 83%
gt custom scalar long string 100.32 - 100.65 212.39 - 212.92 112%
like custom scalar 89.111 - 90.811 177.80 - 180.97 100%
nlike custom scalar 87.976 - 89.682 98.116 - 101.02 12%
ilike custom scalar 199.82 - 200.68 312.25 - 314.25 56%
nilike custom scalar 198.95 - 199.67 232.68 - 234.71 17%

@comphead comphead merged commit e2b7919 into apache:main Mar 29, 2025
39 checks passed
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single Utf8View )
4 participants