Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

zhuqi-lucas · 2025-03-27T02:22:07Z

…ng on a single Utf8View )

Which issue does this PR close?

Closes #15403

Rationale for this change

First we look at why Q3 case slow:

Here is source code:

 pub unsafe fn compare_unchecked(
        left: &GenericByteViewArray<T>,
        left_idx: usize,
        right: &GenericByteViewArray<T>,
        right_idx: usize,
    ) -> std::cmp::Ordering {
        let l_view = left.views().get_unchecked(left_idx);
        let l_len = *l_view as u32;

        let r_view = right.views().get_unchecked(right_idx);
        let r_len = *r_view as u32;

        if l_len <= 12 && r_len <= 12 {
            let l_data = unsafe { GenericByteViewArray::<T>::inline_value(l_view, l_len as usize) };
            let r_data = unsafe { GenericByteViewArray::<T>::inline_value(r_view, r_len as usize) };
            return l_data.cmp(r_data);
        }

        // one of the string is larger than 12 bytes,
        // we then try to compare the inlined data first
        let l_inlined_data = unsafe { GenericByteViewArray::<T>::inline_value(l_view, 4) };
        let r_inlined_data = unsafe { GenericByteViewArray::<T>::inline_value(r_view, 4) };
        if r_inlined_data != l_inlined_data {
            return l_inlined_data.cmp(r_inlined_data);
        }

        // unfortunately, we need to compare the full data
        let l_full_data: &[u8] = unsafe { left.value_unchecked(left_idx).as_ref() };
        let r_full_data: &[u8] = unsafe { right.value_unchecked(right_idx).as_ref() };

        l_full_data.cmp(r_full_data)
    }

We will mostly go to the last compare case: unfortunately, we need to compare the full data
due to the l_comments dataset will have many same prefix when compare. But compare with Utf8 full data compare, the Utf8View full data compare is slow, mostly i think is the stringView full data compare can't be optimized well by compiler due to many possible reasons:

The stringview data buffers are going larger when we construct the cursor values to compare.
The compare data are not data locality for stringview, but it is data locality for stringarray because they are copied to the only one buffer.
The fulldata of stringarray enable better inlining and vectorized operations.

Solution:

Introduce GC() for stringview type when create cursor values.

Summary
By applying GC() to the stringview type when creating cursor values, we ensure that:

Only essential data is kept: Reducing the buffer size minimizes memory overhead.

Data becomes more cache-friendly: Smaller buffers allow for better CPU cache utilization, leading to faster string comparisons.

Compiler optimizations are more effective: Improved data locality and reduced buffer size enable better inlining and vectorized operations, contributing to overall performance gains.

This change is well-founded, as it addresses both memory efficiency and execution speed, especially in scenarios with intensive string comparison operations.

   /// # Garbage Collection
    ///
    /// Before GC:
    /// ```text
    ///                                        ┌──────┐
    ///                                        │......│
    ///                                        │......│
    /// ┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer
    /// │       View 1       │─ ─ ─ ─          │......│  with data that
    /// ├────────────────────┤                 │......│ is not referred
    /// │       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
    /// └────────────────────┘                 │......│      View 2
    ///                                        │......│
    ///    2 views, refer to                   │......│
    ///   small portions of a                  └──────┘
    ///      large buffer
    /// ```
    ///
    /// After GC:
    ///
    /// ```text
    /// ┌────────────────────┐                 ┌─────┐    After gc, only
    /// │       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is
    /// ├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by
    /// │       View 2       │─ ─ ─ ─          └─────┘     the views is
    /// └────────────────────┘                                 left
    ///
    ///
    ///         2 views
    /// ```

And of course gc() has some overhead, we need to copy the data, but we get huge performance than loss from the actual testing result.

What changes are included in this PR?

Introduce GC() for stringview type when create cursor values.

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single Utf8View )

Are these changes tested?

Yes, the Q3 will have huge improvement for about 40% performance.

python3 ./compare.py ./results/main/sort_tpch.json ./results/issue_15403/sort_tpch.json
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ issue_15403 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  153.67ms │    145.64ms │ +1.06x faster │
│ Q2           │  129.33ms │    132.22ms │     no change │
│ Q3           │ 1341.97ms │    964.69ms │ +1.39x faster │
│ Q4           │  224.86ms │    224.15ms │     no change │
│ Q5           │  410.67ms │    406.52ms │     no change │
│ Q6           │  425.51ms │    420.21ms │     no change │
│ Q7           │  651.09ms │    644.93ms │     no change │
│ Q8           │  458.20ms │    463.22ms │     no change │
│ Q9           │  476.74ms │    472.85ms │     no change │
│ Q10          │  679.90ms │    667.38ms │     no change │
│ Q11          │  414.58ms │    427.22ms │     no change │
└──────────────┴───────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)          │ 5366.53ms │
│ Total Time (issue_15403)   │ 4969.04ms │
│ Average Time (main)        │  487.87ms │
│ Average Time (issue_15403) │  451.73ms │
│ Queries Faster             │         2 │
│ Queries Slower             │         0 │
│ Queries with No Change     │         9 │
└────────────────────────────┴───────────┘

Are there any user-facing changes?

No

…ng on a single Utf8View )

berkaysynnada · 2025-03-27T11:57:34Z

Very nice catch 💯 Thank you @zhuqi-lucas. Can you also run the other benchmarks? --might be any case where copy overhead dominates the gain?

zhuqi-lucas · 2025-03-27T12:52:58Z

Thank you @berkaysynnada for review. Currently only sort-tpch q3 will have the case that string larger than 12bytes for most cases and also high percentage of same 4 bytes prefix, so that we will meet the performance regression about stringview.

For other testing about sort stringview, it will not affect too much, see q11, it also for stringview sort merge, but with short string less than 12bytes, let's see the result has some overhead due to gc(), but it does not affect too much.

Q11          │  414.58ms │    427.22ms │     no change

And the main data to compare for Q11 already improved 20% after we merged single column stringview compare:
#15348 (comment)

And our code already has batch level gc(), for exampe when we organize_stringview_arrays for spilling:

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Line 493 in 18feb8b

let new_array = string_view_array.gc();

I think it will not cause too much overhead.

Also updated the 10G sort tpch for Q3, the similar result:

This PR

cargo run --profile release-nonlto --target aarch64-apple-darwin   --bin dfbench -- sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1
    Finished `release-nonlto` profile [optimized] target(s) in 1m 25s
     Running `target/aarch64-apple-darwin/release-nonlto/dfbench sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1`
Q3 iteration 0 took 7879.6 ms and returned 59986052 rows
Q3 avg time: 7879.60 ms

Main:

cargo run --profile release-nonlto --target aarch64-apple-darwin   --bin dfbench -- sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1
    Finished `release-nonlto` profile [optimized] target(s) in 0.26s
     Running `target/aarch64-apple-darwin/release-nonlto/dfbench sort-tpch -p /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 -q 3 -i 1`
Q3 iteration 0 took 10178.2 ms and returned 59986052 rows
Q3 avg time: 10178.20 ms

zhuqi-lucas · 2025-03-28T09:57:12Z

After thinking more, i think the better way is to improve it in arrow-rs, so we will benefit more about the utf8view regression cases, created the ticket for arrow-rs:

apache/arrow-rs#7350

berkaysynnada

Thank you for the detailed response. I think this is a beneficial change and mergable, but there could be more thoughts people want to share, so let's wait a little more.

zhuqi-lucas · 2025-03-28T10:26:53Z

Thank you @berkaysynnada , at the same time i will track and try to improve it in arrow-rs because the sort merge is one of the regression cases. cc @alamb for more input.

First of all, i create the reproducer benchmark PR for arrow-rs, and it's also part of the regression not the full set:

apache/arrow-rs#7351

Benchmark	Utf8 Time (µs)	Utf8View Time (µs)	Slower (%)
eq custom	97.426 - 97.940	210.93 - 214.13	116%
neq custom	100.27 - 102.04	211.39 - 212.46	111%
gt custom	84.338 - 84.634	196.90 - 197.52	133%
gt custom scalar	99.384 - 99.817	181.69 - 182.25	83%
gt custom scalar long string	100.32 - 100.65	212.39 - 212.92	112%
like custom scalar	89.111 - 90.811	177.80 - 180.97	100%
nlike custom scalar	87.976 - 89.682	98.116 - 101.02	12%
ilike custom scalar	199.82 - 200.68	312.25 - 314.25	56%
nilike custom scalar	198.95 - 199.67	232.68 - 234.71	17%

…ng on a single Utf8View ) (apache#15447)

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi…

5057921

…ng on a single Utf8View )

alamb added the performance Make DataFusion faster label Mar 27, 2025

zhuqi-lucas mentioned this pull request Mar 28, 2025

StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) apache/arrow-rs#7350

Open

berkaysynnada approved these changes Mar 28, 2025

View reviewed changes

comphead merged commit e2b7919 into apache:main Mar 29, 2025
39 checks passed

goldmedal mentioned this pull request Mar 30, 2025

custom_datasource example panicked during RepartitionExec planning #15493

Closed

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi…

1092611

…ng on a single Utf8View ) (apache#15447)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

zhuqi-lucas commented Mar 27, 2025 •

edited

Loading

berkaysynnada commented Mar 27, 2025 •

edited

Loading

zhuqi-lucas commented Mar 27, 2025 •

edited

Loading

zhuqi-lucas commented Mar 28, 2025

berkaysynnada left a comment

zhuqi-lucas commented Mar 28, 2025 •

edited

Loading

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

Conversation

zhuqi-lucas commented Mar 27, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada commented Mar 27, 2025 • edited Loading

zhuqi-lucas commented Mar 27, 2025 • edited Loading

zhuqi-lucas commented Mar 28, 2025

berkaysynnada left a comment

Choose a reason for hiding this comment

zhuqi-lucas commented Mar 28, 2025 • edited Loading

zhuqi-lucas commented Mar 27, 2025 •

edited

Loading

berkaysynnada commented Mar 27, 2025 •

edited

Loading

zhuqi-lucas commented Mar 27, 2025 •

edited

Loading

zhuqi-lucas commented Mar 28, 2025 •

edited

Loading