Skip to content

Support Utf8View for Avro #7434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kumarlokesh
Copy link

@kumarlokesh kumarlokesh commented Apr 22, 2025

Which issue does this PR close?

Closes #7262.

Rationale for this change

What changes are included in this PR?

This PR adds support for Utf8View to the arrow-avro reader, enabling the creation of StringViewArray instead of regular StringArray for string data.

  • Added Utf8View variant to Codec enum and with_utf8view method for codec conversion
  • Updated Decoder enum with a StringView variant to handle StringViewArray creation
  • Added a configuration option use_utf8view to ReadOptions struct
  • Added benchmarks
  • Created an example application demonstrating how to use the feature
  • Updated documentation throughout the codebase

Benchmark output:

array_creation/string_array_10_chars                                                                            
                        time:   [9.5921 µs 9.6520 µs 9.7296 µs]
                        change: [-7.4725% -5.8906% -4.2024%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_view_10_chars                                                                            
                        time:   [10.053 µs 10.128 µs 10.207 µs]
                        change: [-3.1158% -0.9311% +1.0989%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
array_creation/string_array_100_chars                                                                            
                        time:   [16.242 µs 16.372 µs 16.576 µs]
                        change: [-1.3374% +0.4015% +2.4089%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_view_100_chars                                                                            
                        time:   [13.027 µs 13.290 µs 13.592 µs]
                        change: [-16.763% -12.893% -9.4265%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_array_1000_chars                                                                           
                        time:   [76.218 µs 77.540 µs 78.691 µs]
                        change: [-8.4455% -6.3045% -4.1053%] (p = 0.00 < 0.05)
                        Performance has improved.
array_creation/string_view_1000_chars                                                                            
                        time:   [45.964 µs 46.399 µs 46.770 µs]
                        change: [-8.7387% -3.9085% -0.5912%] (p = 0.07 > 0.05)
                        No change in performance detected.

string_operations/string_array_value_10_chars                                                                            
                        time:   [546.50 ns 565.27 ns 585.60 ns]
                        change: [-16.303% -10.503% -5.2608%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
string_operations/string_view_value_10_chars                                                                            
                        time:   [973.27 ns 992.09 ns 1.0101 µs]
                        change: [-14.527% -5.2942% +1.9214%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) low mild
  2 (10.00%) high mild
string_operations/string_array_value_100_chars                                                                            
                        time:   [548.03 ns 559.51 ns 573.23 ns]
                        change: [+1.1381% +3.0271% +5.0339%] (p = 0.01 < 0.05)
                        Performance has regressed.
string_operations/string_view_value_100_chars                                                                            
                        time:   [987.22 ns 1.0068 µs 1.0248 µs]
                        change: [-8.6431% +0.4588% +6.7220%] (p = 0.93 > 0.05)
                        No change in performance detected.
string_operations/string_array_value_1000_chars                                                                            
                        time:   [545.70 ns 556.69 ns 571.33 ns]
                        change: [-13.942% -8.4282% -3.7589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe
string_operations/string_view_value_1000_chars                                                                            
                        time:   [947.71 ns 960.19 ns 974.77 ns]
                        change: [+0.9469% +4.6877% +8.4201%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild

avro_reader/string_array_10_chars                                                                           
                        time:   [149.00 µs 152.79 µs 155.90 µs]
                        change: [+2.5473% +5.4637% +8.5654%] (p = 0.00 < 0.05)
                        Performance has regressed.
avro_reader/string_view_10_chars                                                                           
                        time:   [141.75 µs 142.96 µs 144.64 µs]
                        change: [-14.307% -5.2067% +1.3228%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
avro_reader/string_array_100_chars                                                                           
                        time:   [173.20 µs 177.79 µs 183.34 µs]
                        change: [-4.6247% +1.1333% +5.8015%] (p = 0.72 > 0.05)
                        No change in performance detected.
avro_reader/string_view_100_chars                                                                           
                        time:   [174.64 µs 178.95 µs 183.48 µs]
                        change: [-10.183% -0.2004% +7.0467%] (p = 0.97 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
avro_reader/string_array_1000_chars                                                                           
                        time:   [404.78 µs 415.51 µs 428.94 µs]
                        change: [-12.609% -2.9688% +4.7088%] (p = 0.64 > 0.05)
                        No change in performance detected.
avro_reader/string_view_1000_chars                                                                           
                        time:   [392.38 µs 403.15 µs 416.30 µs]
                        change: [+4.5005% +11.040% +18.788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe

From the above benchmark results, we can draw following conclusions:

1. Array Creation Performance
1.1 Small Strings (10 chars):
StringArray: 9.65 µs
StringViewArray: 10.13 µs
StringArray is about 5% faster for small strings

1.2 Medium Strings (100 chars):
StringArray: 16.37 µs
StringViewArray: 13.29 µs
StringViewArray is about 19% faster

1.3 Large Strings (1000 chars):
StringArray: 77.54 µs
StringViewArray: 46.40 µs
StringViewArray is about 40% faster

2. String Value Access Performance
Value Access (all string sizes):
StringArray: ~560 ns
StringViewArray: ~980 ns
StringArray is consistently about 43% faster for value access

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Utf8View for avro
1 participant