Skip to content

Optimize numeric-to-string coercion in StringArrayDecoder #7274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

ndemir
Copy link

@ndemir ndemir commented Mar 12, 2025

Which issue does this PR close?

I see an issue opened here: #7273
I realized that can be optimized and I worked on that (not in the way that was suggested in the issue).

Closes #7273

Rationale for this change

This PR introduces performance improvements for string coercion.

Performance Metrics:

Author of the #7273 recommends to use cargo bench --bench json_reader, we will go with that.

Step#1: Save Baseline Metrics

> git checkout main
> cargo bench --bench json_reader -- --save-baseline before        

Step#2: Compare the changes with Baseline Metrics

> git checkout ndemir/string-coerce-optimization
> cargo bench --bench json_reader --  --baseline before

small_bench_primitive   time:   [8.5790 µs 8.5890 µs 8.6029 µs]
                        change: [-3.0652% -2.4764% -1.9676%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

large_bench_primitive   time:   [3.2421 ms 3.2951 ms 3.3615 ms]
                        change: [+0.9382% +2.6439% +4.5593%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

small_bench_list        time:   [17.444 µs 17.655 µs 18.076 µs]
                        change: [-6.4619% -4.0195% -2.1687%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  3 (3.00%) high severe

What changes are included in this PR?

  • Shared buffer for numeric -> string: Added a number_buffer to reuse for numeric->string conversions.
  • Performance: This reduces overhead by writing numeric values to a single buffer and then converting to &str.
  • Safety comment: I added a comment and explain why from_utf8_unchecked is valid.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 12, 2025
@ndemir
Copy link
Author

ndemir commented Mar 12, 2025

I see that the performance is consistently BETTER even though it is not hitting the refactored code.
Will investigate.

@ndemir ndemir closed this Mar 12, 2025
write!(&mut self.number_buffer, "{}", n).unwrap();
// SAFETY: We only write ASCII characters (digits, signs, decimal points,
// exponent symbols) into `number_buffer`, which are guaranteed valid UTF-8.
unsafe { std::str::from_utf8_unchecked(&self.number_buffer) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid unsafe here by using

    number_buffer: String,

This is similar to how @zhuqi-lucas did it here:

        // Temporary buffer to avoid per-iteration allocation for numeric types
        let mut tmp_buf = String::new();
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON Reader Faster Coercion of Primitives to String
2 participants