[TST] More benchmark queries for regex #4910

Sicheng-Pan · 2025-06-21T00:12:18Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- This PR adds more regex patterns in the benchmark. The benchmark also serve as an integration for regex as it compares the result with bruteforce evaluation.
- Updates a few dependencies. Verified that there should be no breaking change
- Updates some wal3 test because fragment size changed after dependency. The existing fragment should be compatible and manifest should still be valid
New functionality
- N/A

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

github-actions · 2025-06-21T00:12:32Z

Sicheng-Pan · 2025-06-21T00:12:35Z

[TST] More benchmark queries for regex #4910 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2025-06-21T00:13:14Z

Benchmark Expansion for Regex Patterns and Rust Dataset; Arrow/Parquet Ecosystem Updates

This PR makes major changes to the benchmarking/integration suite for regex and fulltext search functionality by expanding the coverage and diversity of regular expressions used in tests and benchmarks, primarily focusing on realistic Rust code patterns. It introduces and wires up a new Rust dataset based on the BigCode 'the-stack-dedup' corpus, reworks the relevant dataset loader logic for async streaming from HuggingFace/hf-hub, and adapts the benchmark routines to leverage this dataset. There is also a substantial update to the Arrow and Parquet-related dependencies (from 52.x to 55.1) across the workspace, resulting in a large Cargo.lock diff and a need to update some file size assertions in integration tests. Additionally, dependency and dataset wiring is improved for future extensibility.

Key Changes:
• Adds comprehensive lists of realistic fulltext and regex query patterns to rust/index/benches/literal.rs and rust/worker/benches/regex.rs.
• Replaces ad-hoc filesystem dataset loading with a new loader for the BigCode The Stack deduplicated Rust subset (TheStackDedupRust), integrating async streaming of content.
• Adapts benchmark benchmarks and test frameworks to utilize the new Rust dataset for higher-fidelity evaluation.
• Upgrades Arrow, Parquet, and related dependencies to version 55.1 throughout the workspace and synchronizes code, test, and Cargo.lock for these changes.
• Corrects integration test fixtures involving file sizes/fragments for k8s/WAL tests to reflect new Parquet logic and block sizes.
• General dependency hygiene and cargo workspace corrections (e.g., moving to async Parquet APIs, resolving breaking changes, updating dataset utilities).

Affected Areas:
• Benchmark and test harnesses for regex/fulltext (rust/index/benches/literal.rs, rust/worker/benches/regex.rs)
• Dataset infrastructure (rust/benchmark/src/datasets/rust.rs, wiring in mod.rs and other dataset utilities)
• Dependency versions and build system (Cargo.toml, Cargo.lock, workspace settings)
• WAL3 integration/k8s tests expecting specific Parquet file sizes
• Dependency wiring for async Parquet/Arrow IO

This summary was automatically generated by @propel-code-bot

Sicheng-Pan marked this pull request as ready for review June 21, 2025 00:12

Sicheng-Pan requested a review from sanketkedia June 21, 2025 00:13

sicheng added 3 commits June 20, 2025 18:59

[TST] More benchmark queries for regex

7c530ff

Fix cargo lock

ddacc38

Fix lint

99c96f1

Sicheng-Pan force-pushed the sicheng/06-20-more-regex-bench branch from 4d57d3b to 99c96f1 Compare June 21, 2025 01:59

Update k8s tests for updated parquet dep

b967f30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TST] More benchmark queries for regex #4910

[TST] More benchmark queries for regex #4910

Sicheng-Pan commented Jun 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

Sicheng-Pan commented Jun 21, 2025

Uh oh!

propel-code-bot bot commented Jun 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

[TST] More benchmark queries for regex #4910

Are you sure you want to change the base?

[TST] More benchmark queries for regex #4910

Conversation

Sicheng-Pan commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

github-actions bot commented Jun 21, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

Sicheng-Pan commented Jun 21, 2025

Uh oh!

propel-code-bot bot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Sicheng-Pan commented Jun 21, 2025 •

edited

Loading

propel-code-bot bot commented Jun 21, 2025 •

edited

Loading