✨ re-implement `dplyr::{filter, mutate, arrange}` using data parallelism #22

jdhoffa · 2024-12-13T00:44:22Z

See this dicussion:
#8

Acceptance criteria:

Analogue to dplyr::filter() written
Analogue to dplyr::mutate() written
Analogue to dplyr::arrange() written
Back-end to all should likely involve Rust's rayon crate, and follow the guidance outlined in this chapter: https://rust-lang-nursery.github.io/rust-cookbook/concurrency/parallel.html

The text was updated successfully, but these errors were encountered:

jonocarroll · 2024-12-16T09:44:56Z

One complication I'm interested in is how to support data.frame in general - there are some really funky edge cases that should probably be avoided; you can store a matrix in a data.frame column. You can store a list in a tibble column.

The {savvy} docs do mention working with Rust lists then converting back to data.frame once safely back in the world of R. That might have consequences for filtering a data.frame based on a calculation involving just one column. Perhaps a suitable approach would be calculating a boolean vector in Rust and still doing the actual filtering on the R side.

jdhoffa · 2024-12-16T09:58:06Z

That is a super interesting point. One easy possibility (to begin with anyway) would be to only support more "standard" data-frame col-types, and error gracefully if we are faced with the funky edge cases?

jdhoffa · 2024-12-16T10:05:05Z

Re: filter, the idea of only calculating the boolean vector in Rust makes a lot of sense.

jdhoffa · 2024-12-16T10:07:43Z

Somewhat tangentially related, but I wonder if we should try to support larger-than-memory data structures too:
#26

asbates · 2024-12-18T03:10:02Z

That is a super interesting point. One easy possibility (to begin with anyway) would be to only support more "standard" data-frame col-types, and error gracefully if we are faced with the funky edge cases?

I think this is probably the best approach. We can do the column type check in R and just never let weird columns enter Rust.

jdhoffa added the feature a feature request or enhancement label Dec 13, 2024

github-project-automation bot added this to 🚀 blazr: towards the first CRAN release Dec 13, 2024

github-project-automation bot moved this to Todo in 🚀 blazr: towards the first CRAN release Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ re-implement `dplyr::{filter, mutate, arrange}` using data parallelism #22

✨ re-implement `dplyr::{filter, mutate, arrange}` using data parallelism #22

jdhoffa commented Dec 13, 2024

jonocarroll commented Dec 16, 2024

jdhoffa commented Dec 16, 2024

jdhoffa commented Dec 16, 2024

jdhoffa commented Dec 16, 2024

asbates commented Dec 18, 2024

✨ re-implement dplyr::{filter, mutate, arrange} using data parallelism #22

✨ re-implement dplyr::{filter, mutate, arrange} using data parallelism #22

Comments

jdhoffa commented Dec 13, 2024

jonocarroll commented Dec 16, 2024

jdhoffa commented Dec 16, 2024

jdhoffa commented Dec 16, 2024

jdhoffa commented Dec 16, 2024

asbates commented Dec 18, 2024

✨ re-implement `dplyr::{filter, mutate, arrange}` using data parallelism #22

✨ re-implement `dplyr::{filter, mutate, arrange}` using data parallelism #22