Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using furrr instead of pbapply and letting the user set the parallelization. #17

Merged
merged 9 commits into from
Jan 6, 2025
Prev Previous commit
Next Next commit
Benchmark computation time with parallelization.
ManonSimonot committed Jan 6, 2025
commit efab186352fdd6710357476fbd55184d63e20916
91 changes: 91 additions & 0 deletions R/parallelization-vignette.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
## code to prepare `parallelization-vignette` dataset goes here

# Setup -------------------------------------------------------------------

library(flipr)
library(future)
library(parallel)
library(progressr)
library(tictoc)

ngrid_in <- 50L
nperms <- 5000
n1 <- 10
set.seed(1234)
x <- rnorm(n1, mean = 1, sd = 1)
y <- rnorm(n1, mean = 4, sd = 1)

null_spec <- function(y, parameters) {
purrr::map(y, ~ .x - parameters[1])
}

stat_functions <- list(stat_t)

stat_assignments <- list(delta = 1)

# Inference on the mean without parallelization --------------------------------

pf <- PlausibilityFunction$new(
null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
x, y
)

pf$set_point_estimate(mean(y) - mean(x))

pf$set_nperms(nperms)
pf$set_parameter_bounds(
point_estimate = pf$point_estimate,
conf_level = pf$max_conf_level
)

pf$set_grid(
parameters = pf$parameters,
npoints = ngrid_in
)

tic()
pf$evaluate_grid(grid = pf$grid)
time_without_parallelization <- toc()$callback_msg

# Inference on the mean with parallelization -----------------------------------

ncores <- 4
plan(multisession, workers = ncores)
cl <- makeCluster(ncores)
setDefaultCluster(cl)
progressr::handlers(global = TRUE)

pf <- PlausibilityFunction$new(
null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
x, y
)

pf$set_point_estimate(mean(y) - mean(x))

pf$set_nperms(nperms)
pf$set_parameter_bounds(
point_estimate = pf$point_estimate,
conf_level = pf$max_conf_level
)

pf$set_grid(
parameters = pf$parameters,
npoints = ngrid_in
)

tic()
pf$evaluate_grid(grid = pf$grid)
time_with_parallelization <- toc()$callback_msg

df_parallelization <- list(
delta = pf$grid$delta,
time_par = time_with_parallelization,
time_without_par = time_without_parallelization
)

saveRDS(df_parallelization, "data-raw/df_parallelization.rds")

Binary file modified R/sysdata.rda
Binary file not shown.
2 changes: 2 additions & 0 deletions data-raw/build-sysdata.R
Original file line number Diff line number Diff line change
@@ -6,11 +6,13 @@ df_mean <- readRDS("data-raw/df_mean.rds")
df_sd <- readRDS("data-raw/df_sd.rds")
df_fisher <- readRDS("data-raw/df_fisher.rds")
df_tippett <- readRDS("data-raw/df_tippett.rds")
df_parallelization <- readRDS("data-raw/df_parallelization.rds")
usethis::use_data(
alpha_estimates,
pfa, pfb, pfc,
df_mean, df_sd,
df_fisher, df_tippett,
df_parallelization,
overwrite = TRUE,
internal = TRUE
)
Binary file added data-raw/df_parallelization.rds
Binary file not shown.
61 changes: 58 additions & 3 deletions vignettes/parallelization.Rmd
Original file line number Diff line number Diff line change
@@ -12,18 +12,64 @@ knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
load("../R/sysdata.rda")
time_without_parallelization <- df_parallelization$time_without_par
time_with_parallelization <- df_parallelization$time_par
```

The [**flipr**](https://lmjl-alea.github.io/flipr/) package uses functions contained in the [**furrr**](https://future.futureverse.org/index.html) package for parallel processing. The setting of parallelization has to be done on the user side. We illustrate here how to achieve asynchronous evaluation. We use the [**future**](https://future.futureverse.org/index.html) package to set the plan, the **parallel** package to define a default cluster, and the [**progressr**](https://progressr.futureverse.org/index.html) package to report progress updates.

By setting the desired number of cores, we define the number of background R sessions that will be used to evaluate expressions in parallel. This number is used to set the multisession plan with the function `future::plan()` and to define a default cluster with `parallel::setDefaultCluster()`. Then, to enable the visualization of evaluation progress, we can put the code in the `progressr::with_progress()` function, or more simply set it for all the following code with the `progressr::handlers()` function. After these settings, [**flipr**](https://lmjl-alea.github.io/flipr/) functions can be used, as shown in this example.

```{r setup}
library(flipr)
```

## Computation without parallel processing

To show the benefit of parallel processing, we compare here the processing times necessary to evaluate a grid with a plausibility function. First, here is the computation without parallelization.

```{r, eval=FALSE}
set.seed(1234)
x <- rnorm(10, 1, 1)
y <- rnorm(10, 4, 1)

null_spec <- function(y, parameters) {
purrr::map(y, ~ .x - parameters[1])
}
stat_functions <- list(stat_t)
stat_assignments <- list(delta = 1)

pf <- PlausibilityFunction$new(
null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
x, y
)

pf$set_point_estimate(mean(y) - mean(x), overwrite = TRUE)
pf$set_parameter_bounds(
point_estimate = pf$point_estimate,
conf_level = pf$max_conf_level
)
pf$set_grid(
parameters = pf$parameters,
npoints = 50L
)

tictoc::tic()
pf$evaluate_grid(grid = pf$grid)
time_without_parallelization <- tictoc::toc()
```

```{r}
time_without_parallelization
```

## Computation with parallel processing

By setting the desired number of cores, we define the number of background R sessions that will be used to evaluate expressions in parallel. This number is used to set the multisession plan with the function `future::plan()` and to define a default cluster with `parallel::setDefaultCluster()`. Then, to enable the visualization of evaluation progress, we can put the code in the `progressr::with_progress()` function, or more simply set it for all the following code with the `progressr::handlers()` function. After these settings, [**flipr**](https://lmjl-alea.github.io/flipr/) functions can be used, as shown in this example.

```{r, eval=FALSE}
ncores <- 3
ncores <- 4
future::plan(multisession, workers = ncores)
cl <- parallel::makeCluster(ncores)
parallel::setDefaultCluster(cl)
@@ -55,13 +101,22 @@ pf$set_grid(
parameters = pf$parameters,
npoints = 50L
)

tictoc::tic()
pf$evaluate_grid(grid = pf$grid)
time_with_parallelization <- tictoc::toc()

parallel::stopCluster(cl)
```

It is good practice to shut down the workers with the `parallel::stopCluster()` function at the end of the code.

```{r}
time_with_parallelization
```

This experiment proves that we can save a lot of computation time when using parallel processing, as we gained approximately 33 seconds in this example to evaluate the plausibility function.

Finally, to return to a sequential plan with no progress updates, the following code can be used.

```{r, eval=FALSE}