Skip to content

Reduced fsd() performance in >=2.0.16 #679

@andrewGhazi

Description

@andrewGhazi

Hi there, I noticed that the performance of collapse::fsd() seems to have gone down on a little benchmark I've been running between versions 2.0.15 and 2.0.16. I tried a few combinations of R and and collapse going back to 4.2.1 and 1.8.9 respectively and I think the issue stems from collapse. This is the script I've been testing with:

install.packages(c("bench", "waldo", "remotes"))

remotes::install_version("data.table", "1.15.4")
#remotes::install_version("collapse",   "2.0.15")
remotes::install_version("collapse",   "2.0.16")

library(data.table); setDTthreads(1)
library(collapse); set_collapse(nthreads = 1)

n = 3e5

set.seed(123)

val_dt = data.table(g = rep(1:n, each = 6),
                    x = rt(6 * n, 3))

val_dt

dt_f1 = \(val_dt) val_dt[,.(x = sd(x)), by = g]

cl_f1 = \(val_dt) val_dt |> gby(g) |> fsd()

cl_f2 = \(val_dt) val_dt |> gby(g) |> smr(x = fsd(x))

check_fun = \(x,y) length(waldo::compare(x,y, tolerance = 1e-8)) == 0

res = bench::mark(data.table = dt_f1(val_dt),
                  collapse   = cl_f1(val_dt),
                  collapse2  = cl_f2(val_dt),
                  check = check_fun)

sessionInfo()

res |> 
  slt(expression:mem_alloc, n_itr)

On the rocker/r-ver:4.4.2 image with [email protected] that script produces:

# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
1 data.table  34.84ms   37.3ms      22.6    53.9MB    12
2 collapse     9.01ms   12.4ms      81.3    24.3MB    41
3 collapse2     8.9ms   12.1ms      75.0    24.2MB    38

But the same script on the same image with [email protected] yields:

# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
1 data.table   35.1ms   36.6ms      22.9    53.9MB    12
2 collapse     23.8ms   27.4ms      38.0    24.3MB    19
3 collapse2    44.4ms   48.2ms      21.0    24.2MB    11

The plain fsd() call went from 12ms to 27ms, and the smr(x = fsd(x)) went from 12ms to 48ms. A different test with fmean() showed no differences there, but I didn't try any of the other special fast statistical functions outside of those two.

Just thought I'd flag the issue since it seems it hasn't been noticed yet and I didn't see anything the seemed like it could be related in the NEWS for 2.0.16. I wish I could offer more to identify the root issue but it's beyond my capabilities I think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions