adding `.getbatch()` method to `mnist_dataset` dataset generator improves performance markedly

The standard dataset_generator for MNIST dataset does not include a `.getbatch()` method  and, as a result,  getting a batch is quite slow, at least on CPU. 

```
# dataset root directory
dir <- "./dataset"

# download dataset
train_ds <- mnist_dataset(
  dir,
  download = TRUE,
  transform = transform_to_tensor
)
# dataloader 
train_dl <- dataloader(train_ds, batch_size = 128, shuffle = TRUE)
# get a batch via the dataloader iterator
train_iter <- train_dl$.iter()
microbenchmark( b <- $.next())
```

The timings are:

```
Unit: milliseconds
                    expr     min       lq     mean   median       uq     max neval
 b <- train_iter$.next() 45.5263 47.78455 52.74117 49.19825 52.95675 87.2219   100
```

As explained in the [vignette](https://cran.r-project.org/web/packages/torch/vignettes/loading-data.html), the dataloader uses the `.getitem()` method iteratively to return a batch in absence of  a `.getmatch()` method.  

Interestingly, it seems that the  `.getitem()`  method mitgh be used as `.getbatch()` method without any change:

``` 
# mnist_dataset .getitem() method
> train_ds$.getitem
function (index) 
{
    img <- self$data[index, , ]
    target <- self$targets[index]
    if (!is.null(self$transform)) 
        img <- self$transform(img)
    if (!is.null(self$target_transform)) 
        target <- self$target_transform(target)
    list(x = img, y = target)
}
<environment: 0x000001527445be68>
```

It is easy to add a `.getbatch()` to the exsiting `mnist_dataset` dataset generator:

```
# create a new dataset generator that extends mnist_dataset
mnist_dataset2 <- dataset(
  inherit = mnist_dataset,
  .getbatch = function(index) {
    self$.getitem(index)
  }
)
````

Let's measure the performance with this new dataset generator:

```
# create a dataset with the new dataset generator
train_ds2 <- mnist_dataset2(
  dir,
  download = TRUE,
  transform = transform_to_tensor
)
# create a dataloder with the new dataset
train_dl2 <- dataloader(train_ds2, batch_size = 128, shuffle = TRUE)
# get a batch via the dataloader
train_iter2 <- train_dl2$.iter()
microbenchmark::microbenchmark(train_iter2$.next())
```
```
Unit: milliseconds
                expr      min       lq     mean   median       uq     max neval
 train_iter2$.next() 3.995601 4.328151 5.430246 4.601451 4.965501 11.7692   100
```
The new dataloader is almost 10 times faster!    

That saids, it seems that the newdata loader cannot be used in place of `train_dl` in  this [example](https://mlverse.github.io/luz/articles/examples/mnist-cnn.html) which uses `luz` to train the network:

```
fitted <- mnist_module %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_adam,
    metrics = list(
      luz_metric_accuracy()
    )
  ) %>%
  fit(train_dl, epochs = 1, valid_data = test_dl)
```

It yields an error message

`expected input[1, 28, 128, 28] to have 1 channels, but got 28 channels instead`

I don't have a PC with GPU to test whether there is a similar improvement when the data are loaded on the GPU.  I also wonder why the `.getbatch()` function is not always implemented since it seems an easy way to improve performance.  Though I did not investigate the origin the error, the `luz::fit` method should be able to accept data_loader with a `.getbatch`  method. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adding `.getbatch()` method to `mnist_dataset` dataset generator improves performance markedly #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

adding .getbatch() method to mnist_dataset dataset generator improves performance markedly #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

adding `.getbatch()` method to `mnist_dataset` dataset generator improves performance markedly #106