Skip to content

Conversation

koshtiakanksha
Copy link
Contributor

@koshtiakanksha koshtiakanksha commented Aug 14, 2025

Closes #230
Closes #231
Closes #232
Closes #233
Closes #234

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review is partial
praise Congratulation for the large scope ambition of this P.R.
todo, see inline comments

#'
#' @examples
#' \dontrun{
#' devtools::load_all()
Copy link
Collaborator

@cregouby cregouby Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo {devtools} is for developers, not for end-users. Please remove (as it is unneeded)

Comment on lines 96 to 98
x <- if (ext %in% c("jpg","jpeg")) jpeg::readJPEG(img_path)
else if (ext == "png") png::readPNG(img_path)
else jpeg::readJPEG(img_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improvement please brace all if code according to linters/styler recommendations ; https://style.tidyverse.org/syntax.html#braced-expressions

Comment on lines 95 to 100
ext <- tolower(fs::path_ext(img_path))
x <- if (ext %in% c("jpg","jpeg")) jpeg::readJPEG(img_path)
else if (ext == "png") png::readPNG(img_path)
else jpeg::readJPEG(img_path)
if (length(dim(x)) == 3 && dim(x)[3] == 4) x <- x[,,1:3, drop = FALSE]
if (length(dim(x)) == 2) x <- array(rep(x, 3L), dim = c(dim(x), 3L))
Copy link
Collaborator

@cregouby cregouby Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improvement blocking this looks like a duplicate of

base_loader <- function(path) {
ext <- tolower(fs::path_ext(path))
if (ext %in% c("jpg", "jpeg"))
img <- jpeg::readJPEG(path)
else if (ext %in% c("png"))
img <- png::readPNG(path)
else if (ext %in% c("tif", "tiff"))
img <- tiff::readTIFF(path)
else
runtime_error("unknown extension {ext} in path {path}")
if (length(dim(img)) == 2)
img <- abind::abind(img, img, img, along = 3)
else if (length(dim(img)) == 3 && dim(img)[1] == 1)
img <- abind::abind(img, img, img, along = 1)
img
}

improvement blocking using the base_loader() could be generalized to the parent class and thus allow rely on the inherited .getitem method. This would remove a lot of duplicated code ( applies multiple time)

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuated review, still partial

DESCRIPTION Outdated
@@ -54,5 +55,50 @@ Suggests:
testthat,
coro,
R.matlab,
xml2
xml2,
hfhub
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo not used so please remove.

DESCRIPTION Outdated
@@ -43,6 +43,7 @@ Imports:
png,
abind,
jsonlite,
yaml,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo not used so please remove.

"activity_diagram","signature","paper_part",
"tabular_data","paragraph"),
url = paste0(
"https://huggingface.co/datasets/akankshakoshti/rf100-doc/resolve/main/",
Copy link
Collaborator

@cregouby cregouby Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please do not multiply existing datasets hosting version. For the sake of lineage and consistency and license, please stick on the provided dataset URL

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo Please use the #' @include dataset-rf100-underwater.R to manage inheritance, rather than using the Collate in description.
improvement blocking a lot of the methods could be generalized in between those datasets and much more if not most of the methods should be inherited. This would reduce a lot the code base among rf100_* files (see

whoi_plankton_dataset <- torch::dataset(
name = "whoi_plankton",
inherit = whoi_small_plankton_dataset,
archive_size = "9.1 GB",
resources = data.frame(
split = c(rep("test", 4), rep("train", 13), rep("val", 2)),
url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"),
paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"),
paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"),
paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")),
md5 = c("cd41b344ec4b6af83e39c38e19f09190",
"aa0965c0e59f7b1cddcb3c565d80edf3",
"b2a75513f1a084724e100678d8ee7180",
"a03c4d52758078bfb0799894926d60f6",
"07eaff140f39868a8bcb1d3c02ebe60f",
"87c927b9fbe0c327b7b9ae18388b4fcf",
"456efd91901571a41c2157732880f6b8",
"dc929fde45e3b2e38bdd61a15566cf32",
"f92ab6cfb4a3dd7e0866f3fdf8dbc33c",
"61c555bba39b6b3ccb4f02a5cf07e762",
"57e03cecf2b5d97912ed37e1b8fc6263",
"56081cc99e61c36e89db1566dbbf06c1",
"60b7998630468cb18880660c81d1004a",
"1fa94ceb54d4e53643a0d8cf323af901",
"7a7be4e3dfdc39a50c8ca086a4d9a8de",
"07194caf75805e956986cba68e6b398e",
"0f4d47f240cd9c30a7dd786171fa40ca",
"db827a7de8790cdcae67b174c7b8ea5e",
"d3181d9ffaed43d0c01f59455924edca"),
size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2))
)
)
)

DESCRIPTION Outdated
Comment on lines 61 to 104
Collate:
'conditions.R'
'dataset-caltech.R'
'dataset-cifar.R'
'dataset-coco.R'
'dataset-eurosat.R'
'dataset-fer.R'
'dataset-fgvc.R'
'dataset-flickr.R'
'dataset-flowers.R'
'dataset-lfw.R'
'dataset-mnist.R'
'dataset-oxfordiiitpet.R'
'dataset-pascal.R'
'dataset-places365.R'
'dataset-plankton.R'
'dataset-rf100-underwater.R'
'dataset-rf100-doc.R'
'dataset-rf100-electromagnetic.R'
'dataset-rf100-microscopic.R'
'dataset-rf100-peixos.R'
'extension.R'
'folder-dataset.R'
'globals.R'
'models-alexnet.R'
'models-deeplabv3.R'
'models-efficientnet.R'
'models-efficientnetv2.R'
'models-fcn.R'
'models-inception.R'
'models-mobilenetv2.R'
'models-resnet.R'
'models-vgg.R'
'models-vit.R'
'ops-box_convert.R'
'ops-boxes.R'
'tiny-imagenet-dataset.R'
'transforms-array.R'
'transforms-defaults.R'
'transforms-generics.R'
'transforms-magick.R'
'transforms-tensor.R'
'utils.R'
'vision_utils.R'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo collate is a bad idea to workaround inherit = object not found error. Please remove.
suggestion Please rather use #' @include dataset-rf100-underwater.R in dataset definition when you need it

#' @export
rf100_document_collection <- torch::dataset(
name = "rf100_document_collection",
inherit = rf100_underwater_collection,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion Please rather use #' @include dataset-rf100-underwater.R to avoid object not found error at build time

"https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/bccd-ouzjz.zip?download=1",
"https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/parasites-1s07h.zip?download=1",
"https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/cells-uyemf.zip?download=1",
NA_character_, # liquid_crystals not present in repo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please use dataset from https://huggingface.co/datasets/Francesco/4-fold-defect for liquid crystals. I apologize for that as I renamed it without notice.

Comment on lines 102 to 104
if (is.na(self$archive_url) || !nzchar(self$archive_url)) {
runtime_error(paste0("No download URL for dataset '", self$dataset, "'."))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo this is unneeded. Please remove after adding the correct url

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please add an entry in the NEWS

@koshtiakanksha
Copy link
Contributor Author

Hey @cregouby, The nested TAR structure made this a bit tricky. I tried simpler approaches but ran into errors, so this is the best I could get working. Does this look okay to you, or would you suggest another approach?

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please switch to the parquet version of the dataset (see inline)

Comment on lines 279 to 281
if (length(dim(x)) == 3 && dim(x)[3] == 4) {
x <- x[, , 1:3, drop = FALSE]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improvement please add a comment for such code like # remove alpha channel if exists
question is it specific to some images in one dataset or is it wider than that
suggestion if in multiple datasets, would be worth moving that code to base_loader()

if (!is.null(self$transform)) x <- self$transform(x)
if (!is.null(self$target_transform)) y <- self$target_transform(y)

structure(list(x = x, y = y), class = "image_with_bounding_box")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo we moved away lately from structure() starting with #245 due to https://lintr.r-lib.org/reference/default_undesirable_functions.html

datasets <- c(
"tweeter_post", "tweeter_profile", "document_part",
"activity_diagram", "signature", "paper_part",
"tabular_data" #, "paragraph"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please include "paragraph" dataset in the tests

Comment on lines 56 to 65
url = c(
"https://huggingface.co/datasets/Francesco/tweeter-posts/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/tweeter-profile/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/document-parts/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/activity-diagrams-qdobr/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/signatures-xc8up/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/paper-parts/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/tabular-data-wf9uh/resolve/main/dataset.tar.gz?download=1",
"https://huggingface.co/datasets/Francesco/paragraphs-co84b/resolve/main/dataset.tar.gz?download=1"
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo reuse simplification (applies to the 5 dataset files) using the parquet files located in the /data/ folder for each dataset is much much easier :

  • the resources dataframe is a bit more long to build but you have an example in
    resources = data.frame(
    split = c(rep("test", 4), rep("train", 13), rep("val", 2)),
    url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"),
    paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"),
    paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"),
    paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")),
    md5 = c("cd41b344ec4b6af83e39c38e19f09190",
    "aa0965c0e59f7b1cddcb3c565d80edf3",
    "b2a75513f1a084724e100678d8ee7180",
    "a03c4d52758078bfb0799894926d60f6",
    "07eaff140f39868a8bcb1d3c02ebe60f",
    "87c927b9fbe0c327b7b9ae18388b4fcf",
    "456efd91901571a41c2157732880f6b8",
    "dc929fde45e3b2e38bdd61a15566cf32",
    "f92ab6cfb4a3dd7e0866f3fdf8dbc33c",
    "61c555bba39b6b3ccb4f02a5cf07e762",
    "57e03cecf2b5d97912ed37e1b8fc6263",
    "56081cc99e61c36e89db1566dbbf06c1",
    "60b7998630468cb18880660c81d1004a",
    "1fa94ceb54d4e53643a0d8cf323af901",
    "7a7be4e3dfdc39a50c8ca086a4d9a8de",
    "07194caf75805e956986cba68e6b398e",
    "0f4d47f240cd9c30a7dd786171fa40ca",
    "db827a7de8790cdcae67b174c7b8ea5e",
    "d3181d9ffaed43d0c01f59455924edca"),
    size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2))
  • you can derive almost all other methods via a inherit = whoi_small_plankton_dataset,
  • the split only downloads the split subset of the dataset making time to data quicker, and disk footprint lower
  • you remove the burden of os specific code that we don't want

@cregouby
Copy link
Collaborator

Hello @koshtiakanksha,

I've completed the datasetrf100-doc.R with some fix.

> library(torchvision)
> withr::with_language("en_US", devtools::test_active_file())
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 73 ]

Can I let you update the parquet URLs for the rest of the rf100 collection ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants