RF100 add 4 underwater datasets #239

koshtiakanksha · 2025-08-14T01:10:37Z

Closes #230
Closes #231
Closes #232
Closes #233
Closes #234

cregouby

review is partial
praise Congratulation for the large scope ambition of this P.R.
todo, see inline comments

cregouby · 2025-08-20T06:36:17Z

R/dataset-rf100-doc.R

+#'
+#' @examples
+#' \dontrun{
+#' devtools::load_all()


todo {devtools} is for developers, not for end-users. Please remove (as it is unneeded)

cregouby · 2025-08-20T06:48:42Z

R/dataset-rf100-doc.R

+    x <- if (ext %in% c("jpg","jpeg")) jpeg::readJPEG(img_path)
+    else if (ext == "png")        png::readPNG(img_path)
+    else                          jpeg::readJPEG(img_path)


improvement please brace all if code according to linters/styler recommendations ; https://style.tidyverse.org/syntax.html#braced-expressions

cregouby · 2025-08-20T06:50:58Z

R/dataset-rf100-doc.R

+    ext <- tolower(fs::path_ext(img_path))
+    x <- if (ext %in% c("jpg","jpeg")) jpeg::readJPEG(img_path)
+    else if (ext == "png")        png::readPNG(img_path)
+    else                          jpeg::readJPEG(img_path)
+    if (length(dim(x)) == 3 && dim(x)[3] == 4) x <- x[,,1:3, drop = FALSE]
+    if (length(dim(x)) == 2) x <- array(rep(x, 3L), dim = c(dim(x), 3L))


improvement blocking this looks like a duplicate of

torchvision/R/folder-dataset.R

Lines 129 to 148 in bd0e7d8

base_loader <- function(path) {

ext <- tolower(fs::path_ext(path))

if (ext %in% c("jpg", "jpeg"))

img <- jpeg::readJPEG(path)

else if (ext %in% c("png"))

img <- png::readPNG(path)

else if (ext %in% c("tif", "tiff"))

img <- tiff::readTIFF(path)

else

runtime_error("unknown extension {ext} in path {path}")

if (length(dim(img)) == 2)

img <- abind::abind(img, img, img, along = 3)

else if (length(dim(img)) == 3 && dim(img)[1] == 1)

img <- abind::abind(img, img, img, along = 1)

img

}

improvement blocking using the base_loader() could be generalized to the parent class and thus allow rely on the inherited .getitem method. This would remove a lot of duplicated code ( applies multiple time)

cregouby

continuated review, still partial

R/dataset-rf100-doc.R

cregouby · 2025-08-20T12:39:34Z

DESCRIPTION

@@ -54,5 +55,50 @@ Suggests:
    testthat,
    coro,
    R.matlab,
-    xml2
+    xml2,
+    hfhub


todo not used so please remove.

cregouby · 2025-08-20T12:40:05Z

DESCRIPTION

@@ -43,6 +43,7 @@ Imports:
    png,
    abind,
    jsonlite,
+    yaml,


todo not used so please remove.

cregouby · 2025-08-20T12:44:11Z

R/dataset-rf100-doc.R

+                "activity_diagram","signature","paper_part",
+                "tabular_data","paragraph"),
+    url = paste0(
+      "https://huggingface.co/datasets/akankshakoshti/rf100-doc/resolve/main/",


todo please do not multiply existing datasets hosting version. For the sake of lineage and consistency and license, please stick on the provided dataset URL

cregouby

todo Please use the #' @include dataset-rf100-underwater.R to manage inheritance, rather than using the Collate in description.
improvement blocking a lot of the methods could be generalized in between those datasets and much more if not most of the methods should be inherited. This would reduce a lot the code base among rf100_* files (see

torchvision/R/dataset-plankton.R

Lines 145 to 177 in bd0e7d8

    
           whoi_plankton_dataset <- torch::dataset( 
        
             name = "whoi_plankton", 
        
             inherit = whoi_small_plankton_dataset, 
        
             archive_size = "9.1 GB", 
        
             resources = data.frame( 
        
               split = c(rep("test", 4), rep("train", 13), rep("val", 2)), 
        
               url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"), 
        
                       paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"), 
        
                       paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"), 
        
                       paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")), 
        
               md5 = c("cd41b344ec4b6af83e39c38e19f09190", 
        
                       "aa0965c0e59f7b1cddcb3c565d80edf3", 
        
                       "b2a75513f1a084724e100678d8ee7180", 
        
                       "a03c4d52758078bfb0799894926d60f6", 
        
                       "07eaff140f39868a8bcb1d3c02ebe60f", 
        
                       "87c927b9fbe0c327b7b9ae18388b4fcf", 
        
                       "456efd91901571a41c2157732880f6b8", 
        
                       "dc929fde45e3b2e38bdd61a15566cf32", 
        
                       "f92ab6cfb4a3dd7e0866f3fdf8dbc33c", 
        
                       "61c555bba39b6b3ccb4f02a5cf07e762", 
        
                       "57e03cecf2b5d97912ed37e1b8fc6263", 
        
                       "56081cc99e61c36e89db1566dbbf06c1", 
        
                       "60b7998630468cb18880660c81d1004a", 
        
                       "1fa94ceb54d4e53643a0d8cf323af901", 
        
                       "7a7be4e3dfdc39a50c8ca086a4d9a8de", 
        
                       "07194caf75805e956986cba68e6b398e", 
        
                       "0f4d47f240cd9c30a7dd786171fa40ca", 
        
                       "db827a7de8790cdcae67b174c7b8ea5e", 
        
                       "d3181d9ffaed43d0c01f59455924edca"), 
        
               size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2)) 
        
             ) 
        
           )

)

cregouby · 2025-08-21T07:39:17Z

DESCRIPTION

+Collate: 
+    'conditions.R'
+    'dataset-caltech.R'
+    'dataset-cifar.R'
+    'dataset-coco.R'
+    'dataset-eurosat.R'
+    'dataset-fer.R'
+    'dataset-fgvc.R'
+    'dataset-flickr.R'
+    'dataset-flowers.R'
+    'dataset-lfw.R'
+    'dataset-mnist.R'
+    'dataset-oxfordiiitpet.R'
+    'dataset-pascal.R'
+    'dataset-places365.R'
+    'dataset-plankton.R'
+    'dataset-rf100-underwater.R'
+    'dataset-rf100-doc.R'
+    'dataset-rf100-electromagnetic.R'
+    'dataset-rf100-microscopic.R'
+    'dataset-rf100-peixos.R'
+    'extension.R'
+    'folder-dataset.R'
+    'globals.R'
+    'models-alexnet.R'
+    'models-deeplabv3.R'
+    'models-efficientnet.R'
+    'models-efficientnetv2.R'
+    'models-fcn.R'
+    'models-inception.R'
+    'models-mobilenetv2.R'
+    'models-resnet.R'
+    'models-vgg.R'
+    'models-vit.R'
+    'ops-box_convert.R'
+    'ops-boxes.R'
+    'tiny-imagenet-dataset.R'
+    'transforms-array.R'
+    'transforms-defaults.R'
+    'transforms-generics.R'
+    'transforms-magick.R'
+    'transforms-tensor.R'
+    'utils.R'
+    'vision_utils.R'


todo collate is a bad idea to workaround inherit = object not found error. Please remove.
suggestion Please rather use #' @include dataset-rf100-underwater.R in dataset definition when you need it

cregouby · 2025-08-21T07:39:54Z

R/dataset-rf100-doc.R

+#' @export
+rf100_document_collection <- torch::dataset(
+  name = "rf100_document_collection",
+  inherit = rf100_underwater_collection,


suggestion Please rather use #' @include dataset-rf100-underwater.R to avoid object not found error at build time

R/dataset-rf100-electromagnetic.R

cregouby · 2025-08-21T07:50:30Z

R/dataset-rf100-microscopic.R

+      "https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/bccd-ouzjz.zip?download=1",
+      "https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/parasites-1s07h.zip?download=1",
+      "https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/cells-uyemf.zip?download=1",
+      NA_character_,  # liquid_crystals not present in repo


todo please use dataset from https://huggingface.co/datasets/Francesco/4-fold-defect for liquid crystals. I apologize for that as I renamed it without notice.

cregouby · 2025-08-21T08:03:33Z

R/dataset-rf100-microscopic.R

+      if (is.na(self$archive_url) || !nzchar(self$archive_url)) {
+        runtime_error(paste0("No download URL for dataset '", self$dataset, "'."))
+      }


todo this is unneeded. Please remove after adding the correct url

cregouby

todo please add an entry in the NEWS

koshtiakanksha · 2025-08-28T03:43:05Z

Hey @cregouby, The nested TAR structure made this a bit tricky. I tried simpler approaches but ran into errors, so this is the best I could get working. Does this look okay to you, or would you suggest another approach?

cregouby

todo please switch to the parquet version of the dataset (see inline)

cregouby · 2025-08-28T10:24:17Z

R/dataset-rf100-doc.R

+    if (length(dim(x)) == 3 && dim(x)[3] == 4) {
+      x <- x[, , 1:3, drop = FALSE]
+    }


improvement please add a comment for such code like # remove alpha channel if exists
question is it specific to some images in one dataset or is it wider than that
suggestion if in multiple datasets, would be worth moving that code to base_loader()

cregouby · 2025-08-28T10:33:24Z

R/dataset-rf100-doc.R

+    if (!is.null(self$transform)) x <- self$transform(x)
+    if (!is.null(self$target_transform)) y <- self$target_transform(y)
+
+    structure(list(x = x, y = y), class = "image_with_bounding_box")


todo we moved away lately from structure() starting with #245 due to https://lintr.r-lib.org/reference/default_undesirable_functions.html

cregouby · 2025-08-28T15:18:51Z

tests/testthat/test-dataset-rf100-doc.R

+datasets <- c(
+  "tweeter_post", "tweeter_profile", "document_part",
+  "activity_diagram", "signature", "paper_part",
+  "tabular_data" #, "paragraph"


todo please include "paragraph" dataset in the tests

cregouby · 2025-08-28T15:42:09Z

R/dataset-rf100-doc.R

+    url = c(
+      "https://huggingface.co/datasets/Francesco/tweeter-posts/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/tweeter-profile/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/document-parts/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/activity-diagrams-qdobr/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/signatures-xc8up/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/paper-parts/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/tabular-data-wf9uh/resolve/main/dataset.tar.gz?download=1",
+      "https://huggingface.co/datasets/Francesco/paragraphs-co84b/resolve/main/dataset.tar.gz?download=1"
+    ),


todo reuse simplification (applies to the 5 dataset files) using the parquet files located in the /data/ folder for each dataset is much much easier :

the resources dataframe is a bit more long to build but you have an example in

torchvision/R/dataset-plankton.R

Lines 149 to 174 in d227a4f

resources = data.frame(

split = c(rep("test", 4), rep("train", 13), rep("val", 2)),

url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"),

paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"),

paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"),

paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")),

md5 = c("cd41b344ec4b6af83e39c38e19f09190",

"aa0965c0e59f7b1cddcb3c565d80edf3",

"b2a75513f1a084724e100678d8ee7180",

"a03c4d52758078bfb0799894926d60f6",

"07eaff140f39868a8bcb1d3c02ebe60f",

"87c927b9fbe0c327b7b9ae18388b4fcf",

"456efd91901571a41c2157732880f6b8",

"dc929fde45e3b2e38bdd61a15566cf32",

"f92ab6cfb4a3dd7e0866f3fdf8dbc33c",

"61c555bba39b6b3ccb4f02a5cf07e762",

"57e03cecf2b5d97912ed37e1b8fc6263",

"56081cc99e61c36e89db1566dbbf06c1",

"60b7998630468cb18880660c81d1004a",

"1fa94ceb54d4e53643a0d8cf323af901",

"7a7be4e3dfdc39a50c8ca086a4d9a8de",

"07194caf75805e956986cba68e6b398e",

"0f4d47f240cd9c30a7dd786171fa40ca",

"db827a7de8790cdcae67b174c7b8ea5e",

"d3181d9ffaed43d0c01f59455924edca"),

size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2))

you can derive almost all other methods via a inherit = whoi_small_plankton_dataset,

the split only downloads the split subset of the dataset making time to data quicker, and disk footprint lower

you remove the burden of os specific code that we don't want

cregouby · 2025-08-30T15:35:37Z

Hello @koshtiakanksha,

I've completed the datasetrf100-doc.R with some fix.

> library(torchvision)
> withr::with_language("en_US", devtools::test_active_file())
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 73 ]

Can I let you update the parquet URLs for the rest of the rf100 collection ?

koshtiakanksha added 10 commits August 13, 2025 17:50

Selecting the issue.

d0e90f1

Drawing the bounding boxes.

33943e3

Adding and loading the datasets.

6070658

Fixing the bounding boxes error for the rf100-doc.

83de698

inheriting the params and functions and creating a test file.

e11c98e

Adding the rf100-microscopic.

2db70e3

Simplified the microscopic dataset.

43f7cb2

Adding the rf100-electromagnetic.

33fbe29

Adding the Peixos.

2e899da

Adding the Md5sum.

c122aed

cregouby requested changes Aug 20, 2025

View reviewed changes

cregouby requested changes Aug 21, 2025

View reviewed changes

cregouby reviewed Aug 21, 2025

View reviewed changes

koshtiakanksha added 11 commits August 24, 2025 00:56

Changing the Links.

9fa877d

Fixing the code.

4ee6766

Underwater pipes works.

b6271d3

Fixing the code.

be87772

Fixed underwater.

700f00b

Fixed doc.

aab67e2

Fixing the other datasets.

966e3b1

failing the tests.

6d90bf3

Fixing the failing tests.

bd82bed

Updating the description.'

d63cdcb

Adding the md5sum.

c028090

cregouby requested changes Aug 28, 2025

View reviewed changes

koshtiakanksha and others added 3 commits August 28, 2025 14:35

Switching to paraquets.

ebd2dc0

Merge remote-tracking branch 'origin/main' into rf100

27de5b5

fix arrow reading of rf100-doc

d57932c

cregouby added 2 commits August 30, 2025 17:41

align tests on expected dataset outputs

0072a23

no need for curl

52db8ad

	base_loader <- function(path) {

	ext <- tolower(fs::path_ext(path))

	if (ext %in% c("jpg", "jpeg"))
	img <- jpeg::readJPEG(path)
	else if (ext %in% c("png"))
	img <- png::readPNG(path)
	else if (ext %in% c("tif", "tiff"))
	img <- tiff::readTIFF(path)
	else
	runtime_error("unknown extension {ext} in path {path}")

	if (length(dim(img)) == 2)
	img <- abind::abind(img, img, img, along = 3)
	else if (length(dim(img)) == 3 && dim(img)[1] == 1)
	img <- abind::abind(img, img, img, along = 1)

	img
	}

	whoi_plankton_dataset <- torch::dataset(
	name = "whoi_plankton",
	inherit = whoi_small_plankton_dataset,
	archive_size = "9.1 GB",
	resources = data.frame(
	split = c(rep("test", 4), rep("train", 13), rep("val", 2)),
	url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"),
	paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"),
	paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"),
	paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")),
	md5 = c("cd41b344ec4b6af83e39c38e19f09190",
	"aa0965c0e59f7b1cddcb3c565d80edf3",
	"b2a75513f1a084724e100678d8ee7180",
	"a03c4d52758078bfb0799894926d60f6",
	"07eaff140f39868a8bcb1d3c02ebe60f",
	"87c927b9fbe0c327b7b9ae18388b4fcf",
	"456efd91901571a41c2157732880f6b8",
	"dc929fde45e3b2e38bdd61a15566cf32",
	"f92ab6cfb4a3dd7e0866f3fdf8dbc33c",
	"61c555bba39b6b3ccb4f02a5cf07e762",
	"57e03cecf2b5d97912ed37e1b8fc6263",
	"56081cc99e61c36e89db1566dbbf06c1",
	"60b7998630468cb18880660c81d1004a",
	"1fa94ceb54d4e53643a0d8cf323af901",
	"7a7be4e3dfdc39a50c8ca086a4d9a8de",
	"07194caf75805e956986cba68e6b398e",
	"0f4d47f240cd9c30a7dd786171fa40ca",
	"db827a7de8790cdcae67b174c7b8ea5e",
	"d3181d9ffaed43d0c01f59455924edca"),
	size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2))
	)
	)

RF100 add 4 underwater datasets #239

Are you sure you want to change the base?

RF100 add 4 underwater datasets #239

Uh oh!

Conversation

koshtiakanksha commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cregouby left a comment

Choose a reason for hiding this comment

Uh oh!

cregouby Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cregouby Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cregouby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cregouby Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cregouby left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cregouby left a comment

Choose a reason for hiding this comment

Uh oh!

koshtiakanksha commented Aug 28, 2025

Uh oh!

cregouby left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cregouby commented Aug 30, 2025

Uh oh!

Uh oh!

koshtiakanksha commented Aug 14, 2025 •

edited

Loading

cregouby Aug 20, 2025 •

edited

Loading

cregouby Aug 20, 2025 •

edited

Loading

cregouby Aug 20, 2025 •

edited

Loading