-
Notifications
You must be signed in to change notification settings - Fork 23
RF100 add 4 underwater datasets #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review is partial
praise Congratulation for the large scope ambition of this P.R.
todo, see inline comments
R/dataset-rf100-doc.R
Outdated
#' | ||
#' @examples | ||
#' \dontrun{ | ||
#' devtools::load_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo {devtools} is for developers, not for end-users. Please remove (as it is unneeded)
R/dataset-rf100-doc.R
Outdated
x <- if (ext %in% c("jpg","jpeg")) jpeg::readJPEG(img_path) | ||
else if (ext == "png") png::readPNG(img_path) | ||
else jpeg::readJPEG(img_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improvement please brace all if
code according to linters/styler recommendations ; https://style.tidyverse.org/syntax.html#braced-expressions
R/dataset-rf100-doc.R
Outdated
ext <- tolower(fs::path_ext(img_path)) | ||
x <- if (ext %in% c("jpg","jpeg")) jpeg::readJPEG(img_path) | ||
else if (ext == "png") png::readPNG(img_path) | ||
else jpeg::readJPEG(img_path) | ||
if (length(dim(x)) == 3 && dim(x)[3] == 4) x <- x[,,1:3, drop = FALSE] | ||
if (length(dim(x)) == 2) x <- array(rep(x, 3L), dim = c(dim(x), 3L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improvement blocking this looks like a duplicate of
torchvision/R/folder-dataset.R
Lines 129 to 148 in bd0e7d8
base_loader <- function(path) { | |
ext <- tolower(fs::path_ext(path)) | |
if (ext %in% c("jpg", "jpeg")) | |
img <- jpeg::readJPEG(path) | |
else if (ext %in% c("png")) | |
img <- png::readPNG(path) | |
else if (ext %in% c("tif", "tiff")) | |
img <- tiff::readTIFF(path) | |
else | |
runtime_error("unknown extension {ext} in path {path}") | |
if (length(dim(img)) == 2) | |
img <- abind::abind(img, img, img, along = 3) | |
else if (length(dim(img)) == 3 && dim(img)[1] == 1) | |
img <- abind::abind(img, img, img, along = 1) | |
img | |
} |
improvement blocking using the base_loader() could be generalized to the parent class and thus allow rely on the inherited .getitem
method. This would remove a lot of duplicated code ( applies multiple time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuated review, still partial
DESCRIPTION
Outdated
@@ -54,5 +55,50 @@ Suggests: | |||
testthat, | |||
coro, | |||
R.matlab, | |||
xml2 | |||
xml2, | |||
hfhub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo not used so please remove.
DESCRIPTION
Outdated
@@ -43,6 +43,7 @@ Imports: | |||
png, | |||
abind, | |||
jsonlite, | |||
yaml, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo not used so please remove.
R/dataset-rf100-doc.R
Outdated
"activity_diagram","signature","paper_part", | ||
"tabular_data","paragraph"), | ||
url = paste0( | ||
"https://huggingface.co/datasets/akankshakoshti/rf100-doc/resolve/main/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo please do not multiply existing datasets hosting version. For the sake of lineage and consistency and license, please stick on the provided dataset URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo Please use the #' @include dataset-rf100-underwater.R
to manage inheritance, rather than using the Collate in description.
improvement blocking a lot of the methods could be generalized in between those datasets and much more if not most of the methods should be inherited. This would reduce a lot the code base among rf100_* files (see
torchvision/R/dataset-plankton.R
Lines 145 to 177 in bd0e7d8
whoi_plankton_dataset <- torch::dataset( | |
name = "whoi_plankton", | |
inherit = whoi_small_plankton_dataset, | |
archive_size = "9.1 GB", | |
resources = data.frame( | |
split = c(rep("test", 4), rep("train", 13), rep("val", 2)), | |
url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"), | |
paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"), | |
paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"), | |
paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")), | |
md5 = c("cd41b344ec4b6af83e39c38e19f09190", | |
"aa0965c0e59f7b1cddcb3c565d80edf3", | |
"b2a75513f1a084724e100678d8ee7180", | |
"a03c4d52758078bfb0799894926d60f6", | |
"07eaff140f39868a8bcb1d3c02ebe60f", | |
"87c927b9fbe0c327b7b9ae18388b4fcf", | |
"456efd91901571a41c2157732880f6b8", | |
"dc929fde45e3b2e38bdd61a15566cf32", | |
"f92ab6cfb4a3dd7e0866f3fdf8dbc33c", | |
"61c555bba39b6b3ccb4f02a5cf07e762", | |
"57e03cecf2b5d97912ed37e1b8fc6263", | |
"56081cc99e61c36e89db1566dbbf06c1", | |
"60b7998630468cb18880660c81d1004a", | |
"1fa94ceb54d4e53643a0d8cf323af901", | |
"7a7be4e3dfdc39a50c8ca086a4d9a8de", | |
"07194caf75805e956986cba68e6b398e", | |
"0f4d47f240cd9c30a7dd786171fa40ca", | |
"db827a7de8790cdcae67b174c7b8ea5e", | |
"d3181d9ffaed43d0c01f59455924edca"), | |
size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2)) | |
) | |
) | |
DESCRIPTION
Outdated
Collate: | ||
'conditions.R' | ||
'dataset-caltech.R' | ||
'dataset-cifar.R' | ||
'dataset-coco.R' | ||
'dataset-eurosat.R' | ||
'dataset-fer.R' | ||
'dataset-fgvc.R' | ||
'dataset-flickr.R' | ||
'dataset-flowers.R' | ||
'dataset-lfw.R' | ||
'dataset-mnist.R' | ||
'dataset-oxfordiiitpet.R' | ||
'dataset-pascal.R' | ||
'dataset-places365.R' | ||
'dataset-plankton.R' | ||
'dataset-rf100-underwater.R' | ||
'dataset-rf100-doc.R' | ||
'dataset-rf100-electromagnetic.R' | ||
'dataset-rf100-microscopic.R' | ||
'dataset-rf100-peixos.R' | ||
'extension.R' | ||
'folder-dataset.R' | ||
'globals.R' | ||
'models-alexnet.R' | ||
'models-deeplabv3.R' | ||
'models-efficientnet.R' | ||
'models-efficientnetv2.R' | ||
'models-fcn.R' | ||
'models-inception.R' | ||
'models-mobilenetv2.R' | ||
'models-resnet.R' | ||
'models-vgg.R' | ||
'models-vit.R' | ||
'ops-box_convert.R' | ||
'ops-boxes.R' | ||
'tiny-imagenet-dataset.R' | ||
'transforms-array.R' | ||
'transforms-defaults.R' | ||
'transforms-generics.R' | ||
'transforms-magick.R' | ||
'transforms-tensor.R' | ||
'utils.R' | ||
'vision_utils.R' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo collate is a bad idea to workaround inherit =
object not found error. Please remove.
suggestion Please rather use #' @include dataset-rf100-underwater.R
in dataset definition when you need it
R/dataset-rf100-doc.R
Outdated
#' @export | ||
rf100_document_collection <- torch::dataset( | ||
name = "rf100_document_collection", | ||
inherit = rf100_underwater_collection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion Please rather use #' @include dataset-rf100-underwater.R
to avoid object not found
error at build time
R/dataset-rf100-microscopic.R
Outdated
"https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/bccd-ouzjz.zip?download=1", | ||
"https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/parasites-1s07h.zip?download=1", | ||
"https://huggingface.co/datasets/akankshakoshti/rf100-microscopic/resolve/main/cells-uyemf.zip?download=1", | ||
NA_character_, # liquid_crystals not present in repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo please use dataset from https://huggingface.co/datasets/Francesco/4-fold-defect for liquid crystals. I apologize for that as I renamed it without notice.
R/dataset-rf100-microscopic.R
Outdated
if (is.na(self$archive_url) || !nzchar(self$archive_url)) { | ||
runtime_error(paste0("No download URL for dataset '", self$dataset, "'.")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo this is unneeded. Please remove after adding the correct url
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo please add an entry in the NEWS
Hey @cregouby, The nested TAR structure made this a bit tricky. I tried simpler approaches but ran into errors, so this is the best I could get working. Does this look okay to you, or would you suggest another approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo please switch to the parquet version of the dataset (see inline)
R/dataset-rf100-doc.R
Outdated
if (length(dim(x)) == 3 && dim(x)[3] == 4) { | ||
x <- x[, , 1:3, drop = FALSE] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improvement please add a comment for such code like # remove alpha channel if exists
question is it specific to some images in one dataset or is it wider than that
suggestion if in multiple datasets, would be worth moving that code to base_loader()
R/dataset-rf100-doc.R
Outdated
if (!is.null(self$transform)) x <- self$transform(x) | ||
if (!is.null(self$target_transform)) y <- self$target_transform(y) | ||
|
||
structure(list(x = x, y = y), class = "image_with_bounding_box") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo we moved away lately from structure()
starting with #245 due to https://lintr.r-lib.org/reference/default_undesirable_functions.html
datasets <- c( | ||
"tweeter_post", "tweeter_profile", "document_part", | ||
"activity_diagram", "signature", "paper_part", | ||
"tabular_data" #, "paragraph" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo please include "paragraph" dataset in the tests
R/dataset-rf100-doc.R
Outdated
url = c( | ||
"https://huggingface.co/datasets/Francesco/tweeter-posts/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/tweeter-profile/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/document-parts/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/activity-diagrams-qdobr/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/signatures-xc8up/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/paper-parts/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/tabular-data-wf9uh/resolve/main/dataset.tar.gz?download=1", | ||
"https://huggingface.co/datasets/Francesco/paragraphs-co84b/resolve/main/dataset.tar.gz?download=1" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo reuse simplification (applies to the 5 dataset files) using the parquet files located in the /data/ folder for each dataset is much much easier :
- the
resources
dataframe is a bit more long to build but you have an example intorchvision/R/dataset-plankton.R
Lines 149 to 174 in d227a4f
resources = data.frame( split = c(rep("test", 4), rep("train", 13), rep("val", 2)), url = c(paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/test-0000",0:3,"-of-00004.parquet?download=true"), paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-0000",0:9,"-of-00013.parquet?download=true"), paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/train-000",10:12,"-of-00013.parquet?download=true"), paste0("https://huggingface.co/datasets/nf-whoi/whoi-plankton/resolve/main/data/validation-0000",0:1,"-of-00002.parquet?download=true")), md5 = c("cd41b344ec4b6af83e39c38e19f09190", "aa0965c0e59f7b1cddcb3c565d80edf3", "b2a75513f1a084724e100678d8ee7180", "a03c4d52758078bfb0799894926d60f6", "07eaff140f39868a8bcb1d3c02ebe60f", "87c927b9fbe0c327b7b9ae18388b4fcf", "456efd91901571a41c2157732880f6b8", "dc929fde45e3b2e38bdd61a15566cf32", "f92ab6cfb4a3dd7e0866f3fdf8dbc33c", "61c555bba39b6b3ccb4f02a5cf07e762", "57e03cecf2b5d97912ed37e1b8fc6263", "56081cc99e61c36e89db1566dbbf06c1", "60b7998630468cb18880660c81d1004a", "1fa94ceb54d4e53643a0d8cf323af901", "7a7be4e3dfdc39a50c8ca086a4d9a8de", "07194caf75805e956986cba68e6b398e", "0f4d47f240cd9c30a7dd786171fa40ca", "db827a7de8790cdcae67b174c7b8ea5e", "d3181d9ffaed43d0c01f59455924edca"), size = c(rep(450e6, 4), rep(490e6, 13), rep(450e6, 2)) - you can derive almost all other methods via a
inherit = whoi_small_plankton_dataset,
- the split only downloads the split subset of the dataset making time to data quicker, and disk footprint lower
- you remove the burden of os specific code that we don't want
Hello @koshtiakanksha, I've completed the > library(torchvision)
> withr::with_language("en_US", devtools::test_active_file())
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 73 ] Can I let you update the parquet URLs for the rest of the rf100 collection ? |
Closes #230
Closes #231
Closes #232
Closes #233
Closes #234