Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/remove ood dataset #97

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

Feat/remove ood dataset #97

wants to merge 25 commits into from

Conversation

paulnovello
Copy link
Member

@paulnovello paulnovello commented Feb 3, 2025

This pull request includes significant changes to the oodeel library, focusing on refactoring and improving the data handling and dataset management functionalities. The major changes include the introduction of a new data handler loader, backend detection, and the deprecation of older dataset handling classes (OODDataset).

Before:

  • An OODDataset object is instantiated
  • Operations are performed on OODDatasets
  • Then OODDatasets return a tf.dataset or DataLoader using the .prepare method
ds_fit = OODDataset("cifar10", load_kwargs={"split": "train"}, input_key="image", backend=X)
ds_in = OODDataset("cifar10", load_kwargs={"split": "test"}, input_key="image", backend=X)

# 1b- Load out-of-distribution dataset: SVHN
ds_out = OODDataset("svhn_cropped", load_kwargs={"split": "test"}, backend=X)


# 2- prepare data (preprocess, shuffle, batch)
def preprocess_fn(*inputs):
    x = inputs[0] / 255
    return tuple([x] + list(inputs[1:]))


ds_fit = ds_fit.prepare(batch_size, preprocess_fn)
ds_in = ds_in.prepare(batch_size, preprocess_fn)
ds_out = ds_out.prepare(batch_size, preprocess_fn)

Problems

  • OODDataset is a misleading name since it can also designate ID dataset
  • OODDataset nature is redundant with tf.dataset and DataLoader, so it is confusing (in the example, ds_fit is an OODDataset first, but at the end is a tf.dataset or a DataLoader

After:

data_handler = load_data_handler(backend=X) #backend=X is optional since  load_data_handler automatically detects backend if only tf or torch is installed

ds_fit = data_handler.load_dataset("cifar10", load_kwargs={"split": "train"})
ds_in = data_handler.load_dataset("cifar10", load_kwargs={"split": "test"})

# 1b- Load out-of-distribution dataset: SVHN
ds_out = data_handler.load_dataset("svhn_cropped", load_kwargs={"split": "test"})


# 2- prepare data (preprocess, shuffle, batch)
def preprocess_fn(*inputs):
    x = inputs[0] / 255
    return tuple([x] + list(inputs[1:]))


ds_fit = data_handler.prepare(ds_fit, batch_size, preprocess_fn)
ds_in = data_handler.prepare(ds_in, batch_size, preprocess_fn)
ds_out = data_handler.prepare(ds_out, batch_size, preprocess_fn)

Refactoring and new functionalities:

  • oodeel/datasets/__init__.py: Replaced the import of OODDataset with load_data_handler and moved OODDataset to a deprecated module.
  • oodeel/datasets/data_handler.py: Added functions for backend detection (get_backend) and dynamic data handler loading ( load_data_handler). Introduced several new methods for dataset preparation and handling, such as split_by_class, prepare, load_dataset_from_arrays, and load_custom_dataset. [1] [2] [3] [4]

Deprecation and reorganization:

Initialization and imports:

Copy link

github-actions bot commented Feb 3, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2921 2637 90% 70% 🟢

New Files

File Coverage Status
oodeel/datasets/deprecated/DEPRECATED_data_handler.py 100% 🟢
oodeel/datasets/deprecated/DEPRECATED_tf_data_handler.py 81% 🟢
oodeel/datasets/deprecated/DEPRECATED_torch_data_handler.py 84% 🟢
oodeel/datasets/deprecated/init.py 100% 🟢
TOTAL 91% 🟢

Modified Files

File Coverage Status
oodeel/datasets/init.py 100% 🟢
oodeel/datasets/data_handler.py 86% 🟢
oodeel/datasets/tf_data_handler.py 82% 🟢
oodeel/datasets/torch_data_handler.py 83% 🟢
oodeel/eval/metrics.py 99% 🟢
oodeel/utils/tf_training_tools.py 75% 🟢
TOTAL 88% 🟢

updated for commit: a075c63 by action🐍

@paulnovello paulnovello force-pushed the feat/remove_OODDataset branch from fa39bea to 45c0dc4 Compare February 21, 2025 17:59
@paulnovello paulnovello force-pushed the feat/remove_OODDataset branch from 45c0dc4 to eb19601 Compare March 3, 2025 15:15
@paulnovello paulnovello force-pushed the feat/remove_OODDataset branch from 65cb388 to a075c63 Compare March 3, 2025 15:32
@paulnovello paulnovello requested a review from y-prudent March 3, 2025 15:57
@paulnovello paulnovello requested a review from cofri March 3, 2025 15:57
Copy link
Member

@y-prudent y-prudent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 remarks, regarding:

  • Data handler loading => Check the comments below
  • Deprecated code: Should we retain the deprecated code? We could release a new major version (with an appropriate tag) when merging this PR, allowing users who depend on the old API to continue using an earlier version of oodeel. Moreover, since the path to the deprecated code has been changed, any updates using the old code would fail anyway.

Otherwise, LGTM !

@@ -20,6 +20,7 @@
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
import importlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Python 3.12, when I do:

import importlib

print(importlib.util.find_spec("torch"))

I get the error:

AttributeError: module 'importlib' has no attribute 'util'

Here is a Python issue disscussing this error for versions of python > 3.11: https://discuss.python.org/t/python3-11-importlib-no-longer-exposes-util/25641. The solution seem to do the explicit import import importlib.util. By the way, get_backend function is never tested in oodeel!

Suggested change
import importlib
import importlib.util

Comment on lines 109 to +113

def test_get_feature_shape():
def test_instanciate_tf_datahandler():
handler = load_data_handler(backend="torch")
assert isinstance(handler, TorchDataHandler)

Copy link
Member

@y-prudent y-prudent Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be renamed test_instanciate_torch_datahandler.

And the case where the backend is not specified (equal to None) is not tested. It should be tested in the case where only torch is installed.

@y-prudent y-prudent self-requested a review March 4, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants