Skip to content

[ARCHITECTURE UPDATE] update data organisation #241

@casenave

Description

@casenave

I propose to update the architecture of dataset to mimic the one of HuggingFace datasets as close as possible:

folder
├── data
│   ├── train
│   │   ├── sample_000000000
│   │   │   ├── features_000000000.cgns
│   │   │   ├── features_000000001.cgns
│   │   └── sample_0000000001
│   │   │   ├──...
│   ├── test
│   │   ├── ...
├── infos.yaml
└── problem_definitions
│   └── task_1
│       ├── problem_infos.yaml
│       └── split.json
│   ├── task_2
│   │   ├── ...

Like HF datasets, we can introduce Dataset(the actual one) and DatasetDict:dict[str,Dataset]. The split will contain the keys of DatasetDict and subsplit with a numbering local to the corresponding key (train, test, ...).

Doing this will make hf datasets repo very similar to the data memory mapping:

Image

(this was obtained by using the hf_dataset.push_to_hub(repo_id) and our Hugging Face bridge)

The multiple problem definition proposal will indeed enable multiple task defined over the same dataset:

Image

The work I did in #240 implements this for HF dataset repos of PLAID datasets. I think the problem definition can be modifyins as well to:

  • indicate in and out split concerned by the regression task
  • name the score function for the moment (maybe later we should find a way to define an implementation
  • rely on the flatten tree keys for the in and out feature identifiers, e.g. Base_2_2/Zone/GridCoordinates/CoordinateX

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestorganisationEvolution in project organisation

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions