Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: built_in/data/structured schema validation #107

Open
geritwagner opened this issue Jan 3, 2023 · 0 comments
Open

feat: built_in/data/structured schema validation #107

geritwagner opened this issue Jan 3, 2023 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@geritwagner
Copy link
Collaborator

geritwagner commented Jan 3, 2023

Feature Request

Describe the Feature Request

For the structured data extraction (the structured data endpoint), schema validation should be supported. Once users start extracting data in a csv file, a schema of the data types and structures should be derived and checked automatically. For example, if the schema requires particular columns to contain integers, or dates, an error should be raised if the actual values have different types.

Describe Preferred Solution

As a starting point, the example repository could be used to create a test dataset (a csv file containing the included papers and different columns). Currently, the structured data endpoint should add a new row for each paper that is added to the sample. Users can manually add columns to extract data for each paper (row).

For schema validation, the pandas-schema provides an appropriate package and examples. To validate the data extraction (csv), the validate_structured_data() method should be extended.

Once the validation works for the example dataset, the schema should be stored in the data endpoint settings (using the dataclass library and the asdict() method). Instead of requiring users to enter the fields through the command line (see __set_fields()), the field names and their schema should be derived from the actual csv file. To load the schema from the settings, the dacite library is recommended.

Optional: If the csv changes (columns added or removed), corresponding updates in the schema should be proposed automatically.

Expected Effort Required

2 months, 5 people.

@geritwagner geritwagner added the enhancement New feature or request label Jan 3, 2023
@geritwagner geritwagner added the good first issue Good for newcomers label Feb 3, 2023
@geritwagner geritwagner added this to the v0.10.0 milestone Feb 24, 2023
@geritwagner geritwagner removed this from the v0.10.0 milestone Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant