feat: built_in/data/structured schema validation #107

geritwagner · 2023-01-03T08:39:50Z

Feature Request

Describe the Feature Request

For the structured data extraction (the structured data endpoint), schema validation should be supported. Once users start extracting data in a csv file, a schema of the data types and structures should be derived and checked automatically. For example, if the schema requires particular columns to contain integers, or dates, an error should be raised if the actual values have different types.

Describe Preferred Solution

As a starting point, the example repository could be used to create a test dataset (a csv file containing the included papers and different columns). Currently, the structured data endpoint should add a new row for each paper that is added to the sample. Users can manually add columns to extract data for each paper (row).

For schema validation, the pandas-schema provides an appropriate package and examples. To validate the data extraction (csv), the validate_structured_data() method should be extended.

Once the validation works for the example dataset, the schema should be stored in the data endpoint settings (using the dataclass library and the asdict() method). Instead of requiring users to enter the fields through the command line (see __set_fields()), the field names and their schema should be derived from the actual csv file. To load the schema from the settings, the dacite library is recommended.

Optional: If the csv changes (columns added or removed), corresponding updates in the schema should be proposed automatically.

Expected Effort Required

2 months, 5 people.

The text was updated successfully, but these errors were encountered:

geritwagner added the enhancement New feature or request label Jan 3, 2023

geritwagner added the good first issue Good for newcomers label Feb 3, 2023

geritwagner added this to the v0.10.0 milestone Feb 24, 2023

geritwagner removed this from the v0.10.0 milestone Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: built_in/data/structured schema validation #107

feat: built_in/data/structured schema validation #107

geritwagner commented Jan 3, 2023 •

edited

Loading

feat: built_in/data/structured schema validation #107

feat: built_in/data/structured schema validation #107

Comments

geritwagner commented Jan 3, 2023 • edited Loading

Feature Request

geritwagner commented Jan 3, 2023 •

edited

Loading