You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the structured data extraction (the structured data endpoint), schema validation should be supported. Once users start extracting data in a csv file, a schema of the data types and structures should be derived and checked automatically. For example, if the schema requires particular columns to contain integers, or dates, an error should be raised if the actual values have different types.
Describe Preferred Solution
As a starting point, the example repository could be used to create a test dataset (a csv file containing the included papers and different columns). Currently, the structured data endpoint should add a new row for each paper that is added to the sample. Users can manually add columns to extract data for each paper (row).
For schema validation, the pandas-schema provides an appropriate package and examples. To validate the data extraction (csv), the validate_structured_data() method should be extended.
Once the validation works for the example dataset, the schema should be stored in the data endpoint settings (using the dataclass library and the asdict() method). Instead of requiring users to enter the fields through the command line (see __set_fields()), the field names and their schema should be derived from the actual csv file. To load the schema from the settings, the dacite library is recommended.
Optional: If the csv changes (columns added or removed), corresponding updates in the schema should be proposed automatically.
Expected Effort Required
2 months, 5 people.
The text was updated successfully, but these errors were encountered:
Feature Request
Describe the Feature Request
For the structured data extraction (the structured data endpoint), schema validation should be supported. Once users start extracting data in a csv file, a schema of the data types and structures should be derived and checked automatically. For example, if the schema requires particular columns to contain integers, or dates, an error should be raised if the actual values have different types.
Describe Preferred Solution
As a starting point, the example repository could be used to create a test dataset (a csv file containing the included papers and different columns). Currently, the structured data endpoint should add a new row for each paper that is added to the sample. Users can manually add columns to extract data for each paper (row).
For schema validation, the pandas-schema provides an appropriate package and examples. To validate the data extraction (csv), the
validate_structured_data()
method should be extended.Once the validation works for the example dataset, the schema should be stored in the data endpoint settings (using the dataclass library and the
asdict()
method). Instead of requiring users to enter the fields through the command line (see__set_fields()
), the field names and their schema should be derived from the actual csv file. To load the schema from the settings, the dacite library is recommended.Optional: If the csv changes (columns added or removed), corresponding updates in the schema should be proposed automatically.
Expected Effort Required
2 months, 5 people.
The text was updated successfully, but these errors were encountered: