-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubverse-transform: add a temporary patch for schema mismatch errors #24
Comments
@matthewcornell and I have been pairing on this and have some working code in this branch: https://github.com/hubverse-org/hubverse-transform/tree/schema_patch |
One thing I'm not sure about with this temp fix is the fact that the schema is being applied after reading in: hubverse-transform/src/hubverse_transform/model_output.py Lines 205 to 218 in 5bedb1f
I believe this can cause problems with values like |
Yeah, we couldn't find a way to apply the schema when reading in w/o knowing the column names in advance (which, of course, we will once we do the work in #14). So we had to update the schema after the data is read. That is a good point about the zeroes...we haven't finished writing all of the tests yet, but @matthewcornell let's make sure we think this through! |
Background
We plan to add a feature to hubverse-transform that will enable the use of hub-specific schemas when converting incoming model-output files to parquet: #14
Progress on that feature, however, has taken a backseat to the work we're doing on Hubverse visualizations.
In the meantime, hubs that have been onboarded to the cloud (specifically last season's CDC FluSight) are paying a performance penalty due to workarounds in place to accommodate non-uniform model-output file schemas.
This issue proposes to add a temporary patch to hubverse-transform that will force our two most "problematic" model-output fields to character.
Definition of done
location
to characteroutput_type_id
to characterThe text was updated successfully, but these errors were encountered: