Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubverse-transform: add a temporary patch for schema mismatch errors #24

Closed
2 tasks done
bsweger opened this issue Aug 6, 2024 · 3 comments · Fixed by #26
Closed
2 tasks done

hubverse-transform: add a temporary patch for schema mismatch errors #24

bsweger opened this issue Aug 6, 2024 · 3 comments · Fixed by #26
Assignees

Comments

@bsweger
Copy link
Collaborator

bsweger commented Aug 6, 2024

Background

We plan to add a feature to hubverse-transform that will enable the use of hub-specific schemas when converting incoming model-output files to parquet: #14

Progress on that feature, however, has taken a backseat to the work we're doing on Hubverse visualizations.

In the meantime, hubs that have been onboarded to the cloud (specifically last season's CDC FluSight) are paying a performance penalty due to workarounds in place to accommodate non-uniform model-output file schemas.

This issue proposes to add a temporary patch to hubverse-transform that will force our two most "problematic" model-output fields to character.

Definition of done

  • parquet schema for S3 model-output files forces location to character
  • parquet schema for S3 model-output files forces output_type_id to character
@bsweger bsweger converted this from a draft issue Aug 6, 2024
@bsweger bsweger added this to the hubverse cloud sync milestone Aug 6, 2024
@bsweger bsweger moved this from In Progress to Up Next in hubverse Development overview Aug 6, 2024
@bsweger
Copy link
Collaborator Author

bsweger commented Aug 9, 2024

@matthewcornell and I have been pairing on this and have some working code in this branch: https://github.com/hubverse-org/hubverse-transform/tree/schema_patch

@annakrystalli
Copy link
Member

annakrystalli commented Aug 9, 2024

One thing I'm not sure about with this temp fix is the fact that the schema is being applied after reading in:

model_output_table = csv.read_csv(model_output_file, convert_options=options)
else:
# parquet requires random access reading (because metadata),
# so we use open_input_file instead of open_intput_stream
model_output_file = self.fs_input.open_input_file(self.input_file)
model_output_table = pq.read_table(model_output_file)
# temporarily fix: patch two known problematic fields by overriding their data type to string
schema_new = model_output_table.schema
for field_name in ["location", "output_type_id"]:
field_idx = schema_new.get_field_index(field_name) # -1 if not found
if field_idx != -1:
schema_new = schema_new.set(field_idx, pa.field(field_name, pa.string()))
model_output_table = model_output_table.cast(schema_new)

I believe this can cause problems with values like 01 as it will transform them to integer on read, dropping the leading zero, and then write them out as "1". I can't see this being explicitly tested for, I can only see output column data type being checked but not whether the values have been altered in any way. But maybe I'm missing sth as I'm not a python expert.

@bsweger
Copy link
Collaborator Author

bsweger commented Aug 9, 2024

One thing I'm not sure about with this temp fix is the fact that the schema is being applied after reading in:

model_output_table = csv.read_csv(model_output_file, convert_options=options)
else:
# parquet requires random access reading (because metadata),
# so we use open_input_file instead of open_intput_stream
model_output_file = self.fs_input.open_input_file(self.input_file)
model_output_table = pq.read_table(model_output_file)
# temporarily fix: patch two known problematic fields by overriding their data type to string
schema_new = model_output_table.schema
for field_name in ["location", "output_type_id"]:
field_idx = schema_new.get_field_index(field_name) # -1 if not found
if field_idx != -1:
schema_new = schema_new.set(field_idx, pa.field(field_name, pa.string()))
model_output_table = model_output_table.cast(schema_new)

I believe this can cause problems with values like 01 as it will transform them to integer on read, dropping the leading zero, and then write them out as "1". I can't see this being explicitly tested for, I can only see output column data type being checked but not whether the values have been altered in any way. But maybe I'm missing sth as I'm not a python expert.

Yeah, we couldn't find a way to apply the schema when reading in w/o knowing the column names in advance (which, of course, we will once we do the work in #14).

So we had to update the schema after the data is read. That is a good point about the zeroes...we haven't finished writing all of the tests yet, but @matthewcornell let's make sure we think this through!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

3 participants