hubverse-transform: add a temporary patch for schema mismatch errors #24

bsweger · 2024-08-06T20:25:46Z

Background

We plan to add a feature to hubverse-transform that will enable the use of hub-specific schemas when converting incoming model-output files to parquet: #14

Progress on that feature, however, has taken a backseat to the work we're doing on Hubverse visualizations.

In the meantime, hubs that have been onboarded to the cloud (specifically last season's CDC FluSight) are paying a performance penalty due to workarounds in place to accommodate non-uniform model-output file schemas.

This issue proposes to add a temporary patch to hubverse-transform that will force our two most "problematic" model-output fields to character.

Definition of done

parquet schema for S3 model-output files forces location to character
parquet schema for S3 model-output files forces output_type_id to character

The text was updated successfully, but these errors were encountered:

bsweger · 2024-08-09T16:06:16Z

@matthewcornell and I have been pairing on this and have some working code in this branch: https://github.com/hubverse-org/hubverse-transform/tree/schema_patch

annakrystalli · 2024-08-09T16:51:49Z

One thing I'm not sure about with this temp fix is the fact that the schema is being applied after reading in:

hubverse-transform/src/hubverse_transform/model_output.py

Lines 205 to 218 in 5bedb1f

    
               model_output_table = csv.read_csv(model_output_file, convert_options=options) 
        
           else: 
        
               # parquet requires random access reading (because metadata), 
        
               # so we use open_input_file instead of open_intput_stream 
        
               model_output_file = self.fs_input.open_input_file(self.input_file) 
        
               model_output_table = pq.read_table(model_output_file) 
        
           # temporarily fix: patch two known problematic fields by overriding their data type to string 
        
           schema_new = model_output_table.schema 
        
           for field_name in ["location", "output_type_id"]: 
        
               field_idx = schema_new.get_field_index(field_name)  # -1 if not found 
        
               if field_idx != -1: 
        
                   schema_new = schema_new.set(field_idx, pa.field(field_name, pa.string())) 
        
           model_output_table = model_output_table.cast(schema_new)

I believe this can cause problems with values like 01 as it will transform them to integer on read, dropping the leading zero, and then write them out as "1". I can't see this being explicitly tested for, I can only see output column data type being checked but not whether the values have been altered in any way. But maybe I'm missing sth as I'm not a python expert.

bsweger · 2024-08-09T17:58:40Z

One thing I'm not sure about with this temp fix is the fact that the schema is being applied after reading in:

hubverse-transform/src/hubverse_transform/model_output.py

Lines 205 to 218 in 5bedb1f

model_output_table = csv.read_csv(model_output_file, convert_options=options)

else:

# parquet requires random access reading (because metadata),

# so we use open_input_file instead of open_intput_stream

model_output_file = self.fs_input.open_input_file(self.input_file)

model_output_table = pq.read_table(model_output_file)

# temporarily fix: patch two known problematic fields by overriding their data type to string

schema_new = model_output_table.schema

for field_name in ["location", "output_type_id"]:

field_idx = schema_new.get_field_index(field_name) # -1 if not found

if field_idx != -1:

schema_new = schema_new.set(field_idx, pa.field(field_name, pa.string()))

model_output_table = model_output_table.cast(schema_new)

I believe this can cause problems with values like 01 as it will transform them to integer on read, dropping the leading zero, and then write them out as "1". I can't see this being explicitly tested for, I can only see output column data type being checked but not whether the values have been altered in any way. But maybe I'm missing sth as I'm not a python expert.

Yeah, we couldn't find a way to apply the schema when reading in w/o knowing the column names in advance (which, of course, we will once we do the work in #14).

So we had to update the schema after the data is read. That is a good point about the zeroes...we haven't finished writing all of the tests yet, but @matthewcornell let's make sure we think this through!

bsweger added this to hubverse Development overview Aug 6, 2024

bsweger converted this from a draft issue Aug 6, 2024

bsweger added this to the hubverse cloud sync milestone Aug 6, 2024

bsweger moved this from In Progress to Up Next in hubverse Development overview Aug 6, 2024

bsweger assigned bsweger and matthewcornell Aug 9, 2024

bsweger mentioned this issue Aug 9, 2024

hubverse-transform: supply a parquet schema #14

Open

6 tasks

bsweger mentioned this issue Aug 20, 2024

Temporary schema patch for location and output_type_id columns #26

Merged

bsweger closed this as completed in #26 Aug 20, 2024

github-project-automation bot moved this from In Progress to Done in hubverse Development overview Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubverse-transform: add a temporary patch for schema mismatch errors #24

hubverse-transform: add a temporary patch for schema mismatch errors #24

bsweger commented Aug 6, 2024 •

edited

Loading

bsweger commented Aug 9, 2024

annakrystalli commented Aug 9, 2024 •

edited

Loading

bsweger commented Aug 9, 2024

hubverse-transform: add a temporary patch for schema mismatch errors #24

hubverse-transform: add a temporary patch for schema mismatch errors #24

Comments

bsweger commented Aug 6, 2024 • edited Loading

bsweger commented Aug 9, 2024

annakrystalli commented Aug 9, 2024 • edited Loading

bsweger commented Aug 9, 2024

bsweger commented Aug 6, 2024 •

edited

Loading

annakrystalli commented Aug 9, 2024 •

edited

Loading