Skip to content

Commit

Permalink
Fix a bug when opening a parquet file on S3
Browse files Browse the repository at this point in the history
When reading parquet files from S3, hubverse-transform
does an initial read to get the schema (so we can override
it if necessary). However, the read fails because it's
reading the wrong thing, and the transform process tries
to open the model-output data on the local filesystem instead
of on S3.

I opened an issue to address the lack of S3 test cases,
which resulting in this bug hitting production:
#30
  • Loading branch information
bsweger committed Nov 14, 2024
1 parent fce18e7 commit b14b2bb
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions src/hubverse_transform/model_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,12 +208,12 @@ def read_file(self) -> pa.table:
model_output_table = csv.read_csv(model_output_file, convert_options=options)
else:
# temp fix: force location and output_type_id columns to string
schema_new = pq.read_schema(self.input_file)
model_output_file = self.fs_input.open_input_file(self.input_file)
schema_new = pq.read_schema(model_output_file)
for field_name in ["location", "output_type_id"]:
field_idx = schema_new.get_field_index(field_name)
if field_idx >= 0:
schema_new = schema_new.set(field_idx, pa.field(field_name, pa.string()))
model_output_file = self.fs_input.open_input_file(self.input_file)
model_output_table = pq.read_table(model_output_file, schema=schema_new)

return model_output_table
Expand Down

0 comments on commit b14b2bb

Please sign in to comment.