Add better S3 test cases to hubverse-transform #30

bsweger · 2024-11-14T20:18:42Z

We've hit a few errors when syncing hub files to S3 because our test suite isn't robust enough for S3-style file syncing.

Moto, my usual go-to for AWS mocking doesn't help because pyarrow's S3FS object isn't based on boto (it's possible that moto server could help)

We should do one of the following:

Figure out how to mock S3 in a way that works with pyarrow
Stand up a live S3 test hub with a "raw/model-output" folder (similar to how our R tools test S3)

**Definition of Done

all of hubverse-transform's test cases are run against S3 as well as the local filesystem

When reading parquet files from S3, hubverse-transform does an initial read to get the schema (so we can override it if necessary). However, the read fails because it's reading the wrong thing, and the transform process tries to open the model-output data on the local filesystem instead of on S3. I opened an issue to address the lack of S3 test cases, which resulting in this bug hitting production: #30

bsweger added this to hubverse Development overview Nov 14, 2024

bsweger converted this from a draft issue Nov 14, 2024

bsweger added this to the hubverse cloud sync milestone Nov 14, 2024

bsweger mentioned this issue Nov 14, 2024

Fix a bug when opening a parquet file on S3 #31

Merged

bsweger self-assigned this Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better S3 test cases to hubverse-transform #30

Add better S3 test cases to hubverse-transform #30

bsweger commented Nov 14, 2024 •

edited

Loading

Add better S3 test cases to hubverse-transform #30

Add better S3 test cases to hubverse-transform #30

Comments

bsweger commented Nov 14, 2024 • edited Loading

bsweger commented Nov 14, 2024 •

edited

Loading