Bsweger/get task id values #22

bsweger · 2024-06-25T19:44:29Z

Next piece of the work for #14

Here we're adding a new function that inspects a hub's tasks.json and returns a dictionary of task_ids and their possible values. This is information we'll need to determine the schema of the model_output file.

Next step:
Do the same thing, except for output_type_ids, which are configured a bit differently

Here's what it looks like in action:

In [1]: from cloudpathlib import AnyPath

In [2]: from hubverse_transform.hub_config import HubConfig

In [3]: hub_path = AnyPath('s3://bsweger-flusight-forecast')

In [4]: hc = HubConfig(hub_path)

In [5]: task_id_values = hc.get_task_id_values()

In [6]: task_id_values.keys()
Out[6]: dict_keys(['reference_date', 'target', 'horizon', 'location', 'target_end_date'])

In [7]: task_id_values['horizon']
Out[7]: {-1, 0, 1, 2, 3}

In [8]: task_id_values['target']
Out[8]: {'wk flu hosp rate change', 'wk inc flu hosp'}

This helper function augments Python's built-in type function with some logic to see if a string value is actually an iso-formatted date. It will eventually be used to determine the date type for a list of values.

Add a function that returns (for all rounds or for a specific round) a dictionary that contains every task_id name along with its corresponding set of potential values.

bsweger · 2024-06-25T19:45:49Z

src/hubverse_transform/hub_config.py

+
+        return task_id_values
+
+    def _get_data_type(self, value: int | bool | str | date | float) -> type:


Got ahead of myself here: this is the helper that will eventually determine the data type of our task_ids and output_type_ids, but it's not being used yet.

bsweger · 2024-06-25T19:46:25Z

test/unit/test_hub_config.py

@@ -72,7 +73,25 @@ def tasks_config() -> dict:
                    },
                ],
                "submissions_due": {"relative_to": "reference_date", "start": -6, "end": -3},
-            }
+            },
+            {


Added another round to our test data so we can test the new function for multiple rounds.

elray1

had a few questions

elray1 · 2024-06-25T20:12:39Z

src/hubverse_transform/hub_config.py

+        tasks = self.tasks
+        rounds = tasks.get("rounds", [])
+        if round_name != "all":
+            rounds = [r for r in rounds if r.get("round_name") == round_name]


In your example below with

"round_id_from_variable": True, "round_id": "reference_date", "model_tasks": [ { "task_ids": { "reference_date": {"required": None, "optional": ["2024-07-13", "2024-07-21"]},

will this find any results if I ask for round_name = '2024-07-13'?

No, not unless there is a round entry in tasks.json that is explicitly named 2024-07-13. See note below: #22 (comment)

I do think we should settle on how to do this with the understanding that 95% of our use case is generating a schema that will work across all hub files to prevent parquet errors when access model-output files on S3.

My vote would be to switch from round_name to round_id for consistency with the R package: https://github.com/hubverse-org/hubData/blob/main/R/create_hub_schema.R#L101

I agree that we should switch from round_name to round_id

"with the understanding that 95% of our use case is generating a schema that will work across all hub files to prevent parquet errors when access model-output files on S3." But we are going to want these schemas to be correct on a per-round basis, right? It does not seem correct to generically use the same schema for all rounds.

Yeah, it's likely that I overlooked the round_id nuance you suggest when reading the corresponding R function.

To clarify my comment re: the use case....the current goal of ascertaining a hub's schema is to ensure we're applying the correct parquet schema when writing model-output files to S3. To increase the likelihood of successful downstream access, regardless of filter/query patterns, we want a schema that will work for every round and task in tasks.json. **

Thus, the plan for using a "get schema" function in the context of hubverse-transform is never to specify a round id. I realize there's broader applications (i.e., the Python version of hubData), but that's out of scope for solving #14

I shouldn't have copied over a parameter from the R counterpart and then hand-wave it 😬 Apologies for the confusion.

My preference at this time would be to remove round_id and round_name and end up with a function that grabs all possible values for every task_id and output_type_id in every round. In other words, focus on the use case of correct schemas on S3.

** This isn't foolproof (for example, if a hub adds a new round with an incompatible schema), but it's better than our current state.

I support a focus on the use case of correct schemas [on S3 and elsewhere], but need more convincing on what the right answer is for "correct". Couple of thoughts/questions related to this:

I believe that the use of a single schema across all rounds would result in model output files that fail validation checks as performed by hubValidations if the schemas for those rounds differ, which feels "bad". (But maybe I am not understanding the setup for this correctly?) I would expect data files that live on S3 to pass the same validity checks as data files that live on GitHub or my local computer.

Do we have a general problem where if I'm trying to load model outputs from parquet files submitted across multiple rounds with different schemas, the load fails?

OK! So with this understanding, I'm on board with removing round_id and round_name arguments for this functionality so that we just retrieve an overall schema.

I think that's the best plan.

Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files.

To reduce the chances of this happening/mitigate the effects we:

should improve the documentation on this, get admins to think about the issue early on and warn them to avoid changes in data types.

Should propagate the ability to fix the output_type_id column to hubValidations and consider a property in the schema where hub admins can configure and communicate this setting. This would give admins the ability to future proof their hubs by setting the column to character if they are unsure whether they may start collecting an output type that could affect the schema.

As a future feature, once we have created functionality to inspect a hub for integrity, we could also add functionality that could repair any data type discrepancies and update files to conform to a changed schema. This could help admins in a situation where all the above fail and a breaking schema change needs to be introduced.

Happy to open issues regarding the above on Mon if it sounds sensible or a discussion post for more... discussion so as not to take over this PR anymore!

that all sounds good to me, i say let's just file issues for these things :)

Done! https://github.com/orgs/hubverse-org/projects/3/views/12?filterQuery=milestone%3A%22robust-hub-schema%22

Thanks hopping into the thread @annakrystalli--generating a schema for the entire hub definitely makes things more straightforward 😅

src/hubverse_transform/hub_config.py

test/unit/test_hub_config.py

Per this PR convo, we are going to generate a schema for the entire hub instead of generating schemas on a per-round basis #22 (comment)

elray1

one minor change suggested, but i am approving the merge

src/hubverse_transform/hub_config.py

Co-authored-by: Evan Ray <[email protected]>

bsweger added 2 commits June 25, 2024 15:35

Add helper function to return a value's data type

bdadf63

This helper function augments Python's built-in type function with some logic to see if a string value is actually an iso-formatted date. It will eventually be used to determine the date type for a list of values.

Add get_task_id_values function

813b73c

Add a function that returns (for all rounds or for a specific round) a dictionary that contains every task_id name along with its corresponding set of potential values.

bsweger requested a review from matthewcornell June 25, 2024 19:44

bsweger commented Jun 25, 2024

View reviewed changes

elray1 reviewed Jun 25, 2024

View reviewed changes

bsweger added 2 commits June 26, 2024 10:20

Use task_set as a more precise variable name than task

8da9a7b

Remove unecessary noise when testing for a missing round

e9adf0b

bsweger removed the request for review from matthewcornell June 26, 2024 14:21

Remove round-related params from get_task_id_values

d36ab94

Per this PR convo, we are going to generate a schema for the entire hub instead of generating schemas on a per-round basis #22 (comment)

elray1 approved these changes Jul 3, 2024

View reviewed changes

src/hubverse_transform/hub_config.py Outdated Show resolved Hide resolved

Update src/hubverse_transform/hub_config.py

cabfe2e

Co-authored-by: Evan Ray <[email protected]>

bsweger merged commit 5fddc1e into dev Jul 3, 2024
1 check passed

bsweger deleted the bsweger/get-task-id-values branch July 19, 2024 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bsweger/get task id values #22

Bsweger/get task id values #22

bsweger commented Jun 25, 2024

bsweger Jun 25, 2024

bsweger Jun 25, 2024

elray1 left a comment

elray1 Jun 25, 2024

bsweger Jun 26, 2024

elray1 Jun 27, 2024

bsweger Jun 27, 2024 •

edited

Loading

elray1 Jun 27, 2024 •

edited

Loading

elray1 Jun 28, 2024

annakrystalli Jun 28, 2024

elray1 Jun 28, 2024

annakrystalli Jul 1, 2024

bsweger Jul 1, 2024

elray1 left a comment


		return task_id_values

		def _get_data_type(self, value: int \| bool \| str \| date \| float) -> type:

Bsweger/get task id values #22

Bsweger/get task id values #22

Conversation

bsweger commented Jun 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsweger Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

elray1 Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

bsweger Jun 27, 2024 •

edited

Loading

elray1 Jun 27, 2024 •

edited

Loading