Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/get task id values #22

Merged
merged 6 commits into from
Jul 3, 2024
Merged

Bsweger/get task id values #22

merged 6 commits into from
Jul 3, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Jun 25, 2024

Next piece of the work for #14

Here we're adding a new function that inspects a hub's tasks.json and returns a dictionary of task_ids and their possible values. This is information we'll need to determine the schema of the model_output file.

Next step:
Do the same thing, except for output_type_ids, which are configured a bit differently

Here's what it looks like in action:

In [1]: from cloudpathlib import AnyPath

In [2]: from hubverse_transform.hub_config import HubConfig

In [3]: hub_path = AnyPath('s3://bsweger-flusight-forecast')

In [4]: hc = HubConfig(hub_path)

In [5]: task_id_values = hc.get_task_id_values()

In [6]: task_id_values.keys()
Out[6]: dict_keys(['reference_date', 'target', 'horizon', 'location', 'target_end_date'])

In [7]: task_id_values['horizon']
Out[7]: {-1, 0, 1, 2, 3}

In [8]: task_id_values['target']
Out[8]: {'wk flu hosp rate change', 'wk inc flu hosp'}

bsweger added 2 commits June 25, 2024 15:35
This helper function augments Python's built-in type function with
some logic to see if a string value is actually an iso-formatted date.
It will eventually be used to determine the date type for a list
of values.
Add a function that returns (for all rounds or for a specific
round) a dictionary that contains every task_id name along
with its corresponding set of potential values.
@bsweger bsweger requested a review from matthewcornell June 25, 2024 19:44

return task_id_values

def _get_data_type(self, value: int | bool | str | date | float) -> type:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got ahead of myself here: this is the helper that will eventually determine the data type of our task_ids and output_type_ids, but it's not being used yet.

@@ -72,7 +73,25 @@ def tasks_config() -> dict:
},
],
"submissions_due": {"relative_to": "reference_date", "start": -6, "end": -3},
}
},
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another round to our test data so we can test the new function for multiple rounds.

Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a few questions

tasks = self.tasks
rounds = tasks.get("rounds", [])
if round_name != "all":
rounds = [r for r in rounds if r.get("round_name") == round_name]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example below with

                "round_id_from_variable": True,
                "round_id": "reference_date",
                "model_tasks": [
                    {
                        "task_ids": {
                            "reference_date": {"required": None, "optional": ["2024-07-13", "2024-07-21"]},

will this find any results if I ask for round_name = '2024-07-13'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not unless there is a round entry in tasks.json that is explicitly named 2024-07-13. See note below: #22 (comment)

I do think we should settle on how to do this with the understanding that 95% of our use case is generating a schema that will work across all hub files to prevent parquet errors when access model-output files on S3.

My vote would be to switch from round_name to round_id for consistency with the R package: https://github.com/hubverse-org/hubData/blob/main/R/create_hub_schema.R#L101

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree that we should switch from round_name to round_id
  • "with the understanding that 95% of our use case is generating a schema that will work across all hub files to prevent parquet errors when access model-output files on S3." But we are going to want these schemas to be correct on a per-round basis, right? It does not seem correct to generically use the same schema for all rounds.

Copy link
Collaborator Author

@bsweger bsweger Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's likely that I overlooked the round_id nuance you suggest when reading the corresponding R function.

To clarify my comment re: the use case....the current goal of ascertaining a hub's schema is to ensure we're applying the correct parquet schema when writing model-output files to S3. To increase the likelihood of successful downstream access, regardless of filter/query patterns, we want a schema that will work for every round and task in tasks.json. **

Thus, the plan for using a "get schema" function in the context of hubverse-transform is never to specify a round id. I realize there's broader applications (i.e., the Python version of hubData), but that's out of scope for solving #14

I shouldn't have copied over a parameter from the R counterpart and then hand-wave it 😬 Apologies for the confusion.

My preference at this time would be to remove round_id and round_name and end up with a function that grabs all possible values for every task_id and output_type_id in every round. In other words, focus on the use case of correct schemas on S3.

** This isn't foolproof (for example, if a hub adds a new round with an incompatible schema), but it's better than our current state.

Copy link
Collaborator

@elray1 elray1 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support a focus on the use case of correct schemas [on S3 and elsewhere], but need more convincing on what the right answer is for "correct". Couple of thoughts/questions related to this:

  1. I believe that the use of a single schema across all rounds would result in model output files that fail validation checks as performed by hubValidations if the schemas for those rounds differ, which feels "bad". (But maybe I am not understanding the setup for this correctly?) I would expect data files that live on S3 to pass the same validity checks as data files that live on GitHub or my local computer.
  2. Do we have a general problem where if I'm trying to load model outputs from parquet files submitted across multiple rounds with different schemas, the load fails?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! So with this understanding, I'm on board with removing round_id and round_name arguments for this functionality so that we just retrieve an overall schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the best plan.

Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files.

To reduce the chances of this happening/mitigate the effects we:

  • should improve the documentation on this, get admins to think about the issue early on and warn them to avoid changes in data types.
  • Should propagate the ability to fix the output_type_id column to hubValidations and consider a property in the schema where hub admins can configure and communicate this setting. This would give admins the ability to future proof their hubs by setting the column to character if they are unsure whether they may start collecting an output type that could affect the schema.
  • As a future feature, once we have created functionality to inspect a hub for integrity, we could also add functionality that could repair any data type discrepancies and update files to conform to a changed schema. This could help admins in a situation where all the above fail and a breaking schema change needs to be introduced.

Happy to open issues regarding the above on Mon if it sounds sensible or a discussion post for more... discussion so as not to take over this PR anymore!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that all sounds good to me, i say let's just file issues for these things :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks hopping into the thread @annakrystalli--generating a schema for the entire hub definitely makes things more straightforward 😅

@bsweger bsweger removed the request for review from matthewcornell June 26, 2024 14:21
Per this PR convo, we are going to generate a schema for the
entire hub instead of generating schemas on a per-round basis
#22 (comment)
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor change suggested, but i am approving the merge

@bsweger bsweger merged commit 5fddc1e into dev Jul 3, 2024
1 check passed
@bsweger bsweger deleted the bsweger/get-task-id-values branch July 19, 2024 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants