-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bsweger/get task id values #22
Conversation
This helper function augments Python's built-in type function with some logic to see if a string value is actually an iso-formatted date. It will eventually be used to determine the date type for a list of values.
Add a function that returns (for all rounds or for a specific round) a dictionary that contains every task_id name along with its corresponding set of potential values.
|
||
return task_id_values | ||
|
||
def _get_data_type(self, value: int | bool | str | date | float) -> type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got ahead of myself here: this is the helper that will eventually determine the data type of our task_ids and output_type_ids, but it's not being used yet.
@@ -72,7 +73,25 @@ def tasks_config() -> dict: | |||
}, | |||
], | |||
"submissions_due": {"relative_to": "reference_date", "start": -6, "end": -3}, | |||
} | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another round to our test data so we can test the new function for multiple rounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had a few questions
src/hubverse_transform/hub_config.py
Outdated
tasks = self.tasks | ||
rounds = tasks.get("rounds", []) | ||
if round_name != "all": | ||
rounds = [r for r in rounds if r.get("round_name") == round_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example below with
"round_id_from_variable": True,
"round_id": "reference_date",
"model_tasks": [
{
"task_ids": {
"reference_date": {"required": None, "optional": ["2024-07-13", "2024-07-21"]},
will this find any results if I ask for round_name = '2024-07-13'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not unless there is a round entry in tasks.json
that is explicitly named 2024-07-13
. See note below: #22 (comment)
I do think we should settle on how to do this with the understanding that 95% of our use case is generating a schema that will work across all hub files to prevent parquet errors when access model-output files on S3.
My vote would be to switch from round_name
to round_id
for consistency with the R package: https://github.com/hubverse-org/hubData/blob/main/R/create_hub_schema.R#L101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that we should switch from
round_name
toround_id
- "with the understanding that 95% of our use case is generating a schema that will work across all hub files to prevent parquet errors when access model-output files on S3." But we are going to want these schemas to be correct on a per-round basis, right? It does not seem correct to generically use the same schema for all rounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's likely that I overlooked the round_id nuance you suggest when reading the corresponding R function.
To clarify my comment re: the use case....the current goal of ascertaining a hub's schema is to ensure we're applying the correct parquet schema when writing model-output files to S3. To increase the likelihood of successful downstream access, regardless of filter/query patterns, we want a schema that will work for every round and task in tasks.json
. **
Thus, the plan for using a "get schema" function in the context of hubverse-transform
is never to specify a round id. I realize there's broader applications (i.e., the Python version of hubData
), but that's out of scope for solving #14
I shouldn't have copied over a parameter from the R counterpart and then hand-wave it 😬 Apologies for the confusion.
My preference at this time would be to remove round_id
and round_name
and end up with a function that grabs all possible values for every task_id and output_type_id in every round. In other words, focus on the use case of correct schemas on S3.
** This isn't foolproof (for example, if a hub adds a new round with an incompatible schema), but it's better than our current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support a focus on the use case of correct schemas [on S3 and elsewhere], but need more convincing on what the right answer is for "correct". Couple of thoughts/questions related to this:
- I believe that the use of a single schema across all rounds would result in model output files that fail validation checks as performed by hubValidations if the schemas for those rounds differ, which feels "bad". (But maybe I am not understanding the setup for this correctly?) I would expect data files that live on S3 to pass the same validity checks as data files that live on GitHub or my local computer.
- Do we have a general problem where if I'm trying to load model outputs from parquet files submitted across multiple rounds with different schemas, the load fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! So with this understanding, I'm on board with removing round_id
and round_name
arguments for this functionality so that we just retrieve an overall schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the best plan.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id
column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files.
To reduce the chances of this happening/mitigate the effects we:
- should improve the documentation on this, get admins to think about the issue early on and warn them to avoid changes in data types.
- Should propagate the ability to fix the
output_type_id
column tohubValidations
and consider a property in the schema where hub admins can configure and communicate this setting. This would give admins the ability to future proof their hubs by setting the column to character if they are unsure whether they may start collecting an output type that could affect the schema. - As a future feature, once we have created functionality to inspect a hub for integrity, we could also add functionality that could repair any data type discrepancies and update files to conform to a changed schema. This could help admins in a situation where all the above fail and a breaking schema change needs to be introduced.
Happy to open issues regarding the above on Mon if it sounds sensible or a discussion post for more... discussion so as not to take over this PR anymore!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that all sounds good to me, i say let's just file issues for these things :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks hopping into the thread @annakrystalli--generating a schema for the entire hub definitely makes things more straightforward 😅
Per this PR convo, we are going to generate a schema for the entire hub instead of generating schemas on a per-round basis #22 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor change suggested, but i am approving the merge
Co-authored-by: Evan Ray <[email protected]>
Next piece of the work for #14
Here we're adding a new function that inspects a hub's
tasks.json
and returns a dictionary oftask_id
s and their possible values. This is information we'll need to determine the schema of the model_output file.Next step:
Do the same thing, except for
output_type_id
s, which are configured a bit differentlyHere's what it looks like in action: