Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boilerplate to the dashboard template about accessing hub data on S3 #4

Open
1 task
bsweger opened this issue Feb 12, 2025 · 9 comments
Open
1 task
Labels
documentation Improvements or additions to documentation

Comments

@bsweger
Copy link

bsweger commented Feb 12, 2025

Background

@nickreich noted on Slack that Hubverse hubs that sync their data to AWS:

don't indicate anywhere that their model-output and target data are available outside the hub repo, in public S3 buckets
don't have any documentation about how to access the files on S3

The Hubverse should do a better job of advertising data available on S3.

Definition of done

@bsweger bsweger converted this from a draft issue Feb 12, 2025
@bsweger bsweger added the documentation Improvements or additions to documentation label Feb 12, 2025
@bsweger
Copy link
Author

bsweger commented Feb 12, 2025

Related issue: hubverse-org/hubTemplate#22

@zkamvar
Copy link
Member

zkamvar commented Feb 12, 2025

Fun thing: I'm not sure that either of the tools in the dashboard are able to use s3 (yet).

That being said, it might really help operations to be able to do this because that way I wouldn't have to assume that a hub is on GitHub and have to fetch it.

@bsweger
Copy link
Author

bsweger commented Feb 12, 2025

Fun thing: I'm not sure that either of the tools in the dashboard are able to use s3 (yet).

That being said, it might really help operations to be able to do this because that way I wouldn't have to assume that a hub is on GitHub and have to fetch it.

Interesting---we we first discussed dashboards at the hub retreat in June (just before you started), we had floated the idea of telling hubs they had to host their data on the cloud before using these tools.

Now that y'all have rolled up your sleeves, it would be interesting to get a sense of how much time and complexity it adds to work with the data on GitHub instead.

(all this aside, I assume that the play here is to have a template markdown page that cloud folks can use when publishing their dashboards)

@zkamvar
Copy link
Member

zkamvar commented Feb 12, 2025

Interesting---we we first discussed dashboards at the hub retreat in June (just before you started), we had floated the idea of telling hubs they had to host their data on the cloud before using these tools.

LOL no one told me 😅

Now that y'all have rolled up your sleeves, it would be interesting to get a sense of how much time and complexity it adds to work with the data on GitHub instead.

I mean, that's what we have right now... and it means that we have to first clone the hub before doing anything with it.

(all this aside, I assume that the play here is to have a template markdown page that cloud folks can use when publishing their dashboards)

Oh are you talking about publishing a dashboard on s3? Because the play for incorporating s3 hubs would be to conditionally fetch the data from s3 or download the hub if we can't.

@zkamvar
Copy link
Member

zkamvar commented Feb 12, 2025

(all this aside, I assume that the play here is to have a template markdown page that cloud folks can use when publishing their dashboards)

And for context, the dashboards are built as static sites where all the data have been pre-computed and are called via fetch() in JavaScript to a local or remote resource (see the code for predtimechart.js), so unless they roll their own data generation, the template would be not very interesting.

@elray1
Copy link
Contributor

elray1 commented Feb 13, 2025

Interesting---we we first discussed dashboards at the hub retreat in June (just before you started), we had floated the idea of telling hubs they had to host their data on the cloud before using these tools.

FWIW, we did discuss this -- it was just in the era when notes were recorded in google docs and so it was easy for ideas to get lost. In this case, we were using this doc -- see notes from Oct 9 and summary at the bottom. The discussion is limited, though -- more or less, "the current working solution requires clones of repos, which may be a problem if repo sizes get large. Consider S3 in the future."

If we think the time has come to reconsider support hubs in clouds, there is a question of what it would take to move to pulling data from S3. I think the answers may differ for hubPredEvalsData, which computes scores for the evals tool, and hub-dashboard-predtimechart, which creates the json data ingested by the viz tool:

  • hubPredEvalsData has been written so that it takes (a) a hub path, which is used as the argument to hubData::connect_hub and hubUtils::read_config, both of which already support loading data from S3 buckets, and (b) a data frame of oracle outputs. Point (b) is a temporary working solution until such time as we can rely on hubs actually storing oracle output, but note that (i) our intermediate working solution is to store the oracle output on a github branch for the dashboard repo, which does not require a local clone of the hub, and (ii) once hubs do include oracle output, i believe that would show up in the cloud mirror of hub data. So, conclusion: shouldn't take much work, if any, to pull data from S3 here.
  • However, hub-dashboard-predtimechart is written in Python and therefore is not using any of our R tooling that has built-in support for loading data from S3. Its hub_dir is a Path object and paths to various files are being constructed from that. I think that means it would be more of a project to switch over to S3 for the forecast viz data.

[Edit: I added this comment having read the discussion thread but not the actual issue topic and now feel that I've sent the conversation even farther away from the actual topic here. If we want to discuss this further maybe we should file a separate issue or start an RFC about it?]

@bsweger
Copy link
Author

bsweger commented Feb 13, 2025

And for context, the dashboards are built as static sites where all the data have been pre-computed and are called via fetch() in JavaScript to a local or remote resource (see the code for predtimechart.js), so unless they roll their own data generation, the template would be not very interesting.

Right--I'm muddying the waters by using the word "template" instead of "boilerplate". I heard in our Monday meeting that we wanted some text in hub-dashboard-template that admins of cloud-hosted hubs could use as a starting point for describing how people can access the S3 data (since the dashboard essentially serves as a hub's website). Maybe as a new "data" or "data access" tab?

Same idea as the boilerplate in the README PR: hubverse-org/hubTemplate#23

(Which I agree is not very interesting, but is a step in the right direction...ultimately, I'd love to see an actual template-driven process for creating a hub)

Does that seem reasonable?

@bsweger bsweger changed the title Add a dashboard template section that describes how to access hub data from S3 Add boilerplate to the dashboard template about accessing hub data on S3 Feb 13, 2025
@bsweger
Copy link
Author

bsweger commented Feb 13, 2025

If we think the time has come to reconsider support hubs in clouds, there is a question of what it would take to move to pulling data from S3.

Thanks for adding that context @elray1. If we didn't actually decide to rely on S3 for the hub dashboard, then I certainly don't want to derail all the work that's already been done and cause thrashing in the project!

In the long run, continuing to tightly couple hub data access to git is something we'd want to address, but if S3 was never truly a requirement for this first dashboard iteration, I'm definitely not suggesting that we rejig the POC work y'all have been doing.

@zkamvar
Copy link
Member

zkamvar commented Feb 13, 2025

A couple of things:

  1. To clarify: I was aware the s3 was a possibility, I was not aware that it was potentially a requirement.
  2. Thank you, Becky, for clarifying about boilerplate vs. template

I heard in our Monday meeting that we wanted some text in hub-dashboard-template that admins of cloud-hosted hubs could use as a starting point for describing how people can access the S3 data (since the dashboard essentially serves as a hub's website). Maybe as a new "data" or "data access" tab?

This doesn't even have to be limited to s3 and we could actually template this similar to how we template the data generation by doing the following:

  1. include a data-access.qmd file in https://github.com/hubverse-org/hub-dash-site-builder/tree/main/static. This file would contain quarto variable shortcodes that would be used to populate the templated content.
  2. add a variables: entry in site-config.yml that would prevent manually editing the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Up Next
Development

No branches or pull requests

3 participants