Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data-pipeline to pull data from archived location #13

Open
3 tasks
titaniumbones opened this issue Jan 29, 2025 · 7 comments
Open
3 tasks

Update data-pipeline to pull data from archived location #13

titaniumbones opened this issue Jan 29, 2025 · 7 comments

Comments

@titaniumbones
Copy link
Collaborator

The current version of data-pipeline pulls data sources from US servers (on AWS). @willf has captured these and uploaded to zenodo. We should

  • check whether we can already use environment variables to set base urls of resources
  • if not, consider parameterizing, or at least update existing URLs
  • start tracking these changes in the project documentation so that we can revert or update these changes at a later date.
@willf
Copy link
Collaborator

willf commented Jan 29, 2025

Just as a note, they cannot be served from Zenodo.

@titaniumbones
Copy link
Collaborator Author

ah ok! the download seemed to work without a login or anything but maybe it's against the ToS?

@willf
Copy link
Collaborator

willf commented Jan 29, 2025

No, I mean we can't just point the servers to the Zenodo deposit. This is just cold storage until we place them in the right location.

@titaniumbones
Copy link
Collaborator Author

I think I'm being a little dense but curl -O w/ the zenodo download link seems to work for me. I think I'm right that these files are just inputs to the data pipeline, which produces all of the GIS assets used in the map. So on the assumption that data-pipeline is run rarely, and therefore they are ony pulled rarely, zenodo looks like it would work in the short run? Though I have only tried with the smaller datasets.

Image

@willf
Copy link
Collaborator

willf commented Jan 29, 2025

Yes, that seems right! Sorry to be confusing.

@ericnost
Copy link
Member

ericnost commented Jan 30, 2025

check whether we can already use environment variables to set base urls of resources
if not, consider parameterizing, or at least update existing URLs

I'm sure we could do this (switch settings.AWS_JUSTICE40_DATASOURCES_URL to be equal to some other url), but I'm not sure it'd work with zenodo given the way we have the repository configured.

The data pipeline fetches specific datasets one at a time e.g.

# fetch
self.calenviroscreen_ftp_url = (
    settings.AWS_JUSTICE40_DATASOURCES_URL # sure, could change this...
    + "/CalEnviroScreen_4.0_2021.zip" # but we don't have this per se
)

rather than grabbing one or two big files. (CalEnviroScreen is a bad example because we also have that archived separately, so we could just point to that...)

So to do what I think you're suggesting @titaniumbones I think we'd either have to rearrange the zenodo repository into many different smaller .zip files or reconfigure the data pipeline code to grab one or two big files. The latter might be a lot of work?

Instead, we might just update the documentation here to say, download the five zip files from zenodo (maybe not score since that's output from scoring steps?), then skip step 1 etl-run and instead go to score-run. Same with Docker documentation here.

I hope this makes sense. Just going off my understanding of the issue and read of the documentation.

@titaniumbones
Copy link
Collaborator Author

Instead, we might just update the documentation here to say, download the five zip files from zenodo (maybe not score since that's output from scoring steps?), then skip step 1 etl-run and instead go to score-run. Same with Docker documentation here.

This sounds good at least for now. And maybe somewhere add a little bash script that just does all the steps (it's not harder than writing them in the markdown!).

Sincewe have the data, I'd mark this as a "later" task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants