Update data-pipeline to pull data from archived location #13

titaniumbones · 2025-01-29T18:44:13Z

The current version of data-pipeline pulls data sources from US servers (on AWS). @willf has captured these and uploaded to zenodo. We should

check whether we can already use environment variables to set base urls of resources
if not, consider parameterizing, or at least update existing URLs
start tracking these changes in the project documentation so that we can revert or update these changes at a later date.

The text was updated successfully, but these errors were encountered:

willf · 2025-01-29T18:48:02Z

Just as a note, they cannot be served from Zenodo.

titaniumbones · 2025-01-29T18:48:59Z

ah ok! the download seemed to work without a login or anything but maybe it's against the ToS?

willf · 2025-01-29T18:52:08Z

No, I mean we can't just point the servers to the Zenodo deposit. This is just cold storage until we place them in the right location.

titaniumbones · 2025-01-29T21:00:03Z

I think I'm being a little dense but curl -O w/ the zenodo download link seems to work for me. I think I'm right that these files are just inputs to the data pipeline, which produces all of the GIS assets used in the map. So on the assumption that data-pipeline is run rarely, and therefore they are ony pulled rarely, zenodo looks like it would work in the short run? Though I have only tried with the smaller datasets.

willf · 2025-01-29T21:15:03Z

Yes, that seems right! Sorry to be confusing.

ericnost · 2025-01-30T02:09:18Z

check whether we can already use environment variables to set base urls of resources
if not, consider parameterizing, or at least update existing URLs

I'm sure we could do this (switch settings.AWS_JUSTICE40_DATASOURCES_URL to be equal to some other url), but I'm not sure it'd work with zenodo given the way we have the repository configured.

The data pipeline fetches specific datasets one at a time e.g.

# fetch
self.calenviroscreen_ftp_url = (
    settings.AWS_JUSTICE40_DATASOURCES_URL # sure, could change this...
    + "/CalEnviroScreen_4.0_2021.zip" # but we don't have this per se
)

rather than grabbing one or two big files. (CalEnviroScreen is a bad example because we also have that archived separately, so we could just point to that...)

So to do what I think you're suggesting @titaniumbones I think we'd either have to rearrange the zenodo repository into many different smaller .zip files or reconfigure the data pipeline code to grab one or two big files. The latter might be a lot of work?

Instead, we might just update the documentation here to say, download the five zip files from zenodo (maybe not score since that's output from scoring steps?), then skip step 1 etl-run and instead go to score-run. Same with Docker documentation here.

I hope this makes sense. Just going off my understanding of the issue and read of the documentation.

titaniumbones · 2025-01-30T23:32:33Z

Instead, we might just update the documentation here to say, download the five zip files from zenodo (maybe not score since that's output from scoring steps?), then skip step 1 etl-run and instead go to score-run. Same with Docker documentation here.

This sounds good at least for now. And maybe somewhere add a little bash script that just does all the steps (it's not harder than writing them in the markdown!).

Sincewe have the data, I'd mark this as a "later" task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data-pipeline to pull data from archived location #13

Update data-pipeline to pull data from archived location #13

titaniumbones commented Jan 29, 2025

willf commented Jan 29, 2025

titaniumbones commented Jan 29, 2025

willf commented Jan 29, 2025

titaniumbones commented Jan 29, 2025

willf commented Jan 29, 2025

ericnost commented Jan 30, 2025 •

edited

Loading

titaniumbones commented Jan 30, 2025

Update data-pipeline to pull data from archived location #13

Update data-pipeline to pull data from archived location #13

Comments

titaniumbones commented Jan 29, 2025

willf commented Jan 29, 2025

titaniumbones commented Jan 29, 2025

willf commented Jan 29, 2025

titaniumbones commented Jan 29, 2025

willf commented Jan 29, 2025

ericnost commented Jan 30, 2025 • edited Loading

titaniumbones commented Jan 30, 2025

ericnost commented Jan 30, 2025 •

edited

Loading