Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Regridding performance #17

Open
mpiannucci opened this issue Apr 28, 2022 · 12 comments
Open

Improve Regridding performance #17

mpiannucci opened this issue Apr 28, 2022 · 12 comments

Comments

@mpiannucci
Copy link
Contributor

The /image/tile api uses xESMF for regridding on the fly. This means that the weights are recomputed every single time the /tile api is called which is absolutely terrible across the board.

Initial idea is to precompute the weights for every level, then only apply the grid, reproject and Clip the tile. Should prob just cache the regridded values but trying to keep things idempotent as possible

@abkfenris
Copy link
Collaborator

How about caching the weights on first access for a level?

@mpiannucci
Copy link
Contributor Author

How about caching the weights on first access for a level?

Yeah I think that’s a good compromise. I am worried about when multiple tiles from a single level are requested concurrently, so prob need a mutex in there to avoid clashing, might try that tonight

@mpiannucci
Copy link
Contributor Author

So in trying this last night, xESMF is very slow to generate weights for the regridder once you get past zoom level one. It wasn't as bad when originally regridding to only the tile extents.

Need to rethink the best way to manage this, ncwms seems to be able to do this very fast so will use that as inspiration

@abkfenris
Copy link
Collaborator

Is it actually delaying calculation? It might need to have an explicitly instantiated dask cluster in scope before it will defer calculation.

@mpiannucci
Copy link
Contributor Author

Not sure. I have another idea using just reprojection i am going to test

@jmunroe
Copy link
Collaborator

jmunroe commented Apr 29, 2022

I think @abkfenris is correct -- without a dask scheduler set up the calculation will not be lazy.

@mpiannucci mpiannucci mentioned this issue Apr 29, 2022
@mpiannucci
Copy link
Contributor Author

We can get away with just reprojecting with rioxarray and its a lot faster than regridding with xesmf
Screen Shot 2022-04-29 at 10 23 51 AM

@mpiannucci
Copy link
Contributor Author

^^ but that doesn't solve longer term issues with lazy loading because rioxarray needs the whole dataset to reprojrct at the moment.

So good enough to get data to the zarr pyramid endpoints as a POC but not the long term solution for now

@abkfenris
Copy link
Collaborator

For checking if there are any dask clusters that have been implicitly created, distributed.client._global_client_index should give a dictionary of any dask clients currently in scope. via https://github.com/pangeo-forge/pangeo-forge-recipes/pull/350/files#r861207572

@mpiannucci
Copy link
Contributor Author

So in testing, the bottleneck for using xESMF is generating the weights, which can not be done in parallel using dask or otherwise when using xESMF. The actual regridding is fast enough once the weights are generated

@jmunroe
Copy link
Collaborator

jmunroe commented Apr 30, 2022 via email

@abkfenris
Copy link
Collaborator

abkfenris commented May 1, 2022

While we could pre-compute weights ourselves, it would be nice to not have to define a single specific sidecar file. I'd still lean towards adding them to the cache.

How about adding a datasets/{dataset_id}/tree/cache route, that then fires off a background task to build up the weights and cache them? And we could either try to generate on the fly or fire off an error with a message to hit that endpoint if necessary if the weights aren't cached.

Also on the cache side of things we might want to explore overriding the current xpublish.get_cache(). The current cache store is a dict, so it's confined to a single process (also we might want to explore running it in gunicorn to enable multiple accesses), but cachey.Cache's data store is pluggable. We could try something like redis_collections.Dict or other MutableMappings that could be swapped in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants