Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HfFileSystem] Reuse caching when downloading a file #1452

Open
Wauplin opened this issue Apr 27, 2023 · 3 comments
Open

[HfFileSystem] Reuse caching when downloading a file #1452

Wauplin opened this issue Apr 27, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Apr 27, 2023

In hffs, we implement _fetch_range which allows to retrieve bytes from a remote file without downloading it entirely (see fsspec). This is nice when downloading only parts of a file but it we want to download it entirely, it would be best to benefit from the existing hf_hub_download than using the HF cache system.

@mariosasko @lhoestq given your knowledge of fsspec, do you think it would be possible to overwrite the read method so that if read is called with length=-1, then we cache the entire file and read it from disk? And if length!=-1 we default back to the normal implementation. Do you see any weird side effect that this could cause?

Also for _fetch_range instead of always fetching from remote, we could try to find the file locally first.


Otherwise I saw that they also define a BaseCache object that we could extend. To you think it's worth trying to tweak it to use our existing cache?

@Wauplin Wauplin added the enhancement New feature or request label Apr 27, 2023
@mariosasko
Copy link
Contributor

The BaseCache cache types are used for read-buffering, which is expected to be temporary, not persistent. For the latter, we'd need to implement a new filesystem (if we want to follow the spec; see the existing implementations and the docs for it)

If we decide to implement it, what would be the best way of handling non-hffs filesystems when given as target_protocol in your opinion (e.g., throwing an error)?

@Wauplin
Copy link
Contributor Author

Wauplin commented May 22, 2023

Just to explain a bit more (and sum up an private slack convo):

Ideally I would like that

df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

caches the train.csv file locally using hf_hub_download before reading it. This is how every HF library works so we already have the logic implement.
Note from @mariosasko: this is NOT the default behavior expected by fsspec.

More precisely:

  • If I read a section of a file that is in my HF cache, read it from there
  • If I read an entire file that is in my HF cache, read it from there
  • If I read an entire file that is NOT in my HF cache, download with hf_hub_download + read it
  • If I read a section of a file that is not in my HF cache, read it from URL (not cached)

1. and 2. could be done immediately. 4. is what we are currently doing. And 3. is where it wouldn't be strictly compliant with fsspec.

Note from @mariosasko: case 4. should be considered as the corner case (i.e. let's not optimize for case 4.).


For the latter, we'd need to implement a new filesystem (if we want to follow the spec; see the existing implementations and the docs for it)

We don't want to go in that direction unless we have a strong community demand. Having a 2nd cache system would lead to more maintenance and confusion.


If we start to cache files locally before reading them, we should implement a WholeFileCacheFileSystem.

@mariosasko would it be possible to adapt the current implement to be a WholeFileCacheFileSystem ? That would be the best of both world IMO: reusing existing code + not having 2 separate implementations + have cache + still be compliant with fsspec. Would there be any drawback except the case 4. that will be less optimized (but we don't care).

@mariosasko
Copy link
Contributor

This is how every HF library works so we already have the logic implement.

HfFileSystem's primary purpose is to allow (non-HF) libraries easy integration with the Hub, so I don't think this is a good argument. It's more important what other libraries expect. Polars is a great example considering its popularity - its lazy API can process "larger than memory datasets using streaming" without storing anything on disk. So, in this scenario, we would ruin performance by caching the files.

So, I think it's best to add a method (or attribute) to HfFileSystem to toggle the caching on/off (and have it turned off by default).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants