[HfFileSystem] Reuse caching when downloading a file #1452

Wauplin · 2023-04-27T08:13:54Z

In hffs, we implement _fetch_range which allows to retrieve bytes from a remote file without downloading it entirely (see fsspec). This is nice when downloading only parts of a file but it we want to download it entirely, it would be best to benefit from the existing hf_hub_download than using the HF cache system.

@mariosasko @lhoestq given your knowledge of fsspec, do you think it would be possible to overwrite the read method so that if read is called with length=-1, then we cache the entire file and read it from disk? And if length!=-1 we default back to the normal implementation. Do you see any weird side effect that this could cause?

Also for _fetch_range instead of always fetching from remote, we could try to find the file locally first.

Otherwise I saw that they also define a BaseCache object that we could extend. To you think it's worth trying to tweak it to use our existing cache?

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-05-09T17:44:41Z

The BaseCache cache types are used for read-buffering, which is expected to be temporary, not persistent. For the latter, we'd need to implement a new filesystem (if we want to follow the spec; see the existing implementations and the docs for it)

If we decide to implement it, what would be the best way of handling non-hffs filesystems when given as target_protocol in your opinion (e.g., throwing an error)?

Wauplin · 2023-05-22T08:43:00Z

Just to explain a bit more (and sum up an private slack convo):

Ideally I would like that

df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

caches the train.csv file locally using hf_hub_download before reading it. This is how every HF library works so we already have the logic implement.
Note from @mariosasko: this is NOT the default behavior expected by fsspec.

More precisely:

If I read a section of a file that is in my HF cache, read it from there
If I read an entire file that is in my HF cache, read it from there
If I read an entire file that is NOT in my HF cache, download with hf_hub_download + read it
If I read a section of a file that is not in my HF cache, read it from URL (not cached)

1. and 2. could be done immediately. 4. is what we are currently doing. And 3. is where it wouldn't be strictly compliant with fsspec.

Note from @mariosasko: case 4. should be considered as the corner case (i.e. let's not optimize for case 4.).

For the latter, we'd need to implement a new filesystem (if we want to follow the spec; see the existing implementations and the docs for it)

We don't want to go in that direction unless we have a strong community demand. Having a 2nd cache system would lead to more maintenance and confusion.

If we start to cache files locally before reading them, we should implement a WholeFileCacheFileSystem.

@mariosasko would it be possible to adapt the current implement to be a WholeFileCacheFileSystem ? That would be the best of both world IMO: reusing existing code + not having 2 separate implementations + have cache + still be compliant with fsspec. Would there be any drawback except the case 4. that will be less optimized (but we don't care).

mariosasko · 2023-06-16T17:55:17Z

This is how every HF library works so we already have the logic implement.

HfFileSystem's primary purpose is to allow (non-HF) libraries easy integration with the Hub, so I don't think this is a good argument. It's more important what other libraries expect. Polars is a great example considering its popularity - its lazy API can process "larger than memory datasets using streaming" without storing anything on disk. So, in this scenario, we would ruin performance by caching the files.

So, I think it's best to add a method (or attribute) to HfFileSystem to toggle the caching on/off (and have it turned off by default).

Wauplin added the enhancement label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HfFileSystem] Reuse caching when downloading a file #1452

[HfFileSystem] Reuse caching when downloading a file #1452

Wauplin commented Apr 27, 2023 •

edited

Loading

mariosasko commented May 9, 2023

Wauplin commented May 22, 2023 •

edited

Loading

mariosasko commented Jun 16, 2023

[HfFileSystem] Reuse caching when downloading a file #1452

[HfFileSystem] Reuse caching when downloading a file #1452

Comments

Wauplin commented Apr 27, 2023 • edited Loading

mariosasko commented May 9, 2023

Wauplin commented May 22, 2023 • edited Loading

mariosasko commented Jun 16, 2023

Wauplin commented Apr 27, 2023 •

edited

Loading

Wauplin commented May 22, 2023 •

edited

Loading