Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dry_run for snapshot_download #1023

Open
cccntu opened this issue Aug 31, 2022 · 9 comments
Open

Allow dry_run for snapshot_download #1023

cccntu opened this issue Aug 31, 2022 · 9 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@cccntu
Copy link

cccntu commented Aug 31, 2022

The current download progress bar doesn't show the file names being downloaded, and doesn't show how many files will be downloaded
With allow_patterns , ignore_patterns, having a dry run option would be very useful.
One can also use it to add logging.

@Wauplin Wauplin added the enhancement New feature or request label Sep 1, 2022
@Wauplin Wauplin changed the title Allow dry_run for snapshot_download 🚩 Allow dry_run for snapshot_download Sep 1, 2022
@Wauplin
Copy link
Contributor

Wauplin commented Sep 1, 2022

Hi @cccntu, thanks for the feature request 👍

Do you want to give it a try to implement it ? :)
Implementation of the dry-run option would most probably happen somewhere here. You might need to look into implementation of hf_hub_download as well to know which file is already cached from a previous revision and which file is new. hf_hub_download can also have the dry-run option.

If you need any help, I'd be happy to answer any question.

@Wauplin Wauplin added the good first issue Good for newcomers label Jul 4, 2023
@Wauplin Wauplin changed the title 🚩 Allow dry_run for snapshot_download Allow dry_run for snapshot_download Jul 11, 2023
@druvdub
Copy link
Contributor

druvdub commented Feb 20, 2024

Hey is it okay if I pick this issue? :)

@Wauplin
Copy link
Contributor

Wauplin commented Feb 20, 2024

Hi @druvdub, thanks for proposing you on this one! This issue is not yet assigned so yes, feel free to take it! 🙏 Implementation-wise, I think you can start with a PR that adds a dry_run option to hf_hub_download first. Adding support for snapshot_download and huggingface-cli download can be done in a follow-up PR.

The biggest question is: what is the expected output of a download in dry-run mode? My expectation is that it would not download anything but it is still ok-ish if it updates the internal refs (see these docs). As a return value, None should be fine and use logs to print what should have been done if it wasn't a dry-run. I see different cases:

  • file already exists and at a correct place => nothing to do
  • file already exists but not in correct place => requires only a symlink
  • file doesn't exists in cache => requires a download
  • if resume_download=True => how much left to download?

What do you think? Would you like to start by having a look at the code and then discuss what you think is a good direction to take? Happy to discuss it further if you're interested.

@druvdub
Copy link
Contributor

druvdub commented Feb 20, 2024

@Wauplin Yes, I would most likely explore the codebase a bit and then we can discuss how to proceed forward on this issue

@druvdub
Copy link
Contributor

druvdub commented Feb 26, 2024

@Wauplin right, did a bit of reading and was wondering if the output should be something that can be parsed and used further for the dry-run like in a JSON format along with the logs.

Secondly, should the dry-run support legacy_cache_layout as well.

Also, considering the resume_download=True case, would we want to display all files done so far + remaining data left or the files remaining. This is in case, a file is only partially downloaded and that might complicate displaying information. We could simply just display how much is left to download and simply not show the files remaining

@Wauplin
Copy link
Contributor

Wauplin commented Feb 27, 2024

if the output should be something that can be parsed and used further for the dry-run like in a JSON format along with the logs

Let's do that yes. Returning a dataclass with all the required information.

Secondly, should the dry-run support legacy_cache_layout as well.

No need to support legacy_cache_layout no.

would we want to display all files done so far + remaining data left or the files remaining

That's a question more for snapshot_download rather than hf_hub_download. I think for each file we could have:

  • filename => path in repo on the Hub
  • destination => where it will be stored
  • commit_hash => the "etag"
  • file_size => full size on the Hub
  • download_size => full size to download (might be equals to 0, full_size or something in between)

WDYT?

@druvdub
Copy link
Contributor

druvdub commented Feb 27, 2024

Sounds good. I'll make a draft PR with the changes.

@druvdub
Copy link
Contributor

druvdub commented Mar 4, 2024

Hey @Wauplin, going through the code I found a HfFileMetadata class here, Would that be something we'd like to return or something other than that. I feel it might kind of be a duplicated class and we could reduce it to something generic or potentially extend this class

@Wauplin
Copy link
Contributor

Wauplin commented Mar 6, 2024

Hi @druvdub, thanks for looking into it! Actually I think it's best to start from a separate dataclass altogether, even if that means you end up with duplicated fields. We tried to play with inheritance in the past for those situations but it always went with drawbacks. Since we need to document all the attributes in the docstring anyway, we would not gain much. So let's have a dry-run -specific class :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants