Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting the pickle analysis data from within Python #1212

Open
JohnnyRacer opened this issue Nov 21, 2022 · 13 comments
Open

Getting the pickle analysis data from within Python #1212

JohnnyRacer opened this issue Nov 21, 2022 · 13 comments

Comments

@JohnnyRacer
Copy link

Hello, I am trying to obtain the pickle analysis data that is associated with pickled model checkpoints from Python programmatically but is unable to find anything in the docs on how to do this. Is there a clear way to do this?

A screenshot of the example of the analysis information is attached below.
pickle

@julien-c
Copy link
Member

it is not currently supported, but cc'ing @McPatate (and @krampstudio for visibility)

thanks for the feature request!

@Wauplin
Copy link
Contributor

Wauplin commented Apr 26, 2023

Hey @JohnnyRacer, we are finally getting back to you about your feature request... and with good news!

The security scan details are now returned by the server on the /tree endpoint with expand=True as parameter. For example, https://huggingface.co/api/models/distilgpt2/tree/main?expand=True returns:

2023-04-26_16-43

Results are paginated (50 files per page) which mean you need to iterate over the "next" links header.

An implementation is in progress in huggingface_hub (see PR: #1451) so you can also benefit from it by installing the package from source. Please let me know what you think about this API :)

@JohnnyRacer
Copy link
Author

@Wauplin Thanks a lot for adding it as a new feature, just tested out the new API and it works very well! I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load? This seems like a good addition now that the scan data is available via the API.

@Wauplin
Copy link
Contributor

Wauplin commented Apr 26, 2023

I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load?

That's actually a pretty good idea! I'll keep it in mind to do it properly. I think that if we do something like that, it will be a warning but without asking confirmation from the user (or at least not by default). Always a balance to find between usability and security :)

@McPatate
Copy link
Member

I think having an extra option disable_unsafe_download (defaulting to False) could be a nice to have 😄

@Wauplin
Copy link
Contributor

Wauplin commented Apr 27, 2023

Yes, I was thinking about an environment variable directly so that users can be sure to disable them globally.

@Wauplin
Copy link
Contributor

Wauplin commented May 8, 2023

@McPatate before implementing this, what's your take on how to consider a file as "unsafe". I saw that the security scan has 3 levels: innocuous, suspicious and dangerous.

I see four options here:

  1. set HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1 raise an Exception on dangerous files but not suspicious ones
  2. set HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1 raise an Exception on dangerous files AND suspicious ones
  3. set HF_HUB_DISABLE_DANGEROUS_DOWNLOAD=1 and HF_HUB_DISABLE_SUSPICIOUS_DOWNLOAD=1 separately. If "disable suspicious" is True, "disable dangerous" is automatically True as well => more explicit and flexible but it adds two env variables instead of one.
  4. set HF_HUB_UNSAFE_DOWNLOAD_LEVEL (or HF_HUB_SAFETY_LEVEL?) to either "minimal" (default), "intermediate" or "maximal" => as much flexibility as 3. but with only 1 variable (caveat: I'd prefer a better naming if anyone has a suggestion)

For me it really depends on what suspicious means and when it happens. IMO if a user sets HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1, I'd rather forbid any non-innocuous download (i.e. option 2). Otherwise if flexibility is important, I'd go for option 4 but with a better naming.

Also cc @JohnnyRacer if you are opinionated on this topic.

@JohnnyRacer
Copy link
Author

@Wauplin

I think it would be a good idea to prevent downloads of dangerous Pickle checkpoints by default, with a warning message that can be disabled via an environment variable like you had mentioned above or by calling a function like transformers.env_configs.disable_unsafe_download(True) . To toggle the ability to allow for unsafe download on or off. I think a way to disable unsafe downloads is needed as it can be used conveniently as a malware dropper, especially when models are loaded onto production environments. For suspicious models I think a toggle-able warning will suffice since most do not contain dangerous eval` which would allow for arbitrary code execution.

@julien-c
Copy link
Member

I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)

@McPatate
Copy link
Member

McPatate commented May 17, 2023

Does it cost a lot to send a request to the hub to check before starting a download if there are dangerous imports?

It could also be done asynchronously, as in no changes in the download logic, except you add a cancellation token to the download job and if the file contains a dangerous import, the asynchronous job cancels the download, wdyt @julien-c?

EDIT: nvm my comment, the solution I'm proposing doesn't change much to the additional load issue

@Wauplin
Copy link
Contributor

Wauplin commented May 17, 2023

I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)

That would definitely close this issue without additional load indeed. But that should rather live in libraries like transformers, diffusers,... isn't it? huggingface_hub is kinda agnostic on the content of the files it downloads.

@julien-c
Copy link
Member

But that should rather live in libraries

Yes, correct

@julien-c
Copy link
Member

or even in torch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants