Getting the pickle analysis data from within Python #1212

JohnnyRacer · 2022-11-21T01:41:08Z

Hello, I am trying to obtain the pickle analysis data that is associated with pickled model checkpoints from Python programmatically but is unable to find anything in the docs on how to do this. Is there a clear way to do this?

A screenshot of the example of the analysis information is attached below.

julien-c · 2022-11-21T11:56:52Z

it is not currently supported, but cc'ing @McPatate (and @krampstudio for visibility)

thanks for the feature request!

Wauplin · 2023-04-26T14:47:19Z

Hey @JohnnyRacer, we are finally getting back to you about your feature request... and with good news!

The security scan details are now returned by the server on the /tree endpoint with expand=True as parameter. For example, https://huggingface.co/api/models/distilgpt2/tree/main?expand=True returns:

Results are paginated (50 files per page) which mean you need to iterate over the "next" links header.

An implementation is in progress in huggingface_hub (see PR: #1451) so you can also benefit from it by installing the package from source. Please let me know what you think about this API :)

JohnnyRacer · 2023-04-26T16:44:15Z

@Wauplin Thanks a lot for adding it as a new feature, just tested out the new API and it works very well! I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load? This seems like a good addition now that the scan data is available via the API.

Wauplin · 2023-04-26T16:57:45Z

I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load?

That's actually a pretty good idea! I'll keep it in mind to do it properly. I think that if we do something like that, it will be a warning but without asking confirmation from the user (or at least not by default). Always a balance to find between usability and security :)

McPatate · 2023-04-26T21:11:26Z

I think having an extra option disable_unsafe_download (defaulting to False) could be a nice to have 😄

Wauplin · 2023-04-27T05:22:06Z

Yes, I was thinking about an environment variable directly so that users can be sure to disable them globally.

Wauplin · 2023-05-08T14:27:00Z

@McPatate before implementing this, what's your take on how to consider a file as "unsafe". I saw that the security scan has 3 levels: innocuous, suspicious and dangerous.

I see four options here:

set HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1 raise an Exception on dangerous files but not suspicious ones
set HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1 raise an Exception on dangerous files AND suspicious ones
set HF_HUB_DISABLE_DANGEROUS_DOWNLOAD=1 and HF_HUB_DISABLE_SUSPICIOUS_DOWNLOAD=1 separately. If "disable suspicious" is True, "disable dangerous" is automatically True as well => more explicit and flexible but it adds two env variables instead of one.
set HF_HUB_UNSAFE_DOWNLOAD_LEVEL (or HF_HUB_SAFETY_LEVEL?) to either "minimal" (default), "intermediate" or "maximal" => as much flexibility as 3. but with only 1 variable (caveat: I'd prefer a better naming if anyone has a suggestion)

For me it really depends on what suspicious means and when it happens. IMO if a user sets HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1, I'd rather forbid any non-innocuous download (i.e. option 2). Otherwise if flexibility is important, I'd go for option 4 but with a better naming.

Also cc @JohnnyRacer if you are opinionated on this topic.

JohnnyRacer · 2023-05-16T19:46:47Z

@Wauplin

I think it would be a good idea to prevent downloads of dangerous Pickle checkpoints by default, with a warning message that can be disabled via an environment variable like you had mentioned above or by calling a function like transformers.env_configs.disable_unsafe_download(True) . To toggle the ability to allow for unsafe download on or off. I think a way to disable unsafe downloads is needed as it can be used conveniently as a malware dropper, especially when models are loaded onto production environments. For suspicious models I think a toggle-able warning will suffice since most do not contain dangerous eval` which would allow for arbitrary code execution.

julien-c · 2023-05-17T09:34:53Z

I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)

McPatate · 2023-05-17T10:48:24Z

Does it cost a lot to send a request to the hub to check before starting a download if there are dangerous imports?

It could also be done asynchronously, as in no changes in the download logic, except you add a cancellation token to the download job and if the file contains a dangerous import, the asynchronous job cancels the download, wdyt @julien-c?

EDIT: nvm my comment, the solution I'm proposing doesn't change much to the additional load issue

Wauplin · 2023-05-17T11:36:59Z

I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)

That would definitely close this issue without additional load indeed. But that should rather live in libraries like transformers, diffusers,... isn't it? huggingface_hub is kinda agnostic on the content of the files it downloads.

julien-c · 2023-05-17T11:37:38Z

But that should rather live in libraries

Yes, correct

julien-c · 2023-05-17T11:38:10Z

or even in torch

Wauplin mentioned this issue Apr 26, 2023

Add expand parameter to list_repo_files #1451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting the pickle analysis data from within Python #1212

Getting the pickle analysis data from within Python #1212

JohnnyRacer commented Nov 21, 2022

julien-c commented Nov 21, 2022

Wauplin commented Apr 26, 2023

JohnnyRacer commented Apr 26, 2023

Wauplin commented Apr 26, 2023 •

edited

Loading

McPatate commented Apr 26, 2023

Wauplin commented Apr 27, 2023

Wauplin commented May 8, 2023 •

edited

Loading

JohnnyRacer commented May 16, 2023

julien-c commented May 17, 2023

McPatate commented May 17, 2023 •

edited

Loading

Wauplin commented May 17, 2023 •

edited

Loading

julien-c commented May 17, 2023

julien-c commented May 17, 2023

Getting the pickle analysis data from within Python #1212

Getting the pickle analysis data from within Python #1212

Comments

JohnnyRacer commented Nov 21, 2022

julien-c commented Nov 21, 2022

Wauplin commented Apr 26, 2023

JohnnyRacer commented Apr 26, 2023

Wauplin commented Apr 26, 2023 • edited Loading

McPatate commented Apr 26, 2023

Wauplin commented Apr 27, 2023

Wauplin commented May 8, 2023 • edited Loading

JohnnyRacer commented May 16, 2023

julien-c commented May 17, 2023

McPatate commented May 17, 2023 • edited Loading

Wauplin commented May 17, 2023 • edited Loading

julien-c commented May 17, 2023

julien-c commented May 17, 2023

Wauplin commented Apr 26, 2023 •

edited

Loading

Wauplin commented May 8, 2023 •

edited

Loading

McPatate commented May 17, 2023 •

edited

Loading

Wauplin commented May 17, 2023 •

edited

Loading