Skip to content

Deepcopy of HfFileSystem fails due to non-picklable HfHubHTTPError response arg #3576

@owenowenisme

Description

@owenowenisme

Describe the bug

When trying to deepcopy an HfFileSystem with error cached in _repo_and_revision_exists_cache, the whole deepcopy will failed because an error object HfHubHTTPError do not implement __reduce__ex correctly.

We did not pass the response into constructor because this is a keyword argument without default value, therefore this will give us error
TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'

This will be fatal if we want to serialize HfFileSystem instance.

  • errors.py
class HfHubHTTPError(HTTPError, OSError):

    def __init__(
        self,
        message: str,
        *,
        response: Response,
        server_message: Optional[str] = None,
    ):
        self.request_id = response.headers.get("x-request-id") or response.headers.get("X-Amzn-Trace-Id")
        self.server_message = server_message
        self.response = response
        self.request = response.request
        super().__init__(message)

    def __reduce_ex__(self, protocol):
        """Fix pickling of Exception subclass with kwargs. We need to override __reduce_ex__ of the parent class"""
        return (self.__class__, (str(self),), {"response": self.response, "server_message": self.server_message})

Reproduction

To minimize repro script, we just deepcopy the _repo_and_revision_exists_cache like HfFileSystem.

# test_hf_cloudpickle_bug.py
from copy import deepcopy
from huggingface_hub import HfFileSystem
from huggingface_hub.utils import RepositoryNotFoundError
from requests import Response, Request

# Mock an error
resp = Response()
resp.status_code = 404
resp.url = "https://huggingface.co/api/datasets/rotten_tomatoes/test.parquet"
resp.request = Request("GET", "https://huggingface.co/api/datasets/rotten_tomatoes/test.parquet")
resp._content = b'{"error": "Repository Not Found"}'

err = RepositoryNotFoundError(
    "404 Client Error. Repository Not Found.",
    response=resp,
    server_message="Repository Not Found",
)

fs = HfFileSystem()
# Simulate the error in cache
fs._repo_and_revision_exists_cache = {
    ("dataset", "rotten_tomatoes/test.parquet", None): (False, err),
}

# Now try to deepcopy the cache: this is exactly what _get_instance_state does.
cache_copy = deepcopy(fs._repo_and_revision_exists_cache)  # <- expected to fail on buggy behavior

Logs

❯ python test_hf.py                                                                                                                                                                                                                              (myenv) 
Traceback (most recent call last):
  File "/Users/youchenglin/ray/test_hf.py", line 30, in <module>
    cache_copy = deepcopy(fs._repo_and_revision_exists_cache)  # <- expected to fail on buggy behavior
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 265, in _reconstruct
    y = func(*args)
TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'

System info

- huggingface_hub version: 1.1.5
- python 3.10.19

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions