Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/hub/repositories-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ This document shows how to handle repositories through the web interface as well

If you do not have `git` available as a CLI command yet, you will need to [install Git](https://git-scm.com/downloads) for your platform. You will also need to [install Git LFS](https://git-lfs.github.com/), which will be used to handle large files such as images and model weights.

> [!TIP]
> For improved upload and download speeds when working with large files and Git, install the [Git Xet](xet/using-xet-storage#git) extension.

To be able to push your code to the Hub, you'll need to authenticate somehow. The easiest way to do this is by installing the [`huggingface_hub` CLI](https://huggingface.co/docs/huggingface_hub/index) and running the login command:

```bash
Expand Down
2 changes: 1 addition & 1 deletion docs/hub/xet/deduplication.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Deduplication
# Deduplication

Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded.

Expand Down
2 changes: 1 addition & 1 deletion docs/hub/xet/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Instead, on the Hub, these large files are tracked using "pointer files" and ide

Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see [Backwards Compatibility & Legacy](./legacy-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS.

### Open Source Xet Protocol
## Open Source Xet Protocol

If you are looking to understand the underlying Xet protocol or are looking to build a new client library to access Xet Storage, check out the [Xet Protocol Specification](https://huggingface.co/docs/xet/index).

Expand Down
2 changes: 1 addition & 1 deletion docs/hub/xet/legacy-git-lfs.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Backward Compatibility with LFS
# Backward Compatibility with LFS

Uploads from legacy / non‑Xet‑aware clients still follow the standard Git LFS path, even if the repo is already Xet-backed. Once the file is uploaded to LFS, a background process automatically migrates the file to using Xet storage. The Xet architecture provides backwards compatibility for legacy clients downloading files from Xet-backed repos by offering a Git LFS bridge. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed file, a legacy client will get a single URL from the bridge which does the work of reconstructing the request file and returning the URL to the resource. This allows downloading files through a URL so that you can continue to use the Hub's web interface or `curl`. By having LFS file uploads automatically migrate and having older clients continue to download files from Xet-backed repositories, maintainers and the rest of the Hub can update their pipelines at their own pace.

Expand Down
2 changes: 1 addition & 1 deletion docs/hub/xet/overview.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Xet History & Overview
# Xet History & Overview

[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub.

Expand Down
2 changes: 1 addition & 1 deletion docs/hub/xet/security.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Security Model
# Security Model

Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions, i.e. you can only read chunks which are required to reproduce files you have access to, and no more.

Expand Down
102 changes: 94 additions & 8 deletions docs/hub/xet/using-xet-storage.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
## Using Xet Storage
# Using Xet Storage

## Python

To access a Xet-aware version of the `huggingface_hub`, simply install the latest version:

Expand All @@ -24,19 +26,103 @@ To see more detailed usage docs, refer to the `huggingface_hub` docs for:
- [Download](https://huggingface.co/docs/huggingface_hub/guides/download#hfxet)
- [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet)

### Recommendations
## Git

Git users can access the benefits of Xet by downloading and installing the Git Xet extension. Once installed, simply use the [standard workflows for managing Hub repositories with Git](../repositories-getting-started) - no additional changes necessary.

### Prerequisites

Install [Git](https://git-scm.com/) and [Git LFS](https://git-lfs.com/).

### Install on macOS or Linux (amd64 or aarch64)

Install using an installation script with the following command in your terminal (requires `curl` and `unzip`):
```
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh
```
Or, install using [Homebrew](https://brew.sh/), with the following [tap](https://docs.brew.sh/Taps) (direct `brew install` coming soon):
```
brew tap huggingface/tap
brew install git-xet
```

To verify the installation, run:
```
git-xet --version
```
### Windows (amd64)

Using an installer:
- Download `git-xet-windows-installer-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.1.0/git-xet-windows-installer-x86_64.zip)) and unzip.
- Run the `msi` installer file and follow the prompts.

Manual installation:
- Download `git-xet-windows-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.1.0/git-xet-windows-x86_64.zip)) and unzip.
- Place the extracted `git-xet.exe` under a `PATH` directory.
- Run `git-xet install` in a terminal.

To verify the installation, run:
```
git-xet --version
```

### Using Git Xet

Once installed on your platform, using Git Xet is as simple as following the Hub's standard Git workflows. First, make sure Git LFS is initialized in your local version of the repository:

```
git lfs install
```
Then make any additions you might want, commit your changes, and `push` your commit to the Hub:

```
# Create any files you like! Then...
git add .
git commit -m "Uploading new models" # You can choose any descriptive message
git push
```
Under the hood, the [Xet protocol](https://huggingface.co/docs/xet/index) is invoked to upload large files directly to Xet storage, increasing upload speeds through the power of [chunk-level deduplication](./deduplication).

### Uninstall on macOS or Linux

Using Homebrew:
```
git-xet uninstall
brew uninstall git-xet
```
If you used the installation script (for MacOS or Linux), run the following in your terminal:
```
git-xet uninstall
sudo rm $(which git-xet)
```
### Uninstall on Windows

If you used the installer:
- Navigate to Settings -> Apps -> Installed apps
- Find "Git-Xet".
- Select the "Uninstall" option available in the context menu.

If you manually installed:
- Run `git-xet uninstall` in a terminal.
- Delete the `git-xet.exe` file from the location where it was originally placed.

## Recommendations

Xet integrates seamlessly with all of the Hub's workflows. However, there are a few steps you may consider to get the most benefits from Xet storage.

When uploading or downloading with Python:

- **Make sure `hf_xet` is installed**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files.
- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to optimize downloads and uploads.

Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage:
When uploading or downloading in Git or Python:

- **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files.
- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads.
- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient.
- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage.
- **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need.

### Current Limitations
## Current Limitations

While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository:

- **64-bit systems only**: The `hf_xet` client currently requires a 64-bit architecture; 32-bit systems are not supported.
- **Git client integration (git-xet)**: Under active development - coming soon, stay tuned!
- **64-bit systems only**: The `hf_xet` client currently requires a 64-bit architecture; 32-bit systems are not supported.