Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
jsulz 
posted an update Dec 5, 2024
Post
1330
Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co/blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

Datasets are among my favorite to visualize because of their mixture of files and folder structures. Here's the huggingface/documentation-images where alongside documentation images we store images for the Hugging Face blog:

I also enjoy the wikimedia/wikipedia dataset. It's fascinating to see the distribution of bytes across languages.

Some datasets are actually quite difficult to visualize because the number of points in the Plotly graph cause the browser to crash on render. It's quite possible you'll run into this if you use the Space. A simple check for file count could help, but for now I find myself running it a few times just to see if I can grab the image. allenai is home to many such datasets, but I eventually found allenai/paloma a eval dataset, that I could visualize

For some of these larger datasets, I might run things locally and write the image out to see if there are any interesting findings.

I thought big and complex repos would be fun to visualize and they can be! This image is from blanchon/RESISC45, a repo with 31,000 images from Google Earth, each bucketed into one of 45 taxonomies with 700 images per taxonomy:

Screenshot 2024-12-06 at 9.58.49 AM.png

But more fun is when you find a repository that is structured (naming conventions and directories) in a way that lets you see the inequity in the bytes.

This is most apparent in NLP datasets that are multilingual, similar to the wikimedia/wikipedia dataset. If you zoom in on any of these (or run them yourself in the Space) you'll see a directory or file naming convention using the language abbreviation. Sections that near yellow for directories or files == more bytes devoted to that language.

Here's facebook/multilingual_librispeech:

newplot (28).png

and mozilla-foundation/common_voice_17_0:

newplot (29).png

and google/xtreme:

newplot (30).png

and unolp/CulturaX:

newplot (31).png

Each dataset shows some imbalance in the languages represented, and this pattern holds true for other types of datasets as well. However, such discrepancies can be harder to spot when folder or file naming conventions prioritize machine over human readability.

Another fun example is the nguha/legalbench dataset, designed to evaluate legal reasoning in LLMs. It provides a clear view of the types of reasoning being tested:

newplot (32).png

Although you might have to squint to see the labels. This is one where it might be best to head over to the Space https://huggingface.co/spaces/jsulz/repo-info and see it for yourself ;)

In this post