Exploring Typo Squatting Threats in the Hugging Face Ecosystem

Project Overview

This repository presents an experiment on a method developed to detect impersonation attacks, specifically typosquatting, within model hubs such as Hugging Face. Our study focuses on three key aspects of model hubs: models, datasets, and organizations. Different methods are applied to each aspect, including Levenshtein distance, the SequenceMatcher function from the difflib package, and quantitative analysis.

Contribution

This project represents the first systematic examination of naming-based vulnerabilities in AI model repositories, uncovering a widespread issue of typosquatting within the Hugging Face ecosystem. The analysis identified models and datasets with potential risks, specifically targeting the top 100 most downloaded models and the top 100 most trending datasets. At the organizational level, our research encompassed all organizations on Hugging Face. We compiled a table listing models, datasets, and organizations exhibiting potentially malicious behavior, emphasizing the need for enhanced governance and security measures within AI model hubs. All suspicious cases have been reported to Hugging Face for further investigation.

Research objects

Models
Datasets
Organizations

Repository Structure

dataset/: Contains the dataset used in this study, with models, datasets, organizations information gathered from Hugging Face.
result/: Contains models, datasets, and organizations exhibiting potentially malicious behavior.
Similarity caculation/:
- similarity_caculation_model: Research on the top 100 most downloaded models
- similarity_caculation_dataset: Research on the top 100 most trending datasets
- similarity_caculation_organization: Research on all organiztaions
Similarity analysis/: - similarity_analysis_org: A quantitative analysis on organizations.
README.md\: Project overview, research context, and usage instructions

Data Collection

Data for this project was gathered from Hugging Face, including models, datasets, organizations.

Key Findings

Models: Our analysis reveals 1,574 squatting models targeting top-100 downloads models, with 10.4% exhibiting suspicious behaviors and potential malicious intent through deceptive naming patterns and harmful content manipulation.
Datasets: We discovered 625 cases of typosquatting, where 42.2% demonstrated clear signs of intentional impersonation through misleading metadata and content similarities.
Organizations: We identified 302 instances of squatting behavior, among which 4.8% exhibited explicit malicious intent through active impersonation and deceptive practices, while others showed patterns of preemptive name registration for potential future exploitation.

Contributing

Contributions aimed at enhancing detection methods, expanding datasets, or providing feedback are highly encouraged. Interested contributors are invited to submit a pull request or contact the repository maintainers directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Typo Squatting Threats in the Hugging Face Ecosystem

Project Overview

Contribution

Research objects

Repository Structure

Data Collection

Key Findings

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
dataset		dataset
result		result
similarity_analysis		similarity_analysis
similarity_caculation		similarity_caculation
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Exploring Typo Squatting Threats in the Hugging Face Ecosystem

Project Overview

Contribution

Research objects

Repository Structure

Data Collection

Key Findings

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages