Skip to content

facebookresearch/MetaCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Meta CLIP

FAIR, Meta

arXiv arXiv Hugging Face Collection Open In Colab Hugging Face Spaces

After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:

  • large-scale non-English data curation pipelines are largely undeveloped;
  • the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.

With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.

Updates

Quick Start

The pre-trained MetaCLIP models are available in

mini_clip (this repo)
import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer


model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenize = get_tokenizer("facebook/xlm-v-base")

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
Huggingface
from PIL import Image
from transformers import AutoProcessor, AutoModel


# Meta CLIP 1
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

# Meta CLIP 2
# model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
# processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)

Pre-trained Models

Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".

Meta CLIP 2

model_name pretrained Tokenizer Data Card # of Seen Pairs Res. CVQA-LOCAL ZS Acc.
ViT-H-14-quickgelu-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 57.4
ViT-H-14-378-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 378 58.2
ViT-bigG-14-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 60.7
ViT-bigG-14-378-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 378 62.0

Meta CLIP 2 Distilled

model_name pretrained Tokenizer Data Card # of Seen Pairs Res. CVQA-LOCAL ZS Acc.
ViT-S-16-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 46.9
ViT-S-16-384-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 384 47.4
ViT-S-16-mT5-worldwide@mT5WorldWideCLIP metaclip2_worldwide google/siglip-so400m-patch16-256-i18n Online Curation 29B 224 42.8
ViT-M-16-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 49.3
ViT-M-16-384-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 384 50.7
ViT-M-16-mT5-worldwide@mT5WorldWideCLIP metaclip2_worldwide google/siglip-so400m-patch16-256-i18n Online Curation 29B 224 48.7
ViT-B-32-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 49.1
ViT-B-32-384-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 384 50.0
ViT-B-32-mT5-worldwide@mT5WorldWideCLIP metaclip2_worldwide google/siglip-so400m-patch16-256-i18n Online Curation 29B 224 48.4
ViT-B-16-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 50.9
ViT-B-16-384-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 384 51.5
ViT-L-14-worldwide@WorldWideCLIP metaclip2_worldwide facebook/xlm-v-base Online Curation 29B 224 56.5

Meta CLIP 1

model_name pretrained Data Card # of Seen Pairs Res. GPUs IN ZS Acc.
ViT-B-32-quickgelu metaclip_400m data card 12.8B 224 64 x V100 65.5
ViT-B-16-quickgelu metaclip_400m data card 12.8B 224 64 x V100 70.8
ViT-L-14-quickgelu metaclip_400m data card 12.8B 224 128 x V100 76.2
ViT-B-32-quickgelu metaclip_2_5b data card 12.8B 224 64 x V100 67.6
ViT-B-16-quickgelu metaclip_2_5b data card 12.8B 224 64 x V100 72.1
ViT-L-14-quickgelu metaclip_2_5b data card 12.8B 224 128 x V100 79.2
ViT-H-14-quickgelu metaclip_2_5b data card 12.8B 224 256 x A100 80.5
ViT-bigG-14-quickgelu (v1.1) metaclip_2_5b data card 12.8B 224 256 x A100 82.1
ViT-H-14 (v1.2) metaclip_v1_2_altogether Online Curation 35B 224 256 x H100 82.0

Environment

This code is customized from OpenCLIP and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1 used by this repo:

conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
    -c pytorch-nightly \
    -c nvidia \
    -c conda-forge \
    -c anaconda

Curation

See MetaCLIP 2 and MetaCLIP 1.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Hu Xu ([email protected]).

Citation

Please cite the following paper if MetaCLIP helps your work:

```bibtex
@inproceedings{chuang2025metaclip2,
   title={Meta CLIP 2: A Worldwide Scaling Recipe},
   author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu},
   journal={arXiv preprint arXiv:2507.22062},
   year={2025}
}

@inproceedings{xu2023metaclip,
   title={Demystifying CLIP Data},
   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2309.16671},
   year={2023}
}

@inproceedings{xu2024altogether,
   title={Altogether: Image Captioning via Re-aligning Alt-text},
   author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2410.17251},
   year={2024}
}

@inproceedings{ma2024mode,
  title={Mode: Clip data experts via clustering},
  author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2024}
}

Reference

The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.

TODO

  • pip installation of metaclip package;
  • refactor mini_clip with apps for MoDE, altogether.
  • more updates for Meta CLIP 2: metadata, data loader, training code.

License

The majority of Meta CLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.

Acknowledgement

We gratefully acknowledge the OpenCLIP team for initial CLIP codebase and integration and NielsRogge's integration into Huggingface.

About

NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Languages