Meta CLIP

After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:

large-scale non-English data curation pipelines are largely undeveloped;
the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.

With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.

Updates

11/21/2025: distillled models, training/eval code for Meta CLIP 2 are out.
09/18/2025: 🔥 paper Meta CLIP 2 (worldwide) accepted by NeurIPS 2025 as a spotlight presentation.
08/25/2025: Meta CLIP 2 (worldwide) is on open_clip and Huggingface.
07/29/2025: paper Meta CLIP 2: A Worldwide Scaling Recipe (aka Meta CLIP 2 worldwide) is released.
12/10/2024: Meta CLIP 1.2 (ViT-H/14) trained with Altogether synthetic captions is released.
10/09/2024: Altogether: Image Captioning via Re-aligning Alt-text (aka Meta CLIP 1.2) is accepted by EMNLP 2024 with code released.
08/15/2024: v0.1 released.
04/25/2024: paper MoDE: CLIP Data Experts via Clustering is accepted by CVPR 2024 with code released.
01/18/2024: add code for building metadata.
01/16/2024: paper Demystifying CLIP Data accepted by ICLR as spotlight presentation.
12/25/2023: Huggingface Space demo and Colab released.
12/21/2023: Meta CLIP 1.1 (ViT-G/14) released.
09/28/2023: initial release.

Quick Start

The pre-trained MetaCLIP models are available in

mini_clip (this repo)

import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer


model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenize = get_tokenizer("facebook/xlm-v-base")

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Huggingface

from PIL import Image
from transformers import AutoProcessor, AutoModel


# Meta CLIP 1
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

# Meta CLIP 2
# model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
# processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)

Pre-trained Models

Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".

Meta CLIP 2

`model_name`	`pretrained`	Tokenizer	Data Card	# of Seen Pairs	Res.	CVQA-LOCAL ZS Acc.
`ViT-H-14-quickgelu-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	57.4
`ViT-H-14-378-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	378	58.2
`ViT-bigG-14-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	60.7
`ViT-bigG-14-378-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	378	62.0

Meta CLIP 2 Distilled

`model_name`	`pretrained`	Tokenizer	Data Card	# of Seen Pairs	Res.	CVQA-LOCAL ZS Acc.
`ViT-S-16-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	46.9
`ViT-S-16-384-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	384	47.4
`ViT-S-16-mT5-worldwide@mT5WorldWideCLIP`	`metaclip2_worldwide`	`google/siglip-so400m-patch16-256-i18n`	Online Curation	29B	224	42.8
`ViT-M-16-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	49.3
`ViT-M-16-384-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	384	50.7
`ViT-M-16-mT5-worldwide@mT5WorldWideCLIP`	`metaclip2_worldwide`	`google/siglip-so400m-patch16-256-i18n`	Online Curation	29B	224	48.7
`ViT-B-32-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	49.1
`ViT-B-32-384-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	384	50.0
`ViT-B-32-mT5-worldwide@mT5WorldWideCLIP`	`metaclip2_worldwide`	`google/siglip-so400m-patch16-256-i18n`	Online Curation	29B	224	48.4
`ViT-B-16-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	50.9
`ViT-B-16-384-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	384	51.5
`ViT-L-14-worldwide@WorldWideCLIP`	`metaclip2_worldwide`	`facebook/xlm-v-base`	Online Curation	29B	224	56.5

Meta CLIP 1

`model_name`	`pretrained`	Data Card	# of Seen Pairs	Res.	GPUs	IN ZS Acc.
`ViT-B-32-quickgelu`	`metaclip_400m`	data card	12.8B	224	64 x V100	65.5
`ViT-B-16-quickgelu`	`metaclip_400m`	data card	12.8B	224	64 x V100	70.8
`ViT-L-14-quickgelu`	`metaclip_400m`	data card	12.8B	224	128 x V100	76.2
`ViT-B-32-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	64 x V100	67.6
`ViT-B-16-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	64 x V100	72.1
`ViT-L-14-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	128 x V100	79.2
`ViT-H-14-quickgelu`	`metaclip_2_5b`	data card	12.8B	224	256 x A100	80.5
`ViT-bigG-14-quickgelu` (v1.1)	`metaclip_2_5b`	data card	12.8B	224	256 x A100	82.1
`ViT-H-14` (v1.2)	`metaclip_v1_2_altogether`	Online Curation	35B	224	256 x H100	82.0

Environment

This code is customized from OpenCLIP and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1 used by this repo:

conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
    -c pytorch-nightly \
    -c nvidia \
    -c conda-forge \
    -c anaconda

Curation

See MetaCLIP 2 and MetaCLIP 1.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Hu Xu ([email protected]).

Citation

Please cite the following paper if MetaCLIP helps your work:

```bibtex
@inproceedings{chuang2025metaclip2,
   title={Meta CLIP 2: A Worldwide Scaling Recipe},
   author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu},
   journal={arXiv preprint arXiv:2507.22062},
   year={2025}
}

@inproceedings{xu2023metaclip,
   title={Demystifying CLIP Data},
   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2309.16671},
   year={2023}
}

@inproceedings{xu2024altogether,
   title={Altogether: Image Captioning via Re-aligning Alt-text},
   author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2410.17251},
   year={2024}
}

@inproceedings{ma2024mode,
  title={Mode: Clip data experts via clustering},
  author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2024}
}

Reference

The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.

TODO

pip installation of metaclip package;
refactor mini_clip with apps for MoDE, altogether.
more updates for Meta CLIP 2: metadata, data loader, training code.

License

The majority of Meta CLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.

Acknowledgement

We gratefully acknowledge the OpenCLIP team for initial CLIP codebase and integration and NielsRogge's integration into Huggingface.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
apps		apps
clipeval		clipeval
configs		configs
docs		docs
metaclip		metaclip
src		src
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
openclip_LICENSE		openclip_LICENSE
setup.py		setup.py
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meta CLIP

Updates

Quick Start

Pre-trained Models

Environment

Curation

Bugs or questions?

Citation

Reference

TODO

License

Acknowledgement

About

Uh oh!

Releases 3

Uh oh!

Contributors 10

Languages

License

facebookresearch/MetaCLIP

Folders and files

Latest commit

History

Repository files navigation

Meta CLIP

Updates

Quick Start

Pre-trained Models

Environment

Curation

Bugs or questions?

Citation

Reference

TODO

License

Acknowledgement

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Contributors 10

Languages