Model Summary
NLLB-CLIP-SigLIP is a model that combines a text encoder from the NLLB model and an image encoder from the SigLIP model. This allows us to extend the model capabilities to 201 languages of the Flores-200. NLLB-CLIP sets state-of-the-art on the Crossmodal-3600 dataset by performing very well on low-resource languages. You can find more details about the model in the paper.
This version performs much better than the standard version. You can see the results here and here.
NB: There is even better version of this model available!
How to use
This model is integrated into OpenCLIP so that you can use it as any other model:
!pip install -U open_clip_torch
from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch
model, transform = create_model_from_pretrained("nllb-clip-base-siglip", "v1", device="cuda")
tokenizer = get_tokenizer("nllb-clip-base-siglip")
class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]
text_inputs = []
for i in range(len(class_options)):
tokenizer.set_language(class_langs[i])
text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")
image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)
image_inputs = transform(image).unsqueeze(0).to("cuda")
with torch.inference_mode():
logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)
print(logits_per_image.softmax(dim=-1))
Acknowledgements
I thank ML Collective for providing Google Cloud compute resources to train the OpenCLIP-compatible version of NLLB-CLIP.
- Downloads last month
- 665
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.