UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift


The uform3-image-text-multilingual-base UForm model is a tiny vision and multilingual language encoder, covering 21 languages, mapping them into a shared vector space. This model produces up to 256-dimensional embeddings and is made of:

  • Text encoder: 12-layer BERT for up to 50 input tokens.
  • Visual encoder: ViT-B/16 for images of 224 x 224 resolution.

Unlike most CLIP-like multomodal models, this model shares 4 layers between the text and visual encoder to allow for more data- and parameter-efficient training. Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code. If you need a larger, more accurate, or multilingual model, check our HuggingFace Hub. For more details on running the model, check out the UForm GitHub repository.

Evaluation

For all evaluations, the multimodal part was used unless otherwise stated.

Monolingual

Dataset Recall@1 Recall@5 Recall@10
Zero-Shot Flickr 0.558 0.813 0.874
MS-COCO ¹ 0.401 0.680 0.781

¹ It's important to note, that the MS-COCO train split was present in the training data.

Multilingual

Recall@10 on the XTD-10 dataset:

English German Spanish French Italian Russian Japanese Korean Turkish Chinese Polish
96.1 93.5 95.7 94.1 94.4 90.4 90.2 91.3 95.2 93.8 95.8

Recall@1, Recall@5, and Recall@10 on the COCO-SM dataset:

Target Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
Arabic 22.7 31.7 44.9 57.8 55.8 69.2 274 M
Armenian 5.6 22.0 14.3 44.7 20.2 56.0 4 M
Chinese 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
English 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
French 31.3 35.4 56.5 62.6 67.4 73.3 274 M
German 31.7 35.1 56.9 62.2 67.4 73.3 134 M
Hebrew 23.7 26.7 46.3 51.8 57.0 63.5 9 M
Hindi 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Indonesian 26.9 30.7 51.4 57.0 62.7 68.6 199 M
Italian 31.3 34.9 56.7 62.1 67.1 73.1 67 M
Japanese 27.4 32.6 51.5 59.2 62.6 70.6 125 M
Korean 24.4 31.5 48.1 57.8 59.2 69.2 81 M
Persian 24.0 28.8 47.0 54.6 57.8 66.2 77 M
Polish 29.2 33.6 53.9 60.1 64.7 71.3 41 M
Portuguese 31.6 32.7 57.1 59.6 67.9 71.0 257 M
Russian 29.9 33.9 54.8 60.9 65.8 72.0 258 M
Spanish 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Thai 21.5 28.7 43.0 54.6 53.7 66.0 61 M
Turkish 25.5 33.0 49.1 59.6 60.3 70.8 88 M
Ukranian 26.0 30.6 49.9 56.7 60.9 68.1 41 M
Vietnamese 25.4 28.3 49.2 53.9 60.3 65.5 85 M
Mean 26.5±6.4 31.8±3.5 49.8±9.8 58.1±4.5 60.4±10.6 69.4±4.3 -
Google Translate 27.4±6.3 31.5±3.5 51.1±9.5 57.8±4.4 61.7±10.3 69.1±4.3 -
Microsoft Translator 27.2±6.4 31.4±3.6 50.8±9.8 57.7±4.7 61.4±10.6 68.9±4.6 -
Meta NLLB 24.9±6.7 32.4±3.5 47.5±10.3 58.9±4.5 58.2±11.2 70.2±4.3 -

For a deeper comparison of output ranking check the following table for the Normalized Discounted Cumulative Gains for the first 20 results - NDCG@20:

Arabic Armenian Chinese French German Hebrew Hindi Indonesian Italian Japanese Korean Persian Polish Portuguese Russian Spanish Thai Turkish Ukranian Vietnamese Mean (all) Mean (Google Translate) Mean(Microsoft Translator) Mean(NLLB)
OpenCLIP NDCG 0.639 0.204 0.731 0.823 0.806 0.657 0.616 0.733 0.811 0.737 0.686 0.667 0.764 0.832 0.777 0.849 0.606 0.701 0.704 0.697 0.716 ± 0.149 0.732 ± 0.145 0.730 ± 0.149 0.686 ± 0.158
UForm NDCG 0.868 0.691 0.880 0.932 0.927 0.791 0.879 0.870 0.930 0.885 0.869 0.831 0.897 0.897 0.906 0.939 0.822 0.898 0.851 0.818 0.875 ± 0.064 0.869 ± 0.063 0.869 ± 0.066 0.888 ± 0.064

Installation

pip install "uform[torch,onnx]"

Usage

To load the model:

from uform import get_model, Modality

import requests
from io import BytesIO
from PIL import Image

model_name = 'unum-cloud/uform3-image-text-multilingual-base'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

To encode the content:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))

image_data = processor_image(image)
text_data = processor_text(text)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
text_features, text_embedding = model_text.encode(text_data, return_features=True)
Downloads last month
244
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train unum-cloud/uform3-image-text-multilingual-base

Collection including unum-cloud/uform3-image-text-multilingual-base