Model Details
Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept the Llama 2 Community License Agreement before requesting access here.
Model type:
LLaVA vision-language model trained on OSS LLM generated instruction following data.
Model state:
FireLLaVA 13B was trained in December 2023
Paper or resources for more information:
How to use the model
The model is served on Fireworks.ai, and you can try it out here: https://app.fireworks.ai/models/fireworks/firellava-13b API endpoints are also available with instructions linked here: https://readme.fireworks.ai/docs/querying-vision-language-models
Otherwise, if you wish to run the model locally using huggingface transformers library, you can do so, please read the instructions below. First, make sure to have transformers >= 4.35.3. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> to the location where you want to query images. However, do note that model performance with multiple images in the input may degrade since it is not trained with multiple images in the input.
Using pipeline
from transformers import pipeline
from PIL import Image
import requests
model_id = "fireworks-ai/FireLLaVA-13b"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'USER: \nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT: Volkswagen'}]
Using pure transformers
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "fireworks-ai/FireLLaVA-13b"
prompt = "USER: <image>\nWhat is this?\n\nASSISTANT:"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
>>> "This is an early Volkswagen Beetle car, also known as a VW bug, parked on a brick street and next to a building with doors ..."
- Downloads last month
- 42