Enables users to query images using text input. Users can select specific objects for their queries, streamlining the process of asking questions and eliminating the need to describe the position of objects within the image using spatial words.
- Requires a
CUDA compatible GPU
with minimum8gb VRAM
Python>=3.8
- Add your
GPT-4 Vision API Key
using a.env
file defined asOPENAI_API_KEY
from dotenv import load_dotenv
load_dotenv()
- Clone the repository
git clone https://github.com/shetumohanto/visual-question-answering.git
cd visual-question-answering
- Download checkpoint for the
segment anything
vit_h
model.
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
- Install required dependencies
pip install -r requirements.txt
- Run Intelligent EYE
streamlit run app.py
This project uses the following technologies at its core architecture:
- Segment Anything: Segment anything
- SoM: Set-of-Mark Visual Prompting for GPT-4V