Visual question answering with grounding and user selection priority

Introduction

Enables users to query images using text input. Users can select specific objects for their queries, streamlining the process of asking questions and eliminating the need to describe the position of objects within the image using spatial words.

Architecture

🔗 System requirements

Requires a CUDA compatible GPU with minimum 8gb VRAM
Python>=3.8

🚀 Quick Start

Add your GPT-4 Vision API Key using a .env file defined as OPENAI_API_KEY

from dotenv import load_dotenv
load_dotenv()

Clone the repository

git clone https://github.com/shetumohanto/visual-question-answering.git
cd visual-question-answering

Download checkpoint for the segment anything vit_h model.

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Install required dependencies

pip install -r requirements.txt

Run Intelligent EYE

streamlit run app.py

🔗 Related Works

This project uses the following technologies at its core architecture:

Segment Anything: Segment anything
SoM: Set-of-Mark Visual Prompting for GPT-4V

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
README.md		README.md
app.py		app.py
gpt4v.py		gpt4v.py
intelligent_eye.py		intelligent_eye.py
requirements.txt		requirements.txt
visualizer.py		visualizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual question answering with grounding and user selection priority

Introduction

Architecture

🔗 System requirements

🚀 Quick Start

🔗 Related Works

🔗 Sample output

About

Releases

Packages

Languages

shetumohanto/visual-question-answering

Folders and files

Latest commit

History

Repository files navigation

Visual question answering with grounding and user selection priority

Introduction

Architecture

🔗 System requirements

🚀 Quick Start

🔗 Related Works

🔗 Sample output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages