Skip to content

Visual question answering with grounding using prompt engineering and segmentation

Notifications You must be signed in to change notification settings

shetumohanto/visual-question-answering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual question answering with grounding and user selection priority

Introduction

Enables users to query images using text input. Users can select specific objects for their queries, streamlining the process of asking questions and eliminating the need to describe the position of objects within the image using spatial words.

Architecture

🔗 System requirements

  • Requires a CUDA compatible GPU with minimum 8gb VRAM
  • Python>=3.8

🚀 Quick Start

  • Add your GPT-4 Vision API Key using a .env file defined as OPENAI_API_KEY
from dotenv import load_dotenv
load_dotenv()
  • Clone the repository
git clone https://github.com/shetumohanto/visual-question-answering.git
cd visual-question-answering
  • Download checkpoint for the segment anything vit_h model.
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
  • Install required dependencies
pip install -r requirements.txt
  • Run Intelligent EYE
streamlit run app.py

🔗 Related Works

This project uses the following technologies at its core architecture:

🔗 Sample output

example

About

Visual question answering with grounding using prompt engineering and segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages