AgenticVQA is a Vision Question Answering (VQA) system that leverages LLMs and multimodal models to answer questions about images, with support for audio transcription and multiagentic workflows.
- Python 3.12+
- pip
- (Optional) ffmpeg (for audio conversion)
- Clone the repository:
git clone https://github.com/KaramSahoo/AgenticVQA cd AgenticVQA - Install dependencies:
If you use audio features, install ffmpeg:
pip install -r requirements.txt
# Windows choco install ffmpeg # MacOS brew install ffmpeg # Linux sudo apt-get install ffmpeg
- Configure API keys for OpenAI, Anthropic, LangSmith, etc. as environment variables or in your code.
AgenticVQA/
├── answer_log.csv # Log of answers generated by the system
├── app.py # Main application entry point
├── agents/ # Core agent logic for VQA and evaluation
│ ├── florence_agent.py # Florence VQA agent implementation
│ ├── query_evaluator.py # Evaluates queries and answers
│ └── write_answer.py # Agent with VLM that analyzes the image and audio and generates answer based on user query.
├── blueprints/ # Flask Blueprints for managing routing logic
│ └── generate.py # API Endpoints for performing VQA and using OD/OCR tools
├── config/ # Configuration files for different environments
│ ├── __init__.py # Config package init
│ ├── development_config.py # Development settings
│ └── production_config.py # Production settings
├── helper/ # Helper utilities (audio, etc.)
│ └── audio.py # Audio file conversion and transcription
├── prompts/ # Prompt templates for LLMs
│ ├── system_message.py # System prompt templates
│ └── user_prompts.py # User prompt templates
├── tools/ # Vision and OCR tools
│ ├── ocr.py # OCR utility functions
│ └── od.py # Object detection utility functions
├── utils/ # Utility functions and schemas
│ ├── logger.py # Logging functionality
│ └── schemas.py # Structured Output schemas
├── workflows/ # Multi-Agentic Workflow scripts for VQA using LangGraph
│ └── vqa_workflow.py # Main VQA workflow logic
└── requirements.txt # Python dependencies
Main entry point for running the AgenticVQA application. Handles initialization and routing.
Contains agent logic for answering VQA queries and evaluating responses.
florence_agent.py: Implements the Florence VQA agent.query_evaluator.py: Evaluates the quality and correctness of answers.write_answer.py: Handles writing answers to logs or files.
Flask Blueprints for managing routing logic and API endpoints.
generate.py: API Endpoints for performing VQA and using OD/OCR tools
Configuration files for different environments.
development_config.py: Settings for development.production_config.py: Settings for production.
Helper utilities for audio and other tasks.
audio.py: Functions for audio file conversion and speech-to-text transcription.
Prompt templates for LLMs.
system_message.py: System-level prompt templates.user_prompts.py: User-level prompt templates.
Vision and OCR tools.
ocr.py: Optical Character Recognition utility.od.py: Object Detection utility.
General utility functions and data schemas.
logger.py: Logging setup and helpers.schemas.py: Data validation schemas.
Workflow scripts for orchestrating VQA tasks.
vqa_workflow.py: Main workflow for VQA pipeline.
Lists all Python dependencies required for the project.
- Prepare your images and audio files in the workspace.
- Run
app.pyor use the provided notebooks inevaluations/to start VQA tasks. - Use the helper scripts for audio transcription and database management as needed.
- Customize prompts and workflows for your specific use case.
Feel free to open issues or submit pull requests for improvements or new features.