This project implements a Visual Question Answering (VQA) system using multimodal transformers in PyTorch to enable real-world image understanding. The system combines image and text data for feature extraction, fusion, and prediction, leveraging advanced transformer architectures to achieve robust results. Two approaches—classification and generation—are explored using the DAQUAR dataset.
- Abstract
- Overview of the project goals, methods, and key findings.
- Introduction
- Background, objectives, and scope of the project.
- Methodology
- Feature extraction techniques, multimodal fusion, and implementation tools.
- Datasets
- Description of the DAQUAR dataset used in this project.
- Assessment Methodology
- Metrics and evaluation techniques.
- Literature Review
- Thematic and comparative analyses of existing approaches.
- Critical Analysis
- Gaps, limitations, and implications of the study.
- Conclusion
- Summary of findings and future directions.
- References
- Cited sources and resources.
- Image Feature Extraction: Vision Transformers (ViT) for tokenizing images into spatial representations.
- Text Feature Extraction: BERT for encoding natural language questions.
- Multimodal Fusion: Late fusion, bilinear pooling, and attention mechanisms to integrate visual and textual data.
- Generation Model: Combining BERT, ViT, and GPT2 for sequence generation tasks.
- Dropout layers to mitigate overfitting.
- Gradient clipping to stabilize backpropagation.
- PyTorch
- Hugging Face Transformers
- Scikit-learn
- NLTK
DAQUAR (DAtaset for QUestion Answering on Real-world images):
- Size: 12,500 question-answer pairs.
- Focus: Indoor scenes and basic object recognition.
- Applications: Ideal for single-word/phrase-answer modeling.
- Accuracy: Measures correctness of predictions.
- Macro F1 Score: Evaluates model balance across classes.
- Wu and Palmer Similarity (WUPS): Captures semantic similarity between predicted answers and ground truths.
- Input Dimensions: Effect of image patch and token embedding sizes.
- Pre-processing: Analysis of normalization, resizing, and tokenization methods.
- Fusion Mechanisms: Comparing concatenation and bilinear pooling.
- Attention Mechanisms: Evaluating different attention models.
- Classification Model: BERT + ViT achieved a WUPS score of 0.26.
- Generation Model: BERT + ViT + GPT2 achieved superior performance with a WUPS score of 0.27.
- Challenges: Limited dataset size and high computational requirements.
- Future Directions:
- Transfer learning for diverse datasets.
- Integration of external knowledge graphs.
- Optimization for computational efficiency.
- Python 3.8+
- PyTorch
- Hugging Face Transformers
- Scikit-learn
- Clone the repository:
git clone https://github.com/RobuRishabh/Multimodal-Visual-Question-Answering-VQA-with-Generative-AI-utilizing-LLM-and-Vision-Language-Model.git
- Install dependencies:
pip install -r requirements.txt
- Download the DAQUAR dataset and place it in the
data/
folder.
- Training the Classification Model:
python VQA_Classification.ipynb
- Training the Generation Model:
python VQA_Generation.ipynb
Rishabh Singh
- Course: CS 6120 (Natural Language Processing)
- Instructor: Prof. Uzair Ahmad