Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
The evaluation dataset of the technical paper "A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering" is shown in Huggingface Knowledge-intensive Dataset
We released two million Wikipedia Knowledge Datasets in Wikipedia-Knowledge-2M. The dataset includes a JSON file and a compressed archive containing all the image files. The JSON file's image attributes correspond to the compressed archive's image files.
We have also provided the JSON file for the 504K KonwledgeQA dataset in LLaVA-KnowledgeQA-504K. The dataset mainly consists of the training sets from OK-VQA, A-OKVQA, and TextVQA. The images in this dataset come from COCO Caption and TextVQA, which you will need to download yourself.
- Pytorch
2.0.1
conda env create -n CVLM python=3.8
conda activate CVLM
pip install -r requirement.txt
Before you start the pretraining for the visual knowledge aligner, you should place the downloaded Wikipedia-Knowledge-2M
dataset in LLaVA/playground/knowledge_data directory.
Then you can use the following scripts for pretraining.
cd LLaVA
export PYTHONPATH=path_to_current_dir
bash scripts/decoder_model/pretrain_knowledge.sh
Replace pretrain_opt_adapter
with the save path of your pretrained VKA.
bash scripts/knowledge/pretrain.sh
You should use the code to extract trainable parameters from the saved checkpoints file and store them as inputs in the next stage of training.
Change the attribute pretrain_knowledge_params_path
to the path where the parameters extracted in the previous stage are stored.
bash scripts/knowledge_qa/llava_vka_qa.sh
Besides, after completing the training, you can use the code to extract both trainable non-LoRA parameters and LoRA parameters from the checkpoints.
Finally, we used a two-stage training method when fine-tuning FKA.
bash scripts/knowledge_qa/llava_fka_qa.sh
bash scripts/knowledge_qa/llava_fka_qa_stage2.sh
It is important to note that during each stage of training, the parameters from the previous stage need to be accessed via attribute pretrain_knowledge_params_path
, and the parameters should be extraxted by code.
This stage of training also requires loading the training parameters from the Pretraining Visual Knowledge Aligner.
You need to modify attribute pretrain_opt_adapter
by your save path.
cd Qwen
bash finetune/pretrain_ds.sh
bash finetune/finetune_lora_ds.sh
The sam_images on GitHub are incomplete; you need to re-download them from Hugging Face.
We released the best model based on LLaVA on CVLM-LLaVA, the best model based on QWen-VL on CVLM-Qwen and pretrain OPT on CVLM-Opt
After downloading checkpoints, organize the weights as follows.
└── LLaVA
├──checkpoints
├──CVLM-LLaVA
└── Qwen
├──checkpoints
├──CVLM-Qwen
├──qwen-pretrain
├──qwen-vka
The evaluation scripts of LLaVA are on scripts/knowledge_qa/eval
,
We mainly evaluated six benchmark datasets: OK-VQA, VQAv2, A-OKVQA, TextVQA, InfoSeek, and SEED-Bench.
**Before your evaluation, you should unzip the images generated by SAM.
cd LLaVA\playground\knowledge_qa\sam
tar -xzvf images_all.tar.gz
Just so you know, the saved result files will be in the answers_upload folder within the corresponding directory.
bash scripts/knowledge_qa/eval/okvqa.sh
cd /data/cxy/Knowledge_LLaVA/upload/playground/knowledge_qa/eval/okvqa
python okvqa_eval.py --pred_file your_save_path
bash scripts/knowledge_qa/eval/vqav2.sh
cd /data/cxy/Knowledge_LLaVA/upload/playground/knowledge_qa/eval/vqav2
python vqa_eval.py --pred_file your_save_path
Evaluation on open-ended A-OKVQA. The following scripts will also perform the evaluation.
bash scripts/knowledge_qa/eval/aokvqa_oe.sh
Evaluation on multi-choices A-OKVQA.
bash scripts/knowledge_qa/eval/aokvqa.sh
Evaluation on TextVQA.
bash scripts/knowledge_qa/eval/textvqa.sh
Evaluation on InfoSeek.
bash scripts/knowledge_qa/eval/infoseek.sh
Evaluation on SEED-Bench.
bash scripts/knowledge_qa/eval/seedbench.sh
The Qwen model is evaluated using the same datasets as the LLaVA model.
Before you evaluate the Qwen-VL model, you need to download the Qwen-VL model from Qwen-VL and use the two Python files under path to replace the original files.
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset okvqa --few-shot 0
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset vqav2 --few-shot 0
Evaluation on open-ended A-OKVQA.
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset aokvqa --few-shot 0
Evaluation on multi-choices A-OKVQA.
python eval_mm/evaluate_multiple_choice_generated.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset aokvqa --few-shot 0
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset textvqa --few-shot 0
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset infoseek --few-shot 0
python eval_mm/evaluate_multiple_choice_generated.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset seedbench --few-shot 0
If you find our paper and code useful in your research, please consider giving a star and citation
@inproceedings{li-etal-2024-cognitive,
title = "Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment",
author = "Li, Yunxin and
Chen, Xinyu and
Hu, Baotian and
Shi, Haoyuan and
Zhang, Min",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.411/",
doi = "10.18653/v1/2024.acl-long.411",
pages = "7615--7626"
}
@article{li2023comprehensive,
title={A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering},
author={Li, Yunxin and Wang, Longyue and Hu, Baotian and Chen, Xinyu and Zhong, Wanqi and Lyu, Chenyang and Zhang, Min},
journal={arXiv preprint arXiv:2311.07536},
year={2023}
}