NAACL 2025
Moran Yanuka, Assaf Ben-Kish, Yonatan Bitton, Idan Szpektor, Raja Giryes
git clone https://github.com/moranyanuka/knowada.git
cd knowada
To set up our environment, please run:
conda env create -f environment.yml
conda activate knowada
Add your Gemini API key:
export API_KEY='<your-key>'
First, create a CSV file with the following columns:
original_description
: Contains the ground-truth image description from the evaluation datasetgenerated_description
: Contains the generated description of the VLM to evaluate
See an example of such a file here.
Then, run the following command:
python eval/generate_propositions.py \
--df_path <path-to-model-generation> \
--output_dir <path-to-evaluation-output>
The script will write the propositions of the ground truth descriptions, the propositions of the generated descriptions, and the final metrics to output_dir
.
To rewrite the DOCCI captions according to the knowledge gaps of PaliGemma, run the following script:
python run.py \
--generate_questions True \
--generate_answers True \
--generate_judgments True \
--generate_rewritten_descriptions True \
--output_folder <path-to-output-directory>
This will generate the following files:
questions.csv
: Contains the generated questions based on the image descriptionsanswers.csv
: The VLM's sampled answers to each questionjudgments.csv
: The judgments determining whether a given answer is correct on a scale of 1-3 (1 is completely incorrect, 3 is completely correct)difficult_questions_list.csv
: Contains for each descriptions, all the questions that are considered unknown for a given thresholdrewritten_captions.csv
: The final rewritten captions based on the unknown questions
You can adjust some of the parameters in each stage of the pipeline using the config files (e.g., the train/test split, the difficulty_threshold for determining if a question is unknown, etc.)
If you find this useful for your research, please cite the following:
@article{yanuka2024bridging,
title={Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions},
author={Yanuka, Moran and Kish, Assaf Ben and Bitton, Yonatan and Szpektor, Idan and Giryes, Raja},
journal={arXiv preprint arXiv:2411.09018},
year={2024}
}