Skip to content

[NAACL 2025] Bridging the Visual Gap: Fine Tuning Multimodal Models with Knowledge Adapted Captions

Notifications You must be signed in to change notification settings

moranyanuka/KnowAda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

NAACL 2025

Moran Yanuka, Assaf Ben-Kish, Yonatan Bitton, Idan Szpektor, Raja Giryes

Setup

Clone Project

git clone https://github.com/moranyanuka/knowada.git
cd knowada

Create the Environment

To set up our environment, please run:

conda env create -f environment.yml
conda activate knowada

Add your Gemini API key:

export API_KEY='<your-key>'

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

First, create a CSV file with the following columns:

  • original_description: Contains the ground-truth image description from the evaluation dataset
  • generated_description: Contains the generated description of the VLM to evaluate

See an example of such a file here.

Then, run the following command:

python eval/generate_propositions.py \
       --df_path <path-to-model-generation> \
       --output_dir <path-to-evaluation-output>

The script will write the propositions of the ground truth descriptions, the propositions of the generated descriptions, and the final metrics to output_dir.

KnowAda

To rewrite the DOCCI captions according to the knowledge gaps of PaliGemma, run the following script:

python run.py \
       --generate_questions True \
       --generate_answers True \
       --generate_judgments True \
       --generate_rewritten_descriptions True \
       --output_folder <path-to-output-directory>

This will generate the following files:

  • questions.csv: Contains the generated questions based on the image descriptions
  • answers.csv: The VLM's sampled answers to each question
  • judgments.csv: The judgments determining whether a given answer is correct on a scale of 1-3 (1 is completely incorrect, 3 is completely correct)
  • difficult_questions_list.csv: Contains for each descriptions, all the questions that are considered unknown for a given threshold
  • rewritten_captions.csv: The final rewritten captions based on the unknown questions

You can adjust some of the parameters in each stage of the pipeline using the config files (e.g., the train/test split, the difficulty_threshold for determining if a question is unknown, etc.)

Citation

If you find this useful for your research, please cite the following:

@article{yanuka2024bridging,
  title={Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions},
  author={Yanuka, Moran and Kish, Assaf Ben and Bitton, Yonatan and Szpektor, Idan and Giryes, Raja},
  journal={arXiv preprint arXiv:2411.09018},
  year={2024}
}

About

[NAACL 2025] Bridging the Visual Gap: Fine Tuning Multimodal Models with Knowledge Adapted Captions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages