Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

NAACL 2025

Moran Yanuka, Assaf Ben-Kish, Yonatan Bitton, Idan Szpektor, Raja Giryes

Setup

Clone Project

git clone https://github.com/moranyanuka/knowada.git
cd knowada

Create the Environment

To set up our environment, please run:

conda env create -f environment.yml
conda activate knowada

Add your Gemini API key:

export API_KEY='<your-key>'

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

First, create a CSV file with the following columns:

original_description: Contains the ground-truth image description from the evaluation dataset
generated_description: Contains the generated description of the VLM to evaluate

See an example of such a file here.

Then, run the following command:

python eval/generate_propositions.py \
       --df_path <path-to-model-generation> \
       --output_dir <path-to-evaluation-output>

The script will write the propositions of the ground truth descriptions, the propositions of the generated descriptions, and the final metrics to output_dir.

KnowAda

To rewrite the DOCCI captions according to the knowledge gaps of PaliGemma, run the following script:

python run.py \
       --generate_questions True \
       --generate_answers True \
       --generate_judgments True \
       --generate_rewritten_descriptions True \
       --output_folder <path-to-output-directory>

This will generate the following files:

questions.csv: Contains the generated questions based on the image descriptions
answers.csv: The VLM's sampled answers to each question
judgments.csv: The judgments determining whether a given answer is correct on a scale of 1-3 (1 is completely incorrect, 3 is completely correct)
difficult_questions_list.csv: Contains for each descriptions, all the questions that are considered unknown for a given threshold
rewritten_captions.csv: The final rewritten captions based on the unknown questions

You can adjust some of the parameters in each stage of the pipeline using the config files (e.g., the train/test split, the difficulty_threshold for determining if a question is unknown, etc.)

Citation

If you find this useful for your research, please cite the following:

@article{yanuka2024bridging,
  title={Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions},
  author={Yanuka, Moran and Kish, Assaf Ben and Bitton, Yonatan and Szpektor, Idan and Giryes, Raja},
  journal={arXiv preprint arXiv:2411.09018},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
eval		eval
examples		examples
prompts		prompts
tasks		tasks
utils		utils
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
environment.yml		environment.yml
judge.py		judge.py
models.py		models.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

Setup

Clone Project

Create the Environment

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

KnowAda

Citation

About

Releases

Packages

Languages

moranyanuka/KnowAda

Folders and files

Latest commit

History

Repository files navigation

Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

Setup

Clone Project

Create the Environment

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

KnowAda

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages