Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains

Problem statement:

Given an image I, the goal is to generate a textual description using a pre-trained Vision-Language Model (VLM) while leveraging real-world knowledge from a Large Language Model (LLM).
The primary focus is on addressing challenges related to modality bias and object hallucination.

Method:

Training: With text-only corpus, nouns are extracted from the sentence by a grammar parser to construct the hard prompt. Then, the soft prompt encodes the overall contexts of the sentence by CLIP text encoder.
Inference: CLIP Image encoding pass to projector which gives soft prompt, along with that CLIP-based entity classifier to construct the entity-aware hard prompt. With the strong transferability from the training-agnostic hard prompt.

Analysis:

Code:

Requirements:

pip install .
git clone https://github.com/tylin/coco-caption

Data Preparation:

cd Code/utils/
python get_entities.py

cd Code/Feature_Extraction/
python CLIP_texts_features_extraction.py
python CLIP_images_features_extraction.py

cd Code/utils/
python prompt_ensemble.py

Training:

cd scripts/
bash train_coco.sh 0
bash train_flickr30k.sh 0

where 0 represent the GPU number.

Inference:

To run the model on single image, you can use Notebook

Evaluation:

Cross-Domain

bash eval_nocaps.sh coco_train_1 0 '--top_k 3 --threshold 0.2' 14
bash eval_flickr30k.sh coco_train_1 0 '--top_k 3 --threshold 0.2' 14
bash eval_coco.sh flicker30K_1 0 '--top_k 3 --threshold 0.2 --using_greedy_search' 29

In-Domain

bash eval_coco.sh coco_train_1 0 '' 14
bash eval_flickr30k.sh flicker30K_1 0 '' 29

Other:

Project PPT Link: https://docs.google.com/presentation/d/1MY__ajJE0VolzEsR9XVgSn5Dtjhx1VWo4irGBco-WsY/edit#slide=id.g28910898772_1_38
If checkpoints are required to run the inference, I can share them privately.(Email me on [email protected])
The 'logs' folder contains the experiment logs generated after running.
The 'checkpoints' folder includes checkpoints from each individual experiment and generated captions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains

Problem statement:

Method:

Analysis:

Code:

Requirements:

Data Preparation:

Training:

Inference:

Evaluation:

Other:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Code		Code
Other		Other
annotations		annotations
checkpoints		checkpoints
images		images
logs		logs
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

RajGothi/Visual-Entities-Empowered-Zero-Shot-Image-to-Text-Generation-Transfer-Across-Domains

Folders and files

Latest commit

History

Repository files navigation

Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains

Problem statement:

Method:

Analysis:

Code:

Requirements:

Data Preparation:

Training:

Inference:

Evaluation:

Other:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages