- Given an image I, the goal is to generate a textual description using a pre-trained Vision-Language Model (VLM) while leveraging real-world knowledge from a Large Language Model (LLM).
- The primary focus is on addressing challenges related to modality bias and object hallucination.
- Training: With text-only corpus, nouns are extracted from the sentence by a grammar parser to construct the hard prompt. Then, the soft prompt encodes the overall contexts of the sentence by CLIP text encoder.
- Inference: CLIP Image encoding pass to projector which gives soft prompt, along with that CLIP-based entity classifier to construct the entity-aware hard prompt. With the strong transferability from the training-agnostic hard prompt.
pip install .
git clone https://github.com/tylin/coco-caption
cd Code/utils/
python get_entities.py
cd Code/Feature_Extraction/
python CLIP_texts_features_extraction.py
python CLIP_images_features_extraction.py
cd Code/utils/
python prompt_ensemble.py
cd scripts/
bash train_coco.sh 0
bash train_flickr30k.sh 0
where 0 represent the GPU number.
To run the model on single image, you can use Notebook
- Cross-Domain
bash eval_nocaps.sh coco_train_1 0 '--top_k 3 --threshold 0.2' 14
bash eval_flickr30k.sh coco_train_1 0 '--top_k 3 --threshold 0.2' 14
bash eval_coco.sh flicker30K_1 0 '--top_k 3 --threshold 0.2 --using_greedy_search' 29
- In-Domain
bash eval_coco.sh coco_train_1 0 '' 14
bash eval_flickr30k.sh flicker30K_1 0 '' 29
- Project PPT Link: https://docs.google.com/presentation/d/1MY__ajJE0VolzEsR9XVgSn5Dtjhx1VWo4irGBco-WsY/edit#slide=id.g28910898772_1_38
- If checkpoints are required to run the inference, I can share them privately.(Email me on [email protected])
- The 'logs' folder contains the experiment logs generated after running.
- The 'checkpoints' folder includes checkpoints from each individual experiment and generated captions.