Problem Description
When running inference with car_image.jpg, the generated description ("a person looking at their reflection in the mirror") has no relation to the actual image content. The model seems to be ignoring the input image and generating arbitrary descriptions.
Reproduction Steps
- Command executed:
CUDA_VISIBLE_DEVICES=0 python -m code.infer \
--prompt "Describe this scene in detail." \
--image ./code/assets/images/car_image.jpg \
--top_p 0.05 --temperature 0.7
- Actual image used:
car_image.jpg
Expected Behavior
Model should generate accurate description of the car image
Actual Output
======== PandaGPT Output ========
against a wall and is positioned at an angle, with the top of the mirror visible above the wall and the bottom of the mirror visible below the wall. The person is looking at their reflection in the mirror and appears to be adjusting their hair. The room behind the person is dimly lit, with a door visible on the right side of the frame. The person's reflection in the mirror is slightly blurry, indicating that the person is moving or adjusting their position. The overall tone of the scene is introspective and personal, with the focus on the person's reflection in the mirror rather than the surrounding environment.
=================================
Environment
- Python: 3.10
- Torchvision: 0.14.1
infer.py
#!/usr/bin/env python3
# infer.py
"""
Standalone inference script:
CUDA_VISIBLE_DEVICES=0 python infer.py \
--prompt "Describe this scene" \
--image demo.jpg --top_p 0.05 --temperature 0.7
"""
import argparse, os, torch
from code.model.openllama import OpenLLAMAPEFTModel
# ---------- 1. Load model (similar to web_demo.py) ----------
def load_model() -> OpenLLAMAPEFTModel:
args = {
"model": "openllama_peft",
"imagebind_ckpt_path": "./pretrained_ckpt/imagebind_ckpt",
"vicuna_ckpt_path": "./pretrained_ckpt/vicuna_ckpt/vicuna-7b-v1.5",
"delta_ckpt_path": "./pretrained_ckpt/pandagpt_ckpt/7b/pandagpt_7b_max_len_1024/pytorch_model.pt",
"stage": 2,
"max_tgt_len": 128,
"lora_r": 32,
"lora_alpha": 32,
"lora_dropout": 0.1,
}
print("[INFO] loading PandaGPT …")
model = OpenLLAMAPEFTModel(**args)
delta = torch.load(
args["delta_ckpt_path"],
map_location=torch.device("cuda")
)
model.load_state_dict(delta, strict=False)
model = model.eval().half().cuda() # Use fp16 & GPU
print("[INFO] model ready!")
return model
# ---------- 2. Command line arguments ----------
def get_args():
p = argparse.ArgumentParser()
p.add_argument("--prompt", required=True, help="Text prompt (multi-turn conversations should be pre-assembled)")
p.add_argument("--image", nargs="*", default=[], help="Path to image file(s)")
p.add_argument("--audio", nargs="*", default=[], help="Path to audio file(s)")
p.add_argument("--video", nargs="*", default=[], help="Path to video file(s)")
p.add_argument("--thermal", nargs="*", default=[], help="Path to thermal image file(s)")
p.add_argument("--top_p", type=float, default=0.01)
p.add_argument("--temperature", type=float, default=1.0)
p.add_argument("--max_len", type=int, default=256)
return p.parse_args()
# ---------- 3. Main inference workflow ----------
def main():
args = get_args()
model = load_model()
# Basic path validation to prevent typos
for path in args.image + args.audio + args.video + args.thermal:
if not os.path.isfile(path):
raise FileNotFoundError(path)
# Generate response using multimodal inputs
output = model.generate({
"prompt": args.prompt,
"image_paths": args.image,
"audio_paths": args.audio,
"video_paths": args.video,
"thermal_paths": args.thermal,
"top_p": args.top_p,
"temperature": args.temperature,
"max_tgt_len": args.max_len,
"modality_embeds": [], # Empty for first call; implement caching for subsequent interactions
})
print("\n======== PandaGPT Output ========")
print(output)
print("=================================")
if __name__ == "__main__":
main()
Problem Description
When running inference with
car_image.jpg, the generated description ("a person looking at their reflection in the mirror") has no relation to the actual image content. The model seems to be ignoring the input image and generating arbitrary descriptions.Reproduction Steps
CUDA_VISIBLE_DEVICES=0 python -m code.infer \ --prompt "Describe this scene in detail." \ --image ./code/assets/images/car_image.jpg \ --top_p 0.05 --temperature 0.7car_image.jpgExpected Behavior
Model should generate accurate description of the car image
Actual Output
Environment
infer.py