Skip to content

Image description completely mismatches actual image content 🤔 #30

@strayberry

Description

@strayberry

Problem Description

When running inference with car_image.jpg, the generated description ("a person looking at their reflection in the mirror") has no relation to the actual image content. The model seems to be ignoring the input image and generating arbitrary descriptions.

Reproduction Steps

  1. Command executed:
CUDA_VISIBLE_DEVICES=0 python -m code.infer \
  --prompt "Describe this scene in detail." \
  --image ./code/assets/images/car_image.jpg \
  --top_p 0.05 --temperature 0.7
  1. Actual image used: car_image.jpg

Expected Behavior

Model should generate accurate description of the car image

Actual Output

======== PandaGPT Output ========
against a wall and is positioned at an angle, with the top of the mirror visible above the wall and the bottom of the mirror visible below the wall. The person is looking at their reflection in the mirror and appears to be adjusting their hair. The room behind the person is dimly lit, with a door visible on the right side of the frame. The person's reflection in the mirror is slightly blurry, indicating that the person is moving or adjusting their position. The overall tone of the scene is introspective and personal, with the focus on the person's reflection in the mirror rather than the surrounding environment.
=================================

Environment

  • Python: 3.10
  • Torchvision: 0.14.1

infer.py

#!/usr/bin/env python3
# infer.py
"""
Standalone inference script:
CUDA_VISIBLE_DEVICES=0 python infer.py \
  --prompt "Describe this scene" \
  --image demo.jpg --top_p 0.05 --temperature 0.7
"""

import argparse, os, torch
from code.model.openllama import OpenLLAMAPEFTModel

# ---------- 1. Load model (similar to web_demo.py) ----------
def load_model() -> OpenLLAMAPEFTModel:
    args = {
        "model": "openllama_peft",
        "imagebind_ckpt_path": "./pretrained_ckpt/imagebind_ckpt",
        "vicuna_ckpt_path": "./pretrained_ckpt/vicuna_ckpt/vicuna-7b-v1.5",
        "delta_ckpt_path": "./pretrained_ckpt/pandagpt_ckpt/7b/pandagpt_7b_max_len_1024/pytorch_model.pt",
        "stage": 2,
        "max_tgt_len": 128,
        "lora_r": 32,
        "lora_alpha": 32,
        "lora_dropout": 0.1,
    }
    print("[INFO] loading PandaGPT …")
    model = OpenLLAMAPEFTModel(**args)
    delta = torch.load(
        args["delta_ckpt_path"],
        map_location=torch.device("cuda")
    )
    model.load_state_dict(delta, strict=False)
    model = model.eval().half().cuda()          # Use fp16 & GPU
    print("[INFO] model ready!")
    return model

# ---------- 2. Command line arguments ----------
def get_args():
    p = argparse.ArgumentParser()
    p.add_argument("--prompt",       required=True, help="Text prompt (multi-turn conversations should be pre-assembled)")
    p.add_argument("--image",        nargs="*", default=[], help="Path to image file(s)")
    p.add_argument("--audio",        nargs="*", default=[], help="Path to audio file(s)")
    p.add_argument("--video",        nargs="*", default=[], help="Path to video file(s)")
    p.add_argument("--thermal",      nargs="*", default=[], help="Path to thermal image file(s)")
    p.add_argument("--top_p",        type=float, default=0.01)
    p.add_argument("--temperature",  type=float, default=1.0)
    p.add_argument("--max_len",      type=int,   default=256)
    return p.parse_args()

# ---------- 3. Main inference workflow ----------
def main():
    args = get_args()
    model = load_model()

    # Basic path validation to prevent typos
    for path in args.image + args.audio + args.video + args.thermal:
        if not os.path.isfile(path):
            raise FileNotFoundError(path)

    # Generate response using multimodal inputs
    output = model.generate({
        "prompt":          args.prompt,
        "image_paths":     args.image,
        "audio_paths":     args.audio,
        "video_paths":     args.video,
        "thermal_paths":   args.thermal,
        "top_p":           args.top_p,
        "temperature":     args.temperature,
        "max_tgt_len":     args.max_len,
        "modality_embeds": [],            # Empty for first call; implement caching for subsequent interactions
    })

    print("\n======== PandaGPT Output ========")
    print(output)
    print("=================================")

if __name__ == "__main__":
    main()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions