Image description completely mismatches actual image content 🤔

### Problem Description
When running inference with `car_image.jpg`, the generated description ("a person looking at their reflection in the mirror") has no relation to the actual image content. The model seems to be ignoring the input image and generating arbitrary descriptions.

### Reproduction Steps
1. Command executed:
```bash
CUDA_VISIBLE_DEVICES=0 python -m code.infer \
  --prompt "Describe this scene in detail." \
  --image ./code/assets/images/car_image.jpg \
  --top_p 0.05 --temperature 0.7
```

2. Actual image used: `car_image.jpg` 

### Expected Behavior
Model should generate accurate description of the car image

### Actual Output
```
======== PandaGPT Output ========
against a wall and is positioned at an angle, with the top of the mirror visible above the wall and the bottom of the mirror visible below the wall. The person is looking at their reflection in the mirror and appears to be adjusting their hair. The room behind the person is dimly lit, with a door visible on the right side of the frame. The person's reflection in the mirror is slightly blurry, indicating that the person is moving or adjusting their position. The overall tone of the scene is introspective and personal, with the focus on the person's reflection in the mirror rather than the surrounding environment.
=================================
```

### Environment
- Python: 3.10
- Torchvision: 0.14.1

### infer.py
```python
#!/usr/bin/env python3
# infer.py
"""
Standalone inference script:
CUDA_VISIBLE_DEVICES=0 python infer.py \
  --prompt "Describe this scene" \
  --image demo.jpg --top_p 0.05 --temperature 0.7
"""

import argparse, os, torch
from code.model.openllama import OpenLLAMAPEFTModel

# ---------- 1. Load model (similar to web_demo.py) ----------
def load_model() -> OpenLLAMAPEFTModel:
    args = {
        "model": "openllama_peft",
        "imagebind_ckpt_path": "./pretrained_ckpt/imagebind_ckpt",
        "vicuna_ckpt_path": "./pretrained_ckpt/vicuna_ckpt/vicuna-7b-v1.5",
        "delta_ckpt_path": "./pretrained_ckpt/pandagpt_ckpt/7b/pandagpt_7b_max_len_1024/pytorch_model.pt",
        "stage": 2,
        "max_tgt_len": 128,
        "lora_r": 32,
        "lora_alpha": 32,
        "lora_dropout": 0.1,
    }
    print("[INFO] loading PandaGPT …")
    model = OpenLLAMAPEFTModel(**args)
    delta = torch.load(
        args["delta_ckpt_path"],
        map_location=torch.device("cuda")
    )
    model.load_state_dict(delta, strict=False)
    model = model.eval().half().cuda()          # Use fp16 & GPU
    print("[INFO] model ready!")
    return model

# ---------- 2. Command line arguments ----------
def get_args():
    p = argparse.ArgumentParser()
    p.add_argument("--prompt",       required=True, help="Text prompt (multi-turn conversations should be pre-assembled)")
    p.add_argument("--image",        nargs="*", default=[], help="Path to image file(s)")
    p.add_argument("--audio",        nargs="*", default=[], help="Path to audio file(s)")
    p.add_argument("--video",        nargs="*", default=[], help="Path to video file(s)")
    p.add_argument("--thermal",      nargs="*", default=[], help="Path to thermal image file(s)")
    p.add_argument("--top_p",        type=float, default=0.01)
    p.add_argument("--temperature",  type=float, default=1.0)
    p.add_argument("--max_len",      type=int,   default=256)
    return p.parse_args()

# ---------- 3. Main inference workflow ----------
def main():
    args = get_args()
    model = load_model()

    # Basic path validation to prevent typos
    for path in args.image + args.audio + args.video + args.thermal:
        if not os.path.isfile(path):
            raise FileNotFoundError(path)

    # Generate response using multimodal inputs
    output = model.generate({
        "prompt":          args.prompt,
        "image_paths":     args.image,
        "audio_paths":     args.audio,
        "video_paths":     args.video,
        "thermal_paths":   args.thermal,
        "top_p":           args.top_p,
        "temperature":     args.temperature,
        "max_tgt_len":     args.max_len,
        "modality_embeds": [],            # Empty for first call; implement caching for subsequent interactions
    })

    print("\n======== PandaGPT Output ========")
    print(output)
    print("=================================")

if __name__ == "__main__":
    main()
``` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image description completely mismatches actual image content 🤔 #30

Problem Description

Reproduction Steps

Expected Behavior

Actual Output

Environment

infer.py

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Image description completely mismatches actual image content 🤔 #30

Description

Problem Description

Reproduction Steps

Expected Behavior

Actual Output

Environment

infer.py

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions