Skip to content

Vaibhav-03/CV_Project_771

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Enhanced Prompt Engineering for Compositional Text-to-Image Generation

This project investigates whether LLM-based prompt enhancement can improve compositional text-to-image generation on the T2I-CompBench benchmark.

Overview

Text-to-image models like Stable Diffusion often struggle with compositional prompts that require:

  • Numeracy: Generating the correct number of objects ("four ships, two tents")
  • Spatial Relations: Positioning objects correctly ("a cat behind a dog")

Our approach uses Claude Opus 4.5 to enhance simple prompts with explicit spatial relations and scale-appropriate attributes, then evaluates whether this improves image generation quality.

Key Findings

Task Baseline Win Rate Enhanced Win Rate Tie Rate
Numeracy 20% 51% 29%
3D Spatial 26% 51% 23%

Enhanced prompts outperform baseline prompts in ~51% of cases across both tasks.

Project Structure

├── T2I-CompBench_dataset/       # Original benchmark prompts
│   ├── numeracy.txt
│   ├── 3d_spatial.txt
│   └── ...
│
├── prompt_enhanced.py           # LLM-based prompt enhancement
├── stable_diffusion_pipeline.py # Batch image generation with SD3
├── extract_scene_graphs.py      # Extract scene graphs from prompts
├── iterative_scene_prompt.py    # Generate iterative prompts from scene graphs
│
├── numeracy_val/                # Numeracy evaluation data
│   ├── sampled_prompts.txt      # 100 baseline prompts
│   ├── enhanced_prompts.txt     # 100 enhanced prompts
│   ├── generated_images_baseline/
│   └── generated_images_enhanced/
│
├── 3d_spatial_val/              # 3D Spatial evaluation data
│   ├── sampled_prompts.txt
│   ├── enhanced_prompts.txt
│   ├── generated_images_baseline/
│   └── generated_images_enhanced/
│
├── evaluate_unbiased.py         # VLM-as-judge evaluation
├── evaluation_results_unbiased.json
└── evaluation_results_unbiased_3d.json

Pipeline

1. Prompt Enhancement

python prompt_enhanced.py --input_file numeracy_val/sampled_prompts.txt \
                          --output_file numeracy_val/enhanced_prompts.json

Example transformation:

Original Enhanced
"four ships, two tents and two fish" "Four toy ships on a table, with two small tents placed behind the ships and two toy fish in front of the ships."

The enhancement adds:

  • Clear spatial relations (behind, in front of, to the left)
  • Scale adjustments (toy, miniature) for coherent scenes
  • Simple surface placement (on a table, on the ground)

2. Image Generation

Generate images using Stable Diffusion for both baseline and enhanced prompts. Images should be named:

img_0_prompt_text.png
img_1_prompt_text.png
...

3. Evaluation (VLM as Judge)

export ANTHROPIC_API_KEY="your-api-key"
python evaluate_unbiased.py --limit 10  # Test with 10 pairs first
python evaluate_unbiased.py             # Full evaluation

Evaluation methodology:

  • Uses Claude Opus 4.5 as an unbiased judge
  • Both images evaluated against the same baseline prompt (fair comparison)
  • Image order is randomized to prevent position bias
  • Neutral labels ("Image A", "Image B") instead of "Baseline/Enhanced"
  • Supports Tie when both images perform equally

Scripts

prompt_enhanced.py

Enhances simple prompts with spatial relations using Claude.

python prompt_enhanced.py --test                    # Test mode
python prompt_enhanced.py --limit 10                # Process first 10
python prompt_enhanced.py --input_file prompts.txt  # Custom input

stable_diffusion_pipeline.py

Generates images using Stable Diffusion 3 Medium with memory optimizations for consumer hardware (CPU offload, fp16).

# Run the pipeline (configured in __main__)
python stable_diffusion_pipeline.py

extract_scene_graphs.py

Extracts structured scene graphs from natural language prompts.

python extract_scene_graphs.py --input_file T2I-CompBench_dataset/numeracy.txt \
                               --output_file scene_graphs_output/numeracy.json

iterative_scene_prompt.py

Generates progressive prompts by traversing scene graphs step-by-step.

python iterative_scene_prompt.py --test  # Quick test
python iterative_scene_prompt.py --scene_graph_file scene_graphs.json

evaluate_unbiased.py

Compares baseline vs enhanced images using VLM as judge.

python evaluate_unbiased.py \
  --baseline_prompts numeracy_val/sampled_prompts.txt \
  --enhanced_prompts numeracy_val/enhanced_prompts.txt \
  --baseline_images numeracy_val/generated_images_baseline/generated_images_baseline \
  --enhanced_images numeracy_val/generated_images_enhanced/generated_images_enhanced \
  --output_file evaluation_results.json

Installation

pip install -r requirements.txt

Requirements:

  • Python 3.8+
  • anthropic >= 0.39.0
  • tqdm >= 4.66.0

Environment:

export ANTHROPIC_API_KEY="your-api-key-here"

Evaluation Metrics

The evaluation produces:

{
  "summary": {
    "total_pairs": 100,
    "baseline_wins": 20,
    "enhanced_wins": 51,
    "ties": 29,
    "baseline_win_rate": 0.20,
    "enhanced_win_rate": 0.51,
    "tie_rate": 0.29,
    "position_bias_analysis": {
      "times_A_chosen": 34,
      "times_B_chosen": 37,
      "A_rate": 0.479
    }
  }
}

Position Bias Analysis: If A_rate ≈ 0.5, the evaluation has no position bias.

Unbiased Evaluation Design

  1. Same Prompt for Both: Both images are evaluated against the baseline prompt only
  2. Randomized Order: Which image is "A" vs "B" is randomized
  3. Neutral Labels: No "baseline" or "enhanced" labels shown to the judge
  4. Tie Support: Judge can declare a tie when both images are equally good/bad

Limitations

  • Evaluation uses a VLM (Claude) as judge, which may have its own biases
  • Image generation uses a single random seed per prompt
  • Sample size is 100 pairs per task

Citation

If you use this work, please cite:

@misc{prompt_enhancement_t2i,
  title={LLM-Enhanced Prompt Engineering for Compositional Text-to-Image Generation},
  author={Your Name},
  year={2024},
  howpublished={Course Project, CS 771}
}

Acknowledgments

About

Towards Compositional Text to Image Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages