This project investigates whether LLM-based prompt enhancement can improve compositional text-to-image generation on the T2I-CompBench benchmark.
Text-to-image models like Stable Diffusion often struggle with compositional prompts that require:
- Numeracy: Generating the correct number of objects ("four ships, two tents")
- Spatial Relations: Positioning objects correctly ("a cat behind a dog")
Our approach uses Claude Opus 4.5 to enhance simple prompts with explicit spatial relations and scale-appropriate attributes, then evaluates whether this improves image generation quality.
| Task | Baseline Win Rate | Enhanced Win Rate | Tie Rate |
|---|---|---|---|
| Numeracy | 20% | 51% | 29% |
| 3D Spatial | 26% | 51% | 23% |
Enhanced prompts outperform baseline prompts in ~51% of cases across both tasks.
├── T2I-CompBench_dataset/ # Original benchmark prompts
│ ├── numeracy.txt
│ ├── 3d_spatial.txt
│ └── ...
│
├── prompt_enhanced.py # LLM-based prompt enhancement
├── stable_diffusion_pipeline.py # Batch image generation with SD3
├── extract_scene_graphs.py # Extract scene graphs from prompts
├── iterative_scene_prompt.py # Generate iterative prompts from scene graphs
│
├── numeracy_val/ # Numeracy evaluation data
│ ├── sampled_prompts.txt # 100 baseline prompts
│ ├── enhanced_prompts.txt # 100 enhanced prompts
│ ├── generated_images_baseline/
│ └── generated_images_enhanced/
│
├── 3d_spatial_val/ # 3D Spatial evaluation data
│ ├── sampled_prompts.txt
│ ├── enhanced_prompts.txt
│ ├── generated_images_baseline/
│ └── generated_images_enhanced/
│
├── evaluate_unbiased.py # VLM-as-judge evaluation
├── evaluation_results_unbiased.json
└── evaluation_results_unbiased_3d.json
python prompt_enhanced.py --input_file numeracy_val/sampled_prompts.txt \
--output_file numeracy_val/enhanced_prompts.jsonExample transformation:
| Original | Enhanced |
|---|---|
| "four ships, two tents and two fish" | "Four toy ships on a table, with two small tents placed behind the ships and two toy fish in front of the ships." |
The enhancement adds:
- Clear spatial relations (behind, in front of, to the left)
- Scale adjustments (toy, miniature) for coherent scenes
- Simple surface placement (on a table, on the ground)
Generate images using Stable Diffusion for both baseline and enhanced prompts. Images should be named:
img_0_prompt_text.png
img_1_prompt_text.png
...
export ANTHROPIC_API_KEY="your-api-key"
python evaluate_unbiased.py --limit 10 # Test with 10 pairs first
python evaluate_unbiased.py # Full evaluationEvaluation methodology:
- Uses Claude Opus 4.5 as an unbiased judge
- Both images evaluated against the same baseline prompt (fair comparison)
- Image order is randomized to prevent position bias
- Neutral labels ("Image A", "Image B") instead of "Baseline/Enhanced"
- Supports Tie when both images perform equally
Enhances simple prompts with spatial relations using Claude.
python prompt_enhanced.py --test # Test mode
python prompt_enhanced.py --limit 10 # Process first 10
python prompt_enhanced.py --input_file prompts.txt # Custom inputGenerates images using Stable Diffusion 3 Medium with memory optimizations for consumer hardware (CPU offload, fp16).
# Run the pipeline (configured in __main__)
python stable_diffusion_pipeline.pyExtracts structured scene graphs from natural language prompts.
python extract_scene_graphs.py --input_file T2I-CompBench_dataset/numeracy.txt \
--output_file scene_graphs_output/numeracy.jsonGenerates progressive prompts by traversing scene graphs step-by-step.
python iterative_scene_prompt.py --test # Quick test
python iterative_scene_prompt.py --scene_graph_file scene_graphs.jsonCompares baseline vs enhanced images using VLM as judge.
python evaluate_unbiased.py \
--baseline_prompts numeracy_val/sampled_prompts.txt \
--enhanced_prompts numeracy_val/enhanced_prompts.txt \
--baseline_images numeracy_val/generated_images_baseline/generated_images_baseline \
--enhanced_images numeracy_val/generated_images_enhanced/generated_images_enhanced \
--output_file evaluation_results.jsonpip install -r requirements.txtRequirements:
- Python 3.8+
- anthropic >= 0.39.0
- tqdm >= 4.66.0
Environment:
export ANTHROPIC_API_KEY="your-api-key-here"The evaluation produces:
{
"summary": {
"total_pairs": 100,
"baseline_wins": 20,
"enhanced_wins": 51,
"ties": 29,
"baseline_win_rate": 0.20,
"enhanced_win_rate": 0.51,
"tie_rate": 0.29,
"position_bias_analysis": {
"times_A_chosen": 34,
"times_B_chosen": 37,
"A_rate": 0.479
}
}
}Position Bias Analysis: If A_rate ≈ 0.5, the evaluation has no position bias.
- Same Prompt for Both: Both images are evaluated against the baseline prompt only
- Randomized Order: Which image is "A" vs "B" is randomized
- Neutral Labels: No "baseline" or "enhanced" labels shown to the judge
- Tie Support: Judge can declare a tie when both images are equally good/bad
- Evaluation uses a VLM (Claude) as judge, which may have its own biases
- Image generation uses a single random seed per prompt
- Sample size is 100 pairs per task
If you use this work, please cite:
@misc{prompt_enhancement_t2i,
title={LLM-Enhanced Prompt Engineering for Compositional Text-to-Image Generation},
author={Your Name},
year={2024},
howpublished={Course Project, CS 771}
}- T2I-CompBench for the benchmark dataset
- Anthropic Claude for the LLM API
- Stable Diffusion for image generation