Local project scaffold to replicate and extend the multilayer diffusion model described in Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition. The goal is to build a trainable and evaluable pipeline that learns to decompose raster images into semantically disentangled RGBA layers, leveraging the high-quality multilayer dataset already available on this machine.
- Rendered RGBA samples:
/home/ubuntu/jjseol/layer_data/inpainting_250k_subset_rendered - Layout JSON with component metadata/descriptions:
/home/ubuntu/jjseol/layer_data/inpainting_250k_subset - Visualization notebook reference:
/home/ubuntu/jjseol/data/test_lica_api copy.ipynb(shows how to load backgrounds, components, and masks)
src/model, data pipelines, training and evaluation scriptsconfigs/experiment configs (model hyperparams, data paths, training stages)notebooks/exploratory notebooks for data inspection and qualitative evaluationscripts/CLI utilities (data prep, evaluation, visualization, training entry)docs/notes on methodology, ablations, and metrics
cd /home/ubuntu/qwen-image-layered
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt # adjust torch build if needed (cuda/rocm/cpu)- Uses the local layered dataset:
- rendered RGBA:
/home/ubuntu/jjseol/layer_data/inpainting_250k_subset_rendered - layout JSON:
/home/ubuntu/jjseol/layer_data/inpainting_250k_subset
- rendered RGBA:
- Dataset class:
src/data/multilayer_dataset.py - Quick check:
python scripts/dataset_sanity_check.py --max-samples 2 --batch-size 1
- Script:
src/data_generation/prepare_rgba_buckets.py - Generates RGBA components and their composites, resized via the bucket strategy (rounded multiples of 32, capped max side/pixels).
- Example (train 1k / val 50):
conda activate training PYTHONPATH=. python src/data_generation/prepare_rgba_buckets.py \ --output-root data/rgba_layers \ --train-count 1000 \ --val-count 50
- Output layout: `data/rgba_layers/{split}/{bucket}/sample_compXXX.png` and `.../{bucket}/sample_composite.png`, manifest at `metadata/manifest.json`.
### Dataloader for generated data
- `src/data_generation/rgba_component_dataset.py` provides `RgbaComponentDataset` + `create_component_dataloader`.
```python
from src.data_generation import create_component_dataloader
train_loader = create_component_dataloader(
root_dir="data/rgba_layers",
split="train",
batch_size=16,
num_workers=4,
)
batch = next(iter(train_loader))
component = batch["component"] # (B, 4, H, W)
composite = batch["composite"] # (B, 4, H, W)
- Entry:
python scripts/train.py --config configs/default.yaml - Stages (to be implemented):
rgba_vae: shared latent VAE for RGB/RGBAdecompose: VLD-MMDiT variable-layer decompositionrefine: task-specific editing refinement
- Default config now consumes the bucketed dataset (
data/rgba_layers), so each batch contains both a single component layer and the corresponding composite image. - After every epoch, the RGBA-VAE stage runs validation on the bucketed
valsplit:- Reconstructs composites, composites them over white background, reports mean PSNR.
- Saves a grid to
outputs/val/val_recon_epoch_{epoch}.pngfor quick inspection.
- Training now uses
Accelerator(bf16 mixed precision by default). Launch with:(Adjust GPU list/process count for your setup.)CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=. /home/ubuntu/miniconda3/envs/training/bin/python -m accelerate.commands.launch \ --num_processes=2 scripts/train.py --config configs/default.yaml
- Fill in RGBA-VAE architecture and reconstruction losses.
- Add VLD-MMDiT decomposition head with order-aware DTW/layer-merging loss.
- Implement evaluation: RGB L1 (alpha-weighted), Alpha soft IoU, PSNR/SSIM/rFID/LPIPS for reconstruction, plus qualitative panels.
- Inspired by AlphaVAE and Qwen-Image-Layered §3.1: first encoder conv and last decoder conv are widened to four channels.
- Use the provided conversion script to adapt the official Qwen-Image VAE:
(Any Hugging Face repo or local directory containing the original RGB VAE works.)
PYTHONPATH=. python scripts/convert_qwen_vae_to_rgba.py \ --source Qwen/Qwen-Image-1.0 \ --subfolder vae \ --output-dir checkpoints/rgba_vae_init - Set
model.rgb_checkpointto the converted weights; the alpha-channel path starts from zeros (bias viaalpha_bias_init). - RGB inputs automatically receive α=1 during training so both RGB/RGBA samples share the latent space.
- RGBA-VAE: shared latent space for RGB/RGBA, evaluate on AIM-500-style reconstruction.
- Variable Layers Decomposition (VLD-MMDiT): support variable-length layer outputs, including order-aware training.
- Multi-stage training: curriculum from text-to-image init → decomposition fine-tuning → task-specific refinement.
- Evaluation: Crello-style protocol (order-aware DTW, layer merging), plus in-house metrics on the local dataset.
- Create and activate a venv:
python -m venv .venv && source .venv/bin/activate - Install deps (to be pinned in
requirements.txtsoon):pip install torch diffusers transformers accelerate scipy numpy pillow matplotlib pyyaml tqdm - Start exploring data: copy the existing visualization patterns into
notebooks/and point to the dataset paths above.