Currently everything is assuming patch size = 1
Previously, the literature suggested two things:
- p = 1 is significantly better
- pixel-space diffusion is impossible
I've found personally that neither of these are necessarily true, and were likely a consequence of learned positional encodings being bad. When using RoPE properly, you can train with larger patch sizes and minimal issues. To this end, I want us to do 16x16 latents with c32 given p = 2
Currently everything is assuming patch size = 1
Previously, the literature suggested two things:
I've found personally that neither of these are necessarily true, and were likely a consequence of learned positional encodings being bad. When using RoPE properly, you can train with larger patch sizes and minimal issues. To this end, I want us to do 16x16 latents with c32 given p = 2