An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published Jun 13, 2024 • 50
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion Paper • 2406.04338 • Published Jun 6, 2024 • 34
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 30 days ago • 85
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 10 days ago • 91
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published 4 days ago • 40
The GAN is dead; long live the GAN! A Modern GAN Baseline Paper • 2501.05441 • Published 2 days ago • 45