This collection brings together the highest-signal research papers in modern AI from the invention of the Transformer to the frontier work of 2024–2025—into a single, curated map of the field. Its purpose is to give researchers, engineers, and newcomers a clear understanding of the ideas that power today’s leading models: large-scale training principles, modern optimization methods like DPO and GRPO, efficient architectures such as FlashAttention and Mamba, next-generation generative models like Flow Matching, and the rise of agentic, system-2 reasoning.
Rather than navigating thousands of scattered sources, readers can use this list to quickly grasp how foundational breakthroughs connect to state-of-the-art systems like GPT-4+, DeepSeek-R1, Llama 3, Sora, Kimi K2, and Flux. It serves as a compact, evolving reference that traces the arc of modern AI from first principles to emergent reasoning and long-horizon agentic behavior—providing the essential conceptual building blocks in one accessible place.
| Name | Comprehensive Description | Link |
|---|---|---|
| Attention Is All You Need (2017) | The paper that introduced the Transformer, replacing recurrence and convolution with self-attention. It became the foundation of all modern LLMs, diffusion transformers, and multimodal models. Practically every architecture today is a scaling, refinement, or reinterpretation of this work. | Arxiv |
| Adam Optimizer (2014) | The default optimizer for deep learning. Introduced adaptive moment estimation, enabling stable training of large neural networks and serving as the base for AdamW and LAMB used in today’s foundation models. | Arxiv |
| The Bitter Lesson (2019) | Sutton's philosophical cornerstone: methods that scale with computation always outperform hand-engineered approaches. Forms the conceptual foundation for modern scaling laws, RL-based reasoning, and agentic training regimes. | Essay |
| Scaling Laws for Neural Language Models (GPT Scaling Laws, 2020) | Kaplan et al. empirically demonstrated predictable power-law relationships between compute, data, and performance. Provided the blueprint for all subsequent model scaling efforts (GPT-3, Chinchilla, Llama 3, DeepSeek family). | Arxiv |
| An Image is Worth 16x16 Words: Vision Transformer (ViT, 2020) | Dosovitskiy et al. demonstrated that pure Transformers can achieve state-of-the-art image classification by treating image patches as tokens. ViT proved that self-attention scales effectively to vision tasks, establishing Transformers as the dominant architecture across modalities and inspiring DiT, CLIP, and modern multimodal models. | Arxiv |
| Welcome to the Era of Experience (2025) | Silver & Sutton's manifesto declaring a shift from data-driven to experience-driven AI. Argues that future systems will rely on agents learning from their own interactions, not human-curated corpora. Influences all work on agentic LLMs and system-2 reasoning. | |
| Kimi K2: Open Agentic Intelligence (2025) | Technical report for Moonshot’s trillion-parameter MoE model built explicitly for agentic behavior. Introduces architecture and training strategies tailored for slow thinking, tool use, and extended reasoning. A major milestone in agent LLM design. | Arxiv |
| DeepSeekMath (GRPO Source, 2024) | Introduced GRPO (Group Relative Policy Optimization)—a stable, compute-efficient RL algorithm for training reasoning models without a value function. GRPO became the backbone of DeepSeek R1 and other RL-trained reasoning LLMs. | Arxiv |
| DeepSeek-R1 Technical Report (2025) | Landmark model demonstrating RL-first reasoning at scale. R1 showed that long-chain deliberate reasoning can be trained directly via RL without supervised CoT. Sparked the shift toward reinforcement-dominant training pipelines. | Arxiv |
| Tree of Thoughts (ToT, 2023) | Introduces search-over-thoughts: a structured reasoning approach using branch-and-bound on intermediate thoughts. Served as the conceptual precursor to many modern agentic planners and reasoning frameworks (MCTS-CoT, R1 search loops). | Arxiv |
| FLUX.1 Kontext & Flow Matching (2022–2024) | “Flow Matching for Generative Modeling” established a new generative modeling paradigm that supersedes classic diffusion. Kontext extends this to in-context image generation, enabling editing, slot-filling, and consistent style/structure transformations. Forms the basis of FLUX.1 and SD3. | FM: Arxiv, Kontext: GitHub |
| Scalable Diffusion Models with Transformers (DiT, 2022) | Replaces U-Nets with pure Transformers for diffusion, enabling massive scaling in image/video generation (Sora, SD3, Flux, Lumina). Established the Diffusion Transformer as the dominant image backbone. | Arxiv |
| Patch n' Pack: NaViT (2023) | Introduced Native Resolution Vision Transformer (NaViT), enabling efficient processing of images at any aspect ratio and resolution through patch packing. NaViT eliminates the need for fixed-size inputs, dramatically improving efficiency and enabling better handling of diverse image formats in production vision systems. | Arxiv |
| Direct Preference Optimization (DPO, 2023) | Made RLHF dramatically simpler and more stable by removing the need for a reward model. Became the standard for preference training and alignment across open-source and commercial models. | Arxiv |
| Mamba 1 (2023): Selective State Spaces | A state-space model with linear-time sequence processing that rivals transformer performance. Enables extremely long context windows with lower compute. Sparked renewed interest in RNN-like architectures. | Arxiv |
| Mamba-2 / Transformers are SSMs (2024) | Unified the Transformer and SSM views through Structured State Space Duality. Introduced GPU-efficient kernels and improvements that make Mamba architectures more scalable and competitive with LLM-scale training. | Arxiv |
| GQA: Generalized Multi-Query Attention (2023) | Efficient attention design used in Llama 2/3, DeepSeek, and many optimized inference stacks. Balances the speed of MQA with the quality of MHA by grouping query heads. | Arxiv |
| FlashAttention 1 & 2 (2022–2023) | A high-performance attention kernel that minimizes memory reads/writes. FlashAttention enabled training very large models efficiently, making long-context attention feasible and becoming the standard kernel for LLMs. | FA1: Arxiv, FA2: Arxiv |
Join the Swarms Discord server, a cozy community of 14,000+ researchers and practitioners focused on frontier research. Our community explores:
- Multi-modality models — Vision-language models, audio-visual systems, and unified architectures
- Continual learning — Methods for models that adapt and learn continuously without catastrophic forgetting
- Multi-agent collaboration — Swarm intelligence, agent coordination, and distributed AI systems
Apache License Version 2.0, January 2004