This repository contains my paper reading notes on deep learning and machine learning. It is inspired by Denny Britz and Daniel Takeshi. A minimalistic webpage generated with Github io can be found here.
My name is Patrick Langechuan Liu. After about a decade of education and research in physics, I found my passion in deep learning and autonomous driving.
If you are new to deep learning in computer vision and don't know where to start, I suggest you spend your first month or so dive deep into this list of papers. I did so (see my notes) and it served me well.
Here is a list of trustworthy sources of papers in case I ran out of papers to read.
I regularly update my blog column the Thinking Car.
- A Crash Course of Planning for Perception Engineers in Autonomous Driving
 - BEV Perception in Mass Production Autonomous Driving
 - Challenges of Mass Production Autonomous Driving in China
 - Vision-centric Semantic Occupancy Prediction for Autonomous Driving (related paper notes)
 - Drivable Space in Autonomous Driving — The Industry
 - Drivable Space in Autonomous Driving — The Academia
 - Drivable Space in Autonomous Driving — The Concept
 - Monocular BEV Perception with Transformers in Autonomous Driving (related paper notes)
 - Illustrated Differences between MLP and Transformers for Tensor Reshaping in Deep Learning
 - Monocular 3D Lane Line Detection in Autonomous Driving (related paper notes)
 - Deep-Learning based Object detection in Crowded Scenes (related paper notes)
 - Monocular Bird’s-Eye-View Semantic Segmentation for Autonomous Driving (related paper notes)
 - Deep Learning in Mapping for Autonomous Driving
 - Monocular Dynamic Object SLAM in Autonomous Driving
 - Monocular 3D Object Detection in Autonomous Driving — A Review
 - Self-supervised Keypoint Learning — A Review
 - Single Stage Instance Segmentation — A Review
 - Self-paced Multitask Learning — A Review
 - Convolutional Neural Networks with Heterogeneous Metadata
 - Lifting 2D object detection to 3D in autonomous driving
 - Multimodal Regression
 
- CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [Notes] WACV 2025 [~80 h real-world driving videos with paired language and trajectory annotations; largest VLA-for-driving dataset]
 - SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment CVPR 2025 [1st Place @ CARLA Challenge 2024; SOTA on CARLA LB 2.0 and Bench2Drive; cited by AutoVLA]
 - AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning arXiv 2025-06 [Trajectory tokenization; dual-thinking (fast vs slow CoT) modes; GRPO fine-tuning; evaluated on nuPlan, nuScenes, Waymo, CARLA; cites SimLingo]
 - DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception arXiv 2025-07 [Hybrid Thinking (text vs tool based CoT) + Active Perception; 3-stage RL training; built on DriveVLM lineage]
 - VERDI: VLM-Embedded Reasoning for Autonomous Driving arXiv 2025-05 [Distills VLM reasoning into a modular AD stack; aligns perception, prediction, planning; improves nuScenes performance with no VLM inference cost]
 - Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Post-Training Enable Robust End-to-End Autonomous Driving arXiv 2025-06 [3B-param VLM trained on 83 h CoVLA + 11 h Waymo long-tail; RL fine-tuned (GRPO); 1st Place in Waymo Vision-Based E2E Driving Challenge (RFS=7.99)]
 - ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-Loop Autonomous Driving arXiv 2025-05 [Chain-of-Thought planning; significant closed-loop improvements on Bench2Drive]
 - DiffVLA: Vision-Language-Guided Diffusion Planning for Autonomous Driving arXiv 2025-05 [VLM-guided diffusion trajectory planning; top performance in Autonomous Grand Challenge 2025]
 - VLAD: A VLM-Augmented Autonomous Driving Framework ITSC 2025 [VLM generates high-level commands for E2E controller; enhances interpretability and planning safety]
 - DriveAction (benchmark): DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models arXiv 2025-06 [Action-rooted evaluation with QA pairs across driving scenarios; cited by VLA4AD survey; gaining traction]
 - DINOv3 [High res Dino]
 - LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
 - DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation [dextrous hand data collection, Shuran Song, Jim Fan]
 - DEXOP: A Device for Robotic Transfer of Dexterous Human Manipulation [Best Paper Award @ RSS 2025]
 - HAD dataset: Grounding Human-to-Vehicle Advice for Self-driving Vehicles CVPR 2019 [John Canny, Honda reesarch institute, 2019, VLA OG]
 
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [LeCun]
 - V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video ICLR 2025
 - I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture ICCV 2023
 - PlaNet: Learning Latent Dynamics for Planning from Pixels
 - DreamerV1: Dream to Control: Learning Behaviors by Latent Imagination
 - DreamerV2: Mastering Atari with Discrete World Models ICLR 2021
 - DreamerV3: Mastering Diverse Domains through World Models Nature 2025
 - DayDreamer: World Models for Physical Robot Learning CoRL 2022
 - Dynalang: Learning to Model the World with Language ICML 2024
 - Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving [Notes] [Marco Pavone, Nvidia]
 - SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation [Notes] ICRA 2025 [Horizon]
 - HE-Drive: Human-Like End-to-End Driving with Vision Language Models IROS 2025 [Horizon]
 - GPT-Driver: Learning to Drive with GPT [NeurIPS 2023, Hang Zhao]
 - Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving [Notes] ICRA 2024 [Wayve]
 - PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving CVPR 2024 [Marco Pavone, NVidia]
 - PDM-Closed: Parting with Misconceptions about Learning-based Vehicle Motion Planning [Notes] CoRL 2023
 - Ego-MLP: Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? CVPR 2024
 - AD-MLP: Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes [Baidu]
 - GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving [Wayve]
 - Cameras as Relative Positional Encoding
 
- Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments CVPR 2025
 - Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [Physical Intelligence]
 - Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback [Li Auto, RLHF]
 - TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference [Li Auto]
 - Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning
 - STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes
 
- VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [Cruise]
 - GPD-1: Generative Pre-training for Driving [PhiGent]
 - Transformers Inference Optimization Toolset
 - Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces [Fei-fei Li]
 - Probing the 3D Awareness of Visual Foundation Models CVPR 2024
 - iVideoGPT: Interactive VideoGPTs are Scalable World Models NeurIPS 2024
 - CarLLaVA: Vision language models for camera-only closed-loop driving [Wayve]
 - Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [DeepRoute]
 - LAW: Enhancing End-to-End Autonomous Driving with Latent World Model
 - TCP: Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline NeurIPS 2022 [E2E planning, Hongyang]
 - When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization
 - RoGs: Large Scale Road Surface Reconstruction with Meshgrid Gaussian
 - RoMe: Towards Large Scale Road Surface Reconstruction via Mesh Representation
 - SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic ECCV 2024
 - Lookahead: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding [specdec]
 - EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [specdec]
 - EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees [specdec]
 - Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
 - RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios ECCV 2024
 - MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
 - Open Sourcing π0 [PI, Industry]
 - Helix: A Vision-Language-Action Model for Generalist Humanoid Control [Figure, Industry]
 - AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One CVPR 2024
 - Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
 - iVideoGPT: Interactive VideoGPTs are Scalable World Models NeurIPS 2024
 - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
 - WORLDMEM: Long-term Consistent World Simulation with Memory [long term memory]
 - PADriver: Towards Personalized Autonomous Driving [megvii, personalized driving]
 
- On the Opportunities and Risks of Foundation Models [Notes]
 - π0: A Vision-Language-Action Flow Model for General Robot Control [Physical Intelligence, VLA]
 - EMMA: End-to-End Multimodal Model for Autonomous Driving [Waymo, VLA]
 - Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data CVPR 2024
 - Depth Anything V2 NeurIPS 2024
 - CarLLaVA: Vision language models for camera-only closed-loop driving
 - LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias [Scene tokenization]
 - NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking NeurIPS 2024
 - Driving Everywhere with Large Language Model Policy Adaptation CVPR 2024 [Marco Pavone]
 - Consistency Models [difusion speedup, OpenAI, Yang Song]
 - VILA: On Pre-training for Visual Language Models CVPR 2024 [Song Han, Yao Lu]
 
- LINGO-1: Exploring Natural Language for Autonomous Driving [Notes] [Wayve, open-loop world model]
 - LINGO-2: Driving with Natural Language [Notes] [Wayve, closed-loop world model]
 - OpenVLA: An Open-Source Vision-Language-Action Model [open source RT-2]
 - Parting with Misconceptions about Learning-based Vehicle Motion Planning CoRL 2023 [Simple non-learning based baseline]
 - QuAD: Query-based Interpretable Neural Motion Planning for Autonomous Driving [Waabi]
 - MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving [Notes] ICRA 2015 [Behavior planning, UMich, May Autonomy]
 - MPDM2: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction [Notes] RSS 2015 [Behavior planning]
 - MPDM3: Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment RSS 2017 [Behavior planning]
 - EUDM: Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching [Notes] ICRA 2020 [Wenchao Ding, Shaojie Shen, Behavior planning]
 - TPP: Tree-structured Policy Planning with Learned Behavior Models ICRA 2023 [Marco Pavone, Nvidia, Behavior planning]
 - MARC: Multipolicy and Risk-aware Contingency Planning for Autonomous Driving [Notes] RAL 2023 [Shaojie Shen, Behavior planning]
 - EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments TRO 2021 [Wenchao Ding, encyclopedia of pnc]
 - trajdata: A Unified Interface to Multiple Human Trajectory Datasets NeurIPS 2023 [Marco Pavone, Nvidia]
 - Optimal Vehicle Trajectory Planning for Static Obstacle Avoidance using Nonlinear Optimization [Xpeng]
 - Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles [Notes] IROS 2019 Oral [Uber ATG, behavioral planning, motion planning]
 - Enhancing End-to-End Autonomous Driving with Latent World Model
 - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [Jiwen Lu]
 - RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision ICRA 2024
 - EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [Sanja, Marco, NV]
 - FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation
 - Trajeglish: Traffic Modeling as Next-Token Prediction ICLR 2024
 - Autonomous Driving Strategies at Intersections: Scenarios, State-of-the-Art, and Future Outlooks ITSC 2021
 - Learning-Based Approach for Online Lane Change Intention Prediction IV 2013 [SVM, LC intention prediction]
 - Traffic Flow-Based Crowdsourced Mapping in Complex Urban Scenario RAL 2023 [Wenchao Ding, Huawei, crowdsourced map]
 - FlowMap: Path Generation for Automated Vehicles in Open Space Using Traffic Flow ICRA 2023
 - Hybrid A-star: Path Planning for Autonomous Vehicles in Unknown Semi-structured Environments IJRR 2010 [Dolgov, Thrun, Searching]
 - Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame ICRA 2010 [Werling, Thrun, Sampling] [MUST READ for planning folks]
 - Autonomous Driving on Curvy Roads Without Reliance on Frenet Frame: A Cartesian-Based Trajectory Planning Method TITS 2022
 - Baidu Apollo EM Motion Planner [Notes][Optimization]
 - 基于改进混合A*的智能汽车时空联合规划方法 汽车工程: 规划&决策2023年 [Joint optimization, search]
 - Enable Faster and Smoother Spatio-temporal Trajectory Planning for Autonomous Vehicles in Constrained Dynamic Environment JAE 2020 [Joint optimization, search]
 - Focused Trajectory Planning for Autonomous On-Road Driving IV 2013 [Joint optimization, Iteration]
 - SSC: Safe Trajectory Generation for Complex Urban Environments Using Spatio-Temporal Semantic Corridor RAL 2019 [Joint optimization, SSC, Wenchao Ding, Motion planning]
 - AlphaGo: Mastering the game of Go with deep neural networks and tree search [Notes] Nature 2016 [DeepMind, MTCS]
 - AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play Science 2017 [DeepMind]
 - MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model Nature 2020 [DeepMind]
 - Grandmaster-Level Chess Without Search [DeepMind]
 - Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving [MobileEye, desire and traj optimization]
 - Comprehensive Reactive Safety: No Need For A Trajectory If You Have A Strategy IROS 2022 [Da Fang, Qcraft]
 - BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning AAAI 2024
 - LLM-MCTS: Large Language Models as Commonsense Knowledge for Large-Scale Task Planning NeurIPS 2023
 - Hivt: Hierarchical vector transformer for multi-agent motion prediction CVPR 2022 [Zikang Zhou, agent-centric, motion prediction]
 - QCNet: Query-Centric Trajectory Prediction [Notes] CVPR 2023 [Zikang Zhou, scene-centric, motion prediction]
 
- Genie: Generative Interactive Environments [Notes] [DeepMind, World Model]
 - DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving [Notes] [Jiwen Lu, World Model]
 - WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens [Notes] [Jiwen Lu, World Model]
 - VideoPoet: A Large Language Model for Zero-Shot Video Generation [Like sora, but LLM, NOT world model]
 - Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [Notes] CVPR 2023 [Sanja, Nvidia, VideoLDM, Video prediction]
 - Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos NeurIPS 2022 [Notes] [OpenAI]
 - MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge NeurIPS 2022 [NVidia, Outstanding paper award]
 - Humanoid Locomotion as Next Token Prediction [Notes] [Berkeley, EAI]
 - RPT: Robot Learning with Sensorimotor Pre-training [Notes] CoRL 2023 Oral [Berkeley, EAI]
 - MVP: Real-World Robot Learning with Masked Visual Pre-training [Notes] CoRL 2022 [Berkeley, EAI]
 - BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning [Notes] CoRL 2021 [Eric Jang, 1X]
 - GenAD: Generalized Predictive Model for Autonomous Driving [Notes] CVPR 2024
 - HG-DAgger: Interactive Imitation Learning with Human Experts [DAgger]
 - DriveGAN: Towards a Controllable High-Quality Neural Simulation [Notes] CVPR 2021 oral [Nvidia, Sanja]
 - VideoGPT: Video Generation using VQ-VAE and Transformers [Notes] [Pieter Abbeel]
 - LLM, Vision Tokenizer and Vision Intelligence, by Lu Jiang [Notes] [Interview Lu Jiang]
 - AV2.0: Reimagining an autonomous vehicle [Notes] [Wayve, Alex Kendall]
 - Simulation for E2E AD [Wayve, Tech Sharing, E2E]
 - E2E lateral planning [Comma.ai, E2E planning]
 - Learning and Leveraging World Models in Visual Representation Learning [LeCun, JEPA series]
 - LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models [Large Vision Models, Jitendra Malik]
 - LWM: World Model on Million-Length Video And Language With RingAttention [Pieter Abbeel]
 - OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [Jiwen Lu, World Model]
 - GenAD: Generative End-to-End Autonomous Driving
 - Transfuser: Multi-Modal Fusion Transformer for End-to-End Autonomous Driving CVPR 2021 [E2E planning, Geiger]
 - Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving [Wayve, LLM + AD]
 - LingoQA: Video Question Answering for Autonomous Driving [Wayve, LLM + AD]
 - Panacea: Panoramic and Controllable Video Generation for Autonomous Driving CVPR 2024 [Megvii]
 - PlanT: Explainable Planning Transformers via Object-Level Representations CoRL 2022
 - Scene as Occupancy ICCV 2023
 - The Shift from Models to Compound AI Systems
 - Roach: End-to-End Urban Driving by Imitating a Reinforcement Learning Coach ICCV 2021
 - Learning from All Vehicles CVPR 2022
 - LBC: Learning by Cheating CoRL 2019
 - Learning to drive from a world on rails ICCV 2021 oral [Philipp Krähenbühl]
 - Learning from All Vehicles CVPR 2022 [Philipp Krähenbühl]
 - VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning [Horizon]
 - VQ-VAE: Neural Discrete Representation Learning NeurIPS 2017 [Image Tokenizer]
 - VQ-GAN: Taming Transformers for High-Resolution Image Synthesis CVPR 2021 [Image Tokenizer]
 - ViT-VQGAN: Vector-quantized Image Modeling with Improved VQGAN ICLR 2022 [Image Tokenizer]
 - MaskGIT: Masked Generative Image Transformer CVPR 2022 [LLM, non-autoregressive]
 - MAGVIT: Masked Generative Video Transformer CVPR 2023 highlight [Video Tokenizer]
 - MAGVIT-v2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation ICLR 2024 [Video Tokenizer]
 - Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models [Reverse Engineering of Sora]
 - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts ICML 2022 [MoE, LLM]
 - Lifelong Language Pretraining with Distribution-Specialized Experts ICML 2023 [MoE, LLM]
 - DriveLM: Drive on Language [Hongyang Li]
 - MotionLM: Multi-Agent Motion Forecasting as Language Modeling ICCV 2023 [Waymo, LLM + AD]
 - CubeLLM: align 2D/3D with language
 - EmerNeRF: ICLR 2024
 - A Language Agent for Autonomous Driving
 - [Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal]
 - DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
 - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving CVPR 2024 [Zheng Zhu]
 - Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond [Zheng Zhu]
 
- End-to-end Autonomous Driving: Challenges and Frontiers [Notes] [Hongyang Li, Shanghai AI labs]
 - DriveVLM: The convergence of Autonomous Driving and Large Vision-Language Models [Notes] [Hang Zhao]
 - DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [Notes] [HKU]
 - GAIA-1: A Generative World Model for Autonomous Driving [Notes] [Wayve, vision foundation model]
 - ADriver-I: A General World Model for Autonomous Driving [Notes] [Megvii, Xiangyu]
 - Drive-WM: Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [Notes]
 - X [Notes] [E2E planning]
 
- ChatGPT for Robotics: Design Principles and Model Abilities [Notes] [Microsoft, LLM for robotics]
 - RoboVQA: Multimodal Long-Horizon Reasoning for Robotics [Notes] [Google DeepMind, LLM for robotics]
 - ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application [Microsoft Robotics]
 - GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration [Notes] [LLM for robotics, Microsoft Robotics]
 - LLM-Brain: LLM as A Robotic Brain: Unifying Egocentric Memory and Control [Notes]
 - Voyager: An Open-Ended Embodied Agent with Large Language Models [Notes] [Reasoning Critique, Linxi Jim Fan]
 
- RetNet: Retentive Network: A Successor to Transformer for Large Language Models [Notes] [MSRA]
 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [Notes] ICML 2020 [Linear attention]
 - AFT: An Attention Free Transformer [Notes] [Apple]
 
- RT-1: Robotics Transformer for Real-World Control at Scale [Notes] [DeepMind]
 - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [Notes] [DeepMind, end-to-end visuomotor]
 - RWKV: Reinventing RNNs for the Transformer Era [Notes]
 
- MILE: Model-Based Imitation Learning for Urban Driving [Notes] NeurIPS 2022 [Alex Kendall]
 - PaLM-E: An embodied multimodal language model [Notes] [Google Robotics]
 - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [Notes] [Feifei Li]
 - CaP: Code as Policies: Language Model Programs for Embodied Control [Notes] [Project]
 - ProgPrompt: Generating Situated Robot Task Plans using Large Language Models ICRA 2023
 - TidyBot: Personalized Robot Assistance with Large Language Models [Notes] [Project]
 - SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [Notes] [Project]
 
- End-to-end review by Shanghai AI Labs
 - Pix2seq v2: A Unified Sequence Interface for Vision Tasks [Notes] NeurIPS 2022 [Geoffrey Hinton]
 - 🦩 Flamingo: a Visual Language Model for Few-Shot Learning [Notes] NeurIPS 2022 [DeepMind]
 - 😼 Gato: A Generalist Agent [Notes] TMLR 2022 [DeepMind]
 - BC-SAC: Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios [Notes] NeurIPS 2022 [Waymo]
 - MGAIL-AD: Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving [Notes] IROS 2022 [Waymo]
 
- SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [Notes] [Occupancy Network, Wei Yi, Jiwen Lu]
 - Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving [Notes] [Occupancy Network, Zhao Hang]
 - Occupancy Networks: Learning 3D Reconstruction in Function Space CVPR 2019 [Notes] [Andreas Geiger]
 - OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction [Occupancy Network, PhiGent]
 - Pix2seq: A Language Modeling Framework for Object Detection [Notes] ICLR 2022 [Geoffrey Hinton]
 - VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [Notes] [Jifeng Dai]
 - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Notes]
 
- UniAD: Planning-oriented Autonomous Driving [Notes] CVPR 2023 best paper [BEV, e2e, Hongyang Li]
 
- GPT-4 Technical Report [Notes] [OpenAI, GPT]
 - OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception [Notes] [Occupancy Network, Jiwen Lu]
 - VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion [Note] CVPR 2023 highlight [Occupancy Network, Nvidia]
 - MonoScene: Monocular 3D Semantic Scene Completion CVPR 2022 [Notes] [Occupancy Network, single cam]
 - CoReNet: Coherent 3D scene reconstruction from a single RGB image [Notes] ECCV 2020 oral
 
- Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning [Notes] [Epoch.ai industry report]
 - Codex: Evaluating Large Language Models Trained on Code [Notes] [GPT, OpenAI]
 - InstructGPT: Training language models to follow instructions with human feedback [Notes] [GPT, OpenAI]
 - TPVFormer: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction [Notes] CVPR 2023 [Occupancy Network, Jiwen Lu]
 
- PPGeo: Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling [Notes] ICLR 2023
 - nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles [Notes]
 
- Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe [Notes] [PJLab]
 
- ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries [Notes] [BEV, perception + prediction, Hang Zhao]
 - MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction [Notes] [Horizon, BEVNet]
 - StopNet: Scalable Trajectory and Occupancy Prediction for Urban Autonomous Driving ICRA 2022
 - MOTR: End-to-End Multiple-Object Tracking with Transformer ECCV 2022 [Megvii, MOT]
 - Anchor DETR: Query Design for Transformer-Based Object Detection [Notes] AAAI 2022 [Megvii]
 
- HOME: Heatmap Output for future Motion Estimation [Notes] ITSC 2021 [behavior prediction, Huawei Paris]
 
- PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark [Notes] [BEVNet, lane line]
 - VectorMapNet: End-to-end Vectorized HD Map Learning [Notes] [BEVNet, LLD, Hang Zhao]
 - PETR: Position Embedding Transformation for Multi-View 3D Object Detection [Notes] ECCV 2022 [BEVNet]
 - PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images [Notes] [BEVNet, MegVii]
 - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation [Notes] [BEVNet, nvidia]
 - BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection [Notes] [BEVNet, NuScenes SOTA, Megvii]
 - CVT: Cross-view Transformers for real-time Map-view Semantic Segmentation [Notes] CVPR 2022 oral [UTAustin, Philipp]
 - Wayformer: Motion Forecasting via Simple & Efficient Attention Networks [Notes] [Behavior prediction, Waymo]
 
- BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection [Notes] [BEVNet]
 - BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving [Notes] [Jiwen Lu, BEVNet, perception + prediction]
 - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [Notes] [BEVNet, Han Song]
 
- BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers [Notes] ECCV 2022 [BEVNet, Hongyang Li, Jifeng Dai]
 
- TNT: Target-driveN Trajectory Prediction [Notes] CoRL 2020 [prediction, Waymo, Hang Zhao]
 - DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets [Notes] ICCV 2021 [prediction, Waymo, 1st place winner WOMD]
 
- Manydepth: The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth [Notes] CVPR 2021 [monodepth, Niantic]
 - DEKR: Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression [Notes] CVPR 2021
 
- BN-FFN-BN: Leveraging Batch Normalization for Vision Transformers [Notes] ICCVW 2021 [BN, transformers]
 - PowerNorm: Rethinking Batch Normalization in Transformers [Notes] ICML 2020 [BN, transformers]
 - MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction [Notes] ICRA 2022 [Waymo, behavior prediction]
 - BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View [Notes]
 - Translating Images into Maps [Notes] ICRA 2022 [BEVNet, transformers]
 
- DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [Notes] CoRL 2021 [BEVNet, transformers]
 - Robust-CVD: Robust Consistent Video Depth Estimation CVPR 2021 oral [website]
 - MAE: Masked Autoencoders Are Scalable Vision Learners [Notes] [Kaiming He, unsupervised learning]
 - SimMIM: A Simple Framework for Masked Image Modeling [Notes] [MSRA, unsupervised learning, MAE]
 - iBOT: Image BERT Pre-Training with Online Tokenizer
 
- STSU: Structured Bird's-Eye-View Traffic Scene Understanding from Onboard Images [Notes] ICCV 2021 [BEV feat stitching, Luc Van Gool]
 - PanopticBEV: Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images [Notes] RAL 2022 [BEVNet, vertical/horizontal features]
 - NEAT: Neural Attention Fields for End-to-End Autonomous Driving [Notes] ICCV 2021 [supplementary] [BEVNet]
 
- DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? [Notes] ICCV 2021 [mono3D, Toyota]
 - EfficientDet: Scalable and Efficient Object Detection [Notes] CVPR 2020 [BiFPN, Tesla AI day]
 - PnPNet: End-to-End Perception and Prediction with Tracking in the Loop [Notes] CVPR 2020 [Uber ATG]
 - MP3: A Unified Model to Map, Perceive, Predict and Plan [Notes] CVPR 2021 [Uber, planning]
 - BEV-Net: Assessing Social Distancing Compliance by Joint People Localization and Geometric Reasoning [Notes] ICCV 2021 [BEVNet, surveillance]
 - LiDAR R-CNN: An Efficient and Universal 3D Object Detector [Notes] CVPR 2021 [TuSimple, Naiyan Wang]
 - Corner Cases for Visual Perception in Automated Driving: Some Guidance on Detection Approaches [Notes] [corner cases]
 - Systematization of Corner Cases for Visual Perception in Automated Driving [Notes] IV 2020 [corner cases]
 - An Application-Driven Conceptualization of Corner Cases for Perception in Highly Automated Driving [Notes] IV 2021 [corner cases]
 - PYVA: Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation [Notes] CVPR 2021 [Supplementary] [BEVNet]
 - YOLOF: You Only Look One-level Feature [Notes] CVPR 2021 [megvii]
 - Perceiving Humans: from Monocular 3D Localization to Social Distancing [Notes] TITS 2021 [monoloco++]
 - PifPaf: Composite Fields for Human Pose Estimation CVPR 2019
 - Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images [BEVNet]
 - TransformerFusion: Monocular RGB Scene Reconstruction using Transformers
 - Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation CVPR 2021
 - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving CVPR 2021
 - Conditional DETR for Fast Training Convergence
 - Probabilistic and Geometric Depth: Detecting Objects in Perspective CoRL 2021
 
- EgoNet: Exploring Intermediate Representation for Monocular Vehicle Pose Estimation [Notes] CVPR 2021 [mono3D]
 - MonoEF: Monocular 3D Object Detection: An Extrinsic Parameter Free Approach [Notes] CVPR 2021 [mono3D]
 - GAC: Ground-aware Monocular 3D Object Detection for Autonomous Driving [Notes] RAL 2021 [mono3D]
 - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [Notes] NeurIPS 2020 [mono3D, senseTime]
 - GUPNet: Geometry Uncertainty Projection Network for Monocular 3D Object Detection [Notes] ICCV 2021 [mono3D, Wanli Ouyang]
 - DARTS: Differentiable Architecture Search [Notes] ICLR 2019 [VGG author]
 - FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search [Notes] CVPR 20219 [DARTS]
 - FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions CVPR 2020
 - FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining CVPR 2021
 - Perceiver: General Perception with Iterative Attention [Notes] ICML 2021 [transformers, multimodal]
 - Perceiver IO: A General Architecture for Structured Inputs & Outputs [Notes]
 - PillarMotion: Self-Supervised Pillar Motion Learning for Autonomous Driving [Notes] CVPR 2021 [Qcraft, Alan Yuille]
 - SimTrack: Exploring Simple 3D Multi-Object Tracking for Autonomous Driving [Notes] ICCV 2019 [QCraft, Alan Yuille]
 
- HDMapNet: An Online HD Map Construction and Evaluation Framework [Notes] CVPR 2021 workshop [youtube video only, Li Auto]
 
- FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras [Notes] ICCV 2021 [BEVNet, perception + prediction]
 - Baidu's CNN seg [Notes]
 
- Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation [Notes] CVPR 2021 [megvii]
 - CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark CVPR 2019
 - The Overlooked Elephant of Object Detection: Open Set WACV 2021
 - Class-Agnostic Object Detection WACV 2021
 - OWOD: Towards Open World Object Detection [Notes] CVPR 2021 oral
 - FsDet: Frustratingly Simple Few-Shot Object Detection ICML 2020
 - MonoFlex: Objects are Different: Flexible Monocular 3D Object Detection [Notes] CVPR 2021 [mono3D, Jiwen Lu, cropped]
 - monoDLE: Delving into Localization Errors for Monocular 3D Object Detection [Notes] CVPR 2021 [mono3D]
 - Exploring 2D Data Augmentation for 3D Monocular Object Detection
 - OCM3D: Object-Centric Monocular 3D Object Detection [mono3D]
 - FSM: Full Surround Monodepth from Multiple Cameras [Notes] ICRA 2021 [monodepth, Xnet]
 
- CaDDN: Categorical Depth Distribution Network for Monocular 3D Object Detection [Notes] CVPR 2021 oral [mono3D, BEVNet]
 - DSNT: Numerical Coordinate Regression with Convolutional Neural Networks [Notes] [differentiable spatial to numerical transform]
 - Soft-Argmax: Human pose regression by combining indirect part detection and contextual information
 - INSTA-YOLO: Real-Time Instance Segmentation [Notes] ICML workshop 2020 [single stage instance segmentation]
 - CenterNet2: Probabilistic two-stage detection [Notes] [CenterNet, two-stage]
 
- Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection [Notes] [NMS]
 - BoxInst: High-Performance Instance Segmentation with Box Annotations [Notes] CVPR 2021 [Chunhua Shen, Tian Zhi]
 - 3DSSD: Point-based 3D Single Stage Object Detector [Notes] CVPR 2020
 - RepVGG: Making VGG-style ConvNets Great Again [Notes] [Megvii, Xiangyu Zhang, ACNet]
 - ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks [Notes] ICCV 2019
 - BEV-Feat-Stitching: Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera [Notes] [BEVNet, mono3D, Luc Van Gool]
 - PSS: Object Detection Made Simpler by Eliminating Heuristic NMS [Notes] [Transformer, DETR]
 
- DeFCN: End-to-End Object Detection with Fully Convolutional Network [Notes] [Transformer, DETR]
 - OneNet: End-to-End One-Stage Object Detection by Classification Cost [Notes] [Transformer, DETR]
 - Traffic Light Mapping, Localization, and State Detection for Autonomous Vehicles [Notes] ICRA 2011 [traffic light, Sebastian Thrun]
 - Towards lifelong feature-based mapping in semi-static environments [Notes] ICRA 2016
 - How to Keep HD Maps for Automated Driving Up To Date [Notes] ICRA 2020 [BMW]
 - Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection [Notes] CVPR 2021 [focal loss]
 - Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning [Notes] CVPR 2018 workshop
 - Centroid Voting: Object-Aware Centroid Voting for Monocular 3D Object Detection [Notes] IROS 2020 [mono3D, geometry + appearance = distance]
 - Monocular 3D Object Detection in Cylindrical Images from Fisheye Cameras [Notes] [GM Israel, mono3D]
 - DeepPS: Vision-Based Parking-Slot Detection: A DCNN-Based Approach and a Large-Scale Benchmark Dataset TIP 2018 [Parking slot detection, PS2.0 dataset]
 - PSDet: Efficient and Universal Parking Slot Detection [Notes] IV 2020 [Zongmu, Parking slot detection]
 - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [Notes] ASPLOS 2020 [pruning]
 - Scaled-YOLOv4: Scaling Cross Stage Partial Network [Notes] [yolo]
 - Yolov5 by Ultralytics [Notes] [yolo, spatial2channel]
 - PP-YOLO: An Effective and Efficient Implementation of Object Detector [Notes] [yolo, paddle-paddle, baidu]
 - PointPainting: Sequential Fusion for 3D Object Detection [Notes] [nuscenece]
 - MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps [Notes] CVPR 2020 [Unseen moving objects, BEV]
 - Locating Objects Without Bounding Boxes [Notes] CVPR 2019 [weighted Haussdorf distance, NMS-free]
 
- TSP: Rethinking Transformer-based Set Prediction for Object Detection [Notes] ICCV 2021 [DETR, transformers, Kris Kitani]
 - Sparse R-CNN: End-to-End Object Detection with Learnable Proposals [Notes] CVPR 2020 [DETR, Transformer]
 - Unsupervised Monocular Depth Learning in Dynamic Scenes [Notes] CoRL 2020 [LearnK improved ver, Google]
 - MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time [Notes] ICML 2020 [Mono3D, pairwise relationship]
 - Argoverse: 3D Tracking and Forecasting with Rich Maps [Notes] CVPR 2019 [HD maps, dataset, CV lidar]
 - The H3D Dataset for Full-Surround 3D Multi-Object Detection and Tracking in Crowded Urban Scenes [Notes] ICRA 2019
 - Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection CVPRW 2020 [dataset, Daimler, mono3D]
 - NYC3DCars: A Dataset of 3D Vehicles in Geographic Context ICCV 2013
 - Towards Fully Autonomous Driving: Systems and Algorithms IV 2011
 - Center3D: Center-based Monocular 3D Object Detection with Joint Depth Understanding [Notes] [mono3D, LID+DepJoint]
 - ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection AAAI 2020 oral [mono3D]
 - CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection [Notes] WACV 2021 [early fusion, camera, radar]
 - 3D-LaneNet+: Anchor Free Lane Detection using a Semi-Local Representation [Notes] NeurIPS 2020 workshop [GM Israel, 3D LLD]
 - LSTR: End-to-end Lane Shape Prediction with Transformers [Notes] WACV 2021 [LLD, transformers]
 - PIXOR: Real-time 3D Object Detection from Point Clouds [Notes] CVPR 2018 (birds eye view)
 - HDNET/PIXOR++: Exploiting HD Maps for 3D Object Detection [Notes] CoRL 2018
 - CPNDet: Corner Proposal Network for Anchor-free, Two-stage Object Detection ECCV 2020 [anchor free, two stage]
 - MVF: End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds [Notes] CoRL 2019 [Waymo, VoxelNet 1st author]
 - Pillar-based Object Detection for Autonomous Driving [Notes] ECCV 2020
 - Training-Time-Friendly Network for Real-Time Object Detection AAAI 2020 [anchor-free, fast training]
 - Autonomous Driving with Deep Learning: A Survey of State-of-Art Technologies [Review of autonomous stack, Yu Huang]
 - Dense Monocular Depth Estimation in Complex Dynamic Scenes CVPR 2016
 - Probabilistic Future Prediction for Video Scene Understanding
 - AB3D: A Baseline for 3D Multi-Object Tracking IROS 2020 [3D MOT]
 - Spatial-Temporal Relation Networks for Multi-Object Tracking ICCV 2019 [MOT, feature location over time]
 - Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking ICRA 2018 [MOT, IIT, 3D shape]
 - ST-3D: Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking CVPR 2020 [Peilinag LI, author of VINS and S3DOT]
 - Augment Your Batch: Improving Generalization Through Instance Repetition CVPR 2020
 - RetinaTrack: Online Single Stage Joint Detection and Tracking CVPR 2020 [MOT]
 - Object as Hotspots: An Anchor-Free 3D Object Detection Approach via Firing of Hotspots
 - Gradient Centralization: A New Optimization Technique for Deep Neural Networks ECCV 2020 oral
 - Depth Completion via Deep Basis Fitting WACV 2020
 - BTS: From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation [monodepth, supervised]
 - The Edge of Depth: Explicit Constraints between Segmentation and Depth CVPR 2020 [monodepth, Xiaoming Liu]
 - On the Continuity of Rotation Representations in Neural Networks CVPR 2019 [rotational representation]
 - VDO-SLAM: A Visual Dynamic Object-aware SLAM System IJRR 2020
 - Dynamic SLAM: The Need For Speed
 - Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction ECCV 2020
 - Traffic Light Mapping and Detection [Notes] ICRA 2011 [traffic light, Google, Chris Urmson]
 - Traffic light recognition exploiting map and localization at every stage [Notes] Expert Systems 2017 [traffic light, 鲜于明镐,徐在圭,郑浩奇]
 - Traffic Light Recognition Using Deep Learning and Prior Maps for Autonomous Cars [Notes] IJCNN 2019 [traffic light, Espirito Santo Brazil]
 
- TSM: Temporal Shift Module for Efficient Video Understanding [Notes] ICCV 2019 [Song Han, video, object detection]
 - WOD: Waymo Dataset: Scalability in Perception for Autonomous Driving: Waymo Open Dataset [Notes] CVPR 2020
 - Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection [Notes] NeurIPS 2020 [classification as regression]
 - A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection NeurIPS 2020 spotlight
 - Rethinking the Value of Labels for Improving Class-Imbalanced Learning NeurIPS 2020
 - RepLoss: Repulsion Loss: Detecting Pedestrians in a Crowd [Notes] CVPR 2018 [crowd detection, Megvii]
 - Adaptive NMS: Refining Pedestrian Detection in a Crowd [Notes] CVPR 2019 oral [crowd detection, NMS]
 - AggLoss: Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd [Notes] ECCV 2018 [crowd detection]
 - CrowdDet: Detection in Crowded Scenes: One Proposal, Multiple Predictions [Notes] CVPR 2020 oral [crowd detection, Megvii, Earth mover's distance]
 - R2-NMS: NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing [Notes] CVPR 2020
 - Double Anchor R-CNN for Human Detection in a Crowd [Notes] [head-body bundle]
 - Review: AP vs MR
 - SKU110K: Precise Detection in Densely Packed Scenes [Notes] CVPR 2019 [crowd detection, no occlusion]
 - GossipNet: Learning non-maximum suppression CVPR 2017
 - TLL: Small-scale Pedestrian Detection Based on Somatic Topology Localization and Temporal Feature Aggregation ECCV 2018
 - Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels GCPR 2020 [mono3D, Daniel Cremers, TUM]
 - CubifAE-3D: Monocular Camera Space Cubification on Autonomous Vehicles for Auto-Encoder based 3D Object Detection [Notes] [mono3D, depth AE pretraining]
 - Deformable DETR: Deformable Transformers for End-to-End Object Detection [Notes] ICLR 2021 [Jifeng Dai, DETR]
 - ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Notes] ICLR 2021
 - BYOL: Bootstrap your own latent: A new approach to self-supervised Learning [self-supervised]
 
- SDFLabel: Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors [Notes] CVPR 2020 oral [TRI, differentiable rendering]
 - DensePose: Dense Human Pose Estimation In The Wild [Notes] CVPR 2018 oral [FAIR]
 - NOCS: Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation CVPR 2019
 - monoDR: Monocular Differentiable Rendering for Self-Supervised 3D Object Detection [Notes] ECCV 2020 [TRI, mono3D]
 - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D [Notes] ECCV 2020 [BEV-Net, Utoronto, Sanja Fidler]
 - Implicit Latent Variable Model for Scene-Consistent Motion Forecasting ECCV 2020 [Uber ATG, Rachel Urtasun]
 - FISHING Net: Future Inference of Semantic Heatmaps In Grids [Notes] CVPRW 2020 [BEV-Net, Mapping, Zoox]
 - VPN: Cross-view Semantic Segmentation for Sensing Surroundings [Notes] RAL 2020 [Bolei Zhou, BEV-Net]
 - VED: Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks [Notes] ICRA 2019 [BEV-Net]
 - Cam2BEV: A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird's Eye View [Notes] ITSC 2020 [BEV-Net]
 - Learning to Look around Objects for Top-View Representations of Outdoor Scenes [Notes] ECCV 2018 [BEV-Net, UCSD, Manmohan Chandraker]
 - A Parametric Top-View Representation of Complex Road Scenes CVPR 2019 [BEV-Net, UCSD, Manmohan Chandraker]
 - FTM: Understanding Road Layout from Videos as a Whole CVPR 2020 [BEV-Net, UCSD, Manmohan Chandraker]
 - KM3D-Net: Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [Notes] RAL 2021 [RTM3D, Peixuan Li]
 - InstanceMotSeg: Real-time Instance Motion Segmentation for Autonomous Driving [Notes] IROS 2020 [motion segmentation]
 - MPV-Nets: Monocular Plan View Networks for Autonomous Driving [Notes] IROS 2019 [BEV-Net]
 - Class-Balanced Loss Based on Effective Number of Samples [Notes] CVPR 2019 [Focal loss authors]
 - Geometric Pretraining for Monocular Depth Estimation [Notes] ICRA 2020
 - Robust Traffic Light and Arrow Detection Using Digital Map with Spatial Prior Information for Automated Driving [Notes] Sensors 2020 [traffic light, 金沢]
 
- Feature-metric Loss for Self-supervised Learning of Depth and Egomotion [Notes] ECCV 2020 [feature-metric, local minima, monodepth]
 - Depth-VO-Feat: Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction CVPR 2018 [feature-metric, monodepth]
 - MonoResMatch: Learning monocular depth estimation infusing traditional stereo knowledge [Notes] CVPR 2019 [monodepth, local minima, cheap stereo GT]
 - SGDepth: Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance [Notes] ECCV 2020 [Moving objects]
 - Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding ECCV 2018 [dynamic objects, rigid and dynamic motion]
 - Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding TPAMI 2018
 - CC: Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation [Notes] CVPR 2019
 - ObjMotionNet: Self-supervised Object Motion and Depth Estimation from Video [Notes] CVPRW 2020 [object motion prediction, velocity prediction]
 - Instance-wise Depth and Motion Learning from Monocular Videos
 - Semantics-Driven Unsupervised Learning for Monocular Depth and Ego-Motion Estimation
 - Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues
 - DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency ECCV 2018
 - LineNet: a Zoomable CNN for Crowdsourced High Definition Maps Modeling in Urban Environments [mapping]
 - Road-SLAM: Road Marking based SLAM with Lane-level Accuracy [Notes] [HD mapping]
 - AVP-SLAM: Semantic Visual Mapping and Localization for Autonomous Vehicles in the Parking Lot [Notes] IROS 2020 [Huawei, HD mapping, Tong Qin, VINS author, autonomous valet parking]
 - AVP-SLAM-Late-Fusion: Mapping and Localization using Semantic Road Marking with Centimeter-level Accuracy in Indoor Parking Lots [Notes] ITSC 2019
 - Lane markings-based relocalization on highway ITSC 2019
 - DeepRoadMapper: Extracting Road Topology from Aerial Images [Notes] ICCV 2017 [Uber ATG, NOT HD maps]
 - RoadTracer: Automatic Extraction of Road Networks from Aerial Images CVPR 2018 [NOT HD maps]
 - PolyMapper: Topological Map Extraction From Overhead Images [Notes] ICCV 2019 [mapping, polygon, NOT HD maps]
 - HRAN: Hierarchical Recurrent Attention Networks for Structured Online Maps [Notes] CVPR 2018 [HD mapping, highway, polyline loss, Chamfer distance]
 - Deep Structured Crosswalk: End-to-End Deep Structured Models for Drawing Crosswalks [Notes] ECCV 2018
 - DeepBoundaryExtractor: Convolutional Recurrent Network for Road Boundary Extraction [Notes] CVPR 2019 [HD mapping, boundary, polyline loss]
 - DAGMapper: Learning to Map by Discovering Lane Topology [Notes] ICCV 2019 [HD mapping, highway, forks and merges, polyline loss]
 - Sparse-HD-Maps: Exploiting Sparse Semantic HD Maps for Self-Driving Vehicle Localization [Notes] IROS 2019 oral [Uber ATG, metadata, mapping, localization]
 - Aerial LaneNet: Lane Marking Semantic Segmentation in Aerial Imagery using Wavelet-Enhanced Cost-sensitive Symmetric Fully Convolutional Neural Networks IEEE TGRS 2018
 - Monocular Localization with Vector HD Map (MLVHM): A Low-Cost Method for Commercial IVs Sensors 2020 [Tsinghua, 3D HD maps]
 - PatchNet: Rethinking Pseudo-LiDAR Representation [Notes] ECCV 2020 [SenseTime, Wanli Ouyang]
 - D4LCN: Learning Depth-Guided Convolutions for Monocular 3D Object Detection [Notes] CVPR 2020 [mono3D]
 - MfS: Learning Stereo from Single Images [Notes] ECCV 2020 [mono for stereo, learn stereo matching with mono]
 - BorderDet: Border Feature for Dense Object Detection ECCV 2020 oral [Megvii]
 - Scale-Aware Trident Networks for Object Detection ICCV 2019 [different heads for different scales]
 - Learning Depth from Monocular Videos using Direct Methods
 - Vid2Depth: Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints CVPR 2018 [Google]
 - NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
 - Supervising the new with the old: learning SFM from SFM [Notes] ECCV 2018
 - Neural RGB->D Sensing: Depth and Uncertainty from a Video Camera CVPR 2019 [multi-frame monodepth]
 - Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [multi-frame monodepth, RNN]
 - Recurrent Neural Network for (Un-)supervised Learning of Monocular VideoVisual Odometry and Depth [multi-frame monodepth, RNN]
 - Exploiting temporal consistency for real-time video depth estimation ICCV 2019 [multi-frame monodepth, RNN, indoor]
 - SfM-Net: Learning of Structure and Motion from Video [dynamic object, SfM]
 - MB-Net: MergeBoxes for Real-Time 3D Vehicles Detection [Notes] IV 2018 [mono3D: Daimler]
 - BS3D: Beyond Bounding Boxes: Using Bounding Shapes for Real-Time 3D Vehicle Detection from Monocular RGB Images [Notes] IV 2019 [mono3D, Daimler]
 - 3D-GCK: Single-Shot 3D Detection of Vehicles from Monocular RGB Images via Geometrically Constrained Keypoints in Real-Time [Notes] IV 2020 [[mono3D, Daimler]
 - UR3D: Distance-Normalized Unified Representation for Monocular 3D Object Detection [Notes] ECCV 2020 [mono3D]
 - DA-3Det: Monocular 3D Object Detection via Feature Domain Adaptation [Notes] ECCV 2020 [mono3D]
 - RAR-Net: Reinforced Axial Refinement Network for Monocular 3D Object Detection [Notes] ECCV 2020 [mono3D]
 
- CenterTrack: Tracking Objects as Points [Notes] ECCV 2020 spotlight [camera based 3D MOD, MOT SOTA, CenterNet, video based object detection, Philipp Krähenbühl]
 - CenterPoint: Center-based 3D Object Detection and Tracking [Notes] CVPR 2021 [lidar based 3D MOD, CenterNet]
 - Tracktor: Tracking without bells and whistles [Notes] ICCV 2019 [Tracktor/Tracktor++, Laura Leal-Taixe@TUM]
 - FairMOT: A Simple Baseline for Multi-Object Tracking [Notes]
 - DeepMOT: A Differentiable Framework for Training Multiple Object Trackers [Notes] CVPR 2020 [trainable Hungarian, Laura Leal-Taixe@TUM]
 - MPNTracker: Learning a Neural Solver for Multiple Object Tracking CVPR 2020 oral [trainable Hungarian, Laura Leal-Taixe@TUM]
 - nuScenes: A multimodal dataset for autonomous driving [Notes] CVPR 2020 [dataset, point cloud, radar]
 - CBGS: Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection [Notes] CVPRW 2019 [Megvii, lidar, WAD challenge winner]
 - AFDet: Anchor Free One Stage 3D Object Detection and Competition solution [Notes] CVPRW 2020 [Horizon robotics, lidar, winning for Waymo challenge]
 - Review of MOT and SOT [Notes]
 - CrowdHuman: A Benchmark for Detecting Human in a Crowd [Notes] [megvii, pedestrian, dataset]
 - WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild [Notes] TMM 2019 [dataset, pedestrian]
 - Tsinghua-Daimler Cyclists: A New Benchmark for Vison-Based Cyclist Detection [Notes] IV 2016 [dataset, cyclist Detection]
 - Specialized Cyclist Detection Dataset: Challenging Real-World Computer Vision Dataset for Cyclist Detection Using a Monocular RGB Camera [Notes] IV 2019 [Extention to KITTI]
 - PointTrack: Segment as Points for Efficient Online Multi-Object Tracking and Segmentation [Notes] ECCV 2020 oral [MOTS]
 - PointTrack++ for Effective Online Multi-Object Tracking and Segmentation [Notes] CVPR 2020 workshop [CVPR2020 MOTS Challenge Winner. PointTrack++ ranks first on KITTI MOTS]
 - SpatialEmbedding: Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth [Notes] ICCV 2019 [one-stage, instance segmentation]
 - BA-Net: Dense Bundle Adjustment Networks [Notes] ICLR 2019 [Bundle adjustment, multi-frame monodepth, feature-metric]
 - DeepSFM: Structure From Motion Via Deep Bundle Adjustment ECCV 2020 oral [multi-frame monodepth, indoor scene]
 - CVD: Consistent Video Depth Estimation [Notes] SIGGRAPH 2020 [multi-frame monodepth, online finetune]
 - DeepV2D: Video to Depth with Differentiable Structure from Motion [Notes] ICLR 2020 [multi-frame monodepth, Jia Deng]
 - GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose [Notes] CVPR 2018 [residual optical flow, monodepth, rigid and dynamic motion]
 - GLNet: Self-supervised Learning with Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera [Notes] ICCV 2019 [online finetune, rigid and dynamic motion]
 - Depth Hints: Self-Supervised Monocular Depth Hints [Notes] ICCV 2019 [monodepth, local minima, cheap stereo GT]
 - MonoUncertainty: On the uncertainty of self-supervised monocular depth estimation [Notes] CVPR 2020 [depth uncertainty]
 - Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment [Notes] [Bundle adjustment, xmotors.ai, multi-frame monodepth]
 - Kinematic 3D Object Detection in Monocular Video [Notes] ECCV 2020 [multi-frame mono3D, Xiaoming Liu]
 - VelocityNet: Camera-based vehicle velocity estimation from monocular video [Notes] CVPR 2017 workshop [monocular velocity estimation, CVPR 2017 challenge winner]
 - Vehicle Centric VelocityNet: End-to-end Learning for Inter-Vehicle Distance and Relative Velocity Estimation in ADAS with a Monocular Camera [Notes] [monocular velocity estimation, monocular distance, SOTA]
 
- LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain [Notes] IROS 2018 [lidar, mapping]
 - PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction [Notes] ICCV 2019
 - JAAD: Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior ICCV 2017
 - Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs BMVC 2019
 - Is the Pedestrian going to Cross? Answering by 2D Pose Estimation IV 2018
 - Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation ITSC 2019 [skeleton, pedestrian, cyclist intention]
 - Attentive Single-Tasking of Multiple Tasks CVPR 2019
 - DETR: End-to-End Object Detection with Transformers [Notes] ECCV 2020 oral [FAIR]
 - Transformer: Attention Is All You Need [Notes] NIPS 2017
 - SpeedNet: Learning the Speediness in Videos [Notes] CVPR 2020 oral
 - MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships [Notes] CVPR 2020 [Mono3D, pairwise relationship]
 - SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation [Notes] CVPRW 2020 [Mono3D, Zongmu]
 - Vehicle Re-ID for Surround-view Camera System [Notes] CVPRW 2020 [tireline, vehicle ReID, Zongmu]
 - End-to-End Lane Marker Detection via Row-wise Classification [Notes] [Qualcomm Korea, LLD as cls]
 - Reliable multilane detection and classification by utilizing CNN as a regression network ECCV 2018 [LLD as reg]
 - SUPER: A Novel Lane Detection System [Notes]
 - Learning Lightweight Lane Detection CNNs by Self Attention Distillation ICCV 2019
 - StixelNet: A Deep Convolutional Network for Obstacle Detection and Road Segmentation BMVC 2015
 - StixelNetV2: Real-time category-based and general obstacle detection for autonomous driving [Notes] ICCV 2017 [DS]
 - Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network [Notes] CVPR 2016 [channel-to-pixel]
 - Car Pose in Context: Accurate Pose Estimation with Ground Plane Constraints [mono3D]
 - Self-Mono-SF: Self-Supervised Monocular Scene Flow Estimation [Notes] CVPR 2020 oral [scene-flow, Stereo input]
 - MEBOW: Monocular Estimation of Body Orientation In the Wild [Notes] CVPR 2020
 - VG-NMS: Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes [Notes] NeurIPS 2019 workshop [Crowded scene, NMS, Daimler]
 - WYSIWYG: What You See is What You Get: Exploiting Visibility for 3D Object Detection [Notes] CVPR 2020 oral [occupancy grid]
 - Real-Time Panoptic Segmentation From Dense Detections [Notes] CVPR 2020 oral [bbox + semantic segmentation = panoptic segmentation, Toyota]
 - Human-Centric Efficiency Improvements in Image Annotation for Autonomous Driving [Notes] CVPRW 2020 [efficient annotation]
 - SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving [Notes] CVPR 2020 oral [Waymo, auto data generation, surfel]
 - LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World [Notes] CVPR 2020 oral [Uber ATG, auto data generation, surfel]
 - SuMa++: Efficient LiDAR-based Semantic SLAM IROS 2019 [semantic segmentation, lidar, SLAM]
 - PON/PyrOccNet: Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks [Notes] CVPR 2020 oral [BEV-Net, OFT]
 - MonoLayout: Amodal scene layout from a single image [Notes] WACV 2020 [BEV-Net]
 - BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud [Notes] CVPR 2020 workshop [BEV-Net, Mapping]
 - A Geometric Approach to Obtain a Bird's Eye View from an Image ICCVW 2019 [mapping, geometry, Andrew Zisserman]
 - FrozenDepth: Learning the Depths of Moving People by Watching Frozen People [Notes] CVPR 2019 oral
 - ORB-SLAM: a Versatile and Accurate Monocular SLAM System TRO 2015
 - ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras TRO 2016
 - CubeSLAM: Monocular 3D Object SLAM [Notes] TRO 2019 [dynamic SLAM, orb slam + mono3D]
 - ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings [Notes] CVPR 2020 [general dynamic SLAM]
 - S3DOT: Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving [Notes] ECCV 2018 [Peiliang Li]
 - Multi-object Monocular SLAM for Dynamic Environments [Notes] IV 2020 [monolayout authors]
 - PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume [Notes] CVPR 2018 oral [Optical flow]
 - LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation CVPR 2018 [Optical flow]
 - FlowNet: Learning Optical Flow With Convolutional Networks ICCV 2015 [Optical flow]
 - FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks CVPR 2017 [Optical flow]
 - ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network CVPR 2019 [semantic segmentation, lightweight]
 - Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic Scenes ICCV 2019 [depth uncertainty]
 
- Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems [Notes] [Honda] ICRA 2019
 - PackNet: 3D Packing for Self-Supervised Monocular Depth Estimation [Notes] CVPR 2020 oral [Scale aware depth]
 - PackNet-SG: Semantically-Guided Representation Learning for Self-Supervised Monocular Depth [Notes] ICLR 2020 [TRI, infinite-depth problem]
 - TrianFlow: Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Notes] CVPR 2020 [Scale aware]
 - Understanding the Limitations of CNN-based Absolute Camera Pose Regression [Notes] CVPR 2019 [Drawbacks of PoseNet, MapNet, Laura Leal-Taixe@TUM]
 - To Learn or Not to Learn: Visual Localization from Essential Matrices [Notes] ICRA 2020 [SIFT + 5 pt solver >> others for VO, Laura Leal-Taixe@TUM]
 - DF-VO: Visual Odometry Revisited: What Should Be Learnt? [Notes] ICRA 2020 [Depth and Flow for accurate VO]
 - D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry [Notes] CVPR 2020 oral [Daniel Cremers, TUM, depth uncertainty]
 - Network Slimming: Learning Efficient Convolutional Networks through Network Slimming [Notes] ICCV 2017
 - BatchNorm Pruning: Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers [Notes] ICLR 2018
 - Direct Sparse Odometry PAMI 2018
 - Train in Germany, Test in The USA: Making 3D Object Detectors Generalize [Notes] CVPR 2020
 - PseudoLidarV3: End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection [Notes] CVPR 2020
 - ATSS: Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection [Notes] CVPR 2020 oral
 - Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression AAAI 2020
 - Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation [Journal version]
 - YOLOv4: Optimal Speed and Accuracy of Object Detection [Notes]
 - CBN: Cross-Iteration Batch Normalization [Notes]
 - Stitcher: Feedback-driven Data Provider for Object Detection [Notes]
 - SKNet: Selective Kernel Networks [Notes] CVPR 2019
 - CBAM: Convolutional Block Attention Module [Notes] ECCV 2018
 - ResNeSt: Split-Attention Networks [Notes]
 
- ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst [Notes] RSS 2019 [Waymo]
 - IntentNet: Learning to Predict Intention from Raw Sensor Data [Notes] CoRL 2018 [Uber ATG, perception and prediction, Lidar+Map]
 - RoR: Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions [Notes] CVPR 2019 [Zoox]
 - MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction [Notes] CoRL 2019 [Waymo, authors from RoR and ChauffeurNet]
 - NMP: End-to-end Interpretable Neural Motion Planner [Notes] CVPR 2019 oral [Uber ATG]
 - Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks [Notes] ICRA 2019 [Henggang Cui, Multimodal, Uber ATG Pittsburgh]
 - Uncertainty-aware Short-term Motion Prediction of Traffic Actors for Autonomous Driving WACV 2020 [Uber ATG Pittsburgh]
 - TensorMask: A Foundation for Dense Object Segmentation [Notes] ICCV 2019 [single-stage instance seg]
 - BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [Notes] CVPR 2020 oral
 - Mask Encoding for Single Shot Instance Segmentation [Notes] CVPR 2020 oral [single-stage instance seg, Chunhua Shen]
 - PolarMask: Single Shot Instance Segmentation with Polar Representation [Notes] CVPR 2020 oral [single-stage instance seg]
 - SOLO: Segmenting Objects by Locations [Notes] ECCV 2020 [single-stage instance seg, Chunhua Shen]
 - SOLOv2: Dynamic, Faster and Stronger [Notes] [single-stage instance seg, Chunhua Shen]
 - CondInst: Conditional Convolutions for Instance Segmentation [Notes] ECCV 2020 oral [single-stage instance seg, Chunhua Shen]
 - CenterMask: Single Shot Instance Segmentation With Point Representation [Notes]CVPR 2020
 
- VPGNet: Vanishing Point Guided Network for Lane and Road Marking Detection and Recognition [Notes] ICCV 2017
 - Which Tasks Should Be Learned Together in Multi-task Learning? [Notes] [Stanford, MTL] ICML 2020
 - MGDA: Multi-Task Learning as Multi-Objective Optimization NeurIPS 2018
 - Taskonomy: Disentangling Task Transfer Learning [Notes] CVPR 2018
 - Rethinking ImageNet Pre-training [Notes] ICCV 2019 [Kaiming He]
 - UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor [Notes] [superpoint]
 - KP2D: Neural Outlier Rejection for Self-Supervised Keypoint Learning [Notes] ICLR 2020 (pointNet)
 - KP3D: Self-Supervised 3D Keypoint Learning for Ego-motion Estimation [Notes] CoRL 2020 [Toyota, superpoint]
 - NG-RANSAC: Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses [Notes] ICCV 2019 [pointNet]
 - Learning to Find Good Correspondences [Notes] CVPR 2018 Oral (pointNet)
 - RefinedMPL: Refined Monocular PseudoLiDAR for 3D Object Detection in Autonomous Driving [Notes] [Huawei, Mono3D]
 - DSP: Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation [Notes] AAAI 2020 (SenseTime, Mono3D)
 - Robust Lane Detection from Continuous Driving Scenes Using Deep Neural Networks (LLD, LSTM)
 - LaneNet: Towards End-to-End Lane Detection: an Instance Segmentation Approach [Notes] IV 2018 (LaneNet)
 - 3D-LaneNet: End-to-End 3D Multiple Lane Detection [Notes] ICCV 2019
 - Semi-Local 3D Lane Detection and Uncertainty Estimation [Notes] [GM Israel, 3D LLD]
 - Gen-LaneNet: A Generalized and Scalable Approach for 3D Lane Detection [Notes] ECCV 2020 [Apollo, 3D LLD]
 - Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty CVPR 2018 [Egocentric prediction]
 - It’s Not All About Size: On the Role of Data Properties in Pedestrian Detection ECCV 2018 [pedestrian]
 
- Associative Embedding: End-to-End Learning for Joint Detection and Grouping [Notes] NIPS 2017
 - Pixels to Graphs by Associative Embedding [Notes] NIPS 2017
 - Social LSTM: Human Trajectory Prediction in Crowded Spaces [Notes] CVPR 2017
 - Online Video Object Detection using Association LSTM [Notes] [single stage, recurrent]
 - SuperPoint: Self-Supervised Interest Point Detection and Description [Notes] CVPR 2018 (channel-to-pixel, deep SLAM, Magic Leap)
 - PointRend: Image Segmentation as Rendering [Notes] CVPR 2020 Oral [Kaiming He, FAIR]
 - Multigrid: A Multigrid Method for Efficiently Training Video Models [Notes] CVPR 2020 Oral [Kaiming He, FAIR]
 - GhostNet: More Features from Cheap Operations [Notes] CVPR 2020
 - FixRes: Fixing the train-test resolution discrepancy [Notes] NIPS 2019 [FAIR]
 - MoVi-3D: Towards Generalization Across Depth for Monocular 3D Object Detection [Notes] ECCV 2020 [Virtual Cam, viewport, Mapillary/Facebook, Mono3D]
 - Amodal Completion and Size Constancy in Natural Scenes [Notes] ICCV 2015 (Amodal completion)
 - MoCo: Momentum Contrast for Unsupervised Visual Representation Learning [Notes] CVPR 2020 Oral [FAIR, Kaiming He]
 
- Double Descent: Reconciling modern machine learning practice and the bias-variance trade-of [Notes] PNAS 2019
 - Deep Double Descent: Where Bigger Models and More Data Hurt [Notes]
 - Visualizing the Loss Landscape of Neural Nets NIPS 2018
 - The ApolloScape Open Dataset for Autonomous Driving and its Application CVPR 2018 (dataset)
 - ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving [Notes] CVPR 2019
 - Part-level Car Parsing and Reconstruction from a Single Street View [Notes] [Baidu]
 - 6D-VNet: End-to-end 6DoF Vehicle Pose Estimation from Monocular RGB Images [Notes] CVPR 2019
 - RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving [Notes] ECCV 2020 spotlight
 - DORN: Deep Ordinal Regression Network for Monocular Depth Estimation [Notes] CVPR 2018 [monodepth, supervised]
 - D&T: Detect to Track and Track to Detect [Notes] ICCV 2017 (from Feichtenhofer)
 - CRF-Net: A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection [Notes] SDF 2019 (radar detection)
 - RVNet: Deep Sensor Fusion of Monocular Camera and Radar for Image-based Obstacle Detection in Challenging Environments [Notes] PSIVT 2019
 - RRPN: Radar Region Proposal Network for Object Detection in Autonomous Vehicles [Notes] ICIP 2019
 - ROLO: Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking [Notes] ISCAS 2016
 - Recurrent SSD: Recurrent Multi-frame Single Shot Detector for Video Object Detection [Notes] BMVC 2018 (Mitsubishi)
 - Recurrent RetinaNet: A Video Object Detection Model Based on Focal Loss [Notes] ICONIP 2018 (single stage, recurrent)
 - Actions as Moving Points [Notes] [not suitable for online]
 - The PREVENTION dataset: a novel benchmark for PREdiction of VEhicles iNTentIONs [Notes] ITSC 2019 [dataset, cut-in]
 - Semi-Automatic High-Accuracy Labelling Tool for Multi-Modal Long-Range Sensor Dataset [Notes] IV 2018
 - Astyx dataset: Automotive Radar Dataset for Deep Learning Based 3D Object Detection [Notes] EuRAD 2019 (Astyx)
 - Astyx camera radar: Deep Learning Based 3D Object Detection for Automotive Radar and Camera [Notes] EuRAD 2019 (Astyx)
 
- How Do Neural Networks See Depth in Single Images? [Notes] ICCV 2019
 - Self-supervised Sparse-to-Dense: Self-supervised Depth Completion from LiDAR and Monocular Camera ICRA 2019 (depth completion)
 - DC: Depth Coefficients for Depth Completion [Notes] CVPR 2019 [Xiaoming Liu, Multimodal]
 - Parse Geometry from a Line: Monocular Depth Estimation with Partial Laser Observation [Notes] ICRA 2017
 - VO-Monodepth: Enhancing self-supervised monocular depth estimation with traditional visual odometry [Notes] 3DV 2019 (sparse to dense)
 - Probabilistic Object Detection: Definition and Evaluation [Notes]
 - The Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation [Notes] ICCV 2019
 - On Calibration of Modern Neural Networks [Notes] ICML 2017 (Weinberger)
 - Extreme clicking for efficient object annotation [Notes] ICCV 2017
 - Radar and Camera Early Fusion for Vehicle Detection in Advanced Driver Assistance Systems [Notes] NeurIPS 2019 (radar)
 - Deep Active Learning for Efficient Training of a LiDAR 3D Object Detector [Notes] IV 2019
 - C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion [Notes] ICCV 2019
 - YOLACT: Real-time Instance Segmentation [Notes] ICCV 2019 [single-stage instance seg]
 - YOLACT++: Better Real-time Instance Segmentation [single-stage instance seg]
 
- Review of Image and Feature Descriptors
 - Vehicle Detection With Automotive Radar Using Deep Learning on Range-Azimuth-Doppler Tensors [Notes] ICCV 2019
 - GPP: Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road [Notes] IV 2020 [UCSD, Trevidi, mono 3DOD]
 - MVRA: Multi-View Reprojection Architecture for Orientation Estimation [Notes] ICCV 2019
 - YOLOv3: An Incremental Improvement
 - Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving [Notes] ICCV 2019 (Detection with Uncertainty)
 - Bayesian YOLOv3: Uncertainty Estimation in One-Stage Object Detection [Notes] [DriveU]
 - Towards Safe Autonomous Driving: Capture Uncertainty in the Deep Neural Network For Lidar 3D Vehicle Detection [Notes] ITSC 2018 (DriveU)
 - Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection [Notes] IV 2019 (DriveU)
 - Can We Trust You? On Calibration of a Probabilistic Object Detector for Autonomous Driving [Notes] IROS 2019 (DriveU)
 - LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving [Notes] CVPR 2019 (uncertainty)
 - LaserNet KL: Learning an Uncertainty-Aware Object Detector for Autonomous Driving [Notes] [LaserNet with KL divergence]
 - IoUNet: Acquisition of Localization Confidence for Accurate Object Detection [Notes] ECCV 2018
 - gIoU: Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression [Notes] CVPR 2019
 - The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks CVPR 2018 [IoU as loss]
 - KL Loss: Bounding Box Regression with Uncertainty for Accurate Object Detection [Notes] CVPR 2019
 - CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth [Notes] CVPR 2019
 - BayesOD: A Bayesian Approach for Uncertainty Estimation in Deep Object Detectors [Notes]
 - TW-SMNet: Deep Multitask Learning of Tele-Wide Stereo Matching [Notes] ICIP 2019
 - Accurate Uncertainties for Deep Learning Using Calibrated Regression [Notes] ICML 2018
 - Calibrating Uncertainties in Object Localization Task [Notes] NIPS 2018
 - SMWA: On the Over-Smoothing Problem of CNN Based Disparity Estimation [Notes] ICCV 2019 [Multimodal, depth estimation]
 - Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image [Notes] ICRA 2018 (depth completion)
 
- Review of monocular object detection
 - Review of 2D 3D contraints in Mono 3DOD
 - MonoGRNet 2: Monocular 3D Object Detection via Geometric Reasoning on Keypoints [Notes] [estimates depth from keypoints]
 - Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image [Notes] CVPR 2017
 - SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss [Notes] [rergess distance from images, centernet like]
 - GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving [Notes] CVPR 2019
 - M3D-RPN: Monocular 3D Region Proposal Network for Object Detection [Notes] ICCV 2019 oral [3D anchors, cyclists, Xiaoming Liu]
 - TLNet: Triangulation Learning Network: from Monocular to Stereo 3D Object Detection [Notes] CVPR 2019
 - A Survey on 3D Object Detection Methods for Autonomous Driving Applications [Notes] TITS 2019 [Review]
 - BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image [Notes] IV 2019
 - ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection [Notes] AAAI 2020 oral [successor to pseudo-lidar, mono 3DOD SOTA]
 - Obj-dist: Learning Object-specific Distance from a Monocular Image [Notes] ICCV 2019 (xmotors.ai + NYU) [monocular distance]
 - DisNet: A novel method for distance estimation from monocular camera [Notes] IROS 2018 [monocular distance]
 - BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles [Notes] IROS 2019
 - Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints [Notes] ICIP 2019
 - 3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare [Notes] CVPR 2018
 - Deep Optics for Monocular Depth Estimation and 3D Object Detection [Notes] ICCV 2019
 - MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation [Notes] ICCV 2019
 - Joint Monocular 3D Vehicle Detection and Tracking [Notes] ICCV 2019 (Berkeley DeepDrive)
 - CasGeo: 3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results [Notes]
 
- Slimmable Neural Networks [Notes] ICLR 2019
 - Universally Slimmable Networks and Improved Training Techniques [Notes] ICCV 2019
 - AutoSlim: Towards One-Shot Architecture Search for Channel Numbers
 - Once for All: Train One Network and Specialize it for Efficient Deployment
 - DOTA: A Large-scale Dataset for Object Detection in Aerial Images [Notes] CVPR 2018 (rotated bbox)
 - RoiTransformer: Learning RoI Transformer for Oriented Object Detection in Aerial Images [Notes] CVPR 2019 (rotated bbox)
 - RRPN: Arbitrary-Oriented Scene Text Detection via Rotation Proposals TMM 2018
 - R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection (rotated bbox)
 - TI white paper: Webinar: mmWave Radar for Automotive and Industrial applications [Notes] [TI, radar]
 - Federated Learning: Strategies for Improving Communication Efficiency [Notes] NIPS 2016
 - sort: Simple Online and Realtime Tracking [Notes] ICIP 2016
 - deep-sort: Simple Online and Realtime Tracking with a Deep Association Metric [Notes]
 - MT-CNN: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks [Notes] SPL 2016 (real time, facial landmark)
 - RetinaFace: Single-stage Dense Face Localisation in the Wild [Notes] CVPR 2020 [joint object and landmark detection]
 - SC-SfM-Learner: Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video [Notes] NIPS 2019
 - SiamMask: Fast Online Object Tracking and Segmentation: A Unifying Approach CVPR 2019 (tracking, segmentation, label propagation)
 - Review of Kálmán Filter (from Tim Babb, Pixar Animation) [Notes]
 - R-FCN: Object Detection via Region-based Fully Convolutional Networks [Notes] NIPS 2016
 - Guided backprop: Striving for Simplicity: The All Convolutional Net [Notes] ICLR 2015
 - Occlusion-Net: 2D/3D Occluded Keypoint Localization Using Graph Networks [Notes] CVPR 2019
 - Boxy Vehicle Detection in Large Images [Notes] ICCV 2019
 - FQNet: Deep Fitting Degree Scoring Network for Monocular 3D Object Detection [Notes] CVPR 2019 [Mono 3DOD, Jiwen Lu]
 
- Mono3D: Monocular 3D Object Detection for Autonomous Driving [Notes] CVPR2016
 - MonoDIS: Disentangling Monocular 3D Object Detection [Notes] ICCV 2019
 - Pseudo lidar-e2e: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud [Notes] ICCV 2019 (pseudo-lidar with 2d and 3d consistency loss, better than PL and worse than PL++, SOTA for pure mono3D)
 - MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization [Notes] AAAI 2019 (SOTA of Mono3DOD, MLF < MonoGRNet < Pseudo-lidar)
 - MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images [Notes] CVPR 2018 (precursor to pseudo-lidar)
 - ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape [Notes] CVPR 2019
 - AM3D: Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving [Notes] ICCV 2019 [similar to pseudo-lidar, color-enhanced]
 - Mono3D++: Monocular 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors [Notes] (from Stefano Soatto) AAAI 2019
 - Deep Metadata Fusion for Traffic Light to Lane Assignment [Notes] IEEE RA-L 2019 (traffic lights association)
 - Automatic Traffic Light to Ego Vehicle Lane Association at Complex Intersections ITSC 2019 (traffic lights association)
 - Distant Vehicle Detection Using Radar and Vision[Notes] ICRA 2019 [radar, vision, radar tracklets fusion]
 - Distance Estimation of Monocular Based on Vehicle Pose Information [Notes]
 - Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics [Notes] CVPR 2018 (Alex Kendall)
 - GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks [Notes] ICML 2018 (multitask)
 - DTP: Dynamic Task Prioritization for Multitask Learning [Notes] ECCV 2018 [multitask, Stanford]
 - Will this car change the lane? - Turn signal recognition in the frequency domain [Notes] IV 2014
 - Complex-YOLO: Real-time 3D Object Detection on Point Clouds [Notes] (BEV detection only)
 - Complexer-YOLO: Real-Time 3D Object Detection and Tracking on Semantic Point Clouds CVPR 2019 (sensor fusion and tracking)
 - An intriguing failing of convolutional neural networks and the CoordConv solution [Notes] NIPS 2018
 
- Deep Parametric Continuous Convolutional Neural Networks [Notes] CVPR 2018 (@Uber, sensor fusion)
 - ContFuse: Deep Continuous Fusion for Multi-Sensor 3D Object Detection [Notes] ECCV 2018 [Uber ATG, sensor fusion, BEV]
 - Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net [Notes] CVPR 2018 oral [lidar only, perception and prediction]
 - LearnK: Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras [Notes] ICCV 2019 [monocular depth estimation, intrinsic estimation, SOTA]
 - monodepth: Unsupervised Monocular Depth Estimation with Left-Right Consistency [Notes] CVPR 2017 oral (monocular depth estimation, stereo for training)
 - Struct2depth: Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos [Notes] AAAI 2019 [monocular depth estimation, estimating movement of dynamic object, infinite depth problem, online finetune]
 - Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency [Notes] AAAI 2018 (monocular depth estimation, static assumption, surface normal)
 - LEGO Learning Edge with Geometry all at Once by Watching Videos [Notes] CVPR 2018 spotlight (monocular depth estimation, static assumption, surface normal)
 - Object Detection and 3D Estimation via an FMCW Radar Using a Fully Convolutional Network [Notes] (radar, RD map, OD, Arxiv 201902)
 - A study on Radar Target Detection Based on Deep Neural Networks [Notes] (radar, RD map, OD)
 - 2D Car Detection in Radar Data with PointNets [Notes] (from Ulm Univ, radar, point cloud, OD, Arxiv 201904)
 - Learning Confidence for Out-of-Distribution Detection in Neural Networks [Notes] (budget to cheat)
 - A Deep Learning Approach to Traffic Lights: Detection, Tracking, and Classification [Notes] ICRA 2017 (Bosch, traffic lights)
 - How hard can it be? Estimating the difficulty of visual search in an image [Notes] CVPR 2016
 - Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges [Notes] (review from Bosch)
 - Review of monocular 3d object detection (blog from 知乎)
 - Deep3dBox: 3D Bounding Box Estimation Using Deep Learning and Geometry [Notes] CVPR 2017 [Zoox]
 - MonoPSR: Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction [Notes] CVPR 2019
 - OFT: Orthographic Feature Transform for Monocular 3D Object Detection [Notes] BMVC 2019 [Convert camera to BEV, Alex Kendall]
 
- MixMatch: A Holistic Approach to Semi-Supervised Learning [Notes]
 - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks [Notes] ICML 2019
 - What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? [Notes] NIPS 2017
 - Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding [Notes]BMVC 2017
 - TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents [Notes] AAAI 2019 oral
 - Deep Depth Completion of a Single RGB-D Image [Notes] CVPR 2018 (indoor)
 - DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene from Sparse LiDAR Data and Single Color Image [Notes] CVPR 2019 (outdoor)
 - SfMLearner: Unsupervised Learning of Depth and Ego-Motion from Video [Notes] CVPR 2017
 - Monodepth2: Digging Into Self-Supervised Monocular Depth Estimation [Notes] ICCV 2019 [Niantic]
 - DeepSignals: Predicting Intent of Drivers Through Visual Signals [Notes] ICRA 2019 (@Uber, turn signal detection)
 - FCOS: Fully Convolutional One-Stage Object Detection [Notes] ICCV 2019 [Chunhua Shen]
 - Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving [Notes] ICLR 2020
 - MMF: Multi-Task Multi-Sensor Fusion for 3D Object Detection [Notes] CVPR 2019 (@Uber, sensor fusion)
 
- CenterNet: Objects as points (from ExtremeNet authors) [Notes]
 - CenterNet: Object Detection with Keypoint Triplets [Notes]
 - Object Detection based on Region Decomposition and Assembly [Notes] AAAI 2019
 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Notes] ICLR 2019
 - M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network [Notes] AAAI 2019
 - Deep Radar Detector [Notes] RadarCon 2019
 - Semantic Segmentation on Radar Point Clouds [[Notes]] (from Daimler AG) FUSION 2018
 - Pruning Filters for Efficient ConvNets [Notes] ICLR 2017
 - Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks [Notes] NIPS 2018 talk
 - LeGR: Filter Pruning via Learned Global Ranking [Notes] CVPR 2020 oral
 - NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection [Notes] CVPR 2019
 - AutoAugment: Learning Augmentation Policies from Data [Notes] CVPR 2019
 - Path Aggregation Network for Instance Segmentation [Notes] CVPR 2018
 - Channel Pruning for Accelerating Very Deep Neural Networks ICCV 2017 (Face++, Yihui He) [Notes]
 - AMC: AutoML for Model Compression and Acceleration on Mobile Devices ECCV 2018 (Song Han, Yihui He)
 - MobileNetV3: Searching for MobileNetV3 [Notes] ICCV 2019
 - MnasNet: Platform-Aware Neural Architecture Search for Mobile [Notes] CVPR 2019
 - Rethinking the Value of Network Pruning ICLR 2019
 
- MobileNetV2: Inverted Residuals and Linear Bottlenecks (MobileNets v2) [Notes] CVPR 2018
 - A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms [Notes] ITSC 2013
 - MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving [Notes]
 - Optimizing the Trade-off between Single-Stage and Two-Stage Object Detectors using Image Difficulty Prediction (Very nice illustration of 1 and 2 stage object detection)
 - Light-Head R-CNN: In Defense of Two-Stage Object Detector [Notes] (from Megvii)
 - CSP: High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection [Notes] CVPR 2019 [center and scale prediction, anchor-free, near SOTA pedestrian]
 - Review of Anchor-free methods (知乎Blog) 目标检测:Anchor-Free时代 Anchor free深度学习的目标检测方法 My Slides on CSP
 - DenseBox: Unifying Landmark Localization with End to End Object Detection
 - CornerNet: Detecting Objects as Paired Keypoints [Notes] ECCV 2018
 - ExtremeNet: Bottom-up Object Detection by Grouping Extreme and Center Points [Notes] CVPR 2019
 - FSAF: Feature Selective Anchor-Free Module for Single-Shot Object Detection [Notes] CVPR 2019
 - FoveaBox: Beyond Anchor-based Object Detector (anchor-free) [Notes]
 
- Bag of Freebies for Training Object Detection Neural Networks [Notes]
 - mixup: Beyond Empirical Risk Minimization [Notes] ICLR 2018
 - Multi-view Convolutional Neural Networks for 3D Shape Recognition (MVCNN) [Notes] ICCV 2015
 - 3D ShapeNets: A Deep Representation for Volumetric Shapes [Notes] CVPR 2015
 - Volumetric and Multi-View CNNs for Object Classification on 3D Data [Notes] CVPR 2016
 - Group Normalization [Notes] ECCV 2018
 - Spatial Transformer Networks [Notes] NIPS 2015
 - Frustum PointNets for 3D Object Detection from RGB-D Data (F-PointNet) [Notes] CVPR 2018
 - Dynamic Graph CNN for Learning on Point Clouds [Notes]
 - PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud (SOTA for 3D object detection) [Notes] CVPR 2019
 - MV3D: Multi-View 3D Object Detection Network for Autonomous Driving [Notes] CVPR 2017 (Baidu, sensor fusion, BV proposal)
 - AVOD: Joint 3D Proposal Generation and Object Detection from View Aggregation [Notes] IROS 2018 (sensor fusion, multiview proposal)
 - MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications [Notes]
 - Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gafp in 3D Object Detection for Autonomous Driving [Notes] CVPR 2019
 - VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection CVPR 2018 (Apple, first end-to-end point cloud encoding to grid)
 - SECOND: Sparsely Embedded Convolutional Detection Sensors 2018 (builds on VoxelNet)
 - PointPillars: Fast Encoders for Object Detection from Point Clouds [Notes] CVPR 2019 (builds on SECOND)
 - Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite [Notes] CVPR 2012
 - Vision meets Robotics: The KITTI Dataset [Notes] IJRR 2013
 
- Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D) [Notes]Video CVPR 2017
 - Initialization Strategies of Spatio-Temporal Convolutional Neural Networks [Notes] Video
 - Detect-and-Track: Efficient Pose Estimation in Videos [Notes] ICCV 2017 Video
 - Deep Learning Based Rib Centerline Extraction and Labeling [Notes] MI MICCAI 2018
 - SlowFast Networks for Video Recognition [Notes] ICCV 2019 Oral
 - Aggregated Residual Transformations for Deep Neural Networks (ResNeXt) [Notes] CVPR 2017
 - Beyond the pixel plane: sensing and learning in 3D (blog, 中文版本)
 - VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition (VoxNet) [Notes]
 - PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation CVPR 2017 [Notes]
 - PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space NIPS 2017 [Notes]
 - Review of Geometric deep learning 几何深度学习前沿 (from 知乎) (Up to CVPR 2018)
 
- DQN: Human-level control through deep reinforcement learning (Nature DQN paper) [Notes] DRL
 - Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection [Notes] MI
 - Panoptic Segmentation [Notes] PanSeg
 - Panoptic Feature Pyramid Networks [Notes] PanSeg
 - Attention-guided Unified Network for Panoptic Segmentation [Notes] PanSeg
 - Bag of Tricks for Image Classification with Convolutional Neural Networks [Notes] CLS
 - Deep Reinforcement Learning for Vessel Centerline Tracing in Multi-modality 3D Volumes [Notes] DRL MI
 - Deep Reinforcement Learning for Flappy Bird [Notes] DRL
 - Long-Term Feature Banks for Detailed Video Understanding [Notes] Video
 - Non-local Neural Networks [Notes] Video CVPR 2018
 
- Mask R-CNN
 - Cascade R-CNN: Delving into High Quality Object Detection
 - Focal Loss for Dense Object Detection (RetinaNet) [Notes]
 - Squeeze-and-Excitation Networks (SENet)
 - Progressive Growing of GANs for Improved Quality, Stability, and Variation
 - Deformable Convolutional Networks ICCV 2017 [build on R-FCN]
 - Learning Region Features for Object Detection
 
- Learning notes on Deep Learning
 - List of Papers on Machine Learning
 - Notes of Literature Review on CNN in CV This is the notes for all the papers in the recommended list here
 - Notes of Literature Review (Others)
 - Notes on how to set up DL/ML environment
 - Useful setup notes
 
Here is the list of papers waiting to be read.
- SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving
 - Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
 - ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness ICML 2019
 - Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet (BagNet) blog ICML 2019
 - A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay
 - Understanding deep learning requires rethinking generalization
 - Gradient Reversal: Unsupervised Domain Adaptation by Backpropagation ICML 2015
 
- Rethinking Pre-training and Self-training NeurIPS 2020 [Quoc Le]
 
- Mask Scoring R-CNN CVPR 2019
 - Training Region-based Object Detectors with Online Hard Example Mining
 - Gliding vertex on the horizontal bounding box for multi-oriented object detection
 - ONCE: Incremental Few-Shot Object Detection CVPR 2020
 - Domain Adaptive Faster R-CNN for Object Detection in the Wild CVPR 2018
 - Foggy Cityscapes: Semantic Foggy Scene Understanding with Synthetic Data IJCV 2018
 - Foggy Cityscapes ECCV: Model Adaptation with Synthetic and Real Data for Semantic Dense Foggy Scene Understanding ECCV 2018
 - Dropout Sampling for Robust Object Detection in Open-Set Conditions ICRA 2018 (Niko Sünderhauf)
 - Hybrid Task Cascade for Instance Segmentation CVPR 2019 (cascaded mask RCNN)
 - Evaluating Merging Strategies for Sampling-based Uncertainty Techniques in Object Detection ICRA 2019 (Niko Sünderhauf)
 - A Unified Panoptic Segmentation Network CVPR 2019 PanSeg
 - Model Vulnerability to Distributional Shifts over Image Transformation Sets (CVPR workshop) tl:dr
 - Automatic adaptation of object detectors to new domains using self-training CVPR 2019 (find corner case and boost)
 - Missing Labels in Object Detection CVPR 2019
 - DenseBox: Unifying Landmark Localization with End to End Object Detection
 - Circular Object Detection in Polar Coordinates for 2D LIDAR Data CCPR 2016
 - LFFD: A Light and Fast Face Detector for Edge Devices [Lightweight, face detection, car detection]
 - UnitBox: An Advanced Object Detection Network ACM MM 2016 [Ln IoU loss, Thomas Huang]
 
- Learning Spatiotemporal Features with 3D Convolutional Networks (C3D) Video ICCV 2015
 - AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
 - Spatiotemporal Residual Networks for Video Action Recognition (decouple spatiotemporal) NIPS 2016
 - Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks (P3D, decouple spatiotemporal) ICCV 2017
 - A Closer Look at Spatiotemporal Convolutions for Action Recognition (decouple spatiotemporal) CVPR 2018
 - Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification (decouple spatiotemporal) ECCV 2018
 - Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CVPR 2018
 - AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation ICCV 2019
 - One-Shot Video Object Segmentation CVPR 2017
 - Looking Fast and Slow: Memory-Guided Mobile Video Object Detection CVPR 2018
 - Towards High Performance Video Object Detection [Notes] CVPR 2018
 - Towards High Performance Video Object Detection for Mobiles [Notes]
 - Temporally Distributed Networks for Fast Video Semantic Segmentation CVPR 2020 [efficient video segmentation]
 - Memory Enhanced Global-Local Aggregation for Video Object Detection CVPR 2020 [efficient video object detection]
 - Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation IJCAI 2018 oral [video skeleton]
 - RST-MODNet: Real-time Spatio-temporal Moving Object Detection for Autonomous Driving NeurIPS 2019 workshop
 - Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 oral
 - Temporal Segment Networks: Towards Good Practices for Deep Action Recognition ECCV 2016
 - TRN: Temporal Relational Reasoning in Videos ECCV 2018
 - X3D: Expanding Architectures for Efficient Video Recognition CVPR 2020 oral [FAIR]
 - Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians CVPR 2020 oral [pedestrian, video]
 - Flow-guided feature aggregation for video object detection ICCV 2017 [video, object detection]
 - 3D human pose estimation in video with temporal convolutions and semi-supervised training CVPR 2019 [mono3D pose estimation from video]
 - OmegaNet: Distilled Semantics for Comprehensive Scene Understanding from Videos CVPR 2020
 - Object Detection in Videos with Tubelet Proposal Networks CVPR 2017 [video object detection]
 - T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos [video object detection]
 - Flow-Guided Feature Aggregation for Video Object Detection ICCV 2017 [Jifeng Dai]
 
- Efficient Deep Learning Inference based on Model Compression (Model Compression)
 - Neural Network Distiller [Intel]
 
- Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks
 - CBAM: Convolutional Block Attention Module
 
- Playing Atari with Deep Reinforcement Learning NIPS 2013
 - Multi-Scale Deep Reinforcement Learning for Real-Time 3D-Landmark Detection in CT Scan
 - An Artificial Agent for Robust Image Registration
 
- 3D-CNN:3D Convolutional Neural Networks for Landing Zone Detection from LiDAR
 - Generative and Discriminative Voxel Modeling with Convolutional Neural Networks
 - Orientation-boosted Voxel Nets for 3D Object Recognition (ORION) <BMVC 2017>
 - GIFT: A Real-time and Scalable 3D Shape Search Engine CVPR 2016
 - 3D Shape Segmentation with Projective Convolutional Networks (ShapePFCN)CVPR 2017
 - Learning Local Shape Descriptors from Part Correspondences With Multi-view Convolutional Networks
 - Open3D: A Modern Library for 3D Data Processing
 - Multimodal Deep Learning for Robust RGB-D Object Recognition IROS 2015
 - FlowNet3D: Learning Scene Flow in 3D Point Clouds CVPR 2019
 - Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling CVPR 2018 (Neighbors Do Help: Deeply Exploiting Local Structures of Point Clouds)
 - PU-Net: Point Cloud Upsampling Network CVPR 2018
 - Recurrent Slice Networks for 3D Segmentation of Point Clouds CVPR 2018
 - SPLATNet: Sparse Lattice Networks for Point Cloud Processing CVPR 2018
 - Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering NIPS 2016
 - Semi-Supervised Classification with Graph Convolutional Networks ICLR 2017
 - Geometric Matrix Completion with Recurrent Multi-Graph Neural Networks NIPS 2017
 - Graph Attention Networks ICLR 2018
 - 3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection (3D SSD)
 - Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models ICCV 2017
 - Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis CVPR 2017
 - IPOD: Intensive Point-based Object Detector for Point Cloud
 - Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones in RGB-Depth Images CVPR 2017
 - 2D-Driven 3D Object Detection in RGB-D Images
 - 3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection
 - Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object Detection [classify occluded object]
 
- PSMNet: Pyramid Stereo Matching Network CVPR 2018
 - Stereo R-CNN based 3D Object Detection for Autonomous Driving CVPR 2019
 - Deep Rigid Instance Scene Flow CVPR 2019
 - Upgrading Optical Flow to 3D Scene Flow through Optical Expansion CVPR 2020
 - Learning Multi-Object Tracking and Segmentation from Automatic Annotations CVPR 2020 [automatic MOTS annotation]
 
- Traffic-Sign Detection and Classification in the Wild CVPR 2016 [Tsinghua, Tencent, traffic signs]
 - A Hierarchical Deep Architecture and Mini-Batch Selection Method For Joint Traffic Sign and Light Detection IEEE CRV 2018 [U torronto]
 - Detecting Traffic Lights by Single Shot Detection ITSC 2018
 - DeepTLR: A single Deep Convolutional Network for Detection and Classification of Traffic Lights IV 2016
 - Evaluating State-of-the-art Object Detector on Challenging Traffic Light Data CVPR 2017 workshop
 - Traffic light recognition in varying illumination using deep learning and saliency map ITSC 2014 [traffic light]
 - Traffic light recognition using high-definition map features RAS 2019
 - Vision for Looking at Traffic Lights: Issues, Survey, and Perspectives TITS 2015
 
- The DriveU Traffic Light Dataset: Introduction and Comparison with Existing Datasets ICRA 2018
 - The Oxford Radar RobotCar Dataset: A Radar Extension to the Oxford RobotCar Dataset
 - Vision for Looking at Traffic Lights: Issues, Survey, and Perspectives (traffic light survey, UCSD LISA)
 - Review of Graph Spectrum Theory (WIP)
 - 3D Deep Learning Tutorial at CVPR 2017 [Notes] - (WIP)
 - A Survey on Neural Architecture Search
 - Network pruning tutorial (blog)
 - GNN tutorial at CVPR 2019
 - Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset [Waymo, prediction dataset]
 - PANDA: A Gigapixel-level Human-centric Video Dataset CVPR 2020
 - WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving ICCV 2019 [Valeo]
 
- Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation 3DV 2018
 - Depth Map Prediction from a Single Image using a Multi-Scale Deep Network NIPS 2014 (Eigen et al)
 - Learning Depth from Monocular Videos using Direct Methods CVPR 2018 (monocular depth estimation)
 - Virtual-Normal: Enforcing geometric constraints of virtual normal for depth prediction [Notes] ICCV 2019 (better generation of PL)
 - Spatial Correspondence with Generative Adversarial Network: Learning Depth from Monocular Videos ICCV 2019
 - Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM ICCV 2019
 - Visualization of Convolutional Neural Networks for Monocular Depth Estimation ICCV 2019
 
- Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation ICCV 2019 workshop [indoor]
 - Multi-Loss Rebalancing Algorithm for Monocular Depth Estimation ECCV 2020 [indoor depth]
 - Disambiguating Monocular Depth Estimation with a Single Transient ECCV 2020 [additional laser sensor, indoor depth]
 - Guiding Monocular Depth Estimation Using Depth-Attention Volume ECCV 2020 [indoor depth]
 - Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets ECCV 2020 [indoor depth]
 - CLIFFNet for Monocular Depth Estimation with Hierarchical Embedding Loss ECCV 2020 [indoor depth]
 
- PointSIFT: A SIFT-like Network Module for 3D Point Cloud Semantic Segmentation (pointnet alternative, backbone)
 - Vehicle Detection from 3D Lidar Using Fully Convolutional Network (VeloFCN) RSS 2016
 - KPConv: Flexible and Deformable Convolution for Point Clouds (from the authors of PointNet)
 - PointCNN: Convolution On X-Transformed Points NIPS 2018
 - L3-Net: Towards Learning based LiDAR Localization for Autonomous Driving CVPR 2019
 - RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement (sensor fusion, 3D mono proposal, refined in point cloud)
 - DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map CVPR 2018
 - Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection IROS 2019
 - PointRNN: Point Recurrent Neural Network for Moving Point Cloud Processing
 - Gated2Depth: Real-time Dense Lidar from Gated Images ICCV 2019 oral
 - A Multi-Sensor Fusion System for Moving Object Detection and Tracking in Urban Driving Environments ICRA 2014
 - PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation CVPR 2018 [sensor fusion, Zoox]
 - Deep Hough Voting for 3D Object Detection in Point Clouds ICCV 2019 [Charles Qi]
 - StixelNet: A Deep Convolutional Network for Obstacle Detection and Road Segmentation
 - PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation CVPR 2020
 - Depth Sensing Beyond LiDAR Range CVPR 2020 [wide baseline stereo with trifocal]
 - Probabilistic Semantic Mapping for Urban Autonomous Driving Applications IROS 2020 [lidar mapping]
 - RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds CVPR 2020 oral [lidar segmentation]
 - PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation CVPR 2020 [lidar segmentation]
 - OctSqueeze: Octree-Structured Entropy Model for LiDAR Compression CVPR 2020 oral [lidar compression]
 - MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models NeurIPS 2020 oral [lidar compression]
 
- Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty CVPR 2018 [on-board bbox prediction]
 - Unsupervised Traffic Accident Detection in First-Person Videos IROS 2019 (Honda)
 - NEMO: Future Object Localization Using Noisy Ego Priors (Honda)
 - Robust Aleatoric Modeling for Future Vehicle Localization (perspective)
 - Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments WACV 2020 (perspective bbox, pedestrian)
 - Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate
 
- End-to-end Lane Detection through Differentiable Least-Squares Fitting ICCV 2019
 - Line-CNN: End-to-End Traffic Line Detection With Line Proposal Unit TITS 2019 [object-like proposals]
 - Detecting Lane and Road Markings at A Distance with Perspective Transformer Layers [3D LLD]
 - Ultra Fast Structure-aware Deep Lane Detection ECCV 2020 [lane detection]
 - A Novel Approach for Detecting Road Based on Two-Stream Fusion Fully Convolutional Network (convert camera to BEV)
 - FastDraw: Addressing the Long Tail of Lane Detection by Adapting a Sequential Prediction Network
 
- RetinaTrack: Online Single Stage Joint Detection and Tracking CVPR 2020
 - Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art (latest update in Dec 2019)
 - Simultaneous Identification and Tracking of Multiple People Using Video and IMUs CVPR 2019
 - Detect-and-Track: Efficient Pose Estimation in Videos
 - TrackNet: Simultaneous Object Detection and Tracking and Its Application in Traffic Video Analysis
 - Video Action Transformer Network CVPR 2019 oral
 - Online Real-time Multiple Spatiotemporal Action Localisation and Prediction ICCV 2017
 - 多目标跟踪 近年论文及开源代码汇总
 - GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning CVPR 2020 oral [3DMOT, CMU, Kris Kitani]
 - Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking ECCV 2020 spotlight [MOT, Tencent]
 - Towards Real-Time Multi-Object Tracking ECCV 2020 [MOT]
 - Probabilistic 3D Multi-Object Tracking for Autonomous Driving [TRI]
 
- Probabilistic Face Embeddings ICCV 2019
 - Data Uncertainty Learning in Face Recognition CVPR 2020
 - Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos CVPR 2020 oral [VGG, self-supervised, interpretable, discriminator]
 
- Revisiting Small Batch Training for Deep Neural Networks
 - ICML2019 workshop: Adaptive and Multitask Learning: Algorithms & Systems ICML 2019
 - Adaptive Scheduling for Multi-Task Learning NIPS 2018 (NMT)
 - Polar Transformer Networks ICLR 2018
 - Measuring Calibration in Deep Learning CVPR 2019
 - Sampling-free Epistemic Uncertainty Estimation Using Approximated Variance Propagation ICCV 2019 (epistemic uncertainty)
 - Making Convolutional Networks Shift-Invariant Again ICML
 - Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty NeurIPS 2019
 - Understanding deep learning requires rethinking generalization ICLR 2017 [ICLR best paper]
 - A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks ICLR 2017 (NLL score as anomaly score)
 - Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination CVPR 2018 spotlight (Stella Yu)
 - Theoretical insights into the optimization landscape of over-parameterized shallow neural networks TIP 2018
 - The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning ICML 2018
 - Designing Network Design Spaces CVPR 2020
 - Moco2: Improved Baselines with Momentum Contrastive Learning
 - SGD on Neural Networks Learns Functions of Increasing Complexity NIPS 2019 (SGD learns a linear classifier first)
 - Pay attention to the activations: a modular attention mechanism for fine-grained image recognition
 - A Mixed Classification-Regression Framework for 3D Pose Estimation from 2D Images BMVC 2018 (multi-bin, what's new?)
 - In-Place Activated BatchNorm for Memory-Optimized Training of DNNs CVPR 2018 (optimized BatchNorm + ReLU)
 - FCNN: Fourier Convolutional Neural Networks (FFT as CNN)
 - Visualizing the Loss Landscape of Neural Nets NIPS 2018
 - Xception: Deep Learning with Depthwise Separable Convolutions (Xception)
 - Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics (uncertainty)
 - Learning to Drive from Simulation without Real World Labels ICRA 2019 (domain adaptation, sim2real)
 - Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks CVPR 2020 oral
 - Switchable Whitening for Deep Representation Learning ICCV 2019 [domain adaptation]
 - Visual Chirality CVPR 2020 oral [best paper nominee]
 - Generalized ODIN: Detecting Out-of-Distribution Image Without Learning From Out-of-Distribution Data CVPR 2020
 - Self-training with Noisy Student improves ImageNet classification CVPR 2020 [distillation]
 - Keep it Simple: Image Statistics Matching for Domain Adaptation CVPRW 2020 [Domain adaptation for 2D mod bbox]
 - Epipolar Transformers CVPR 2020 [Yihui He]
 - Scalable Uncertainty for Computer Vision With Functional Variational Inference CVPR 2020 [epistemic uncertainty with one fwd pass]
 
- 3DOP: 3D Object Proposals for Accurate Object Class Detection NIPS 2015
 - DirectShape: Photometric Alignment of Shape Priors for Visual Vehicle Pose and Shape Estimation
 - Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360° Panoramic Imagery ECCV 2018 (Monocular 3D object detection and depth estimation)
 - Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-aware Representation CVPR 2019 [unified conditional decoder]
 - DDP: Dense Depth Posterior from Single Image and Sparse Range CVPR 2019
 - Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes IJCV 2018 (data augmentation with AR, Toyota)
 - Exploring the Capabilities and Limits of 3D Monocular Object Detection -- A Study on Simulation and Real World Data IITS
 - Towards Scene Understanding with Detailed 3D Object Representations IJCV 2014 (keypoint, 3D bbox annotation)
 - Deep Cuboid Detection: Beyond 2D Bounding Boxes (Magic Leap)
 - Viewpoints and Keypoints (Malik)
 - Lifting Object Detection Datasets into 3D (PASCAL)
 - 3D Object Class Detection in the Wild (keypoint based)
 - Fast Single Shot Detection and Pose Estimation 3DV 2016 (SSD + pose, Wei Liu)
 - Virtual KITTI 2
 - Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing CVPR 2017
 - Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views ICCV 2015 Oral
 - Real-Time Seamless Single Shot 6D Object Pose Prediction CVPR 2018
 - Practical Deep Stereo (PDS): Toward applications-friendly deep stereo matching NIPS 2018 [disparity estimation]
 - Self-supervised Sparse-to-Dense: Self-supervised Depth Completion from LiDAR and Monocular Camera ICRA 2019
 - Learning Depth with Convolutional Spatial Propagation Network (Baidu, depth from SPN) ECCV 2018
 - Just Go with the Flow: Self-Supervised Scene Flow Estimation CVPR 2020 oral [Scene flow, Lidar]
 - Online Depth Learning against Forgetting in Monocular Videos CVPR 2020 [monodepth]
 - Self-Supervised Deep Visual Odometry with Online Adaptation CVPR 2020 oral [DF-VO, TrianFlow, meta-learning]
 - Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume CVPR 2020
 - Online Depth Learning against Forgetting in Monocular Videos CVPR 2020 [monodepth, online learning]
 - SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation CVPR 2020 [monodepth, semantic]
 - Inferring Distributions Over Depth from a Single Image TRO [Depth confidence, stitching them together]
 - Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths CVPR 2020
 - The Edge of Depth: Explicit Constraints between Segmentation and Depth CVPR 2020 [Xiaoming Liu, multimodal, depth bleeding]
 
- MV-RSS: Multi-View Radar Semantic Segmentation ICCV 2021
 - Classification of Objects in Polarimetric Radar Images Using CNNs at 77 GHz (Radar, polar)
 - CNNs for Interference Mitigation and Denoising in Automotive Radar Using Real-World Data NeurIPS 2019 (radar)
 - Road Scene Understanding by Occupancy Grid Learning from Sparse Radar Clusters using Semantic Segmentation ICCV 2019 (radar)
 - RadarNet: Exploiting Radar for Robust Perception of Dynamic Objects ECCV 2020 [Uber ATG]
 - Depth Estimation from Monocular Images and Sparse Radar Data IROS 2020 [Camera + Radar for monodepth, nuscenes]
 - RPR: Radar-Camera Sensor Fusion for Joint Object Detection and Distance Estimation in Autonomous Vehicles IROS 2020 [radar proposal refinement]
 - Warping of Radar Data into Camera Image for Cross-Modal Supervision in Automotive Applications
 
- PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization [Notes] ICCV 2015
 - PoseNet2: Modelling Uncertainty in Deep Learning for Camera Relocalization ICRA 2016
 - PoseNet3: Geometric Loss Functions for Camera Pose Regression with Deep Learning CVPR 2017
 - EssNet: Convolutional neural network architecture for geometric matching CVPR 2017
 - NC-EssNet: Neighbourhood Consensus Networks NeurIPS 2018
 - Reinforced Feature Points: Optimizing Feature Detection and Description for a High-Level Task CVPR 2020 oral [Eric Brachmann, ngransac]
 - Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints CVPR 2018
 - DynSLAM: Robust Dense Mapping for Large-Scale Dynamic Environments [dynamic SLAM, Andreas Geiger] ICRA 2018
 - GCNv2: Efficient Correspondence Prediction for Real-Time SLAM LRA 2019 [Superpoint + orb slam]
 - [Real-time Scalable Dense Surfel Mapping](Real-time Scalable Dense Surfel Mapping) ICRA 2019 [dense reconstruction, monodepth]
 - Dynamic SLAM: The Need For Speed
 - GSLAM: A General SLAM Framework and Benchmark ICCV 2019
 
- Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler Radar CVPR 2020 [Daimler]
 - Radar+RGB Attentive Fusion for Robust Object Detection in Autonomous Vehicles ICIP 2020
 - Spatial Attention Fusion for Obstacle Detection Using MmWave Radar and Vision Sensor sensors 2020 [radar, camera, early fusion]
 
- A Survey on Deep Learning for Localization and Mapping: Towards the Age of Spatial Machine Intelligence
 - Monocular Depth Estimation Based On Deep Learning: An Overview
 
- Uncertainty Guided Multi-Scale Residual Learning-using a Cycle Spinning CNN for Single Image De-Raining CVPR 2019
 - Learn to Combine Modalities in Multimodal Deep Learning (sensor fusion, general DL)
 - Safe Trajectory Generation For Complex Urban Environments Using Spatio-temporal Semantic Corridor LRA 2019 [Motion planning]
 - DAgger: Driving Policy Transfer via Modularity and Abstraction CoRL 2018 [DAgger, Immitation Learning]
 - Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching ICRA 2020 [Motion planning]
 - Calibration of Heterogeneous Sensor Systems
 - Intro:Sensor Fusion for Adas 无人驾驶中的数据融合 (from 知乎) (Up to CVPR 2018)
 - YUVMultiNet: Real-time YUV multi-task CNN for autonomous driving CVPR 2019 (Real Time, Low Power)
 - Deep Fusion of Heterogeneous Sensor Modalities for the Advancements of ADAS to Autonomous Vehicles
 - Temporal Coherence for Active Learning in Videos ICCVW 2019 [active learning, temporal coherence]
 - R-TOD: Real-Time Object Detector with Minimized End-to-End Delay for Autonomous Driving RTSS 2020 [perception system design]
 
- Learning Lane Graph Representations for Motion Forecasting ECCV 2020 [Uber ATG]
 - DSDNet: Deep Structured self-Driving Network ECCV 2020 [Uber ATG]
 
- Temporal Coherence for Active Learning in Videos ICCV 2019 workshop
 - Leveraging Pre-Trained 3D Object Detection Models For Fast Ground Truth Generation ITSC 2018 [UToronto, autolabeling]
 - Learning Multi-Object Tracking and Segmentation From Automatic Annotations CVPR 2020 [Autolabeling]
 - Canonical Surface Mapping via Geometric Cycle Consistency ICCV 2019
 - TIDE: A General Toolbox for Identifying Object Detection Errors ECCV 2018 [tools]
 
- Self-Supervised Camera Self-Calibration from Video [TRI, intrinsic calibration, fisheye/pinhole]
 
- A Convolutional Neural Network for Modelling Sentences ACL 2014
 - FastText: Bag of Tricks for Efficient Text Classification ACL 2017
 - Siamese recurrent architectures for learning sentence similarity AAAI 2016
 - Efficient Estimation of Word Representations in Vector Space ICLR 2013
 - Neural Machine Translation by Jointly Learning to Align and Translate ICLR 2015
 - Transformers: Attention Is All You Need NIPS 2017
 
- Ad推荐系统方向文章汇总
 - UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction [Notes] (dimension reduction, better than t-SNE)
 
- Review Notes of Classical Key Points and Descriptors
 - CRF
 - Visual SLAM and Visual Odometry
 - ORB SLAM
 - Bundle Adjustment
 - 3D vision
 - SLAM/VIO学习总结
 - Design Patterns
 
- Capturing Omni-Range Context for Omnidirectional Segmentation CVPR 2021
 - UP-DETR: Unsupervised Pre-training for Object Detection with Transformers CVPR 2021 [transformers]
 - DCL: Dense Label Encoding for Boundary Discontinuity Free Rotation Detection CVPR 2021
 - 4D Panoptic LiDAR Segmentation CVPR 2021 [TUM]
 - CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild CVPR 2021
 - Fast and Accurate Model Scaling CVPR 2021 [FAIR]
 - Cylinder3D: Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation CVPR 2021 [lidar semantic segmentation]
 - LiDAR R-CNN: An Efficient and Universal 3D Object Detector CVPR 2021 [TuSimple, Lidar]
 - PREDATOR: Registration of 3D Point Clouds with Low Overlap CVPR 2021 oral
 - DBB: Diverse Branch Block: Building a Convolution as an Inception-like Unit CVPR 2021 [RepVGG, ACNet, Xiaohan Ding, Megvii]
 - GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection CVPR 2021 [mono3D]
 - DDMP: Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection CVPR 2021 [mono3D]
 - M3DSSD: Monocular 3D Single Stage Object Detector CVPR 2021 [mono3D]
 - MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation CVPR 2021 [mono3D]
 - HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection CVPR 2021 [Lidar]
 - PLUME: Efficient 3D Object Detection from Stereo Images [Yan Wang, Uber ATG]
 - V2F-Net: Explicit Decomposition of Occluded Pedestrian Detection [crowded, pedestrian, megvii]
 - IP-basic: In Defense of Classical Image Processing: Fast Depth Completion on the CPU CRV 2018
 - Revisiting Feature Alignment for One-stage Object Detection [cls+reg]
 - Per-frame mAP Prediction for Continuous Performance Monitoring of Object Detection During Deployment WACV 2021 [SafetyNet]
 - TSD: Revisiting the Sibling Head in Object Detector CVPR 2020 [sensetime, cls+reg]
 - 1st Place Solutions for OpenImage2019 -- Object Detection and Instance Segmentation [sensetime, cls+reg, 1st place OpenImage2019]
 - Enabling spatio-temporal aggregation in Birds-Eye-View Vehicle Estimation ICRA 2021
 - End-to-end Lane Detection through Differentiable Least-Squares Fitting ICCV workshop 2019
 - Revisiting ResNets: Improved Training and Scaling Strategies
 - Multi-Modality Cut and Paste for 3D Object Detection
 - LD: Localization Distillation for Object Detection
 - PolyTransform: Deep Polygon Transformer for Instance Segmentation CVPR 2020 [single stage instance segmentation]
 - ROAD: The ROad event Awareness Dataset for Autonomous Driving
 - LidarMTL: A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding [lidar MTL]
 - High-Performance Large-Scale Image Recognition Without Normalization ICLR 2021
 - Ground-aware Monocular 3D Object Detection for Autonomous Driving RA-L [mono3D]
 - Demystifying Pseudo-LiDAR for Monocular 3D Object Detection [mono3d]
 - Pseudo-labeling for Scalable 3D Object Detection [Waymo]
 - LLA: Loss-aware Label Assignment for Dense Pedestrian Detection [Megvii]
 - VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation CVPR 2020 [Waymo]
 - CoverNet: Multimodal Behavior Prediction using Trajectory Sets CVPR 2020 [prediction, nuScenes]
 - SplitNet: Divide and Co-training
 - VoVNet: An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection CVPR 2019 workshop
 - Isometric Neural Networks: Non-discriminative data or weak model? On the relative importance of data and model resolution ICCV 2019 workshop [spatial2channel]
 - TResNet WACV 2021 [spatial2channel]
 - Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression AAAI 2020 [DIOU, NMS]
 - RegNet: Designing Network Design Spaces CVPR 2020 [FAIR]
 - On Network Design Spaces for Visual Recognition [FAIR]
 - Lane Endpoint Detection and Position Accuracy Evaluation for Sensor Fusion-Based Vehicle Localization on Highways Sensors 2018 [lane endpoints]
 - Map-Matching-Based Cascade Landmark Detection and Vehicle Localization IEEE Access 2019 [lane endpoints]
 - GCNet: End-to-End Learning of Geometry and Context for Deep Stereo Regression ICCV 2017 [disparity estimation, Alex Kendall, cost volume]
 - Traffic Control Gesture Recognition for Autonomous Vehicles IROS 2020 [Daimler]
 - Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild ECCV 2020
 - OrcVIO: Object residual constrained Visual-Inertial Odometry [dynamic SLAM, very mathematical]
 - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling ECCV 2020
 - DA4AD: End-to-End Deep Attention-based Visual Localization for Autonomous Driving ECCV 2020
 - Towards Lightweight Lane Detection by Optimizing Spatial Embedding ECCV 2020 workshop [LLD]
 - Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection ECCV 2020 workshop [lidar]
 - DeepIM: Deep iterative matching for 6d pose estimation ECCV 2018 [pose estimation]
 - Monocular Depth Prediction through Continuous 3D Loss IROS 2020
 - Multi-Task Learning for Dense Prediction Tasks: A Survey [MTL, Luc Van Gool]
 - Dynamic Task Weighting Methods for Multi-task Networks in Autonomous Driving Systems ITSC 2020 oral [MTL]
 - NeurAll: Towards a Unified Model for Visual Perception in Automated Driving ITSC 2019 oral [MTL]
 - Deep Evidential Regression NeurIPS 2020 [one-pass aleatoric/epistemic uncertainty]
 - Estimating Drivable Collision-Free Space from Monocular Video WACV 2015 [Drivable space]
 - Visualization of Convolutional Neural Networks for Monocular Depth Estimation ICCV 2019 [monodepth]
 - Differentiable Rendering: A Survey [differentiable rendering, TRI]
 - SAFENet: Self-Supervised Monocular Depth Estimation with Semantic-Aware Feature Extraction [monodepth, semantics, Naver labs]
 - Toward Interactive Self-Annotation For Video Object Bounding Box: Recurrent Self-Learning And Hierarchical Annotation Based Framework WACV 2020
 - Towards Good Practice for CNN-Based Monocular Depth Estimation WACV 2020
 - Self-Supervised Scene De-occlusion CVPR 2020 oral
 - TP-LSD: Tri-Points Based Line Segment Detector
 - Data Distillation: Towards Omni-Supervised Learning CVPR 2018 [Kaiming He, FAIR]
 - MiDas: Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer [monodepth, dynamic object, synthetic dataset]
 - Semantics-Driven Unsupervised Learning for Monocular Depth and Ego-Motion Estimation [monodepth]
 - Towards Lightweight Lane Detection by Optimizing Spatial Embedding ECCV 2020 workshop
 - Synthetic-to-Real Domain Adaptation for Lane Detection [GM Israel, LLD]
 - PolyLaneNet: Lane Estimation via Deep Polynomial Regression ICPR 2020 [polynomial, LLD]
 - Learning Universal Shape Dictionary for Realtime Instance Segmentation
 - End-to-End Video Instance Segmentation with Transformers [DETR, transformers]
 - Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks CVPR 2020 workshop
 - When and Why Test-Time Augmentation Works
 - Footprints and Free Space from a Single Color Image CVPR 2020 oral [Parking use, footprint]
 - Driving among Flatmobiles: Bird-Eye-View occupancy grids from a monocular camera for holistic trajectory planning [BEV, only predict footprint]
 - Rethinking Classification and Localization for Object Detection CVPR 2020
 - Monocular 3D Object Detection with Sequential Feature Association and Depth Hint Augmentation [mono3D]
 - Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
 - ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation
 - MVSNet: Depth Inference for Unstructured Multi-view Stereo ECCV 2018
 - Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference CVPR 2019 [Deep learning + MVS, Vidar, same author MVSNet]
 - Artificial Dummies for Urban Dataset Augmentation AAAI 2021
 - DETR for Pedestrian Detection [transformer, pedestrian detection]
 - Multi-Modality Cut and Paste for 3D Object Detection [SenseTime]
 - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [transformer, semantic segmenatation]
 - TransPose: Towards Explainable Human Pose Estimation by Transformer [transformer, pose estimation]
 - Seesaw Loss for Long-Tailed Instance Segmentation
 - SWA Object Detection [Stochastic Weights Averaging (SWA)]
 - 3D Object Detection with Pointformer
 - Toward Transformer-Based Object Detection [DETR-like]
 - Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion [dense SfM]
 - Multi-Modality Cut and Paste for 3D Object Detection
 - Vision Global Localization with Semantic Segmentation and Interest Feature Points
 - Transformer Interpretability Beyond Attention Visualization [transformers]
 - Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU
 - DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution
 - Empirical Upper Bound in Object Detection and More
 - Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline [Fisheye, Senthil Yogamani]
 - Monocular 3D Object Detection with Sequential Feature Association and Depth Hint Augmentation [mono3D]
 - SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular images [Jiwen Lu, monodepth]
 - Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion [TRI]
 - Linformer: Self-Attention with Linear Complexity
 - Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks ICML 2019
 - PCT: Point cloud transformer Computational Visual Media 2021
 - DDT: Unsupervised Object Discovery and Co-Localization by Deep Descriptor Transforming IJCAI 2017
 - Hierarchical Road Topology Learning for Urban Map-less Driving [Mercedes]
 - Probabilistic Future Prediction for Video Scene Understanding ECCV 2020 [Alex Kendall]
 - Detecting 32 Pedestrian Attributes for Autonomous Vehicles [VRU, MTL]
 - Cascaded deep monocular 3D human pose estimation with evolutionary training data CVPR 2020 oral
 - MonoGeo: Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [mono3D]
 - Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth [mono3D]
 - Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting [mono3D]
 - Lite-FPN for Keypoint-based Monocular 3D Object Detection [mono3D]
 - Lidar Point Cloud Guided Monocular 3D Object Detection
 - Vision Transformers for Dense Prediction [Vladlen Koltun, Intel]
 - Efficient Transformers: A Survey
 - Do Vision Transformers See Like Convolutional Neural Networks?
 - Progressive Coordinate Transforms for Monocular 3D Object Detection [mono3D]
 - AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection ICCV 2021 [mono3D]
 - BlazePose: On-device Real-time Body Pose tracking
 
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [Andy Zeng]
 - Large Language Models as General Pattern Machines [Embodied AI]
 - RetinaGAN: An Object-aware Approach to Sim-to-Real Transfer
 - PlaNet: Learning Latent Dynamics for Planning from Pixels ICML 2019
 - Dreamer: Dream to Control: Learning Behaviors by Latent Imagination ICLR 2020 oral
 - DreamerV2: Mastering Atari with Discrete World Models ICLR 2021 [World models]
 - DreamerV3: Mastering Diverse Domains through World Models
 - DayDreamer: World Models for Physical Robot Learning CoRL 2022
 - JEPA: A Path Towards Autonomous Machine Intelligence
 - I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture CVPR 2023
 - Runway Gen-1: Structure and Content-Guided Video Synthesis with Diffusion Models
 - IL Difficulty Model: Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula CoRL 2022 [Waymo]
 - Decision Transformer: Reinforcement Learning via Sequence Modeling NeurIPS 2021 [LLM for planning]
 - LID: Pre-Trained Language Models for Interactive Decision-Making NeurIPS 2022 [LLM for planning]
 - Planning with Large Language Models via Corrective Re-prompting NeurIPS 2022 Workshop
 - Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability ICCV 2023 [TuSimple]
 - Speculative Sampling: Accelerating Large Language Model Decoding with Speculative Sampling [Accelerated LLM, DeepMind]
 - Inference with Reference: Lossless Acceleration of Large Language Models [Accelerated LLM, Microsoft]
 - EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments T-RO 2021
 - Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching ICRA 2020
 - StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection
 - SSCNet: Semantic Scene Completion from a Single Depth Image CVPR 2017
 - SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences ICCV 2019
 - PixPro: Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning [self-supervised]
 - Pixel-Wise Contrastive Distillation [self-supervised]
 - VICRegL: Self-Supervised Learning of Local Visual Features NeurIPS 2022
 - ImageBind: One Embedding Space To Bind Them All CVPR 2023
 - KEMP: Keyframe-Based Hierarchical End-to-End Deep Model for Long-Term Trajectory Prediction ICRA 2022 [Planning]
 - Deep Interactive Motion Prediction and Planning: Playing Games with Motion Prediction Models L4DC [Planning]
 - GameFormer: Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving [Planning]
 - LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving [Planning, Raquel]
 - DIPP: Differentiable Integrated Motion Prediction and Planning with Learnable Cost Function for Autonomous Driving [Planning]
 - Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios [Planning, Waymo]
 - Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving IROS 2022 [Planning, Waymo]
 - Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation ICRA 2022 [Planning, Waymo]
 - JFP: Joint Future Prediction with Interactive Multi-Agent Modeling for Autonomous Driving [Planning, Waymo]
 - MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation NeurIPS 2021
 - 3D Semantic Scene Completion: a Survey IJCV 2022
 - DETIC: Detecting Twenty-thousand Classes using Image-level Supervision ECCV 2022
 - Atlas: End-to-End 3D Scene Reconstruction from Posed Images ECCV 2020
 - TransformerFusion: Monocular RGB Scene Reconstruction using Transformers NeurIPS 2021
 - SimpleOccupancy: A Simple Attempt for 3D Occupancy Estimation in Autonomous Driving [Occupancy Network]
 - OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion [Occupancy Network, stereo]
 - Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception NeurIPS 2022
 - Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline
 - ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals CVPR 2023 [Qcraft, prediction]
 - Motion Transformer with Global Intention Localization and Local Movement Refinement NeurIPS 2022 Oral
 - P4P: Conflict-Aware Motion Prediction for Planning in Autonomous Driving
 - MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction
 - ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries
 - SAM: Segment Anything [FAIR]
 - GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
 - Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge [Encode Road requirement to prediction]
 - Transformer Feed-Forward Layers Are Key-Value Memories EMNLP 2021
 - BEV-LaneDet: a Simple and Effective 3D Lane Detection Baseline CVPR 2023 [BEVNet]
 - Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception [BEVNet, megvii]
 - VAD: Vectorized Scene Representation for Efficient Autonomous Driving [Horizon]
 - A Simple Attempt for 3D Occupancy Estimation in Autonomous Driving
 - BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment [BEVDet, PhiGent]
 - NVRadarNet: Real-Time Radar Obstacle and Free Space Detection for Autonomous Driving
 - GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping CVPR 2020 [Cewu Lu]
 - AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains [Cewu Lu]
 - Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
 - HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding
 - MTR: Motion Transformer with Global Intention Localization and Local Movement Refinement NeurIPS 2022
 - UVTR: Unifying Voxel-based Representation with Transformer for 3D Object Detection [BEVFusion, Megvii, BEVNet, camera + lidar]
 - Don't Use Large Mini-Batches, Use Local SGD ICLR 2020
 - Grokking: Generalization beyond Overfitting on small algorithmic datasets
 - Progress measures for grokking via mechanistic interpretability
 - Understanding deep learning requires rethinking generalization ICLR 2017
 - Unifying Grokking and Double Descent
 - Deep Interactive Motion Prediction and Planning: Playing Games with Motion Prediction Models L4DC 2022
 - Interactive Prediction and Planning for Autonomous Driving: from Algorithms to Fundamental Aspects [PhD thesis of Wei Zhan, 2019]
 - Lyft1001: One Thousand and One Hours: Self-driving Motion Prediction Dataset [Lyft Level 5, prediction dataset]
 - PCAccumulation: Dynamic 3D Scene Analysis by Point Cloud Accumulation ECCV 2022
 - UniSim: A Neural Closed-Loop Sensor Simulator CVPR 2023 [simulation, Raquel]
 - GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving CVPR 2023
 - Accelerating Reinforcement Learning for Autonomous Driving using Task-Agnostic and Ego-Centric Motion Skills [Driving Skill]
 - Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors RSS 2023 [Driving Skill]
 - IL Difficulty Model: Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula CoRL 2022 [Waymo]
 - Neural Map Prior for Autonomous Driving CVPR 2023
 - Track Anything: Segment Anything Meets Videos
 - Self-Supervised Camera Self-Calibration from Video ICRA 2022 [TRI, calibration]
 - Real-time Online Video Detection with Temporal Smoothing Transformers ECCV 2022 [ConvLSTM-style cross-attention]
 - NeRF-Supervised Deep Stereo CVPR 2023
 - GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images NeurIOS 2022
 - OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation CVPR 2023
 - Ego-Body Pose Estimation via Ego-Head Pose Estimation CVPR 2023
 - PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
 - Visual Instruction Tuning
 - VideoChat: Chat-Centric Video Understanding
 - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers CoRL 2022
 - BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision [BEVNet, Jifeng Dai]
 - Fast-BEV: Towards Real-time On-vehicle Bird’s-Eye View Perception NeurIPS 2022
 - Traj++: Human Trajectory Forecasting in Crowds: A Deep Learning Perspective TITS 2021
 - Data Driven Prediction Architecture for Autonomous Driving and its Application on Apollo Platform IV 2020 [Baidu]
 - THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling ICLR 2022
 - Learning Lane Graph Representations for Motion Forecasting ECCV 2020 oral
 - Identifying Driver Interactions via Conditional Behavior Prediction ICRA 2021 [Waymo]
 - Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data ECCV 2020
 - TPNet: Trajectory Proposal Network for Motion Prediction CVPR 2020
 - GOHOME: Graph-Oriented Heatmap Output for future Motion Estimation
 - PECNet: It Is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction ECCV 2020 oral
 - From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting ICCV 2019
 - PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings ICCV 2019
 - PiP: Planning-informed Trajectory Prediction for Autonomous Driving ECCV 2020
 - MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction CoRL 2019
 - LaPred: Lane-Aware Prediction of Multi-Modal Future Trajectories of Dynamic Agents CVPR 2021
 - PRIME: Learning to Predict Vehicle Trajectories with Model-based Planning CoRL 2021
 - A Flexible and Explainable Vehicle Motion Prediction and Inference Framework Combining Semi-Supervised AOG and ST-LSTM TITS 2020
 - Multi-Modal Trajectory Prediction of Surrounding Vehicles with Maneuver based LSTMs IV 2018 [Trivedi]
 - HYPER: Learned Hybrid Trajectory Prediction via Factored Inference and Adaptive Sampling ICRA 2022
 - Trajectory Prediction with Linguistic Representations ICRA 2022
 - What-If Motion Prediction for Autonomous Driving
 - End-to-end Contextual Perception and Prediction with Interaction Transformer IROS 2020 [Auxiliary collision loss, scene compliant pred]
 - SafeCritic: Collision-Aware Trajectory Prediction BMVC 2019 [IRL, scene compliant pred]
 - Large Scale Interactive Motion Forecasting for Autonomous Driving: The Waymo Open Motion Dataset ICCV 2021 [Waymo]
 - Interaction-Based Trajectory Prediction Over a Hybrid Traffic Graph IROS 2020
 - Joint Interaction and Trajectory Prediction for Autonomous Driving using Graph Neural Networks NeurIPS 2019 workshop
 - Fast Risk Assessment for Autonomous Vehicles Using Learned Models of Agent Futures Robotics: science and systems 2020
 - Monocular 3D Object Detection: An Extrinsic Parameter Free Approach CVPR 2021 [PJLab]
 - UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View [BEVFormer, BEVNet, Temporal]
 - GitNet: geometric prior-baesd transformation for birds yee view segmentation
 - WBF: weighted box fusion: ensembling boxes from differnt object detection modules
 - NNI: auto parameter finding algorithm
 - BEVFormer++: Improving BEVFormer for 3D Camera-only Object Detection [Waymo open dataset challenge 1st place in mono3d]
 - LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection [Waymo open dataset challenge official metric]
 - High-Level Interpretation of Urban Road Maps Fusing Deep Learning-Based Pixelwise Scene Segmentation and Digital Navigation Maps Journal of Advanced Transportation 2018
 - A Hybrid Vision-Map Method for Urban Road Detection Journal of Advanced Transportation 2017
 - Terminology and Analysis of Map Deviations in Urban Domains: Towards Dependability for HD Maps in Automated Vehicles IV 2020
 - TIME WILL TELL: NEW OUTLOOKS AND A BASELINE FOR TEMPORAL MULTI-VIEW 3D OBJECT DETECTION
 - Conditional DETR for Fast Training Convergence ICCV 2021
 - DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR ICLR 2022
 - DN-DETR: Accelerate DETR Training by Introducing Query DeNoising CVPR 2022
 - DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
 - Trajectory Forecasting from Detection with Uncertainty-Aware Motion Encoding [Ouyang Wanli]
 - Vision-based Uneven BEV Representation Learning with Polar Rasterization and Surface Estimation [BEVNet, polar]
 - MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries [BEVNet, tracking] CVPR 2022 workshop [Hang Zhao]
 - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning ECCV 2022 [Hongyang Li]
 - GKT: Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer [BEVNet, Horizon]
 - SiamRPN: High Performance Visual Tracking with Siamese Region Proposal Network CVPR 2018
 - TPLR: Topology Preserving Local Road Network Estimation from Single Onboard Camera Image CVPR 2022 [STSU, Luc Van Gool]
 - LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation [Valeo, BEVNet, polar]
 - PolarDETR: Polar Parametrization for Vision-based Surround-View 3D Detection [BEVNet]
 - Exploring Geometric Consistency for Monocular 3D Object Detection CVPR 2022
 - ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection WACV 2022 [mono3D]
 - Learning to Predict 3D Lane Shape and Camera Pose from a Single Image via Geometry Constraints AAAI 2022
 - Detecting Lane and Road Markings at A Distance with Perspective Transformer Layers ICICN 2021 [BEVNet, lane line]
 - Unsupervised Labeled Lane Markers Using Maps ICCV 2019 workshop [Bosch, 2D lane line]
 - M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [Lidar detection, Waymo open dataset] WACV 2022
 - K-Lane: Lidar Lane Dataset and Benchmark for Urban Roads and Highways [lane line dataset]
 - Robust Monocular 3D Lane Detection With Dual Attention ICIP 2021
 - OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D Reconstruction CVPR 2022
 - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer ICLR 2022 [lightweight Transformers]
 - XFormer: Lightweight Vision Transformer with Cross Feature Attention [Samsung]
 - CenterFormer: Center-based Transformer for 3D Object Detection ECCV 2022 oral [TuSimple]
 - LidarMultiNet: Towards a Unified Multi-task Network for LiDAR Perception [2022 Waymo Open Dataset, TuSimple]
 - MTRA: 1st Place Solution for 2022 Waymo Open Dataset Challenge - Motion Prediction [Waymo open dataset challenge 1st place in motion prediction]
 - BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs [BEVNet]
 - Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers CVPR 2022 [nVidia]
 - Efficiently Identifying Task Groupings for Multi-Task Learning NeurIPS 2021 spotlight [MTL]
 - Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [Google, Golden Backbone]
 - "The Pedestrian next to the Lamppost" Adaptive Object Graphs for Better Instantaneous Mapping CVPR 2022
 - GitNet: Geometric Prior-based Transformation for Birds-Eye-View Segmentation [BEVNet, Baidu]
 - FUTR3D: A Unified Sensor Fusion Framework for 3D Detection [Hang Zhao]
 - GitNet: Geometric Prior-based Transformation for Birds-Eye-View Segmentation [BEVNet]
 - MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers [monodepth]
 - Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving
 - cosFormer: Rethinking Softmax in Attention ICLR 2022
 - StretchBEV: Stretching Future Instance Prediction Spatially and Temporally [BEVNet, prediction]
 - Scene Representation in Bird’s-Eye View from Surrounding Cameras with Transformers [BEVNet, LLD] CVPR 2022 workshop
 - Multi-Frame Self-Supervised Depth with Transformers CVPR 2022
 - It's About Time: Analog Clock Reading in the Wild CVPR 2022 [Andrew Zisserman]
 - SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation CoRL 2022 [Jiwen Lu]
 - ONCE-3DLanes: Building Monocular 3D Lane Detection CVPR 2022
 - K-Lane: Lidar Lane Dataset and Benchmark for Urban Roads and Highways CVPR 2022 workshop [3D LLD]
 - Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving CVPR 2022 workshop
 - A Simple Baseline for BEV Perception Without LiDAR [TRI, BEVNet, vision+radar]
 - Reconstruct from Top View: A 3D Lane Detection Approach based on Geometry Structure Prior CVPR 2022 workshop
 - RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding CVPR 2022 [Waymo, Charles Qi]
 - Occupancy Flow Fields for Motion Forecasting in Autonomous Driving RAL 2022 [Waymo occupancy flow challenge]
 - Safe Local Motion Planning with Self-Supervised Freespace Forecasting CVPR 2021
 - 数据闭环的核心 - Auto-labeling 方案分享
 - K-Lane: Lidar Lane Dataset and Benchmark for Urban Roads and Highways
 - LETR: Line Segment Detection Using Transformers without Edges CVPR 2021 oral
 - HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps CVPR 2021 [HD mapping]
 - SketchRNN: A Neural Representation of Sketch Drawings [David Ha]
 - PolyGen: An Autoregressive Generative Model of 3D Meshes ICML 2020
 - SOLQ: Segmenting Objects by Learning Queries NeurlPS 2021 [Megvii, end-to-end, instance segmentation]
 - MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer 3DV 2022
 - MVSTER: Epipolar Transformer for Efficient Multi-View Stereo ECCV 2022
 - MOVEDepth: Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning [MVS + monodepth]
 - SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation
 - Scene Transformer: A unified architecture for predicting multiple agent trajectories [prediction, Waymo] ICLR 2022
 - SSIA: Monocular Depth Estimation with Self-supervised Instance Adaptation [VGG team, TTR, test time refinement, CVD]
 - CoMoDA: Continuous Monocular Depth Adaptation Using Past Experiences WACV 2021
 - MonoRec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera CVPR 2021 [Daniel Cremmers]
 - Plenoxels: Radiance Fields without Neural Networks
 - Lidar with Velocity: Motion Distortion Correction of Point Clouds from Oscillating Scanning Lidars [Livox, ISEE]
 - NWD: A Normalized Gaussian Wasserstein Distance for Tiny Object Detection
 - Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation NeurIPS 2021 [Sanja Fidler]
 - Insta-DM: Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency AAAI 2021
 - Instance-wise Depth and Motion Learning from Monocular Videos NeurIPS 2020 workshop [website]
 - NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis ECCV 2020 oral
 - BARF: Bundle-Adjusting Neural Radiance Fields ICCV 2021 oral
 - NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo ICCV 2021 oral
 - YOLinO: Generic Single Shot Polyline Detection in Real Time ICCV 2021 workshop [lld]
 - MonoRCNN: Geometry-based Distance Decomposition for Monocular 3D Object Detection ICCV 2021
 - MonoCInIS: Camera Independent Monocular 3D Object Detection using Instance Segmentation ICCV 2021 workshop
 - PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection CVPR 2020 [Waymo challenge 2nd place]
 - Geometry-based Distance Decomposition for Monocular 3D Object Detection ICCV 2021 [mono3D]
 - Offboard 3D Object Detection from Point Cloud Sequences CVPR 2021 [Charles Qi]
 - FreeAnchor: Learning to Match Anchors for Visual Object Detection NeurIPS 2019
 - AutoAssign: Differentiable Label Assignment for Dense Object Detection
 - Probabilistic Anchor Assignment with IoU Prediction for Object Detection ECCV 2020
 - FOVEA: Foveated Image Magnification for Autonomous Navigation ICCV 2021 [Argo]
 - PifPaf: Composite Fields for Human Pose Estimation CVPR 2019
 - Monocular 3D Localization of Vehicles in Road Scenes ICCV 2021 workshop [mono3D, tracking]
 - TransformerFusion: Monocular RGB Scene Reconstruction using Transformers
 - Conditional DETR for Fast Training Convergence
 - Anchor DETR: Query Design for Transformer-Based Detector [megvii]
 - PGD: Probabilistic and Geometric Depth: Detecting Objects in Perspective CoRL 2021
 - Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression
 - What Makes for End-to-End Object Detection? PMLR 2021
 - Instances as Queries ICCV 2021 [instance segmentation]
 - One Million Scenes for Autonomous Driving: ONCE Dataset [Huawei]
 - NVS-MonoDepth: Improving Monocular Depth Prediction with Novel View Synthesis 3DV 2021
 - Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?
 - Topology Preserving Local Road Network Estimation from Single Onboard Camera Image [BEVNet, Luc Van Gool]
 - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine [Small LLM prompting, Microsoft]
 - CoT: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models NeurIPS 2022
 - ToT: Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Notes] NeurIPS 2023 Oral
 - Cumulative Reasoning with Large Language Models
 - A Survey of Techniques for Maximizing LLM Performance [OpenAI]
 - Drive AGI
 - Harnessing the Power of Multi-Modal LLMs for Autonomy [Ghost Autonomy]
 - Language to Rewards for Robotic Skill Synthesis
 - ALOHA: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
 - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [UM]
 - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [Sergey Levine]
 - A Survey of Embodied AI: From Simulators to Research Tasks IEEE TETCI 2021
 - Habitat Challenge 2021
 - Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
 - DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment [Jianyu Chen]
 - The Power of Scale for Parameter-Efficient Prompt Tuning EMNLP 2021
 - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents ICML 2022
 - ProgPrompt: Generating Situated Robot Task Plans using Large Language Models ICRA 2023
 - Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation CoRL 2022
 - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 2022 [LLM Quant]
 - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Song Han, LLM Quant]
 - RoFormer: Enhanced Transformer with Rotary Position Embedding
 - CoDi: Any-to-Any Generation via Composable Diffusion NeurIPS 2023
 - What if a Vacuum Robot has an Arm? UR 2023
 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
 - GPT in 60 Lines of NumPy
 - Speeding up the GPT - KV cache
 - LLM Parameter Counting
 - Transformer Inference Arithmetic
 - ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation NeurIPS 2021 [Junnan Li]
 - CLIP: Learning Transferable Visual Models From Natural Language Supervision ICLR 2021 [OpenAI]
 - BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ICML 2022 [Junnan Li]
 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [Junnan Li]
 - MOO: Open-World Object Manipulation using Pre-trained Vision-Language Models [Google Robotics, end-to-end visuomotor]
 - VC-1: Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
 - CLIPort: What and Where Pathways for Robotic Manipulation CoRL 2021 [Nvidia, end-to-end visuomotor]
 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023
 - SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 [Song Han, LLM Quant]
 - SAPIEN: A SimulAted Part-based Interactive ENvironment CVPR 2020
 - FiLM: Visual Reasoning with a General Conditioning Layer AAAI 2018
 - TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? NeurIPS 2021
 - QLoRA: Efficient Finetuning of Quantized LLMs
 - OVO: Open-Vocabulary Occupancy
 - Code Llama: Open Foundation Models for Code
 - Chinchilla: Training Compute-Optimal Large Language Models [DeepMind]
 - GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
 - RoFormer: Enhanced Transformer with Rotary Position Embedding
 - RH20T: A Robotic Dataset for Learning Diverse Skills in One-Shot
 - Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
 - VIMA: General Robot Manipulation with Multimodal Prompts
 - An Attention Free Transformer [Apple]
 - PDDL Planning with Pretrained Large Language Models [MIT, Leslie Kaelbling]
 - Task and Motion Planning with Large Language Models for Object Rearrangement IROS 2023