MotionLLM: Understanding Human Behaviors from Human Motions and Videos Paper • 2405.20340 • Published May 30, 2024 • 20
Spectrally Pruned Gaussian Fields with Neural Compensation Paper • 2405.00676 • Published May 1, 2024 • 8
Paint by Inpaint: Learning to Add Image Objects by Removing Them First Paper • 2404.18212 • Published Apr 28, 2024 • 27
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report Paper • 2405.00732 • Published Apr 29, 2024 • 119
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding Paper • 2405.08344 • Published May 14, 2024 • 12
FIFO-Diffusion: Generating Infinite Videos from Text without Training Paper • 2405.11473 • Published May 19, 2024 • 54
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots Paper • 2406.02523 • Published Jun 4, 2024 • 10
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis Paper • 2406.06216 • Published Jun 10, 2024 • 19
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning Paper • 2406.06469 • Published Jun 10, 2024 • 24
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 59
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17, 2024 • 22
Octo-planner: On-device Language Model for Planner-Action Agents Paper • 2406.18082 • Published Jun 26, 2024 • 48
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published Jun 13, 2024 • 50
OpenVLA: An Open-Source Vision-Language-Action Model Paper • 2406.09246 • Published Jun 13, 2024 • 36
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published Jun 13, 2024 • 19
MotionClone: Training-Free Motion Cloning for Controllable Video Generation Paper • 2406.05338 • Published Jun 8, 2024 • 39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published Jun 11, 2024 • 32
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion Paper • 2406.04338 • Published Jun 6, 2024 • 34
The Prompt Report: A Systematic Survey of Prompting Techniques Paper • 2406.06608 • Published Jun 6, 2024 • 58
An Image is Worth 32 Tokens for Reconstruction and Generation Paper • 2406.07550 • Published Jun 11, 2024 • 57
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models Paper • 2406.07472 • Published Jun 11, 2024 • 12
Mixture-of-Agents Enhances Large Language Model Capabilities Paper • 2406.04692 • Published Jun 7, 2024 • 55
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Paper • 2406.01014 • Published Jun 3, 2024 • 31
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Paper • 2406.02430 • Published Jun 4, 2024 • 31
Agentless: Demystifying LLM-based Software Engineering Agents Paper • 2407.01489 • Published Jul 1, 2024 • 42
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published Jul 2, 2024 • 21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3, 2024 • 93
TokenPacker: Efficient Visual Projector for Multimodal LLM Paper • 2407.02392 • Published Jul 2, 2024 • 21
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation Paper • 2407.02869 • Published Jul 3, 2024 • 18
Learning to (Learn at Test Time): RNNs with Expressive Hidden States Paper • 2407.04620 • Published Jul 5, 2024 • 27
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence Paper • 2407.07061 • Published Jul 9, 2024 • 27
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models Paper • 2407.06938 • Published Jul 9, 2024 • 23
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS Paper • 2406.18009 • Published Jun 26, 2024 • 20
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning Paper • 2406.19741 • Published Jun 28, 2024 • 59
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person Paper • 2407.16224 • Published Jul 23, 2024 • 27
Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On Paper • 2406.10539 • Published Jun 15, 2024 • 1
Cross Anything: General Quadruped Robot Navigation through Complex Terrains Paper • 2407.16412 • Published Jul 23, 2024 • 6
A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data Paper • 2407.16680 • Published Jul 23, 2024 • 12
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation Paper • 2407.14931 • Published Jul 20, 2024 • 21
EVLM: An Efficient Vision-Language Model for Visual Understanding Paper • 2407.14177 • Published Jul 19, 2024 • 43
The Vision of Autonomic Computing: Can LLMs Make It a Reality? Paper • 2407.14402 • Published Jul 19, 2024 • 14
Internal Consistency and Self-Feedback in Large Language Models: A Survey Paper • 2407.14507 • Published Jul 19, 2024 • 46
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle Paper • 2407.13833 • Published Jul 18, 2024 • 12
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views Paper • 2408.10195 • Published Aug 19, 2024 • 12
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices Paper • 2408.10161 • Published Aug 19, 2024 • 13
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning Paper • 2408.07931 • Published Aug 15, 2024 • 20
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 124
gsplat: An Open-Source Library for Gaussian Splatting Paper • 2409.06765 • Published Sep 10, 2024 • 15
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper • 2409.06666 • Published Sep 10, 2024 • 56
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers Paper • 2409.04196 • Published Sep 6, 2024 • 14
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Paper • 2408.16725 • Published Aug 29, 2024 • 53
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners Paper • 2408.16768 • Published Aug 29, 2024 • 26
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches Paper • 2408.04567 • Published Aug 8, 2024 • 25
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey Paper • 2409.11564 • Published Sep 17, 2024 • 20
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends Paper • 2409.14195 • Published Sep 21, 2024 • 12
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction Paper • 2409.18121 • Published Sep 26, 2024 • 8
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image Paper • 2409.17280 • Published Sep 25, 2024 • 10
Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study Paper • 2409.17580 • Published Sep 26, 2024 • 8
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2, 2024 • 26
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices Paper • 2410.00531 • Published Oct 1, 2024 • 30
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper • 2410.02073 • Published Oct 2, 2024 • 40
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper • 2410.12787 • Published Oct 16, 2024 • 31
Revealing the Barriers of Language Agents in Planning Paper • 2410.12409 • Published Oct 16, 2024 • 25
What Matters in Transformers? Not All Attention is Needed Paper • 2406.15786 • Published Jun 22, 2024 • 30
EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation Paper • 2410.09704 • Published Oct 13, 2024 • 12
Agent S: An Open Agentic Framework that Uses Computers Like a Human Paper • 2410.08164 • Published Oct 10, 2024 • 24
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale Paper • 2409.16299 • Published Sep 9, 2024 • 11
DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes Paper • 2410.18084 • Published Oct 23, 2024 • 13
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding Paper • 2410.13924 • Published Oct 17, 2024 • 6
LLM-based Optimization of Compound AI Systems: A Survey Paper • 2410.16392 • Published Oct 21, 2024 • 14
Improve Vision Language Model Chain-of-thought Reasoning Paper • 2410.16198 • Published Oct 21, 2024 • 22
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities Paper • 2410.11190 • Published Oct 15, 2024 • 21
Unbounded: A Generative Infinite Game of Character Life Simulation Paper • 2410.18975 • Published Oct 24, 2024 • 35
WAFFLE: Multi-Modal Model for Automated Front-End Development Paper • 2410.18362 • Published Oct 24, 2024 • 11
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant Paper • 2410.18603 • Published Oct 24, 2024 • 32
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Paper • 2410.23743 • Published Oct 31, 2024 • 59
AutoTrain: No-code training for state-of-the-art models Paper • 2410.15735 • Published Oct 21, 2024 • 59
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors Paper • 2410.16271 • Published Oct 21, 2024 • 81
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though Paper • 2501.04682 • Published 3 days ago • 63
LLM4SR: A Survey on Large Language Models for Scientific Research Paper • 2501.04306 • Published 3 days ago • 26
Agent Laboratory: Using LLM Agents as Research Assistants Paper • 2501.04227 • Published 4 days ago • 58
Cosmos World Foundation Model Platform for Physical AI Paper • 2501.03575 • Published 4 days ago • 52
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published 8 days ago • 32