Zesen Cheng's picture

Zesen Cheng

ClownRat

·

AI & ML interests

multi-modal foundation model; Segmentation, Detection, and Tracking;

Recent Activity

upvoted a paper about 23 hours ago

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

upvoted a paper about 23 hours ago

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

upvoted a paper about 23 hours ago

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

View all activity

Organizations

ClownRat's activity

upvoted 3 papers about 23 hours ago

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published 17 days ago • 89

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Paper • 2412.20070 • Published 14 days ago • 43

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Paper • 2412.20750 • Published 12 days ago • 19

upvoted a paper 5 days ago

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published 18 days ago • 65

upvoted 2 papers 6 days ago

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published 11 days ago • 40

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published 10 days ago • 91

upvoted 11 papers about 1 month ago

Towards Universal Soccer Video Understanding

Paper • 2412.01820 • Published Dec 2, 2024 • 9

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 17

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 105

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Paper • 2412.03565 • Published Dec 4, 2024 • 11

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 30

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Paper • 2412.03248 • Published Dec 4, 2024 • 26

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 23

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Paper • 2411.19943 • Published Nov 29, 2024 • 56

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 78

Large Language Model-Brained GUI Agents: A Survey

Paper • 2411.18279 • Published Nov 27, 2024 • 29

3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

Paper • 2411.14974 • Published Nov 22, 2024 • 17

upvoted 2 papers about 2 months ago

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Paper • 2411.08380 • Published Nov 13, 2024 • 25

Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 63

upvoted a paper 3 months ago

Enhancing Training Efficiency Using Packing with Flash Attention

Paper • 2407.09105 • Published Jul 12, 2024 • 14