Abstract
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.
Community
942 Authors!
Not me. You listed the wrong 'Yonghui Wu' in the author list.
Not me. You listed the wrong 'Yonghui Wu' in the author list.
Hi! Thanks for letting us know. The authorship has been removed from your account.
Not me. You listed the wrong 'Yonghui Wu' in the author list.
Absolute chad for not pretending to be a Google Author.
The 'Timothy Chung' in the author list is wrongly associated with my account as well
The 'Timothy Chung' in the author list is wrongly associated with my account as well
Hi! Thanks for the feedback. Authorship removed!: )
The 'Timothy Chung' in the author list is wrongly associated with my account as well
Just keep it.
Jokes aside thanks for your honesty.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise (2023)
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2023)
- GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models (2023)
- Multitask Multimodal Prompted Training for Interactive Embodied Task Completion (2023)
- GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Wrong 'Yi Luan' as the author, not me. This is the second time, should I change my name? LOL
Wrong 'Yi Luan' as the author, not me. This is the second time, should I change my name? LOL
Become such a high profile researcher that you defame the old Yi Luan and become the real omega Yi Luan.
Next time it happens just pretend to be the author and contact press with this as proof π€£
Models citing this paper 142
Browse 142 models citing this paperDatasets citing this paper 0
No dataset linking this paper