huggingface-projects (Huggingface Projects)

What a beginning to this year in open ML 🤠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal 🖼️
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook — 22k hours worth of samples from instruction videos 🤯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs 💬
> Microsoft released Phi-4, sota open-source 14B language model 🔥
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct 💭
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview 📕
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs 📕
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences 👩🏻‍💻

Embeddings 🔖
> @MoritzLaurer released zero-shot version of ModernBERT large 👏
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts 🔥
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding

AdinaY

posted an update 1 day ago

Post

422

LLaVA-Mini🔥 A efficient multimodal model for image and video understanding released by Chinese Academy of Sciences
Model: ICTNLP/llava-mini-llama-3.1-8b
Paper: LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token (2501.03895)
✨ Matches LLaVA-v1.5 using just 1 vision token
✨ Delivers <40ms response time
✨ Reduces vision tokens while maintaining strong visual understanding

ThomasSimonini

updated a dataset 1 day ago

huggingface-projects/drlc-leaderboard-data

Viewer • Updated 1 day ago • 36.6k • 6.28k • 2

merve

posted an update 2 days ago

Post

1511

ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗 ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.

1 reply

·

AdinaY

posted an update 5 days ago

Post

2235

Excited to see Alibaba DAMO Academy release a multimodel dataset for vision language pretraining on the hub🔥

Paper: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining (2501.00958)
Dataset: DAMO-NLP-SG/multimodal_textbook

✨ 6.5M images + 0.8B text from 22k hours of instructional videos
✨ Covers subjects like math, physics, and chemistry
✨ Apache 2.0

hysts

in huggingface-projects/llama-3.2-3B-Instruct 8 days ago

Fix

4

#6 opened 9 days ago by

hysts

Xenova

posted an update 10 days ago

Post

5528

First project of 2025: Vision Transformer Explorer

I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! 🤯

Try it out yourself! 👇
webml-community/attention-visualization

Source code: https://github.com/huggingface/transformers.js-examples/tree/main/attention-visualization

merve

posted an update 11 days ago

Post

4714

supercharge your LLM apps with smolagents 🔥

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents

AdinaY

posted an update 16 days ago

Post

3578

The Chinese community is shipping 🚢

DeepSeek V3 (685 B MoE) has quietly released on the hub!
Base: deepseek-ai/DeepSeek-V3-Base
Instruct: deepseek-ai/DeepSeek-V3

Can’t wait to see what’s next!

1 reply

·

merve

posted an update 18 days ago

Post

4627

QwQ can see 🔥
Qwen team released QvQ, a large vision LM with reasoning 😱

it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out ⬇️
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!

2 replies

·

AdinaY

posted an update 18 days ago

Post

2995

QvQ-72B-Preview🎄 an open weight model for visual reasoning just released by Alibaba_Qwen team
Qwen/qvq-676448c820912236342b9888
✨ Combines visual understanding & language reasoning.
✨ Scores 70.3 on MMMU
✨ Outperforms Qwen2-VL-72B-Instruct in complex problem-solving

sayakpaul

posted an update 19 days ago

Post

3918

Commits speak louder than words 🤪

* 4 new video models
* Multiple image models, including SANA & Flux Control
* New quantizers -> GGUF & TorchAO
* New training scripts

Enjoy this holiday-special Diffusers release 🤗
Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0

akhaliq

posted an update 23 days ago

Post

4595

Google drops Gemini 2.0 Flash Thinking

a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more

now available in anychat, try it out: akhaliq/anychat

Huggingface Projects

AI & ML interests

Recent Activity

huggingface-projects's activity

Llama 3.2 3B Instruct

Gemma 2 2B JPN IT

Gemma 2 9B IT

Gemma 2 2B IT

Llama 2 13b Chat

Llama 2 7B Chat

huggingface-projects/Deep-RL-Course-Certification

huggingface-projects/drlc-leaderboard-data

Fix

AI & ML interests

Recent Activity

Team members 24

huggingface-projects's activity

Llama 3.2 3B Instruct

Gemma 2 2B JPN IT

Gemma 2 9B IT

Gemma 2 2B IT

Llama 2 13b Chat

Llama 2 7B Chat

Fix