kotol

company
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

gv-hf's activity

merveย 
posted an update 1 day ago
view post
Post
1401
What a beginning to this year in open ML ๐Ÿค 
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal ๐Ÿ–ผ๏ธ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook โ€” 22k hours worth of samples from instruction videos ๐Ÿคฏ
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs ๐Ÿ’ฌ
> Microsoft released Phi-4, sota open-source 14B language model ๐Ÿ”ฅ
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B ๐Ÿฌ๐Ÿฌ
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct ๐Ÿ’ญ
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview ๐Ÿ“•
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs ๐Ÿ“•
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป

Embeddings ๐Ÿ”–
> @MoritzLaurer released zero-shot version of ModernBERT large ๐Ÿ‘
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation โฏ๏ธ
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts ๐Ÿ”ฅ
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
merveย 
posted an update 2 days ago
view post
Post
1511
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license ๐Ÿ’— ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos โฏ๏ธ

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM ๐Ÿ’ฌ

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks โคต๏ธ

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
  • 1 reply
ยท
Xenovaย 
posted an update 10 days ago
view post
Post
5528
First project of 2025: Vision Transformer Explorer

I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! ๐Ÿคฏ

Try it out yourself! ๐Ÿ‘‡
webml-community/attention-visualization

Source code: https://github.com/huggingface/transformers.js-examples/tree/main/attention-visualization
merveย 
posted an update 11 days ago
view post
Post
4714
supercharge your LLM apps with smolagents ๐Ÿ”ฅ

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents
merveย 
posted an update 18 days ago
Xenovaย 
posted an update 24 days ago
view post
Post
3389
Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
๐Ÿš€ Faster and more accurate than Whisper
๐Ÿ”’ Privacy-focused (no data leaves your device)
โšก๏ธ WebGPU accelerated (w/ WASM fallback)
๐Ÿ”ฅ Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web
ยท
merveย 
posted an update 24 days ago
view post
Post
2787
Aya by Cohere For AI can now see! ๐Ÿ‘€

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B ๐ŸŒฑ works on 8 languages! ๐Ÿ—ฃ๏ธ

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya ๐Ÿ‘
kudos @nahidalam and team
  • 1 reply
ยท
merveย 
posted an update 25 days ago
view post
Post
3295
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models ๐Ÿงถ

โœจ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
โœจ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐Ÿ“ˆ scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful ๐Ÿ”ฅ
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐Ÿ”ฅ
ยท
merveย 
posted an update about 1 month ago
view post
Post
1766
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document ๐Ÿ““
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 ๐Ÿ”ฅ https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
Xenovaย 
posted an update about 1 month ago
view post
Post
3024
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! ๐Ÿ”ฅ High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. ๐Ÿค— Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)
merveย 
posted an update about 1 month ago
view post
Post
5596
This week in open-source AI was insane ๐Ÿค  A small recap๐Ÿ•บ๐Ÿป merve/dec-6-releases-67545caebe9fc4776faac0a3

Multimodal ๐Ÿ–ผ๏ธ
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐Ÿ‘
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โœจ
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts

LLMs ๐Ÿ’ฌ
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐Ÿ”ฅ
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! ๐Ÿ”ฅ nearly 8TB pretraining data in many languages!

Image/Video Generation ๐Ÿ–ผ๏ธ
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux

Audio ๐Ÿ”Š
> Indic-Parler-TTS is a new text2speech model made by community
merveย 
posted an update about 1 month ago
view post
Post
1552
New InternVL drop with a state-of-the-art 78B vision language model with MIT license ๐Ÿ”ฅ https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks ๐Ÿ‘ Try here OpenGVLab/InternVL
ariG23498ย 
posted an update about 1 month ago

Update README.md

#3 opened about 1 month ago by
ariG23498

Update README.md

#3 opened about 1 month ago by
ariG23498

Update README.md

#3 opened about 1 month ago by
ariG23498

Update README.md

#3 opened about 1 month ago by
ariG23498
ariG23498ย 

Update README.md

#3 opened about 1 month ago by
ariG23498
ariG23498ย 

Update README.md

#3 opened about 1 month ago by
ariG23498