Hugging Face TB Research

Enterprise

community

AI & ML interests

Exploring smol models and high quality web and synthetic datasets, generated by LLMs (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)

Recent Activity

loubnabnl new activity 4 days ago

HuggingFaceTB/SmolLM2-135M-Instruct:update model max length

loubnabnl new activity 4 days ago

HuggingFaceTB/SmolLM2-360M-Instruct:update model max length

loubnabnl new activity 4 days ago

HuggingFaceTB/SmolLM2-1.7B-Instruct:Update model max length

View all activity

HuggingFaceTB's activity

merve

posted an update 1 day ago

Post

1335

What a beginning to this year in open ML 🤠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal 🖼️
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook — 22k hours worth of samples from instruction videos 🤯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs 💬
> Microsoft released Phi-4, sota open-source 14B language model 🔥
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct 💭
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview 📕
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs 📕
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences 👩🏻‍💻

Embeddings 🔖
> @MoritzLaurer released zero-shot version of ModernBERT large 👏
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts 🔥
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding

davanstrien

posted an update 1 day ago

Post

1030

The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c

1 reply

cfahlgren1

posted an update 1 day ago

Post

965

Wow, I just added Langfuse tracing to the Deepseek Artifacts app and it's really nice 🔥

It allows me to visualize and track more things along with the cfahlgren1/react-code-instructions dataset.

It was just added as a one click Docker Space template, so it's super easy to self host 💪

merve

posted an update 2 days ago

Post

1505

ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗 ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.

1 reply

ngxson

posted an update 3 days ago

Post

1997

I made this small tool that can be useful for debugging Ollama chat template: ngxson/ollama_template_test

CC @bartowski you may need this ;-)

2 replies

loubnabnl

in HuggingFaceTB/SmolLM2-135M-Instruct 4 days ago

update model max length

#7 opened 4 days ago by

andito

loubnabnl

in HuggingFaceTB/SmolLM2-360M-Instruct 4 days ago

update model max length

#8 opened 4 days ago by

andito

loubnabnl

in HuggingFaceTB/SmolLM2-1.7B-Instruct 4 days ago

Update model max length

#21 opened 4 days ago by

andito

anton-l

updated a dataset 4 days ago

HuggingFaceTB/finemath_contamination_report

Viewer • Updated 4 days ago • 5.33k • 24 • 1

andito

updated a model 4 days ago

HuggingFaceTB/SmolLM2-360M-Instruct

Text Generation • Updated 4 days ago • 1.07M • • 68

andito

in HuggingFaceTB/SmolLM2-360M-Instruct 4 days ago

update model max length

#8 opened 4 days ago by

andito

updated a model 4 days ago

HuggingFaceTB/SmolLM2-135M-Instruct

Text Generation • Updated 4 days ago • 37.7k • 85

andito

in HuggingFaceTB/SmolLM2-135M-Instruct 4 days ago

update model max length

#7 opened 4 days ago by

andito

updated a model 4 days ago

HuggingFaceTB/SmolLM2-1.7B-Instruct

Text Generation • Updated 4 days ago • 76.6k • 465

andito

in HuggingFaceTB/SmolLM2-1.7B-Instruct 4 days ago

Update model max length

#21 opened 4 days ago by

andito

loubnabnl

updated a collection 5 days ago

📐 FineMath

Collection

FineMath datasets and ablation models • 14 items • Updated 5 days ago • 17

loubnabnl

updated a model 5 days ago

HuggingFaceTB/FineMath-Llama-3B

Updated 5 days ago • 126 • 12

lewtun

posted an update 6 days ago

Post

3160

I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!

https://x.com/casper_hansen_/status/1875872309996855343

Together with the recent PRIME method [2] for scaling RL, reasoning for open models is looking pretty exciting for 2025!

[1] Training Large Language Models to Reason in a Continuous Latent Space (2412.06769)
[2] https://huggingface.co/blog/ganqu/prime

Xenova

in HuggingFaceTB/finemath 7 days ago

Arrgh Spam

#30 opened 7 days ago by

ZiggyS

cfahlgren1

posted an update 8 days ago

Post

1929

You'll notice the AI in the SQL Console is much better at working with chatml conversations:

Here's example of unnesting the cfahlgren1/react-code-instructions in less than 10 seconds by asking it. Check it out here: cfahlgren1/react-code-instructions

- "show me the average assistant response length"
- "extract user, system, and assistant messages into separate columns"

It's super easy to work with conversational datasets now with natural language 🗣️

AI & ML interests

Recent Activity

Team members 34

HuggingFaceTB's activity

update model max length

update model max length

Update model max length

update model max length

update model max length

Update model max length

Arrgh Spam