72 67 198

David Berenstein

davidberenstein1957

AI & ML interests

Everything data

Recent Activity

updated a dataset about 21 hours ago

davidberenstein1957/my-distiset-df9db7bc

liked a Space about 21 hours ago

argilla/synthetic-data-generator

updated a Space about 21 hours ago

argilla/synthetic-data-generator

View all activity

Articles

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

Nov 21, 2024

• 35

How to build a custom text classifier without days of human labeling

Oct 17, 2024

• 55

How to optimize your data labelling project with custom interfaces

Oct 16, 2024

• 18

To what extent are we responsible for our content and how to create safer Spaces?

Aug 30, 2024

• 3

Data Is Better Together: A Look Back and Forward

Jun 20, 2024

• 19

Organizations

davidberenstein1957's activity

replied to davanstrien's post 1 day ago

Open collaboration is key for democratising AI.

reacted to davanstrien's post with 🤝❤️🚀 1 day ago

Post

1030

The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c

1 reply

posted an update 7 days ago

Post

1894

Fine-tune a SmolLM on domain-specific synthetic data from a LLM

Blog: https://huggingface.co/blog/davidberenstein1957/fine-tune-a-smollm-on-synthetic-data-of-llm

1 reply

posted an update 12 days ago

Post

1962

Fine-tuning ModernBERT for text classification using synthetic data generation

From prompt to model in 3 steps.
1 dataset description
20 minutes of generating
60 minutes of fine-tuning on my Macbook Pro

Tutorial: https://nbsanity.com/static/552eb50cbd91bedb4e5b73fddca2664a/fine-tune-modernbert-classifier.html

posted an update 23 days ago

Post

1353

🐇 Tumble down the AI rabbit hole without any technical knowledge!

Explore AI models on the Hub by a simple and quick search

Demo: davidberenstein1957/transformers-pipeline-playground

reacted to their post with 🔥 24 days ago

Post

4186

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator

4 replies

replied to their post 24 days ago

Feedback is welcome :)

replied to their post 24 days ago

thanks! Hope you can create some cool and useful datasets with it!

reacted to jwlben11's post with 🤗 25 days ago

Post

2143

What is the use of hugginface? How can I get up to speed on ML and AI and how to use this platform? Would be nice if there was a get started here section.

1 reply

reacted to their post with 🤯🧠❤️👀 26 days ago

Post

4186

4 replies

posted an update 26 days ago

Post

4186

4 replies

reacted to julien-c's post with 👀🚀😎🤝 about 1 month ago

Post

8204

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

David Berenstein

AI & ML interests

Recent Activity

Articles

Fine-tune a SmolLM on domain-specific synthetic data from a LLM

Fine-tune ModernBERT for text classification using synthetic data

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Let’s make a generation of amazing image generation models

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

How to build a custom text classifier without days of human labeling

How to optimize your data labelling project with custom interfaces

To what extent are we responsible for our content and how to create safer Spaces?

Data Is Better Together: A Look Back and Forward

Organizations

davidberenstein1957's activity