Natalia Elvira's picture

Natalia Elvira

nataliaElv

·

AI & ML interests

Data curation, high-quality data, multilinguality, NLP & computational linguistics

Recent Activity

reacted to davanstrien's post with 🚀 1 day ago

The https://huggingface.co/datasets/data-is-better-together/fineweb-c dataset is growing! This week a few more languages have got 1,000 annotations for the educational quality of data from https://huggingface.co/datasets/HuggingFaceFW/fineweb-2. Why should you care? The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining. Why not use an LLM? LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in. The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things: - Evaluate whether an LLM can label the educational quality for texts in that language well - Directly be used for training quality classifiers - Help discover other rules and huerisitcs for refining fineweb2 further for different languages. This week the following languages where done: Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community Contribute yourself here: https://huggingface.co/spaces/data-is-better-together/fineweb-c

posted an update 2 days ago

Do you want to easily save annotations to a Dataset in the Hub? In the last version of Argilla (v2.6.0), you can export your data directly from the UI to the Hub. Check all the changes and update to the latest version: https://github.com/argilla-io/argilla/releases/tag/v2.6.0

posted an update 25 days ago

If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video! I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out! https://www.youtube.com/watch?v=_-ORB4WAVGU

View all activity

Articles

FineWeb2-C: Help Build Better Language Models in Your Language

Argilla 2.4: Easily Build Fine-Tuning and Evaluation datasets on the Hub — No Code Required

How to build a custom text classifier without days of human labeling

How to optimize your data labelling project with custom interfaces

Organizations

nataliaElv's activity

upvoted 2 articles 3 months ago

Article

How to build a custom text classifier without days of human labeling

By

•

Oct 17, 2024

• 55

Article

How to optimize your data labelling project with custom interfaces

By

•

Oct 16, 2024

• 18

upvoted a collection 3 months ago

Datasets ATR line-level

This collection contains all our datasets for Automatic Text Recognition on line images. • 12 items • Updated Mar 14, 2024 • 3

upvoted an article 3 months ago

Article

Fine-tuning a token classification model for legal data using Argilla and AutoTrain

By

•

Sep 7, 2024

• 14

upvoted an article 4 months ago

Article

Introducing Community Tools on HuggingChat

Sep 16, 2024

• 34

upvoted an article 6 months ago

Article

How we leveraged distilabel to create an Argilla 2.0 Chatbot

Jul 16, 2024

• 32

upvoted an article 7 months ago

Article

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

By

•

Jun 4, 2024

• 73

upvoted an article 9 months ago

Article

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

By

•

Apr 29, 2024

• 29