Daniel Vila's picture

Daniel Vila

dvilasuero

·

https://argilla.io

AI & ML interests

RLHF, RLAIF, DPO, data, data, data

Recent Activity

reacted to davanstrien's post with 🚀 1 day ago

The https://huggingface.co/datasets/data-is-better-together/fineweb-c dataset is growing! This week a few more languages have got 1,000 annotations for the educational quality of data from https://huggingface.co/datasets/HuggingFaceFW/fineweb-2. Why should you care? The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining. Why not use an LLM? LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in. The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things: - Evaluate whether an LLM can label the educational quality for texts in that language well - Directly be used for training quality classifiers - Help discover other rules and huerisitcs for refining fineweb2 further for different languages. This week the following languages where done: Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community Contribute yourself here: https://huggingface.co/spaces/data-is-better-together/fineweb-c

liked a model 2 days ago

stabilityai/stable-point-aware-3d

liked a dataset 2 days ago

eltorio/ROCOv2-radiology

View all activity

Articles

FineWeb2-C: Help Build Better Language Models in Your Language

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Let’s make a generation of amazing image generation models

Argilla 2.4: Easily Build Fine-Tuning and Evaluation datasets on the Hub — No Code Required

How to build a custom text classifier without days of human labeling

How to optimize your data labelling project with custom interfaces

🔥 Argilla 2.0: the data-centric tool for AI makers 🤗

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

How we leveraged distilabel to create an Argilla 2.0 Chatbot

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

Data is better together

Organizations

dvilasuero's activity

New activity in ariG23498/open-image-preferences-v1-sdxl-lora 26 days ago

Add generated example

#3 opened 26 days ago by

commented a paper about 1 month ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 17 •

New activity in CohereForAI/Global-MMLU about 1 month ago

Include argilla tag

#2 opened about 1 month ago by

New activity in open-acc/README about 2 months ago

[24/ 11] What are you working on this week! 💪

#2 opened about 2 months ago by

New activity in microsoft/orca-agentinstruct-1M-v1 about 2 months ago

message column is a str

#3 opened about 2 months ago by

New activity in davidberenstein1957/vectorsearch-hub-datasets about 2 months ago

Typo search input

#1 opened about 2 months ago by

New activity in GAIR/o1-journey 2 months ago

More details about how this dataset was built

#3 opened 2 months ago by

New activity in JournalistsonHF/README 2 months ago

Create datasets for AI with No Code - Love to hear your use cases

#11 opened 2 months ago by

New activity in yutaozhu94/INTERS 2 months ago

Error in config preventing the dataset to be loaded using dataset-viewer

#3 opened 2 months ago by

New activity in glaiveai/reflection-v1 3 months ago

Duplicates

#3 opened 3 months ago by

New activity in dvilasuero/image-prefs 3 months ago

Using the proper Docker image

#1 opened 3 months ago by

New activity in argilla/synthetic-data-generator 4 months ago

Review before pushing the dataset.

#3 opened 4 months ago by

Enhancement to mitigate response included in user message

#1 opened 4 months ago by

New activity in thesven/Reflective-MAGLLAMA-v0.1 4 months ago

Just wanted to say kudos and thank you!

#2 opened 4 months ago by

New activity in argilla/synthetic-data-generator 4 months ago

Add disabled view if not logged in

#2 opened 4 months ago by

New activity in DIBT-Russian/prompt-translation-for-Russian 8 months ago

Upgrade Argilla server

#2 opened 8 months ago by

Update Argilla version

#1 opened 8 months ago by

New activity in dvilasuero/distillama3-prompts10k 8 months ago

Librarian Bot: Add language metadata for dataset

#2 opened 9 months ago by

New activity in dvilasuero/human-rights_config_space 9 months ago

Upload 2 files

#2 opened 9 months ago by

New activity in abhishek/autotrain-llama3-orpo-v2 9 months ago

Adds dataset metadata

#1 opened 9 months ago by