BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

Skylion007 authored a paper 1 day ago

The GAN is dead; long live the GAN! A Modern GAN Baseline

Jekaterina authored a paper 12 days ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Jekaterina authored a paper 12 days ago

DEPAC: a Corpus for Depression and Anxiety Detection from Speech

View all activity

bigscience's activity

davanstrien

posted an update 1 day ago

Post

1060

The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c

1 reply

albertvillanova

posted an update 4 days ago

Post

1631

Discover all the improvements in the new version of Lighteval: https://huggingface.co/docs/lighteval/

DeividasM

authored a paper 12 days ago

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published 24 days ago • 8

davanstrien

posted an update 15 days ago

Post

3132

🇸🇰 Hovorte po slovensky? Help build better AI for Slovak!

We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!

Your contribution will help create better language models for 5+ million Slovak speakers.

Annotate here: data-is-better-together/fineweb-c.

Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community

3 replies

christopher

in bigscience/bloom-1b1-intermediate 22 days ago

Adding `safetensors` variant of this model

#2 opened 22 days ago by

SFconvertbot

davanstrien

posted an update 22 days ago

Post

1759

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

christopher

in bigscience/bloom-7b1-intermediate 22 days ago

Adding `safetensors` variant of this model

#4 opened 23 days ago by

SFconvertbot

soldni

authored 2 papers 22 days ago

RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models

Paper • 2409.02685 • Published Sep 4, 2024 • 1

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Paper • 2412.04403 • Published Dec 5, 2024 • 2

lhoestq

authored a paper 23 days ago

Croissant: A Metadata Format for ML-Ready Datasets

Paper • 2403.19546 • Published Mar 28, 2024 • 1

lhoestq

posted an update 30 days ago

Post

1695

Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)

Davlan

authored 9 papers about 1 month ago

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 6

SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

Paper • 2304.06845 • Published Apr 13, 2023

AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Paper • 2305.06897 • Published May 11, 2023 • 8

MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

Paper • 2305.13989 • Published May 23, 2023

AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages

Paper • 2311.09828 • Published Nov 16, 2023 • 1

The Effect of Domain and Diacritics in Yorùbá-English Neural Machine Translation

Paper • 2103.08647 • Published Mar 15, 2021

MasakhaNER: Named Entity Recognition for African Languages

Paper • 2103.11811 • Published Mar 22, 2021

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Paper • 2201.08277 • Published Jan 20, 2022

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Paper • 2309.07445 • Published Sep 14, 2023

AI & ML interests

Recent Activity

Team members 328

bigscience's activity

Adding `safetensors` variant of this model

Adding `safetensors` variant of this model