Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 β’ 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 β’ 16
Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? May 7, 2024 β’ 7
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 β’ 72
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 28
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
HistBERTurk-Models Collection Fine-tuned BERTurk models for historical Turkish. β’ 3 items β’ Updated 6 days ago β’ 2
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Paper β’ 2501.04828 β’ Published 3 days ago β’ 3
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Paper β’ 2501.05040 β’ Published 2 days ago β’ 7
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations Paper β’ 2501.03403 β’ Published 5 days ago β’ 3
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria β’ 4 days ago β’ 12
view article Article Crowd-sourced Open Preference Dataset for Text-to-Image Generation By RapidataAI β’ 4 days ago β’ 17
CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions Paper β’ 2501.00097 β’ Published 12 days ago β’ 1
view article Article FineWeb2-C: Help Build Better Language Models in Your Language By davanstrien β’ 19 days ago β’ 12
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper β’ 2412.13663 β’ Published 24 days ago β’ 121
Granite 3.1 Language Models Collection A series of language models with 128K context length trained by IBM licensed under Apache 2.0 license. β’ 8 items β’ Updated 24 days ago β’ 47
ModernBERT Collection Bringing BERT into modernity via both architecture changes and scaling β’ 3 items β’ Updated 23 days ago β’ 122
Hf-native ColVision Models Collection Models that can be used with the native transformers π€ implementation instead of colpali-engine. β’ 2 items β’ Updated Dec 8, 2024 β’ 2
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages Paper β’ 2412.09587 β’ Published 30 days ago β’ 3
PaliGemma 2 Release Collection Vision-Language Models available in multiple 3B, 10B and 28B variants. β’ 23 items β’ Updated 29 days ago β’ 125
Open Image Preferences Collection Containing all artifacts for the Stable Diffusion 3.5L vs Flux Dev image preference community sprint. β’ 14 items β’ Updated 23 days ago β’ 6