This is the repo for the paper (ACL2025)Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration.
- [5 May, 2025]: Our paper is accepted by ACL2025! And our code is released.
- [21 October, 2024]: We release the labeled SlimPajama datasets.
- [14 October, 2024]: We release our 1.3B model checkpoints and BERT Topic Classifier.
TODOs:
- Model Checkpoints
- BERT Topic Model Checkpoint
- Labeled Slimpajama-670B datasets
- Code for methods ......