Skip to content

The LM Contamination Index is a manually created database of contamination evidences for LMs.

Notifications You must be signed in to change notification settings

eagirre/lm-contamination

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LM Contamination Index

Large Language Models have seen trillions of tokens. However, who knows what is inside? Recent works have evaluated those models in many different tasks, but, did they make sure the model had not already seen the training or even the evaluation datasets? In the blog post, we show that some popular benchmark datasets are already memorized by ChatGPT and that one can prompt ChatGPT to regenerate them.

On this repo we aim to collect (as much as possible) contamination evidences to provide to the research community a reliable resource to quick check whether the model has already seen their evaluation dataset. However, we are aware of the incompletness of the index and therefore we ask the researchers to in any case, perform an small experiment of contamination beforehand.

You can visit the search tool LM Contamination Index

Contributing

The amount of datasets and models is daunting. We are thus envisioning a community effort. If you are passionate about NLP reasearch and want to contribute against contamination in LLM evaluation, please follow the contribution guidelines

About

The LM Contamination Index is a manually created database of contamination evidences for LMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published