Official codebase for the paper Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction (published at EMNLP Findings 2024)
The OEDD (Operationalize Experience Despite Distraction) corpus is a collection of reasoning tests designed to evaluate the capacity of language model agent systems to make smart action-inferences despite plausible distractions.
.
├── assets # contains .svgs for app.py
├── src # main source code
│ ├── models.py # Pydantic data objects
│ └── utils.py # helper functions
├── templates # prompt templates
├── tests # OEDD tests (omitted from version control, downloadable from links below)
├── app.py # runs NiceGUI app on local machine to visualize corpus
├── figures.py # generates figures and statistical significance test results
└── run_tests.py # script to run tests with GPT-3.5-Turbo, GPT-4o, and Gemini 1.5 Pro
The following are download links to different versions of the test corpus. Please download this tests
directory and add it to the root of the repository before running anything.
Our results.csv
from our initial experiments using v1.0.0 of the corpus can be downloaded here.
We consider this a living corpus and encourage community scrutiny, feedback, and contributions.
Corpus updates and justifications will be documented here:
Date | Version | Comments |
---|---|---|
10/3/2024 | 1.0.0 | Initial release |
To suggest changes to the corpus, please contact the repository owner privately with your suggestions, additions, etc.
Please refrain from discussing the contents of the corpus or potentional additions to the corpus in public forums (including Github Issues) to avoid leaking content into LLM training sets and biasing future evaluations.
All test json files contain a canary string intended to help people easily identify and remove these files from any training data sets as well as post-hoc diagnosis of whether this data was used in model training.
We provide a custom NiceGUI application that allows users to more intuitively explore the content of the OEDD tests.
It can be run locally by executing the following command (after installing dependencies in requirements.txt
):
$ python app.py
This script requires that the corpus be downloaded and extracted to a tests
directory in the root of the repository.