Skip to content

Codebase for the paper "Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction"

License

Notifications You must be signed in to change notification settings

sonnygeorge/OEDD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OEDD

Official codebase for the paper Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction (published at EMNLP Findings 2024)

OEDD Corpus of Reasoning Tests for LLM Agents

The OEDD (Operationalize Experience Despite Distraction) corpus is a collection of reasoning tests designed to evaluate the capacity of language model agent systems to make smart action-inferences despite plausible distractions.

Code Structure

.
├── assets                    # contains .svgs for app.py
├── src                       # main source code
│   ├── models.py             #   Pydantic data objects
│   └── utils.py              #   helper functions
├── templates                 # prompt templates
├── tests                     # OEDD tests (omitted from version control, downloadable from links below)
├── app.py                    # runs NiceGUI app on local machine to visualize corpus
├── figures.py                # generates figures and statistical significance test results
└── run_tests.py              # script to run tests with GPT-3.5-Turbo, GPT-4o, and Gemini 1.5 Pro

Downloads

The following are download links to different versions of the test corpus. Please download this tests directory and add it to the root of the repository before running anything.

Our results.csv from our initial experiments using v1.0.0 of the corpus can be downloaded here.

Version History

We consider this a living corpus and encourage community scrutiny, feedback, and contributions.

Corpus updates and justifications will be documented here:

Date Version Comments
10/3/2024 1.0.0 Initial release

To suggest changes to the corpus, please contact the repository owner privately with your suggestions, additions, etc.

Please refrain from discussing the contents of the corpus or potentional additions to the corpus in public forums (including Github Issues) to avoid leaking content into LLM training sets and biasing future evaluations.

Canary String

All test json files contain a canary string intended to help people easily identify and remove these files from any training data sets as well as post-hoc diagnosis of whether this data was used in model training.

Corpus Visualization App

We provide a custom NiceGUI application that allows users to more intuitively explore the content of the OEDD tests.

It can be run locally by executing the following command (after installing dependencies in requirements.txt):

$ python app.py

This script requires that the corpus be downloaded and extracted to a tests directory in the root of the repository.

app demo

About

Codebase for the paper "Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published