LLM Annotations Framework

This repository provides tools for evaluating and comparing various sampling methods (e.g., uniform sampling, importance sampling, control variate) for datasets annotated by LLMs and humans.

1. Dataset Formatting

Each dataset must be a CSV file containing the following required columns:

Column Name	Description
`data_entry`	The text or data entry being evaluated.
`gold_label`	The ground truth label for the entry (provided by human annotators).
`gpt_label`	The label predicted by the LLM.
`confidence_normalized`	The normalized confidence score of the LLM prediction (range: 0 to 1).

Example Dataset

data_entry,gold_label,gpt_label,confidence_normalized
"The earth is warming.",1,1,0.95
"Climate change is a hoax.",0,0,0.80
"Renewable energy is the future.",1,1,0.90

Dataset File Locations

• Place datasets in the datasets directory under respective subfolders (e.g., global_warming, helmet).
• Ensure the dataset file is named appropriately (e.g., global_warming.csv, helmet.csv).

2. Setting Up load_dataset

The file load_dataset.py handles loading datasets and computing initial statistics. To add a new dataset:

Open load_dataset.py.
Add a new condition:

elif dataset == "new_dataset_name":
    data = pd.read_csv("../llm-annotations/datasets/new_dataset_name/new_dataset_name.csv")

Ensure the new dataset file is properly formatted as described above.

3. Setting Up compute_statistics

The compute_statistics function in statistic.py computes dataset-specific statistics. To add a new dataset:

Open statistic.py.
Add a new if block:

if dataset == "new_dataset_name":
    # Add logic to compute statistics for the dataset
    return some_statistic

Ensure the logic aligns with the dataset’s structure and evaluation needs.

4. Running Tests

Available Sampling Methods

• Uniform Sampling – Replaces a portion of LLM predictions with human labels.
• Importance Sampling – Uses LLM confidence scores to guide sampling.
• Control Variate – Adjusts estimates using LLM confidence as a proxy variable.
• LLM Only – Evaluates the LLM predictions without human labels.

Commands to Run Tests

Use the following commands in your terminal:

Uniform Sampling

python methods/uniform_sampling.py --dataset <dataset_name> --step_size 100 --repeat 1000 --save_dir results/uniform_sampling/

Importance Sampling

python methods/importance_sampling.py --dataset <dataset_name> --max_human_budget 999999 --step_size 100 --repeat 1000 --save_dir results/importance_sampling/

Control Variate

python methods/control_variate.py --dataset <dataset_name> --step_size 100 --max_human_budget 999999 --repeat 1000 --save_dir results/control_variate/

LLM Only

python methods/llm_only.py <dataset_name>

Example for uniform sampling on the global_warming dataset:

python methods/uniform_sampling.py --dataset global_warming --step_size 100 --repeat 1000 --save_dir results/uniform_sampling/

5. Results

Results are saved as CSV files in the results directory under method-specific subfolders: • uniform_sampling
• importance_sampling
• control_variate
• llm_human

Each result CSV file typically contains: • Dataset – The name of the dataset
• Human Samples – The number of human-labeled samples used
• Relative Error – The relative error of the estimate
• LLM Samples (for uniform sampling) – The number of LLM-labeled samples used

6. Plotting Results

To visualize outcomes, use the notebooks in the plots directory. For example: • plot_10_22.ipynb – Generates comparison plots for sampling methods.

Ensure the result CSV files are in the correct locations before running the notebook.

7. Debugging Common Issues

KeyError in Sampling

• Confirm that the dataset size is not exceeded by max_human_budget.
• Cap sample_sizes to the dataset size in the sampling scripts.

Confidence Scores

• Check calibration and meaningfulness of confidence scores.
• Use balanced or stratified sampling if the dataset is skewed.

8. Adding a New Sampling Method

Create a new script in methods (e.g., new_sampling.py).
Implement the sampling logic, following the existing structure.
Save results (CSV) in the results directory, with an appropriate subfolder.

9. Directory Structure

llm-annotations/
├── datasets/
│   ├── global_warming/
│   ├── helmet/
│   ├── implicit_hate/
│   └── ...
├── methods/
│   ├── uniform_sampling.py
│   ├── importance_sampling.py
│   ├── control_variate.py
│   ├── llm_only.py
│   └── ...
├── results/
│   ├── uniform_sampling/
│   ├── importance_sampling/
│   ├── control_variate/
│   └── ...
├── plots/
│   ├── plot_10_22.ipynb
│   └── ...
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.idea		.idea
datasets		datasets
methods		methods
plots		plots
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
calculate_error.py		calculate_error.py
mt_bench_win_rate_control_variates.ipynb		mt_bench_win_rate_control_variates.ipynb
mt_bench_with_gpt_confidence.csv		mt_bench_with_gpt_confidence.csv
plot_figures.py		plot_figures.py
plot_importance_control.py		plot_importance_control.py
plot_results.py		plot_results.py
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Annotations Framework

1. Dataset Formatting

Example Dataset

Dataset File Locations

2. Setting Up load_dataset

3. Setting Up compute_statistics

4. Running Tests

Available Sampling Methods

Commands to Run Tests

Uniform Sampling

Importance Sampling

Control Variate

LLM Only

5. Results

6. Plotting Results

7. Debugging Common Issues

KeyError in Sampling

Confidence Scores

8. Adding a New Sampling Method

9. Directory Structure

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

uiuc-kang-lab/llm-annotations

Folders and files

Latest commit

History

Repository files navigation

LLM Annotations Framework

1. Dataset Formatting

Example Dataset

Dataset File Locations

2. Setting Up load_dataset

3. Setting Up compute_statistics

4. Running Tests

Available Sampling Methods

Commands to Run Tests

Uniform Sampling

Importance Sampling

Control Variate

LLM Only

5. Results

6. Plotting Results

7. Debugging Common Issues

KeyError in Sampling

Confidence Scores

8. Adding a New Sampling Method

9. Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages