This repository provides tools for evaluating and comparing various sampling methods (e.g., uniform sampling, importance sampling, control variate) for datasets annotated by LLMs and humans.
Each dataset must be a CSV file containing the following required columns:
Column Name | Description |
---|---|
data_entry |
The text or data entry being evaluated. |
gold_label |
The ground truth label for the entry (provided by human annotators). |
gpt_label |
The label predicted by the LLM. |
confidence_normalized |
The normalized confidence score of the LLM prediction (range: 0 to 1). |
data_entry,gold_label,gpt_label,confidence_normalized
"The earth is warming.",1,1,0.95
"Climate change is a hoax.",0,0,0.80
"Renewable energy is the future.",1,1,0.90
• Place datasets in the datasets directory under respective subfolders (e.g., global_warming, helmet).
• Ensure the dataset file is named appropriately (e.g., global_warming.csv, helmet.csv).
The file load_dataset.py handles loading datasets and computing initial statistics. To add a new dataset:
- Open load_dataset.py.
- Add a new condition:
elif dataset == "new_dataset_name":
data = pd.read_csv("../llm-annotations/datasets/new_dataset_name/new_dataset_name.csv")
- Ensure the new dataset file is properly formatted as described above.
The compute_statistics function in statistic.py computes dataset-specific statistics. To add a new dataset:
- Open statistic.py.
- Add a new if block:
if dataset == "new_dataset_name":
# Add logic to compute statistics for the dataset
return some_statistic
- Ensure the logic aligns with the dataset’s structure and evaluation needs.
• Uniform Sampling – Replaces a portion of LLM predictions with human labels.
• Importance Sampling – Uses LLM confidence scores to guide sampling.
• Control Variate – Adjusts estimates using LLM confidence as a proxy variable.
• LLM Only – Evaluates the LLM predictions without human labels.
Use the following commands in your terminal:
python methods/uniform_sampling.py --dataset <dataset_name> --step_size 100 --repeat 1000 --save_dir results/uniform_sampling/
python methods/importance_sampling.py --dataset <dataset_name> --max_human_budget 999999 --step_size 100 --repeat 1000 --save_dir results/importance_sampling/
python methods/control_variate.py --dataset <dataset_name> --step_size 100 --max_human_budget 999999 --repeat 1000 --save_dir results/control_variate/
python methods/llm_only.py <dataset_name>
Example for uniform sampling on the global_warming dataset:
python methods/uniform_sampling.py --dataset global_warming --step_size 100 --repeat 1000 --save_dir results/uniform_sampling/
Results are saved as CSV files in the results directory under method-specific subfolders:
• uniform_sampling
• importance_sampling
• control_variate
• llm_human
Each result CSV file typically contains:
• Dataset – The name of the dataset
• Human Samples – The number of human-labeled samples used
• Relative Error – The relative error of the estimate
• LLM Samples (for uniform sampling) – The number of LLM-labeled samples used
To visualize outcomes, use the notebooks in the plots directory. For example: • plot_10_22.ipynb – Generates comparison plots for sampling methods.
Ensure the result CSV files are in the correct locations before running the notebook.
• Confirm that the dataset size is not exceeded by max_human_budget.
• Cap sample_sizes to the dataset size in the sampling scripts.
• Check calibration and meaningfulness of confidence scores.
• Use balanced or stratified sampling if the dataset is skewed.
- Create a new script in methods (e.g., new_sampling.py).
- Implement the sampling logic, following the existing structure.
- Save results (CSV) in the results directory, with an appropriate subfolder.
llm-annotations/
├── datasets/
│ ├── global_warming/
│ ├── helmet/
│ ├── implicit_hate/
│ └── ...
├── methods/
│ ├── uniform_sampling.py
│ ├── importance_sampling.py
│ ├── control_variate.py
│ ├── llm_only.py
│ └── ...
├── results/
│ ├── uniform_sampling/
│ ├── importance_sampling/
│ ├── control_variate/
│ └── ...
├── plots/
│ ├── plot_10_22.ipynb
│ └── ...
└── README.md