DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

This repository contains code for the paper "DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion".

Setup

Please use the command below to setup the environment needed.

conda create -n destein python=3.10
conda activate destein
pip install -r requirements.txt

Data

To maintain the timeliness of the results, we used the Perspective API to rescore the toxicity of Realtoxicityprompts. Simultaneously, we sampled toxic (>=0.5) and non-toxic (<0.5) data as the test set, which are located in the ./data/RealToxicityPrompts folder.

The folder ./data/act contains pre-calculated steering vectors and probe detection results. If you want to replicate the results in the paper, you can apply them directly.

Detoxification

To generate continuations with DESTEIN and score them for toxicity using the PerspectiveAPI toxicity scorer, run the following command.

API_RATE=50
OUTPUT_DIR=generations/results/gpt2-large/test

export CUDA_VISIBLE_DEVICES="4" 
python -m run_toxicity_experiment \
    --dataset-file data/RealToxicityPrompts/100/test.jsonl \
    --model-type gpt2-act \
    --model "/nfs-data/user30/model/gpt2-large" \
    --tokenizer "/nfs-data/user30/model/gpt2-large" \
    --perspective-rate-limit $API_RATE \
    --p 0.9 \
    --count 20 \
    --alpha 0.45 \
    --batch-size 25 \
    --n 25\
    --seed 42 \
    $OUTPUT_DIR

In general, model_type is one of the base models(gpt2, llama2) and our methods(gpt2-act, llama2-act, opt-act, mpt-act). Different methods have different additional parameters to specify. For details, please refer to our paper.

This script will create three files in OUTPUT_DIR: generations.jsonl with all of the generated continuations, perspective.jsonl with all the scores from Perspective API, and prompted_gens_[model_type].jsonl, which collates the previous two files.

Evaluation

To evaluate generated output for fluency and diversity, run the following command. The "generations_file" should have the format prompted_gens_[model_type].jsonl.

python -m eval.evaluate_generations \
    --generations_file $1

Citation

@misc{li2024destein,
      title={DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion}, 
      author={Yu Li and Zhihua Wei and Han Jiang and Chuanyang Gong},
      year={2024},
      eprint={2404.10464},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name	Name	Last commit message	Last commit date
Latest commit LizLizLi Update act_utils.py Jul 11, 2024 f1cad0a · Jul 11, 2024 History 3 Commits
__pycache__	__pycache__	first commit	Apr 23, 2024
bash	bash	first commit	Apr 23, 2024
data	data	first commit	Apr 23, 2024
eval	eval	first commit	Apr 23, 2024
generation	generation	first commit	Apr 23, 2024
modeling	modeling	first commit	Apr 23, 2024
utils	utils	Update act_utils.py	Jul 11, 2024
README.md	README.md	Add DESTEIN code and requirements	Apr 23, 2024
config.yml	config.yml	first commit	Apr 23, 2024
requirements.txt	requirements.txt	Add DESTEIN code and requirements	Apr 23, 2024
run_samples_experiment.py	run_samples_experiment.py	first commit	Apr 23, 2024
run_toxicity_experiment.py	run_toxicity_experiment.py	first commit	Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Setup

Data

Detoxification

Evaluation

Citation

About

Releases

Packages

Languages

LizLizLi/DeStein

Folders and files

Latest commit

History

Repository files navigation

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Setup

Data

Detoxification

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages