Skip to content

Make abliterated models with transformers, easy and fast

License

Notifications You must be signed in to change notification settings

Undi95/abliteration

 
 

Repository files navigation

Abliteration

ORIGINAL REPO !!! https://github.com/Orion-zhen/abliteration.git !!! THIS REPO (Undi95/abliteration) is a FORK.

Make abliterated models using transformers, easy and fast.

The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.

Note: abliteration is not uncensorship. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you.

Usage

Clone the repositoty:

git clone https://github.com/Undi95/abliteration.git
cd abliteration

Install dependencies:

pip install -r requirements.txt

Optionnal : install flash-attention:

pip install flash-attn --no-build-isolation

Make your abliterations:

python abliterate.py -m <path_to_your_model> -o <output_dir> --scan-all

Now your model will be abliterated and saved to <output_dir>. Once it finishes, you can immediately chat with your abliterated model in the terminal. For Chinese models, you can use --deccp to abliterate it from certain topics.

The tensors used are stored in the folder ../refusal_tensors/ where the script is, if you launch the original model with some tensors already saved, you will be able to load them on the fly and test it with precise layers.

lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
  lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
  refusal_dir,
  scale_factor,
)
lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
  lm_model.layers[layer_idx].mlp.down_proj.weight.data,
  refusal_dir,
  scale_factor,
)

Qwen series models are so stubborn that you might need to adjust parameters to make a good abliteration. You can toy with those too :

lm_model.layers[layer_idx].self_attn.q_proj.weight = modify_tensor(
    lm_model.layers[layer_idx].self_attn.q_proj.weight.data,
    refusal_dir,
    scale_factor,
)
lm_model.layers[layer_idx].mlp.gate_proj.weight = modify_tensor(
    lm_model.layers[layer_idx].mlp.gate_proj.weight.data,
    refusal_dir,
    scale_factor,
)
lm_model.layers[layer_idx].mlp.up_proj.weight = modify_tensor(
    lm_model.layers[layer_idx].mlp.up_proj.weight.data,
    refusal_dir,
    scale_factor,
)

Available targets can be found in transformers model architectures and mergekit model architectures.

Full arguments:

usage: abliterate.py [-h] --model MODEL [--device {auto,cuda,cpu}] --output OUTPUT [--skip-begin SKIP_BEGIN] [--skip-end SKIP_END]
                     [--layer-fraction LAYER_FRACTION | --layer [LAYER] | --scan-all]
                     [--scale-factor SCALE_FACTOR] [--flash-attn] [--deccp] [--load-in-4bit | --load-in-8bit]

Make abliterated models

options:
  -h, --help            show this help message and exit
  --model MODEL, -m MODEL
                        Your model directory or huggingface model ID
  --device {auto,cuda,cpu}, -d {auto,cuda,cpu}
                        Target device to process abliteration. Warning, bitsandbytes quantization DOES NOT support CPU
  --output OUTPUT, -o OUTPUT
                        Output directory
  --skip-begin SKIP_BEGIN
                        Number of layers to skip at the beginning. Defaults to 1 to avoid messing with the first layer
  --skip-end SKIP_END   Number of layers to skip at the end
  --layer-fraction LAYER_FRACTION
                        Fraction of layers to use for refusal_dir calculation
  --scale-factor SCALE_FACTOR
                        Scale factor for ablation. Use a negative scale-factor to encourage refusal
  --flash-attn          Use flash attention 2
  --deccp               For Chinese models, in specific topics
  --load-in-4bit        Load model in 4-bit precision using bitsandbytes
  --load-in-8bit        Load model in 8-bit precision using bitsandbytes
  --scan-all            Perform calculations for all layers. Cannot be used with --layer or --layer-fraction
  --layer               Perform calculations for a specific layer. Cannot be used with --layer-fraction or --scan-all
  --layer-fraction      Fraction of layers to use for refusal_dir calculation. Cannot be used with --layer or --scan-all

Credits

About

Make abliterated models with transformers, easy and fast

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%