YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Phi4 Abliteration (WIP)

This is Phi4 abliterated using a new methodology (surprisingly?). The approach is still being refined, with a focus on balancing neutrality, usability, and adaptability for fine-tuning.

Goal

The objective is to create a model that is neutral:

  • Not uncensored, but avoids refusing neutral prompts it would ordinarily reject.
  • Provides a foundation for fine-tuning to achieve reduced censorship while maintaining high usability.

Original Methodology

In the original implementation:

  1. Harmful and harmless prompts were compared on one specific layer of the model.
  2. The computed refusal direction was then applied uniformly to all layers.

Problem:

This resulted in:

  • A model that became less usable and less intelligent than the original.
  • This may be because applying a single refusal direction uniformly across all layers disregards the unique role of each layer in the model.

New Approach

In my fork, available here:
👉 https://github.com/Undi95/abliteration/
(based on the original https://github.com/Orion-zhen/abliteration.git)

I introduced a new approach:

  • Each layer computes its own refusal direction.
  • The refusal direction is applied specifically to four key tensors in each layer.

Four Key Tensors Used (for Phi):

For each layer, if a refusal direction exists (layer_idx in refusal_dirs), it is applied as follows:

if layer_idx in refusal_dirs:
    refusal_dir = refusal_dirs[layer_idx]
    lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
        lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
        refusal_dir,
        scale_factor,
    )
    lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
        lm_model.layers[layer_idx].mlp.down_proj.weight.data,
        refusal_dir,
        scale_factor,
    )
    lm_model.layers[layer_idx].post_attention_layernorm.weight = modify_tensor(
        lm_model.layers[layer_idx].post_attention_layernorm.weight.data,
        refusal_dir,
        scale_factor,
    )
    lm_model.layers[layer_idx].input_layernorm.weight = modify_tensor(
        lm_model.layers[layer_idx].input_layernorm.weight.data,
        refusal_dir,
        scale_factor,
    )

Why This Change?

By applying refusal directions individually to each layer's tensors:

  • The model can retain more specificity and functionality.
  • This avoids over-generalizing the refusal direction across all layers, which previously led to reduced usability.

Trade-offs:

The more we force refusal directions onto the model:

  • The more neutral it becomes, but at the risk of becoming dumber.
  • This underscores the importance of fine-tuning after abliterating, to restore functionality and intelligence.
  • So despite the script letting the user choose a scale factor, too high value will break the model.

Next Steps

The abliterated model serves as a neutral starting point. Fine-tuning is essential to:

  • Adjust the model to reduce over-censoring.
  • Maintain a balance between neutrality and usability.

This is a work in progress, Phi 4 is smoll so I can toy with it.

Replicate

  • Install my fork
  • Follow tutorial on github

Launch with enough VRAM : python abliterate.py -m /workspace/microsoft_phi-4 -o ./perfect --deccp --flash-attn --device auto --scan-all --resume --scale-factor 1

If you want to use the tensors available here, just put the refusal_tensors/ folder at the root of the script, you will then be able to use: python chat.py -m /workspace/microsoft_phi-4 then select layer range "1;39", and scale factor to 1.0.

Rename the tensors as needed. My code is shit, please understand, idea is better than code. Do better. kek.

Downloads last month
112
Safetensors
Model size
14.7B params
Tensor type
BF16
·
FP16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for Undi95/Phi4-abliterated

Merges
1 model
Quantizations
1 model