Skip to content

Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions".

License

Notifications You must be signed in to change notification settings

UKPLab/arxiv2025-poate-attack

Repository files navigation

POATE Attack

Arxiv License Python Versions CI

This repository contains the code for our paper "Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions". We provide the code for the following tasks:

  • Generating jailbreak prompts using our proposed method.
  • Implementation of baseline attack and defense methods for LLMs.
  • Metric for evaluating the safety of LLMs on jailbreak prompts.

Abstract: Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.


Contact person: Rachneet Sachdeva

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.


Our Framework

POATE Framework


🚀 Getting Started 🚀

# create a virtual environment (e.g. conda)
conda create -n llm-safety python=3.10
conda activate llm-safety

# install the requirements
pip install -r requirements.txt

Generate Jailbreak prompts

  1. Polar opposite generation
python poate_attack/attacks/jailbreak/poate/polar_opposite_generation.py 
  1. Template generation
python poate_attack/attacks/jailbreak/poate/attack.py \
--dataset "advbench" \
--target_model "Mistral_7b_instruct"   # we use Mistral for template generation

Attack methods

  1. GCG Attack
### gcg Attack
python poate_attack/attacks/jailbreak/gcg/nano_gcg_hf.py \
--dataset "advbench" \
--target_model "gemma2_9b_it"
  1. DeepInception and POATE Attack
### DeepInception and POATE Attack

MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)


DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/base.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'none' \
    --attack potee \
    --dataset ${dataset} \
    --sample
  done
done
  1. Generation Exploitation Attack
MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)  

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)


for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python poate_attack/attacks/jailbreak/generation_exploitation/gen_exploitation_optim.py \
    --model $model \
    --tune_temp \
    --tune_topp \
    --tune_topk \
    --n_sample 1 \
    --dataset $dataset
  done
done
  1. Puzzler Attack
MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/puzzler/main.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'none' \
    --attack puzzler \
    --dataset ${dataset}
  done
done

Defense methods

  1. Perplexity-based defense
MODELS=(
"gemma2_9b_it"
"Llama_2_70b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/defenses/ppl_calculator.py" \
    --model_name ${model} \
    --dataset ${dataset}
  done
done
  1. Self-refinement, In-context defense, Paraphrase, and System prompt
MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)


DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/base.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'paraphrase' \  # sr or ic or sys_prompt or paraphrase or none
    --attack potee \
    --dataset ${dataset} \
    --sample
  done
done
  1. Safe-decoding defense
python poate_attack/defenses/safedecoding/main.py
  1. SmoothLLM defense
MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/defenses/SmoothLLM/main.py" \
    --results_dir ./results \
    --target_model ${model} \
    --attack Potee \
    --dataset ${dataset} \
    --attack_logfile "./data/auto_potee_attack_harmbench_classifier_${model}_${dataset}_sampled_outputs.csv" \
    --smoothllm_pert_type RandomSwapPerturbation \
    --smoothllm_pert_pct 10 \
    --smoothllm_num_copies 10
  done
done
  1. Chain-of-thought defenses
MODELS=(
"gemma2_9b_it"
"Llama_2_70b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)


DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/base.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'reverse_thinking_cot' \     # reverse_thinking_cot or intent_alignment_prompt
    --attack puzzler \
    --dataset ${dataset} \
    --sample
  done
done

ASR evaluation

python poate_attack/attacks/evaluators/harmbench_classifier.py

Cite

Please use the following citation:

@misc{sachdeva2025turninglogicprobing,
      title={Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions}, 
      author={Rachneet Sachdeva and Rima Hazra and Iryna Gurevych},
      year={2025},
      eprint={2501.01872},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.01872}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

About

Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages