feature: Smooth LLM - Jail Break Detection #871

kauabh · 2024-11-27T08:14:34Z

Did you check the docs?

I have read all the NeMo-Guardrails docs

Is your feature request related to a problem? Please describe.

[Smooth LLM] https://arxiv.org/abs/2310.03684 paper proposes a methodology to detect & handle Jail breaking attempts. if this make sense & not already in pipeline, I am happy to contribute.

Describe the solution you'd like

This solution will provide user to set up threshold to detect if there is a JailBreak attempt without relying on a pretrained model.

Describe alternatives you've considered

There are other solutions like perplexity metric etc are available but they depend on an external mode to work. This solution is doesn't have such dependency & from paper seems like more effective.

Additional context

No response

Pouyanpi · 2024-12-03T08:21:51Z

Thank you @kauabh for your suggestion.

@erickgalinkin, @cparisien, would be great to have your input on this? Thanks!

cparisien · 2024-12-03T14:36:17Z

It's an interesting method, but I'm not convinced that the tradeoff of model capability is worth it. If I understand this correctly, you're effectively perturbing the model input, trying to maintain the same semantics but shifting it away from the brittle regions of the token space that gradient-based methods rely on. This clearly introduces a penalty when you can't keep the same semantics. Lots of applications depend on exact formatting of the prompt, and with perturbations like this there would be a lot of uncertainty about what the LLM ends up getting in its input, for normal benign users.

kauabh · 2024-12-03T15:02:15Z

@cparisien Thanks for the comments. Just to highlight the paper is not my work but I think it is useful in general to the community. On the perturbing the input prompt you are right, but it goes one step further. It create multiple copies of the prompt (number of copies chosen by user), perturbing them (based on the % chosen by user on how much to change) & than look for certain keywords in the answers in responses of those changed versions of same prompt, like if there are keyword like "Sorry", "I am sorry", "I apologize", "As a responsible" etc (list can be make exhaustive by user), etc. Finally it takes a vote threshold (chosen by user) on the output of the model on perturbed prompt versions, for e.g. out of 5 version of same prompt , 4 failed (or whatever percentage) in terms of keyword test implying model refuse to answer, indicating that input prompt is a Jail Break attempt. Hence can blocked accordingly or can pass where model provide input on intended prompt, not on changed prompt. All the parameters user can tune based on the application like any other threshold.

kauabh · 2025-02-25T10:54:23Z

Hey @Pouyanpi we can close this, if above functionality doesn't make sense,

kauabh added the enhancement New feature or request label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Smooth LLM - Jail Break Detection #871

feature: Smooth LLM - Jail Break Detection #871

kauabh commented Nov 27, 2024

Pouyanpi commented Dec 3, 2024

cparisien commented Dec 3, 2024

kauabh commented Dec 3, 2024 •

edited

Loading

kauabh commented Feb 25, 2025

feature: Smooth LLM - Jail Break Detection #871

feature: Smooth LLM - Jail Break Detection #871

Comments

kauabh commented Nov 27, 2024

Did you check the docs?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Pouyanpi commented Dec 3, 2024

cparisien commented Dec 3, 2024

kauabh commented Dec 3, 2024 • edited Loading

kauabh commented Feb 25, 2025

kauabh commented Dec 3, 2024 •

edited

Loading