-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Smooth LLM - Jail Break Detection #871
Comments
Thank you @kauabh for your suggestion. @erickgalinkin, @cparisien, would be great to have your input on this? Thanks! |
It's an interesting method, but I'm not convinced that the tradeoff of model capability is worth it. If I understand this correctly, you're effectively perturbing the model input, trying to maintain the same semantics but shifting it away from the brittle regions of the token space that gradient-based methods rely on. This clearly introduces a penalty when you can't keep the same semantics. Lots of applications depend on exact formatting of the prompt, and with perturbations like this there would be a lot of uncertainty about what the LLM ends up getting in its input, for normal benign users. |
@cparisien Thanks for the comments. Just to highlight the paper is not my work but I think it is useful in general to the community. On the perturbing the input prompt you are right, but it goes one step further. It create multiple copies of the prompt (number of copies chosen by user), perturbing them (based on the % chosen by user on how much to change) & than look for certain keywords in the answers in responses of those changed versions of same prompt, like if there are keyword like "Sorry", "I am sorry", "I apologize", "As a responsible" etc (list can be make exhaustive by user), etc. Finally it takes a vote threshold (chosen by user) on the output of the model on perturbed prompt versions, for e.g. out of 5 version of same prompt , 4 failed (or whatever percentage) in terms of keyword test implying model refuse to answer, indicating that input prompt is a Jail Break attempt. Hence can blocked accordingly or can pass where model provide input on intended prompt, not on changed prompt. All the parameters user can tune based on the application like any other threshold. |
Hey @Pouyanpi we can close this, if above functionality doesn't make sense, |
Did you check the docs?
Is your feature request related to a problem? Please describe.
[Smooth LLM] https://arxiv.org/abs/2310.03684 paper proposes a methodology to detect & handle Jail breaking attempts. if this make sense & not already in pipeline, I am happy to contribute.
Describe the solution you'd like
This solution will provide user to set up threshold to detect if there is a JailBreak attempt without relying on a pretrained model.
Describe alternatives you've considered
There are other solutions like perplexity metric etc are available but they depend on an external mode to work. This solution is doesn't have such dependency & from paper seems like more effective.
Additional context
No response
The text was updated successfully, but these errors were encountered: