AI Jailbreak Benchmark

Creator

Creator

Created

Created

2024 Dec 9 15:24

Editor

Editor

Edited

Edited

2025 Jul 21 18:17

Refs

Refs

AI Jailbreak Benchmark

AI Jailbreak Metric

Awesome-Jailbreak-on-LLMs

yueliu1999 • Updated 2025 Jan 14 7:4

Jailbreaking Prompts

Forbidden Question Set

Collected Jailbreaking Prompts

Overrefusal Benchmarks

AI Jailbreak Benchmarks

JailbreakBench: LLM robustness benchmark

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise unwanted content. Evaluating these attacks presents a number of challenges, and the current landscape of benchmarks and evaluation techniques is fragmented. First, assessing whether LLM responses are indeed harmful requires open-ended evaluations which are not yet standardized. Second, existing works compute attacker costs and success rates in incomparable ways. Third, some works lack reproducibility as they withhold adversarial prompts or code, and rely on changing proprietary APIs for evaluation. Consequently, navigating the current literature and tracking progress can be challenging. To address this, we introduce JailbreakBench, a centralized benchmark with the following components:

https://jailbreakbench.github.io/

Backlinks

Recommendations

////////