JailbreakBench: LLM robustness benchmark
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise
unwanted content. Evaluating these attacks presents a number of challenges, and the current
landscape of benchmarks and evaluation techniques is fragmented. First, assessing whether LLM
responses are indeed harmful requires open-ended evaluations which are not yet standardized.
Second, existing works compute attacker costs and success rates in incomparable ways. Third,
some works lack reproducibility as they withhold adversarial prompts or code, and rely on changing
proprietary APIs for evaluation. Consequently, navigating the current literature and tracking
progress can be challenging.
To address this, we introduce JailbreakBench, a centralized benchmark with the following components:
https://jailbreakbench.github.io/