AI Jailbreak Benchmark

Creator

Creator

Created

Created

2024 Dec 20 23:31

Editor

Editor

Edited

Edited

2025 Jan 29 11:1

Refs

Refs

AI Jailbreak Dataset

AI Jailbreak Benchmarks

JailbreakBench: LLM robustness benchmark

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise unwanted content. Evaluating these attacks presents a number of challenges, and the current landscape of benchmarks and evaluation techniques is fragmented. First, assessing whether LLM responses are indeed harmful requires open-ended evaluations which are not yet standardized. Second, existing works compute attacker costs and success rates in incomparable ways. Third, some works lack reproducibility as they withhold adversarial prompts or code, and rely on changing proprietary APIs for evaluation. Consequently, navigating the current literature and tracking progress can be challenging. To address this, we introduce JailbreakBench, a centralized benchmark with the following components:

https://jailbreakbench.github.io/

Recommendations

////////