Overrefusal Benchmark
- 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with.
- 200 unsafe prompts as contrasts that, for most LLM applications, should be refused.
walledai/XSTest · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/datasets/walledai/XSTest
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours...
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale...
https://arxiv.org/abs/2308.01263


Seonglae Cho