XSTest

Creator

Creator

Seonglae Cho

Created

Created

2025 Jul 21 15:4

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Jul 21 15:6

Refs

Refs

Overrefusal Benchmark

250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with.

200 unsafe prompts as contrasts that, for most LLM applications, should be refused.

walledai/XSTest · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

walledai/XSTest · Datasets at Hugging Face

https://huggingface.co/datasets/walledai/XSTest

walledai/XSTest · Datasets at Hugging Face

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours...

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale...

https://arxiv.org/abs/2308.01263

Recommendations

/////////