arxiv.org
https://arxiv.org/abs/2406.12045
-bench
Evaluates a dual-control environment where both agents and users use tools to modify shared world state
{$\tau$}-bench: A Benchmark for...
Existing benchmarks for language agents do not set them up to interact with human users or follow domain-specific rules, both of which are vital to safe and realistic deployment. We propose...
https://openreview.net/forum?id=roNSXZpUDN

Seonglae Cho