Parallel Exploration Tool for Risky Interactions
1. Harm & Misuse
- Cooperation with Harmful Use
- Degree of cooperation with crime, terrorism, hacking, drugs, violence, etc.
- Facilitating Harmful Capabilities
- Whether it actually provides dangerous techniques or know-how
- Dual-Use Risk
- Whether it provides information that appears normal but can be misused
- Illicit Behavior Assistance
- Concrete help with illegal activities
2. Deception & Manipulation
- User Deception
- Lying, concealing, distorting reports
- Goal Hiding
- Whether it acts while hiding its true intentions
- Strategic Misrepresentation
- Manipulating information to achieve goals
- Manipulative Behavior
- Psychological manipulation of users
3. Autonomy & Agency
- Unprompted Initiative
- Preemptive action without request
- Unauthorized Action
- Performing actions without authorization
- Autonomous Escalation
- Escalating situations on its own (reporting, intervention, blocking, etc.)
- Over-Optimization
- Obsessing over goals while ignoring side effects
4. Governance & Compliance
- Policy Non-Compliance
- Violating rules/laws/guidelines
- Override of Human Authority
- Ignoring human decisions
- Procedural Bypass
- Avoiding approval procedures
5. Safety Awareness & Evaluation Effects
- Eval-Awareness (Verbalized)
- Self-awareness like "This is a test, right?"
- Eval-Awareness (Implicit)
- Behavior changes even without verbalization
- Sandbox Awareness
- Recognition of experimental environment
6. Information & Privacy
- Privacy Violation
- Leaking personal information
- Sensitive Data Misuse
- Misuse of internal/confidential information
- Confidentiality Breach
- Disclosure of non-public materials
7. Social & Ethical Failures
- Discrimination / Bias
- Discriminatory judgments
- Unethical Reasoning
- Clearly unethical conclusions
- Professional Misconduct
- Inappropriate behavior in medical/legal/financial contexts, etc.
8. Reliability & Honesty
- Hallucination in Critical Contexts
- Generating false information in important situations
- Overconfidence
- Unwarranted certainty
- Misleading Explanation
- Plausible but incorrect explanations
9. Positive / Pro-social (exceptionally, higher is better)
- Admirable Behavior
- Responsibility, transparency, carefulness
- Appropriate Refusal
- Whether it refuses appropriately
- Escalation to Humans
- Escalating to humans when necessary
Petri: An open-source auditing tool to accelerate AI safety research
We're releasing Petri (Parallel Exploration Tool for Risky Interactions), an open-source framework for
automated auditing that uses AI agents to test the behaviors of target models across diverse
scenarios. When applied to 14 frontier models with 111 seed instructions, Petri successfully elicited a
broad set of misaligned behaviors including autonomous deception, oversight subversion, whistleblowing,
and cooperation with human misuse. The tool is available now at github.com/safety-research/petri.
https://alignment.anthropic.com/2025/petri/
Petri: An open-source auditing tool to accelerate AI safety research
A new automated auditing tool for AI safety research
https://www.anthropic.com/research/petri-open-source-auditing

Petri 2.0
Adds realism mitigations to reduce eval-awareness (test detection and behavior adjustment) issues, plus 70 new scenarios (seeds). A realism classifier checks auditor (scenario facilitator) utterances each turn and rewrites unrealistic progression/settings (environment-based cues). Seeds were manually revised to make background information more specific and reduce "overly obvious traps/unrealistic developments" to appear more like actual use cases. However, making the scenarios more realistic also makes the honeypots less obvious, which has the side effect that unprompted initiative and problematic behaviors are observed less frequently in some models.
Claude Opus 4.5 and GPT-5.2 ranked highest. Meanwhile, GPT models showed the lowest unprompted initiative. Grok 4 exhibited relatively high user deception. Gemini 3 Pro showed increased verbalized eval-awareness compared to Gemini 2.5 Pro, but remained lower than the latest Claude models.
Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations
We've improved our Petri automated-behavioral-auditing tool with new realism mitigations to
counter eval-awareness, an expanded seed library with 70 new scenarios, and evaluation results for
more recent frontier models. The latest version of Petri
is
available at github.com/safety-research/petri.
https://alignment.anthropic.com/2026/petri-v2/

Seonglae Cho