AI Auditing

AI Monitoring

AI Auditing Methods

Alignment Audit

Internet Comment Automatic bot detection and blocking cases

Detecting and Countering Malicious Uses of Claude

https://www.anthropic.com/news/detecting-and-countering-malicious-uses-of-claude-march-2025

Position Paper

arxiv.org

https://arxiv.org/pdf/2507.11473

Not only

Deliberative Alignment cot think too. CoT monitoring is significantly more effective than monitoring outputs/actions alone. Current-scale RL training does not significantly harm CoT monitoring. CoT monitoring is not a replacement but a complement to mechanistic interpretability.

Evaluating chain-of-thought monitorability

We introduce evaluations for chain-of-thought monitorability and study how it scales with test-time compute, reinforcement learning, and pretraining.

https://openai.com/index/evaluating-chain-of-thought-monitorability/

AI Auditing

AI Monitoring

Backlinks

Recommendations