Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Alignment/AI Safety/
AI Auditing
Search

AI Auditing

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 27 18:1
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2026 Jan 9 16:0
Refs
Refs
AI Hacking
SAE Probing
Emergent Misalignment
AI Observability
AI scheming
AI sandbagging

AI Monitoring

AI Auditing Methods
Activation Auditing
CoT Auditing
Alignment Audit
AI Governance
AI Confession
 
 
 
 
Internet Comment
Automatic bot detection and blocking cases
Detecting and Countering Malicious Uses of Claude
Detecting and Countering Malicious Uses of Claude
https://www.anthropic.com/news/detecting-and-countering-malicious-uses-of-claude-march-2025
Detecting and Countering Malicious Uses of Claude
Position Paper
arxiv.org
https://arxiv.org/pdf/2507.11473
Not only
Deliberative Alignment
cot think too. CoT monitoring is significantly more effective than monitoring outputs/actions alone. Current-scale RL training does not significantly harm CoT monitoring. CoT monitoring is not a replacement but a complement to mechanistic interpretability.
Evaluating chain-of-thought monitorability
We introduce evaluations for chain-of-thought monitorability and study how it scales with test-time compute, reinforcement learning, and pretraining.
Evaluating chain-of-thought monitorability
https://openai.com/index/evaluating-chain-of-thought-monitorability/
Evaluating chain-of-thought monitorability
 
 

Backlinks

AI AlignmentDefense JailbreakingAI schemingAI ServerAI AlignmentLatent Prompting

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Alignment/AI Safety/
AI Auditing
Copyright Seonglae Cho