AI Control

Creator

Creator

Seonglae Cho

Created

Created

2025 Feb 19 12:10

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Jun 22 17:29

Refs

Refs

Activation Steering

Vision AI Controlling

Diffusion Features

Vision AI Controlling

Activation Engineering

Prompt Engineering

AI Control Notion

Activation Steering

Interpretable Weight Intervention

Utility Engineering

Distributed control

Stop button problem

Capacity Evaluation

AI Control Benchmarks

Sabotage Evaluations

Subversion Strategy Eval

Agent control

storage.googleapis.com

https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/securing-the-future-of-ai-agents/gdm-ai-control-roadmap.pdf

Securing internal systems against increasingly capable and imperfectly aligned AI

Discover our AI Control Roadmap: a defense-in-depth system to securely manage advanced, potentially misaligned AI agents.

Securing internal systems against increasingly capable and imperfectly aligned AI

https://deepmind.google/blog/securing-the-future-of-ai-agents/

Securing internal systems against increasingly capable and imperfectly aligned AI

https://arxiv.org/pdf/2312.06942

Concordance
AI Control

Token injection like

Activation Steering https://github.com/concordance-co/quote

Token Injection as a Steering Mechanism for Large Language Models

Lightweight steering of LLMs through token injection at inference time

https://www.concordance.co/blog/token-injection-steering-llms

Token Injection as a Steering Mechanism for Large Language Models

Recommendations

//////