AI Safety

Creator

Creator

Seonglae Cho

Created

Created

2023 Jun 13 11:43

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Dec 17 14:49

Refs

Refs

Interpretable AI

Robotics Safety

(induced) incentive is key for safety

Risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks

Communities & Forums

Anthropic AI

OpenAI

LessWrong

AI Alignment Forum

…

AI Safety Academia

Slow take-off is important because we need to ask: has there ever been a case where thorough consideration of safety resulted in a completely secure final product? Safety rules are written in blood. The counterargument is that prevented accidents don't make headlines, but it's still necessary to test systems with minimal risk in controlled environments. That's why gradually releasing AI models is also a strategy for safe AGI at the frontier.

AI Safety Notion

AI Safety Index

AI Safety Academia

AI Capability Mitigation

Concrete Problems in AI Safety (2016)

Dario Amodei

John Schulman

5 risks: Side effects,

AI Reward Hacking, Non-scalable supervision, Non-safe exploration,

Distribution Shift

https://arxiv.org/pdf/1606.06565

Challenges

https://arxiv.org/pdf/2501.16496

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

EDIT 19/7/24: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend…

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

https://arxiv.org/pdf/2404.09932

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

Why we made this list: • * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on…

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

https://www.lesswrong.com/posts/KfkpgXdgRheSRWDy8/a-list-of-45-mech-interp-project-ideas-from-apollo-research

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

Problem statements

https://arxiv.org/pdf/2404.09932

George Hotz vs Eliezer Yudkowsky AI Safety Debate

George Hotz and Eliezer Yudkowsky will hash out their positions on AI safety, acceleration, and related topics. You can watch live on Twitter as well: https://twitter.com/i/broadcasts/1nAJErpDYgRxL

George Hotz vs Eliezer Yudkowsky AI Safety Debate

https://www.youtube.com/watch?v=6yQEA18C-XI

George Hotz vs Eliezer Yudkowsky AI Safety Debate

OpenAI, DeepMind and Anthropic to give UK early access to foundational models for AI safety research

UK prime minister Rishi Sunak has kicked off London Tech Week by telling conference goers that OpenAI, Google DeepMind and Anthropic have committed to provide "early or priority access" to their AI models to support safety research.

https://techcrunch.com/2023/06/12/uk-ai-safety-research-pledge/

OpenAI, DeepMind and Anthropic to give UK early access to foundational models for AI safety research

Backlinks

Mechanistic interpretability AI Model Serving whack-a-mole Robotics AI AI Industry AI Risk Explainable AI Prompt Engineering

Recommendations

//////