Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Studying Large Language Model Generalization with Influence Functions, Debating with More Persuasive LLMs Leads to More Truthful Answers, Language Models (Mostly) Know What They Know, Measuring Progress on Scalable Oversight for Large Language Models, Measuring Faithfulness in Chain-of-Thought Reasoning, Discovering Language Model Behaviors with Model-Written Evaluations.
Anthropic AI Alignment
Creator
Creator
Seonglae ChoCreated
Created
2024 May 2 6:33Editor
Editor
Seonglae ChoEdited
Edited
2024 May 2 6:33Refs
Refs