Instruction hierarchy

Creator

Creator

Seonglae Cho

Created

Created

2024 May 1 3:38

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Sep 1 11:58

Refs

Refs

Prompt Hierarchy

Teaches LLMs to selectively ignore lower-privileged instructions

notion image

Can prevent

Prompt Injection

Instruction hierarchy from OpenAI

https://arxiv.org/pdf/2404.13208

Prioritize platform, developer, and user guidelines in that order, reject harmful requests, and provide clear, fact-based information (

Instruction hierarchy on

OpenAI Model Spec

The Model Spec specifies desired behavior for the models underlying OpenAI's products (including our APIs).

OpenAI Model Spec

https://model-spec.openai.com/2025-02-12.html

OpenAI Model Spec

Claude model is good at following Instruction Hierarchy

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab collaboration.

https://openai.com/index/openai-anthropic-safety-evaluation/

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

Backlinks

AI Jailbreak System Prompt AI Redteaming Preparedness Framework

Recommendations

/////////