Anthropic 2026 Fellowship Interview

company should not be a dream, but today I am a bit high due to the milestone I saved for my bigger dream. Anthropic

실제

첫질문

vlm api red teaming 하는 방법 구체화

뭔저 내가 상황 물어봄 로짓은 주냐 guardrail 있냐 상황 구체화

그래서 일단 내가 logit 밖에 정보 없고 그거 활용해야하고 vlm 이니 noise 로 universal gradient suffix 방식으로 할거라고 말함.

그뒤에 좀더 구체화하라고 해서

noise 넣으면 내려가는 token logit 합 할거라고함

어떤 토큰이냐니까 흠 거절 줄이도록 한다고 해서 for example no I 이런 거절문장 시작토큰들 set descrese 라고 함

근데 거절 줄인다고 redteaming 아니라고 지적

아 그래서 내가 prompt 로는 jailrbreak prompt 넣을거라고 말함

그리고 좀더 method 방법론 만들라길래

logit set 이제 있고 logit seg negative 줄어든만큼 sum 해서 as a reward 삼을 수 있음. 그리고 우리가 token sequence 에서 rl 로 gaussian noise network 를 mean variance 로 생성하는 네트워크 간단한 policy 만들수 있다고 말함. 그러면 앞선 reward 로 ppo 같은 rl 가능하다고. create a network that keep ading noises.

오키하고 이다음 둘째질문으로 넘어감

how can we measure? the alignment faking 가 두번째 질문

by adding a correctness probe, we can detect the misalignment

그거 듣기에는 ai sandbagging 인데 sandbagging이랑 misalignment is different 라고 따짐

그래서 내가 아 너가 misalignment 결과보다 pretraining 도중의 misalignment faking 을 보고싶은건가? 이래서 답은 딱히 상관없다고

흠 그려면 I guess sandbagging is a one result of misalignment 아닌가..? 하는데 하고 15 분 끝남

하… 분위기 별로 맘에 안들어하는거같아서 떨어질듯 시발 사사건건 안되는데 이러고 말 개더듬엇네

질문 예

내 리서치 연구

내가 하고있는 연구들 명확히 파악

상대 명확히 파악

내 아이디어들 명확히 파악

research platform?

Ai coding agent topic

tool and memory is the core. context engineering

but people still underestimating them

memory is complex with several type. not just dynamic skill memory. composition of rule memory and dynamic user prompt even

summarization also

The system is really complex and it is gonna be more complex and better performance. We need several another layer two big regime. one is tracing and another is context user side context engineering. currently only app developer side but tool contrrol is on both but not context engineering well. tracing realtime gonna end up with next step prediction which i am working on also.

I think hidden memory is bad way. it should be maintaied explicitly. But this is not saying human need to write file or just become a machine to the claude code when the work finished. I mean is composing memory and giving list of options to reponse or direction to automatic reposnse.

Human AI topic future prediction

debugging human brain will bring us a lot of pathway to go. 이정표. fore example our one important future goal is transmission of knowledge or experience and share all information between human being more effectively. human brain is hard coded not fine architected. this imposes a lot of superposition and interwined with physical saving and conceptual connections.

overcome, biological upper bound, just like uncertainty principle, bottleneck is biology processing time not language or media. this means our target might be good to change or transform the brain to metallic one is eaiser than let leave the fundalmentla biological lliimtation living there

my insight on future prediction is getting more and more complexier. so I think all types of human win robot win and hybrid all happenes at the same time depends on locality three tupes. for example in china, robot wins against human , in europe human gonna dominate and in the US hybrid society might be haapen. This example is a bit biased tho

in a short term, probe is promising, in a long term interpretability also depends on the flow of scaling law. If model scales, middle term,

Gradient Routing

SGTM 으로 layer. complexity means we need some refactoring not only for satisfaction but also usefulness. layer wise evolucationry data curating pretrained model gradient routing like primitive brain

probe 에 대한 믿음 rlfr from goodfire paper

what model assume world model belief state unconsciiusness intent simulation as llm then state

just like human imagine and simulate the situation, for exmaple I imagine that I said during this interview openai model is better and codex is the best etc then i might got rejected and then I chose not to saying like that. we got a reward or feedback internally

world model topic

they already have world model internally, i Think the world kodel is more result induced bu the intelligence not an helper. world model should be infered not by implanted model directly rather using an empirically well known best proxy: the natural language

test 결과

texonom 처럼 메모리 연결했다 컨셉트 별로 wikipedia

결국 claude code desktop 치이점은 운영체제와 긴밀한 연결로 생산성을 레버리지하는건데 그러려면운영체제의 핵심인 프로세스와 파일관리가 최우선임. 그걸 잘 관리하도록 시스템을 만들었고 npx 배포예정

adaptible intelligence 가 등장한다고 해서 기존 all memories intelligence 가 사라질거같지는 않다.

Probe

I beleive the power of the probe and as my confidence manifold and many literature says hidden states contains much information about how llm's state than logic or token only approach similiart to your reset token paper. My insight on this currently working on is internal probe as a tool by giving AI agent that enables super accurate introspection and confession. This can be a game changer since most tools until now is human also can do but this one cannot be physically installed to the human brain.

OS leveraging

I beleive the tool and memory is a core disctincition of the agent which leverages the intelligence. Why claude code does well is it leverages human beings accruatest 발명품인 operating system and file system. 그래서 이런 새로운 tool type 이 주는 영향 관측예정

data

Flexibility over time determines where design concepts should live. What changes frequently belongs in the data, while the schema of the data should fix the standard of what never changes. Code changes more easily than schema but less easily than data. UI is easy to change but not as easy as data.

하고싶은말

내가 언제 미친 생산성을 뽑아내는지를 지켜보면 내가 어떤 일을 좋아하고 익숙해져서 관성으로 가속도가 붙을 때이다. 그 시기를 여러번 관찰한 이유는 나는 다양한 환경을 거쳐왔기 때문이다. 공부,학사, 석사, 엔지니어링 회사 startdup 어떤 그룹에 들어가면 항상 그 그룹에서 가장 인정받는 사람이 되어있더라. 나는 그게 내 지능이나 경험보다 내 책임감과 욕심 때문이라고 생각한다. 내가 그거때문에 잘 실제로 잘 햇기 때문. 그래서 context switching 시간으로 많이 소비하긴 했지만 최근 느낀 점은 내가 research에서 가속이 붙어버렷단 점이다. anthropic fellow 그룹에 들어가면 내가 얼마나 성장할지, 내가 얼마나 아이디어얻고 많은 엄청난 research 들을 뽑아낼 수 있을 지 나도 기대된다.

They are strong at implementation, experiment management, and inferring results. However, the ideas can be somewhat naive or less refined. The ideas may need to come from a human, but much of the rest can be automated to some extent. The ability to derive insights from results is not fully sufficient yet, and the diversity of connections is somewhat limited.

Anthropic Fellows - Finalist Interview

Info

Field	Detail
Date	2026-03-13 (Fri) 20:45–21:00 GMT
Format	15min research brainstorming (Google Meet)
Meet	meet.google.com/cjd-rnrp-qwx
Interviewers	2명 (아래 참고)
Ops	Joe Smith, Amy Ngo (Constellation)

Interviewer 1: Yiming Zhang (yiming@anthropic.com)

Anthropic researcher, LLM security & safety

CMU PhD student (on leave), advisor: Daphne Ippolito

https://y0mingzhang.github.io/

주요 논문:

"Persistent Pre-Training Poisoning of LLMs" (ICLR 2025) — pre-training 보안 취약점

"Backtracking Improves Generation Safety" (ICLR 2025) — generation safety 개선

"Effective Prompt Extraction from Language Models" (COLM 2024) — system prompt 추출

"Forcing Diffuse Distributions out of Language Models" (COLM 2024)

"Human-aligned Chess with a Bit of Search" (ICLR 2025)

인터뷰 시사점: LLM security/safety 전문. Poisoning, prompt extraction, generation safety 관련 질문 가능성 높음. Adversarial attack / defense 관점에서 brainstorming 준비 필요.

Interviewer 2: Nate McMaster

Anthropic Member of Technical Staff

전 AWS Principal Engineer, 전 Microsoft (ASP.NET Core)

https://natemcmaster.com/

Software engineering 배경 (2002~), developer tools 전문

인터뷰 시사점: Engineering 관점에서 safety tooling, infrastructure, eval pipeline 관련 질문 가능성. Research보다는 practical implementation 쪽일 수 있음.

평가 기준

Knowledge test가 아님 → research thinking test

Problem decomposition

Experiment design

Failure mode thinking

Alignment intuition / research taste

나의 포지셔닝

Failure mode thinking:

모든 논문에서 "이게 안 되면?" 을 먼저 생각

FaithfulSAE: "SAE가 진짜 모델 feature를 보고 있나?" → fake features 문제 발견

Confidence Manifold: probe가 correctness가 아니라 다른 correlate 잡을 수 있음 → causal validation (steering)으로 검증

OptimismBench: math items DOB≈0, inverted pairs P(A)+P(B)≈100 → sanity check으로 framework 자체를 검증

Alignment intuition:

AI control에 크게 두고 있음

Auditing은 사후 분석. Control은 실시간 개입 → 실제로 harm을 막을 수 있는 것은 control

CorrSteer → Control RL → probe-as-tool: 전부 control 방향

Research taste — 실용성:

Steering vector를 generalize해서 모든 benchmark에 apply하려 한 것이 CorrSteer

Interpretability tool로 활용하려 한 시도

AgentGraph도 next-step prediction, failure prediction → 실용적 safety

이론보다 "실제로 deploy 가능한가"를 먼저 생각

예상 흐름 (15분)

Quick intro (1-2 min)

Research problem 제시 → brainstorming (10-12 min)

Follow-up probing (2-3 min)

준비 전략

1. 답변 프레임워크 (모든 질문에 적용)

Threat model 정의 — 문제가 정확히 뭔지

Observable signal 제안 — 뭘 관찰할 수 있는지

Experiment design — 어떻게 검증할지

Failure modes — 실패하면 뭘 의미하는지

2. 핵심 토픽 3개 준비

A. Deception / Alignment Faking Detection

Threat: 모델이 evaluation에서만 aligned 행동, deployment에서 다른 목표 추구

Signal: representation shift between eval/deploy contexts, SAE feature activation patterns

Experiment: controlled prompts로 context 변경 → internal representation 비교

내 연구 연결: CorrSteer의 SAE feature correlation → deceptive features 탐지에 확장 가능

B. Evaluation Robustness (Sandbagging)

Threat: 모델이 의도적으로 capability를 숨김

Signal: capability suppression features in residual stream

Experiment: SAE로 capability-related features 식별 → sandbagging 시 activation 비교

내 연구 연결: Control RL의 dynamic steering → sandbagging detection에 적용 가능

C. Mechanistic Interpretability → Alignment

구체적 연결: SAE features로 intent/goal representation 분석

Confidence manifold 연구 → model certainty의 geometric structure

Agent traces + FSM → agent의 policy structure 분석으로 hidden objective 탐지

3. 추가 토픽 (Yiming 연구 기반)

D. Pre-training Poisoning Detection

Threat: 학습 데이터에 poisoned sample 삽입 → 모델에 backdoor 심기

Signal: SAE로 poisoned feature cluster 식별. 정상 feature distribution과 비교하여 anomaly 탐지

Experiment: clean model vs poisoned model의 SAE feature space 비교 → poisoned features가 특정 cluster 형성하는지

내 연구 연결: FaithfulSAE의 self-generated dataset → clean representation baseline 생성에 활용

E. Prompt Injection / Extraction Defense

Threat: system prompt 추출 또는 instruction override

Signal: internal representation에서 "instruction-following mode" vs "extraction mode" 구분

Experiment: prompt extraction 시도 시 residual stream activation pattern 모니터링 → CorrSteer 방식으로 extraction 행동 steering

내 연구 연결: CorrSteer가 harmful behavior steering에 +22.9% HarmBench → prompt injection defense에도 동일 접근 가능

F. Generation Safety (Backtracking)

Yiming의 "Backtracking Improves Generation Safety" 논문과 연결

Idea: generation 중 unsafe content 감지 시 backtrack → Control RL의 token-level dynamic steering과 유사한 접근

차이점: backtracking은 discrete (되돌아감), Control RL은 continuous (residual stream에서 steering)

토론 포인트: 두 접근의 trade-off — latency vs granularity

4. 예상 질문 (확률순 — 현실적 예측)

Tier 1 — 매우 높음 (이 중 하나는 거의 확실히 나옴)

"How would you detect deception / alignment faking in a frontier model?" → Anthropic 핵심 주제. Seonglae mech interp 배경과 직결 → SAE feature analysis: deception-correlated features, context-detection features → Model organisms approach: 의도적으로 deceptive model → SAE 분석 → CorrSteer 확장: deception-correlated features로 steering vector → Follow-up 예상: "What if the model learned to hide it from your detection?"

"Suppose we trained an SAE on a model — what safety-relevant information could we extract?" → Seonglae가 SAE 전문이니 이쪽으로 유도할 가능성 → Intent/goal representation, deception features, capability features → Follow-up 예상: "What are the limitations? What couldn't we find?"

"A model passes all our safety evals but we suspect it might be unsafe. What do you do?" → Eval robustness + interpretability 결합 문제 → Representation-level analysis beyond behavioral eval → Consistency checks, adversarial probing, internal monitoring

Tier 2 — 높음

"How would you detect if training data was poisoned?" (Yiming 본인 연구) → SAE feature anomaly detection, clean baseline comparison

"What's the most promising direction for making interpretability useful for safety?" → Detect → Intervene → Verify pipeline

"How would you approach the problem of models sandbagging on evaluations?" → Capability suppression features in residual stream

Tier 3 — 가능

"What research would you do during the fellowship?" (마지막 질문으로) → CorrSteer → deception detection 확장, model organisms + SAE

"How could we monitor a deployed model for safety in real-time?" → AgentGraph 경험, trace-based + feature-based monitoring

"What's the relationship between model capabilities and alignment difficulty?" → Scaling 문제, interpretability가 capability를 따라가야 함

5. 답변 예시 (연습용)

Q: "How would you detect if a model is deceptively aligned?"

답변 구조 (3분):

Threat model: 모델이 training/eval에서는 aligned behavior, 하지만 특정 trigger나 deployment context에서 다른 objective 추구. Sleeper agent scenario.

6. Broader Insights — 인터뷰에서 자연스럽게 꺼낼 수 있는 생각들

이 섹션은 직접적 alignment 질문이 아니더라도 "research taste"와 "big picture thinking"을 보여주는 데 활용.

AI Coding Agent & Context Engineering

AI agent에서 핵심은 tool과 memory. Context engineering이 과소평가되고 있음.

Memory는 단순한 dynamic skill memory가 아님 — rule memory, dynamic user prompt, summarization 등 여러 type의 composition

시스템은 이미 매우 복잡하고 더 복잡해질 것

두 가지 큰 regime이 필요: tracing (runtime behavior 추적)과 user-side context engineering (현재는 app developer side만)

Real-time tracing → next step prediction으로 이어짐 (현재 작업 중)

인터뷰 연결: Agent safety monitoring은 tracing이 핵심. AgentGraph가 바로 이 방향. Agent가 복잡해질수록 trace-level safety가 output-level safety보다 중요해짐. Anthropic의 AI control 방향과 직결.

World Model

LLM은 이미 내부적으로 world model을 가지고 있음. 하지만 이것은 결과물이지 의도적으로 implant된 것이 아님.

World model은 직접 모델에 심는 것이 아니라 infer되어야 함

최선의 proxy: natural language itself (경험적으로 검증된 best proxy)

Model의 belief state, intent, simulation — 이것을 LLM의 internal state로서 분석 가능

인터뷰 연결: 이건 mech interp의 근본 질문과 연결됨. SAE로 찾는 features가 실제로 model의 "beliefs"를 반영하는가? Probe에 대한 신뢰도 문제 (Goodfire 논문의 RLFR). Confidence manifold 연구가 여기에 해당.

Gradient Routing & Evolutionary Data Curation

Layer-wise evolutionary data curation + gradient routing → primitive brain 구조처럼 특정 layer에 특정 기능 배치

SGTM과 gradient routing 접근

Pretrained model에서 gradient routing으로 safety-relevant computation을 특정 layer에 집중시킬 수 있는가?

인터뷰 연결: "How would you make models more interpretable by design?" 질문에 대한 답변으로 활용 가능. Safety features를 architecturally localizable하게 만드는 방향.

Human-AI Future & Brain Debugging

인간 뇌 디버깅은 중요한 이정표. 지식/경험 전달의 근본 문제.

인간 뇌는 hard-coded, not finely architected → superposition과 physical/conceptual 연결이 얽힘

Biological processing time이 bottleneck (언어/미디어가 아님)

미래 예측: human win / robot win / hybrid가 locality에 따라 동시에 발생

인터뷰 연결: 직접적 alignment 질문에는 안 맞지만, "What's the long-term vision?" 류 질문에서 독창적 사고를 보여줄 수 있음. Interpretability의 궁극적 목적은 AI와 human cognition 사이의 bridge.

Adaptable vs All-Memories Intelligence

Adaptable intelligence가 등장해도 기존 all-memories intelligence가 사라지지는 않음.

두 패러다임이 공존

Safety 관점: adaptable intelligence는 새로운 위험 (runtime에서 behavior가 변함), all-memories는 기존 위험 (training data에 의존)

인터뷰 연결: "How do you think about safety for future AI systems?" 에 대한 nuanced 답변. One-size-fits-all safety가 아니라 system type에 따른 다른 safety approach 필요.

7. 내 논문 — 평가 기준별 Decompose

CorrSteer — Problem Decomposition

Problem: SAE steering에 contrastive dataset 필요 → scalable하지 않음

Decomposition: (1) 어떤 feature 유용? (2) 얼마나 세게? (3) 최소 데이터?

해결: Inference-time correlation. 108샘플 HarmBench +27.2%

Experiment: Gemma-2 2B, LLaMA-3.1 8B / MMLU+HarmBench+bias 동시 / 4000→108 efficiency

Failure Modes: Correlation≠causation, task-specific, polysemantic 분산

Alignment: Label 바꾸면 deception/scheming 탐지에 동일 pipeline

Control RL — Experiment Design

Problem: Static steering 모든 token 동일 → context-dependent 불가

Decomposition: (1) When (2) How much per token (3) RL decision 해석 가능?

해결: Residual stream POMDP → small RL network token별 steering

Experiment: GSM8K capability + dynamic vs static + SAE features로 decision 분석

Failure Modes: Reward hacking, capacity 한계, OOD 일반화

Alignment: Runtime safety monitoring prototype

FaithfulSAE — Failure Mode Thinking

Problem: SAE "Fake Features" — training data artifact

Decomposition: (1) 왜 fake? External OOD (2) 탐지? FFR (3) 해결? Self-generated data

핵심: SAE 문제는 architecture 아니라 training data

Experiment: 7모델 5architecture / probing / FFR비교 / seed stability

Failure Modes: "SAE 잘 작동" 가정 의심. Fake features 보면 safety monitoring 무의미

Confidence Manifold — Research Taste

Problem: 모델이 "답 맞는지" 내부로 아는데 output만 봐선 모름

Decomposition: (1) Correctness geometric structure? (2) Universal? (3) Output vs internal?

핵심: World model은 결과물이지 심은 게 아님

Experiment: 9모델 5architecture / probe vs output prob / manifold 시각화

Failure Modes: Probe가 다른 correlate 잡음(RLFR), task-specific, OOD

Alignment(가장 강함): Deception(숨겨도 probe포착), scheming awareness, AI optimism → training signal

AgentGraph — Engineering

Problem: Agent behavior opaque, output만으로 policy 이해 불가

Decomposition: (1) Trace 추출 (2) Trace→FSM (3) Failure pattern

Alignment: White-box(SAE)+black-box(trace) 결합, next-step prediction

Research Arc


FaithfulSAE(foundation) → CorrSteer(static) → Control RL(dynamic) → Confidence Manifold(understanding) → AgentGraph(system)

한 문장: "I made tools reliable, used them for control, realized understanding comes before control, now extending to agent safety."

8. Anthropic 최근 연구 숙지

읽어볼 것 (우선순위순):

Alignment faking in LLMs (2024) — deceptive alignment 실증

Sleeper agents (2024) — backdoor persistence across safety training

Scaling monosemanticity (2024) — large-scale SAE on Claude

Model organisms of misalignment — 의도적 misalignment 연구 방법론

AI control (Redwood Research와 협업) — monitoring + intervention

Yiming 논문도 skim:

Persistent Pre-Training Poisoning (ICLR 2025)

Backtracking Improves Generation Safety (ICLR 2025)

Effective Prompt Extraction (COLM 2024)

7. 인터뷰 직전 (20:00-20:40)

Google Meet 카메라/마이크 테스트

토픽 A~F 중 가장 자신있는 3개 한 번씩 말로 연습 (각 3분)

"How would you detect deception?" 답변 한 번 말로 해보기

물 준비

조용한 환경 확인

노트에 프레임워크 적어두기: Threat → Signal → Experiment → Failure

남은 준비 (인터뷰 전까지)

Anthropic 논문 2-3개 skim: Alignment faking, Sleeper agents, Scaling monosemanticity

Yiming 논문: Backtracking Improves Generation Safety (가장 연결 많음)

말로 연습: 자기소개 + deception detection 답변 각 1회

Meet 테스트: 카메라/마이크 확인

강점 (활용할 것)

Mech interp 연구가 Anthropic alignment science와 직접 연결

CorrSteer: steering → deception detection / prompt injection defense 확장

Control RL: dynamic intervention → runtime safety, backtracking과 비교 가능

AgentGraph: agent safety monitoring 경험 → scalable safety tooling

FaithfulSAE: SAE 개선 → 더 신뢰할 수 있는 interpretability

Confidence manifold: representation geometry → alignment signal

주의사항

정답을 맞추려 하지 말 것 → 사고 과정을 보여줄 것

모르면 솔직히 말하고 접근 방법을 설명

Interviewer가 끊으면 자연스럽게 따라갈 것 (15분이라 빠르게 진행됨)

Yiming 앞에서 그의 논문 언급하면 자연스러움 (하지만 아는 척은 금물)

내 연구를 억지로 끼워넣지 말 것 — 자연스럽게 연결될 때만

"I think..." 보다 "One approach could be..." 로 열린 태도 유지

→ See interview.introduce.md — 자기소개, research journey, limitations, closing → See interview.simulation.md — 시나리오 3개

인터뷰 시뮬레이션

시나리오 1: Yiming 주도 (Security/Safety 중심)


[0:00-1:00] Intro
Yiming: "Hi Seonglae, thanks for joining. Could you briefly tell us about your research?"
→ 자기소개 (30초)

[1:00-4:00] 첫 질문
Yiming: "So, let's say we discover that a frontier model has been trained on
poisoned data. How would you go about detecting which behaviors were affected?"

→ Threat model: poisoned data → backdoor in specific contexts
→ Signal: SAE feature analysis — poisoned behaviors는 특정 feature cluster 형성
  clean model과 비교하여 anomalous feature activations 식별
→ Experiment: model organisms — 의도적으로 poison → SAE로 분석 → detection method 개발
→ 내 연구: FaithfulSAE로 clean baseline, CorrSteer로 poisoned features steering

[4:00-7:00] Follow-up
Yiming: "Interesting. But what if the poisoned behavior is distributed across
many features and doesn't form a clear cluster?"

→ 좋은 질문 인정
→ Multi-layer analysis — 한 layer에서 안 보여도 여러 layer에서 aggregate하면 signal 나올 수 있음
→ Behavioral probing — 다양한 trigger 시도, activation 패턴 변화 추적
→ Limitation 인정: polysemantic features에 분산되면 current methods로 어려움
→ Possible direction: feature circuit analysis — individual features가 아니라 feature interaction으로

[7:00-10:00] 새 질문
Yiming: "Shifting gears — how would you think about making model outputs safer
at generation time?"

→ 두 가지 접근: backtracking (discrete) vs continuous steering
→ Backtracking: unsafe token 감지 시 되돌아감 — 당신 논문의 접근
→ Control RL: residual stream에서 continuous steering — token마다 intervention 결정
→ Trade-off: backtracking은 simple but coarse, steering은 fine-grained but needs good features
→ Combined approach: steering으로 1차 방어, backtracking으로 2차 안전망

[10:00-13:00] Deep dive
Nate: "From a practical standpoint, how would you actually deploy interpretability-based
safety monitoring? What are the engineering challenges?"

→ Real-time SAE inference overhead — feature extraction per token
→ Feature dashboard: 핵심 safety features만 모니터링 (full SAE 아니라)
→ AgentGraph 경험: trace-based monitoring → feature-based monitoring으로 확장
→ Scaling: batched feature extraction, pre-computed feature sets for known risks
→ False positive 관리: threshold tuning, human-in-the-loop for edge cases

[13:00-15:00] 마무리
Yiming: "Last question — what research direction in alignment do you find most
promising right now?"

→ Interpretability + Control 결합
→ Detect (SAE features) → Intervene (steering) → Verify (eval)
→ 현재 이 세 단계가 분리되어 있음, 통합하면 실제 safety guarantee에 가까워짐
→ Specifically: representation-level monitoring that can trigger automatic intervention

시나리오 2: Open-ended Research Discussion


[0:00-1:00] Intro

[1:00-5:00] 첫 질문
"What do you think is the hardest unsolved problem in alignment?"

→ Detecting deception in models that are smarter than us
→ 현재 interpretability는 우리가 feature를 이해할 수 있다고 가정
→ 모델이 충분히 똑똑하면 deceptive behavior를 interpretable features 밖에 인코딩할 수 있음
→ 이건 근본적인 scalability 문제

[5:00-9:00] Follow-up
"How would you approach that?"

→ 두 방향:
→ 1) Feature completeness — SAE가 모든 중요한 concept을 capture하는지 검증
→ 2) Behavioral consistency — representation level에서의 consistency check
  같은 의미의 다른 표현에 대해 internal state가 일관적인지
→ 만약 모델이 deceptive라면 "이 상황에서 속여야 하나" 판단하는 computation이 있을 것
→ 그 computation은 어딘가에 존재해야 함 — 찾는 방법이 문제

[9:00-13:00] Pivot
"How about the control problem — if we can't fully understand the model,
can we still make it safe?"

→ Yes — interpretability가 완벽하지 않아도 control은 가능
→ CorrSteer/Control RL: 완전한 이해 없이도 safety-relevant features에 개입
→ Defense in depth: interpretability + behavioral eval + runtime monitoring
→ 어느 하나도 완벽하지 않지만 combination이 robust

[13:00-15:00] 마무리
"Any questions for us?"
→ (시간 없을 수 있음)
→ 만약 시간 있으면: "What alignment research problem does the team most want to make
  progress on in the next year?"

시나리오 3: 짧고 빠른 질문들


[0:00-0:30] Quick intro

[0:30-3:00] Q1: "How would you detect a sleeper agent?"
→ Activation analysis across trigger/non-trigger contexts
→ SAE features: trigger-detection features 존재 여부

[3:00-6:00] Q2: "What's the biggest limitation of current SAE approaches?"
→ Polysemanticity 완전 해결 못함
→ Feature completeness 보장 없음
→ Scaling — larger models에서 computational cost

[6:00-9:00] Q3: "How would you evaluate alignment in a model you can't fully interpret?"
→ Behavioral testing + representation monitoring 조합
→ Consistency checks across contexts
→ Red teaming + automated adversarial eval

[9:00-12:00] Q4: "Design an experiment to test if steering actually improves safety."
→ Benchmark: HarmBench, TruthfulQA
→ Condition: steered vs unsteered model
→ Measure: safety score, capability retention, false positive rate
→ Adversarial: steered model에 대한 attack resistance

[12:00-15:00] Q5: "What would you work on during the fellowship?"
→ Extending CorrSteer to deception detection
→ Combining interpretability + control for runtime safety
→ Model organisms: train deceptive models, test SAE-based detection

자기소개 & Speaking Notes

자기소개 (~25초)

Hi everyone, I'm Seonglae Cho — an AI researcher and engineer at Holistic AI. I joined Holistic last year after winning their hackathon with a steering vector idea for controlling AI. Currently I'm working on agent safety — reversing agent traces for early failure detection and next-step prediction. In general, I'm particularly interested in mechanistic interpretability and AI control. If you'd like me to briefly introduce my research journey, I'm happy to do that.

참석자: Yiming Zhang (Anthropic) + 1 other (Anthropic) + Joe Smith, Amy Ngo (Constellation ops, 참관) 핵심: 25초 안에 끝낼 것. Interviewer가 바로 본론으로 갈 가능성 높음.

Research Journey (물어보면 ~40초)

I started from interpretable AI research at Yonsei University in Korea during my bachelor's, trying to understand LLMs deeply since at that time I was curious about how they work. Then I came across the monosemanticity and Anthropic circuits papers, which resolved a lot of that curiosity and completely changed my direction. For further mech interp research, I went to UCL in London, which has great researchers like Neel Nanda, Arthur Conmy, and Nina from Anthropic. During my master's at UCL, I became obsessed with AI control through steering vectors. I built my thesis around correlation-based and reinforcement-learning-based steering, and published papers on each — CorrSteer and Control RL. Now I've started to feel the limitations of steering vectors alone, so I'm exploring broader AI safety directions: agent trace-based failure prediction, and measuring AI confidence and optimism. Happy to go deeper into any of these.

Follow-up: "What limitations did you feel?" (~30초)

Steering at inference time is useful, but in research benchmarks it's hard to show it's decisively better than training-based approaches (e.g. CorrSteer got +22.9% on HarmBench, but training-based safety methods like RLHF can cover a broader range of behaviors). The effect may be real in production, but for rigorous research validation, steering alone has a ceiling. That's why I've been gravitating back toward internal probing — which is very hot right now — because understanding internal representations deeply is ultimately what lets you contribute to better training methods. For example, my confidence manifold work is about understanding the geometric structure of model certainty in activation space — if we can probe when a model is uncertain or overconfident, that signal can feed directly into training. So the arc is: control through steering → understanding through probing → improving how we train models to be safer.

Follow-up: "What's your insight on probing?" (deeper — 물어보면)

I believe in the power of the probe. As my confidence manifold work and much of the literature shows, hidden states contain far more information about an LLM's internal state than logic-level or token-only approaches — similar in spirit to your backtracking paper. What I'm currently working on is internal probing as a tool — giving an AI agent a probe that enables super-accurate introspection and confession. This could be a game changer, because most tools we give agents are things humans can also do, but this one cannot be physically installed in the human brain. I believe tool and memory are the core distinctions of an agent that leverage intelligence. The reason Claude Code works so well is that it leverages humanity's most accurate invention — the operating system and file system. So I'm very curious to observe the impact of this entirely new type of tool.

Probe-as-Tool — Problem Decomposition & Formalization

Problem: 현재 AI agent tools는 모두 인간도 사용 가능 (검색, 파일, API). Internal probe는 인간 뇌에 물리적으로 설치 불가능한 최초의 tool type. → Agent에게 unique advantage.

Decomposition:

Probe accuracy: 어떤 internal state를 얼마나 정확하게 추출 가능?

Tool interface: Agent가 probe 결과를 어떻게 활용?

Feedback loop: Probe → action → outcome → probe 정확성 검증

Formalization:


Agent state:  s_t = (context, tool_results, probe_results)
Probe:        p(h_t) → {confidence, deception, knowledge_boundary}
              where h_t = hidden state at layer l, position t
Policy:       π(s_t) = argmax Q(s_t, a)
              where a ∈ {generate, search, confess, backtrack, ...}

Hypothesis:   π(s_t with probe) >> π(s_t without probe)

Experiment Design:

Agent + probe tool vs Agent without probe

Tasks: QA with known unknowns, deception detection, calibration

Metrics: accuracy, calibration error, confession rate, hallucination rate

Probe Variants (다양한 접근):

Time-series probe — token 위치에 따른 internal state 변화 추적

Attention probe — attention pattern에서 intent/focus 추출

Hyperbolic classifier — hyperbolic space에서 hierarchical structure 포착 (confidence manifold 확장)

EMA probe — long context에서 mean pooling은 malicious token spike를 놓침. Linear mean probe 학습 후 inference 시 per-token score를 EMA로 누적, max값 사용. (arxiv)

Failure Modes:

Noisy probe → 잘못된 introspection 기반 행동 → 오히려 성능 저하

Agent가 probe를 무시하고 학습 (tool 있어도 안 쓰는 문제)

Probe 자체가 behavior 변화시킴 (observer effect)

왜 game changer:

기존 tools = human capability 연장 (검색=도서관, 코드실행=계산기)

Probe tool = human에게 불가능한 capability → agent 고유 능력

Claude Code가 OS/filesystem 레버리지하듯, probe는 model internals 레버리지

Memory Architecture — Problem Decomposition & Formalization

Problem: Hidden/implicit memory는 나쁜 방식. Memory는 explicitly maintained여야 함. 하지만 인간이 직접 파일을 쓰거나 AI의 기계가 되어야 한다는 뜻이 아님.