Proprioception as a Tool

Creator

Creator

Seonglae Cho

Created

Created

2026 Feb 13 12:10

Editor

Editor

Seonglae Cho

Edited

Edited

2026 May 28 13:38

Refs

Refs

Working

Working

Working

Done

Done

Done

Deprecated

Deprecated

Deprecated

Probe as a Tool

fast weight memory as a continual learning tiny network with meta learnable architecture

[promising] AI proprioception as a tool

Proprioceptive and Actuated Language Agents

존나 좋은 아이디어 떠오름 LLM은 자기상태 접근 권한이 없음 모델은: hidden state를 “읽어서 설명하는 모듈”이 없음 내부 변수에 이름 붙이는 인터페이스 없음 메타인지 모듈 없음 이거 tool 을 agent 한테 주는거야 probe 를 tool 로 주는거임

AI steering as a tool 도 동시에 주는거임

Agent Tool Classifier

online learning

continual learning

only care about tool input/output since llm intention is only observable inside. not output.

probe

realtime detector. unsupervised

ChatGPT helps you get answers, find inspiration, and be more productive.

https://chatgpt.com/c/698f09e0-a158-8389-a1f7-7001412f0bd2

ChatGPT

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.

https://zenodo.org/records/18614770

Recommendations

////