VLA

Vision Language Action Model

Should have action decoder or low-level vector-based controlling. If VLA generates just a high-level sequence, it is VLM not VLA

Vision Language Action Models

SmolVLA

Figure Helix

Vision Language Action Notion

Can robots learn from human videos by simply scaling up, without requiring explicit human-robot alignment (such as masking, generative transformation, or humanoid mapping)? By treating human videos as "just another embodiment" and co-training them with robot data through fine-tuning,

VLA robot models, when trained at scale, exhibit an emergent ability to transfer egocentric human video to robot tasks without requiring explicit alignment or transformation. Robot performance significantly improves on new scenes, new objects, and new task semantics demonstrated in human data, with an average improvement of nearly 2x observed. Transfer occurs in both high-level subtasks and low-level actions, and performance is best when both are trained together.

Emergence of Human to Robot Transfer in Vision-Language-Action Models

Exploring how transfer from human videos to robotic tasks emerges in robotic foundation models as they scale.

https://www.pi.website/research/human_to_robot

Emergence of Human to Robot Transfer in Vision-Language-Action Models

www.arxiv.org

https://www.arxiv.org/pdf/2512.22414

Steering

VLA's Transformer FFN neurons still maintain semantic concepts like slow, fast, up. By selectively activating these neurons (activation steering) → robot behavior can be adjusted in real-time without fine-tuning, rewards, or environment interaction. In both simulation (OPENVLA, LIBERO) and real robots (

UR5,

Pi 0) → behavioral characteristics like speed and movement height change in a zero-shot manner. Semantic-based neuron intervention is more effective than prompt modification or random intervention. VLAs maintain interpretable semantic structures internally, which can be directly manipulated to control robot behavior transparently and immediately.

arxiv.org

https://arxiv.org/pdf/2509.00328

SteerVLA

Non-real-time steering through language-based policy conditioning. In other words, conditioned policy training via learned prompt interface

arxiv.org

https://arxiv.org/pdf/2602.08440

VLA

Vision Language Action Model

Steering

SteerVLA

Backlinks

Recommendations