SmolVLA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 25 14:9
Editor
Edited
Edited
2025 Nov 26 12:14
Refs
Uses only 487 community robot datasets centered on
SO100
(approximately 10 million frames).
Action Expert (Action Decoder (≈100M Transformer)) +
SmolVLM
uses
Flow Matching
→ directly generates continuous control values; not
Autoregressive Model
, predict the batch action sequence → very fast.
notion image

Rectified Flow Matching Loss

A structure that optimizes MSE to predict the vector field

Action chunk prediction

The action expert is trained with flow matching to output chunks of n = 50 actions with a single forward pass.

Action controller

SmolVLA's Action Expert does not fix the action dimension. The action expert processes action_dim → Linear → model_dim(d). Action results are not fed back into the VLM context. Only sensor states are re-inputted. During training: Action Expert input = noisy action tokens; during inference: Action Expert input = pure Gaussian noise
Since the robot must act based on what it sees now and what language commands it received,
Cross-Attention
is used in
Vision AI Controlling
. Only the first half (8 layers) of the VLM's 16 layers are passed to the Action Expert. This is because the latter VLM layers are more optimized for language generation → not suitable for control, and computation is reduced by half (
LayerSkip
). The middle layers have stronger perception grounding and control signals.

Synchronous inference mode

Action chunk driven: executes all 50-step chunks, then observes the next frame to compute a new chunk.
notion image

Asynchronous inference mode

Frame driven: predicts a new plan of 50 actions at every frame. The 50 actions are a generous buffer, but frames are more critical. Specifically, at every timestep (= every control step), a new observation is received as one action is executed. This creates a structure where 1 observation is received per 1 action.
 
 
 
 
 
paper
model
usage
 

Recommendations