Uses only 487 community robot datasets centered on SO100 (approximately 10 million frames).
Action Expert (Action Decoder (≈100M Transformer)) + SmolVLM uses Flow Matching → directly generates continuous control values; not Autoregressive Model, predict the batch action sequence → very fast.

Action chunk prediction
The action expert is trained with flow matching to output chunks of n = 50 actions with a single forward pass.
Action controller
SmolVLA's Action Expert does not fix the action dimension. The action expert processes action_dim → Linear → model_dim(d). Action results are not fed back into the VLM context. Only sensor states are re-inputted. During training: Action Expert input = noisy action tokens; during inference: Action Expert input = pure Gaussian noise
Since the robot must act based on what it sees now and what language commands it received, Cross-Attention is used in Vision AI Controlling. Only the first half (8 layers) of the VLM's 16 layers are passed to the Action Expert. This is because the latter VLM layers are more optimized for language generation → not suitable for control, and computation is reduced by half (LayerSkip). The middle layers have stronger perception grounding and control signals.
Synchronous inference mode
Action chunk driven: executes all 50-step chunks, then observes the next frame to compute a new chunk.

Asynchronous inference mode
Frame driven: predicts a new plan of 50 actions at every frame. The 50 actions are a generous buffer, but frames are more critical. Specifically, at every timestep (= every control step), a new observation is received as one action is executed. This creates a structure where 1 observation is received per 1 action.
paper
model
usage

Seonglae Cho