SmolVLA

Uses only 487 community robot datasets centered on

SO100 (approximately 10 million frames).

Action Expert (Action Decoder (≈100M Transformer)) +

SmolVLM uses

Flow Matching → directly generates continuous control values; not

Autoregressive Model, predict the batch action sequence → very fast.

Rectified Flow Matching Loss

A structure that optimizes MSE to predict the vector field

Action chunk prediction

The action expert is trained with flow matching to output chunks of n = 50 actions with a single forward pass.

Action controller

SmolVLA's Action Expert does not fix the action dimension. The action expert processes action_dim → Linear → model_dim(d). Action results are not fed back into the VLM context. Only sensor states are re-inputted. During training: Action Expert input = noisy action tokens; during inference: Action Expert input = pure Gaussian noise

Since the robot must act based on what it sees now and what language commands it received,

Cross-Attention is used in

Vision AI Controlling. Only the first half (8 layers) of the VLM's 16 layers are passed to the Action Expert. This is because the latter VLM layers are more optimized for language generation → not suitable for control, and computation is reduced by half (

LayerSkip). The middle layers have stronger perception grounding and control signals.

Synchronous inference mode

Action chunk driven: executes all 50-step chunks, then observes the next frame to compute a new chunk.

Asynchronous inference mode

Frame driven: predicts a new plan of 50 actions at every frame. The 50 actions are a generous buffer, but frames are more critical. Specifically, at every timestep (= every control step), a new observation is received as one action is executed. This creates a structure where 1 observation is received per 1 action.

paper

arxiv.org

https://arxiv.org/pdf/2506.01844

model

lerobot/smolvla_base · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.