A Hierarchical VLA that divides a VLM (understanding location, images, and language) into two layers—high-level (reasoning) and low-level (action)—to enable robots to perform complex instructions
- High-Level VLM
- Low level VLA Pi 0
twisting and error accumulation; low-level action stage failure patterns include proximity bias leading to incorrect grasps, with a tendency to grab nearby objects more frequently
arxiv.org
https://arxiv.org/pdf/2502.19417

Seonglae Cho