A Hierarchical VLA that divides a VLM (understanding location, images, and language) into two layers—high-level (reasoning) and low-level (action)—to enable robots to perform complex instructions
- High-Level VLM
- Low level VLA Pi 0
twisting and error accumulation; low-level action stage failure patterns include proximity bias leading to incorrect grasps, with a tendency to grab nearby objects more frequently

Seonglae Cho