Humanoid Robot Platforms
In the long term, infrastructure and tools are likely to be redesigned around robots. Human imitation may only provide short-term cost savings.
Can robots learn from human videos by simply scaling up, without requiring explicit human-robot alignment (such as masking, generative transformation, or humanoid mapping)? By treating human videos as "just another embodiment" and co-training them with robot data through fine-tuning, VLA robot models, when trained at scale, exhibit an emergent ability to transfer egocentric human video to robot tasks without requiring explicit alignment or transformation. Robot performance significantly improves on new scenes, new objects, and new task semantics demonstrated in human data, with an average improvement of nearly 2x observed. Transfer occurs in both high-level subtasks and low-level actions, and performance is best when both are trained together.

Seonglae Cho
