
Robot manipulation policies typically generate actions based only on "what's currently visible in the camera," leading to significant performance degradation on tasks requiring memory of out-of-view objects/goals. mindmap accumulates past observations into a spatial memory via metric-semantic 3D reconstruction (TSDF +VFM Feature voxel map) and uses this as a condition for a 3D diffusion policy (trajectory denoising transformer) to generate 3D end-effector trajectories.
- RGB-D is processed by VFM (AM-RADIO) into feature maps and back-projected to 3D points using depth.
- Simultaneously, TSDF is accumulated using nvblox, VFM features are projected onto voxels → features are attached to mesh vertices to use as reconstruction tokens.
- Current observation tokens + reconstruction tokens are processed through separate encoders, then concatenated and denoised via attention to produce trajectories.
- For humanoid: includes bimanual control + head yaw control (for exploration/scanning and location memory).
mindmap achieves 76% average success rate, a significant improvement over 3D Diffuser Actor (20%); on humanoid tasks, it outperforms GR00T N1 by +26%p. The gap with "privileged" settings (external camera, eliminating memory needs) is only 9%p. Limitations: small model/small data/task-specific, keypose extraction is cumbersome, reconstruction is non-differentiable leading to memory overhead for per-voxel feature storage.
Ablations:
- Using reconstruction alone degrades performance (current view information helps for pickup, etc.).
- Replacing VFM features with RGB causes significant degradation (semantic information is critical).
- Temporal blending (EMA) vs. overwrite of features shows negligible performance difference.
checkpoint

Seonglae Cho