Mindmap 3D

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 7 12:32
Editor
Edited
Edited
2026 Jan 7 12:43
Refs
Refs
notion image
Robot manipulation policies typically generate actions based only on "what's currently visible in the camera," leading to significant performance degradation on tasks requiring memory of out-of-view objects/goals. mindmap accumulates past observations into a spatial memory via metric-semantic 3D reconstruction (
TSDF
+
VFM Feature
voxel map) and uses this as a condition for a 3D diffusion policy (trajectory denoising transformer) to generate 3D end-effector trajectories.
  • RGB-D is processed by VFM (AM-RADIO) into feature maps and back-projected to 3D points using depth.
  • Simultaneously, TSDF is accumulated using nvblox, VFM features are projected onto voxels → features are attached to mesh vertices to use as reconstruction tokens.
  • Current observation tokens + reconstruction tokens are processed through separate encoders, then concatenated and denoised via attention to produce trajectories.
  • For humanoid: includes bimanual control + head yaw control (for exploration/scanning and location memory).
mindmap achieves 76% average success rate, a significant improvement over 3D Diffuser Actor (20%); on humanoid tasks, it outperforms
GR00T N
1 by +26%p. The gap with "privileged" settings (external camera, eliminating memory needs) is only 9%p. Limitations: small model/small data/task-specific, keypose extraction is cumbersome, reconstruction is non-differentiable leading to memory overhead for per-voxel feature storage.

Ablations:

  • Using reconstruction alone degrades performance (current view information helps for pickup, etc.).
  • Replacing VFM features with RGB causes significant degradation (semantic information is critical).
  • Temporal blending (EMA) vs. overwrite of features shows negligible performance difference.
 
 
 
 
checkpoint
 
 

Recommendations