The existing "depth" concept in deep learning fails to explain the actual learning structure, and proposes a new learning paradigm Nested Learning (NL) that interprets the entire model as a nested structure of multi-level optimization problems
All Neural Network and Model Optimizer can be viewed as Associative Memory that compresses context flow.
- Gradient Descent → Level 1 Associative Memory (data→error signal mapping)
- Momentum Method/Adam Optimizer → Level 2 nested optimization (compressing and memorizing past gradients)
- Attention Mechanism, Multi Layer Perceptron → Sub-optimization modules with their own unique context flows
Deep learning is not simply about stacking layers, but should be understood as a multi-level optimization system with multiple time scales and periodic updates. However, this argument is weak because transformers already operate with virtual layers that interact much more naturally with other layers, making them already a form of non-restricted nested learning.
Nested Learning Models

Seonglae Cho