Tiny Recursive Model
Achieved significantly higher performance and generalization than HRM with just a single 2-layer network (7M parameters) using simple recursion.
Token-level parallelism is possible, but uses a small number of layers recursively with three variables: x (input), y (prediction), and z (reasoning vector). x is the encoding of the input, z is initialized with a projection embedding, and y starts at zero. For n steps, z is updated, and at the end y is updated to yz. This is repeated N times. In other words, n is like layer depth and N is recursive depth.
Criticism
Hierarchical Reasoning Model's recursion doesn't truly reach a fixed point (since the residual doesn't become 0). This means the Implicit Function Theorem cannot be applied with mathematical guarantees. Therefore, the 1-step gradient approximation lacks solid foundation. So TRM completely removes IFT and directly backprops through all recursion steps. As a result, it's much more stable and generalization performance is significantly improved.
In other words, HRM applies differentiation to a mathematically non-differentiable function. This means the gradient is fundamentally incorrect, and even if the model "sort of works" in early training, convergence and generalization performance become distorted. Since the fixed point doesn't actually exist, just backprop through every step.

Therefore, TRM uses full chain backprop through all steps, whereas HRM only backprops through the final 1-step of reasoning.

Seonglae Cho