Continuous Thought Machine

CTM

The key is that attention is performed between activation patterns rather than between tokens.

Traditional artificial neural networks haven't evolved much from 1980s models and barely utilize neurons' 'firing timing' information. In contrast, biological brains use synaptic time differences (spike timing) as crucial mechanisms for learning and reasoning. CTM aims to implement human-like step-by-step, interpretable thinking processes by introducing this 'temporal information' into neural networks.

Each neuron receives its past firing history (patterns from multiple past timepoints) as input to determine its next output. The model uses the degree of synchronization between neurons as its core representation, spontaneously generating diverse dynamics with different frequencies and amplitudes. CTM performs interpretable reasoning through internal 'thinking stages' regardless of static or sequential data, making the 'thought' process visible. For example, in maze solving, it follows observation → planning → movement command generation stages, visualizing human-like strategy as internal attention patterns moving along the path. For images, it scans key features like eyes, nose, and mouth step by step, improving classification accuracy. The argument is that we shouldn't stop 'brain mimicry' after the 2012 deep learning revolution.

Tick Loop

Neuron 'firing timing' information refers to the exact moment when a neuron generates an action potential (spike). In CTM, this means not just using the current activation level, but considering "when was the last firing?" to calculate the next output.

CTM accepts token-based input, but after KV, the core loop consists of Internal Tick, Synapse, Neuron-Level Model, and Synchronization. The reasoning itself is based on synchronized neuron activity (sync) rather than tokens.

Initial pre/post Activation history is learnable parameter with [Batch, Neurons, Memory Length]

Action Synchronization (
Hebbian theory)

Change the pre/post activation history to attention query by projection MLP to [Batch, Dimension Length]

Cross-Attention

CTM performs cross-attention at every tick, but K/V tokens are extracted only once before entering the loop and reused
Output is [Batch, Dimension Length] as there is one query

Synapse MLP

Process information through MLP (or U-Net style MLP) to transfer to neurons

Pre-activation Update History

Neuron-Level Models (NLM)

Process each neuron with a small MLP with hidden dimension as a
Time series

Post-activation Update History

Output Synchronization

Neuron phase and amplitude synchronization

Logit Output

Conclusion

While the information bandwidth between tick steps is limited (Hidden dimension * 1), CTM differs from transformers in that it has inherent test-time compute rather than next-token prediction. Although it has a sequential structure with dependencies like RNNs, it relies on previous ticks rather than tokens, internalizing reasoning computation into the model. This process replaces self-attention by processing each neuron as a

Time series and synchronizing them. While it has limitations in parallel training and scalability compared to transformers, if these aspects are improved, it shows a more innovative approach despite having similar limitations to

Selective State Space. Although its performance is lower than transformers, it outperforms pre-transformer architectures like LSTM, and crucially demonstrates intuitive interpretability of the reasoning process.

Sakana AI

Introducing Continuous Thought Machines