Multi Layer Perceptron

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 5 13:7
Editor
Edited
Edited
2026 Mar 9 16:16

MLP

The rows of the weight matrix before the activation function can be thought of as directions in the embedding space, and that means activation of each neuron tells you how much a given vector aligns with some specific direction. The columns of the weight matrix after the activation function tell you what will be added to the result if that neuron is active.
Can have non-linear decision boundary
using 3 Perceptron, now available to separate XOR

Two layer

convex open or closed region
notion image
each line means perceptron

Three layer

arbitrary (complexity limited by number of neurons)

Hidden layer

except final layer
 
 
 

Converting MLP, GLU into
Polynomial
s in Closed Form with
SVD

arxiv.org

MLP Interpretability

A paper explaining the internal mechanism of the
Grokking
phenomenon in small neural networks learning modular addition through Fourier features + lottery ticket structure + phase alignment process. What the model actually learns: when a two-layer neural network solves modular addition, each neuron learns a single-frequency Fourier feature. In other words, it solves the problem by transforming it into a periodic signal decomposition problem rather than arithmetic. Previous research only discovered that "neurons learn frequencies," but this paper explains how those features are combined into a complete algorithm and why generalization suddenly occurs. Modular addition is special because it can be completely expressed with Fourier bases, making it possible to precisely analyze the internal mechanism, which is why it was chosen as a toy model.
After the memorization phase, phase alignment aligns the frequencies' phases, causing the entire structure to operate like a single algorithm. Then grokking occurs with an explosion in generalization performance. In other words,
Grokking
is not about feature discovery but rather about alignment or composition of already-discovered features.
Similar to the
Lottery Ticket Hypothesis
, there already exists a subnetwork within the network that can implement the correct algorithm. Learning is the process of "activating" that structure.
arxiv.org
 
 

Recommendations