The middle layer KV is converted into a variational latent and injected so that the first half of the layers act as an encoder and the second half of the layers act as a decoder during training. The latent is positioned in the K, V pathway that changes "what to attend to," and when the latent is added to K, V, if the attention pattern itself is modified by the latent decision, the model can select different reasoning branches or decision paths according to the latent. (free)
Z_t is added to Key and Value via (projection): VAE is implemented with discrete categorical latent, and Z_t is a single discrete vector (one-hot, dimension C=2^H=65,536) to enforce high-level decisions preferentially in the representation.

Seonglae Cho