It combines Transformer architecture with existing ranking models
It processes user actions such as likes, dislikes, and skips as sequential input data and then converts these actions into vectors. Finally combining them with music track embeddings to create input tokens for the Transformer.
The Transformer's output merges with the existing ranking model through a multi-layer neural network.

Seonglae Cho