Neural audio codecs (Mimi / Encodec family) typically represent one time frame (e.g., 20 ms) as multiple codebook tokens. For example, frame t is represented by K tokens: Depth Transformer extracts these sequentially within a single frame
Moshi Streaming Transformer Implementation: Temporal + Depth (RQ-Transformer)
Audio codecs (Mimi) split one frame into multiple codebook tokens (Q tokens). If we simply flatten these, the sequence becomes too long. So Moshi processes them with a 2-stage Transformer:
- Temporal Transformer: Follows the time axis steps to build a large context
- Depth Transformer: Predicts codebook (depth) tokens within a single time step in bottom-to-top order
Note that Temporal/Depth are not "two separate Transformers" but rather a 2-stage factorization of the same stream for computational efficiency. Temporal is the "brain", while Depth is the "vocal apparatus".
This is critical for real-time operation: the Temporal Transformer processes a number of steps proportional to the time length S (not K·S), while the Depth Transformer only handles K tokens within each step. This enables low-latency multi-stream audio token generation.
The "Inner Monologue" mechanism predicts text tokens ahead of audio tokens (with a time-aligned delay), which improves language quality and factuality while maintaining streaming ASR/TTS capabilities. This is implemented through inter-channel delay and serialization.
Moshi's Inner Monologue is simply a design where text channel is generated a few steps ahead of audio within a causal AR framework. Full-duplex is not "cross-attention between streams" but rather joint causal modeling of user/agent audio (+ agent text) on the same time axis, where user tokens continuously come in as observations and agent tokens are continuously sampled, enabling "speaking while listening"
arxiv.org
https://arxiv.org/pdf/2410.00037

Seonglae Cho