Moshi

Neural audio codecs (

Encodec family) typically represent one time frame (e.g., 20 ms) as multiple codebook tokens. For example, frame t is represented by K tokens: Depth Transformer extracts these sequentially within a single frame

Moshi Streaming Transformer Implementation: Temporal + Depth (RQ-Transformer)

Audio codecs (Mimi) split one frame into multiple codebook tokens (Q tokens). If we simply flatten these, the sequence becomes too long. So Moshi processes them with a 2-stage Transformer:

Temporal Transformer: Follows the time axis steps to build a large context

Depth Transformer: Predicts codebook (depth) tokens within a single time step in bottom-to-top order

Note that Temporal/Depth are not "two separate Transformers" but rather a 2-stage factorization of the same stream for computational efficiency. Temporal is the "brain", while Depth is the "vocal apparatus".

This is critical for real-time operation: the Temporal Transformer processes a number of steps proportional to the time length S (not K·S), while the Depth Transformer only handles K tokens within each step. This enables low-latency multi-stream audio token generation.

The "Inner Monologue" mechanism predicts text tokens ahead of audio tokens (with a time-aligned delay), which improves language quality and factuality while maintaining streaming ASR/TTS capabilities. This is implemented through inter-channel delay and serialization.

Moshi's Inner Monologue is simply a design where text channel is generated a few steps ahead of audio within a causal AR framework. Full-duplex is not "cross-attention between streams" but rather joint causal modeling of user/agent audio (+ agent text) on the same time axis, where user tokens continuously come in as observations and agent tokens are continuously sampled, enabling "speaking while listening"

arxiv.org

https://arxiv.org/pdf/2410.00037

Moshi

Moshi Streaming Transformer Implementation: Temporal + Depth (RQ-Transformer)

Recommendations