Text Prompt: Role assignment (e.g., insurance consultant), Voice Sample: Voice cloning → Combined into a Hybrid Prompt that enables real-time conversational AI to control both "role" and "voice" simultaneously. Also extends the existing Full-Duplex-Bench
Native full-duplex S2S is based on Moshi. A single autoregressive Transformer models multiple streams simultaneously (multi-stream) as one joint sequence. The reason "speaking while listening" works = separation of observed tokens vs sampled tokens. At each time step, the joint sequence contains both user/agent tokens, but user tokens are observed (input) while agent tokens are sampled (generated by the model). This enables operation without turn segmentation.
- User audio stream
- Agent audio stream
- Agent text stream (predicts/uses time-aligned text in Inner Monologue style)
However, logically it's multi-stream, but implementation-wise it's a single AR stream.
Moshi Streaming Transformer Implementation: Temporal + Depth (RQ-Transformer)
Audio codecs (Mimi) split one frame into multiple codebook tokens (Q tokens). If we simply flatten these, the sequence becomes too long. So Moshi processes them with a 2-stage Transformer:
- Temporal Transformer: Follows the time axis steps to build a large context
- Depth Transformer: Predicts codebook (depth) tokens within a single time step in bottom-to-top order
Note that Temporal/Depth are not "two separate Transformers" but rather a 2-stage factorization of the same stream for computational efficiency. Temporal is the "brain", while Depth is the "vocal apparatus".
This is critical for real-time operation: the Temporal Transformer processes a number of steps proportional to the time length S (not K·S), while the Depth Transformer only handles K tokens within each step. This enables low-latency multi-stream audio token generation.
The "Inner Monologue" mechanism predicts text tokens ahead of audio tokens (with a time-aligned delay), which improves language quality and factuality while maintaining streaming ASR/TTS capabilities. This is implemented through inter-channel delay and serialization.
Moshi's Inner Monologue is simply a design where text channel is generated a few steps ahead of audio within a causal AR framework. Full-duplex is not "cross-attention between streams" but rather joint causal modeling of user/agent audio (+ agent text) on the same time axis, where user tokens continuously come in as observations and agent tokens are continuously sampled, enabling "speaking while listening"
NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice
We introduce PersonaPlex, a full-duplex conversational AI model that enables natural conversations with customizable voices and roles. PersonaPlex handles interruptions and backchannels while maintaining any chosen persona, outperforming existing systems on conversational dynamics and task adherence.
https://research.nvidia.com/labs/adlr/personaplex/
model
nvidia/personaplex-7b-v1 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/nvidia/personaplex-7b-v1
pdf
research.nvidia.com
https://research.nvidia.com/labs/adlr/files/personaplex/personaplex_preprint.pdf

Seonglae Cho