Trains a pre-trained decoder LLM (Llama3-8B) with "time sync" information to operate based on real-world clock timing. This paper attempts to model full-duplex conversations as natural interactions including "turn-free, overlapping, backchannel" features, and provides latency tolerance by introducing prediction/synchronization at the frame/chunk level. However, it is not based on joint streaming audio token autoregression architecture like Moshi
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex...
Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the...
https://arxiv.org/abs/2409.15594


Seonglae Cho