Multimodal Stream

Created

Created

2024 May 18 9:11

Creator

Creator

Seonglae Cho

Editor

Editor

Seonglae Cho

Edited

Edited

2024 May 22 4:26

Refs

Refs

Native Multimodality

Continuously encoding video frames

Combining the video and speech input into a timeline of events

Caching information for efficient recall

How

그래서 어떻게 토큰화 시켜서 넣어줄까 (Vision + Audio 더한다음 Split 해서 넣어주지 않을까)

Inference 중간에 어떻게 토큰 interrupt할까 (KV Cache 로 하면 될듯)

Building on our Gemini models, Project Astra explores the future of AI assistants that can process multimodal information, understand the context you’re in, and respond naturally in conversation.

https://deepmind.google/technologies/gemini/project-astra/

Project Astra

Recommendations

////////