Native Multimodality
- Continuously encoding video frames
- Combining the video and speech input into a timeline of events
- Caching information for efficient recall
How
- 그래서 어떻게 토큰화 시켜서 넣어줄까 (Vision + Audio 더한다음 Split 해서 넣어주지 않을까)
- Inference 중간에 어떻게 토큰 interrupt할까 (KV Cache 로 하면 될듯)
Project Astra
Building on our Gemini models, Project Astra explores the future of AI assistants that can process multimodal information, understand the context you’re in, and respond naturally in conversation.
https://deepmind.google/technologies/gemini/project-astra/

Seonglae Cho