Multimodal Stream

Created
Created
2024 May 18 9:11
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2024 May 22 4:26
Refs
Refs

Native Multimodality

  • Continuously encoding video frames
  • Combining the video and speech input into a timeline of events
  • Caching information for efficient recall

How

  • 그래서 어떻게 토큰화 시켜서 넣어줄까 (Vision + Audio 더한다음 Split 해서 넣어주지 않을까)
  • Inference 중간에 어떻게 토큰 interrupt할까 (KV Cache 로 하면 될듯)
 
 
 
 
 
 

Recommendations