Streaming Vision Speech Model

Creator

Creator

Seonglae Cho

Created

Created

2025 Nov 20 18:34

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Feb 13 12:37

Refs

Refs

Conversational Thinking AI

Working

Working

Working

Done

Done

Done

Deprecated

Deprecated

Deprecated

parallele decoder for multimodal generation

train on image captioning data existing vlm, just train voice decoder part

사람은 설거지하면서 유튜브보고 와이프랑 대화함 로봇은 액션하면서 대화못함. 인간은 하나에 하나씩밖에 못한다고 생각하지만 사실 여러가지 일들은 switching 하는 게 아니라 parallel 하게 무의식적으로 진행중이고 모델 구조고 이를 support 해야한다.

PersonaPlex

Idead

cross attention

1. browser action vla model tool

Research question

shape grapping with action trajectory, isolate shape recognition and concept object identification

Experiments

test there is a knife with provided knife image with soap noise

given knife image, model think of it as a soap is kil point

왜 아직 안 나왔을까 (중요)

이건 “사람들이 생각 못 해서”가 아니라:

학습 불안정

evaluation 어려움

failure mode가 너무 큼

산업적으로 risk 큼

때문에 논문으로 뽑기 어려운 영역이었음.

https://chatgpt.com/g/g-p-6762eaf0e6c08191bbe8f2e5939f2ed0-ai/c/693234ae-3d8c-8325-882e-adfdf4e8b9ee

ChatGPT helps you get answers, find inspiration, and be more productive.

https://chatgpt.com/c/693f30b4-7468-832f-81f3-23b54e95af6f

ChatGPT

Recommendations

////