이미지와 텍스트를 동시에 처리할 수 있는 Cross-modal Embedding 방법을 사용
Language Is Not All You Need: Aligning Perception with Language Models
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large...
https://arxiv.org/abs/2302.14045


Seonglae Cho