이미지와 텍스트를 동시에 처리할 수 있는 Cross-modal Embedding 방법을 사용 arxiv.orghttps://arxiv.org/pdf/2302.14045.pdfLanguage Is Not All You Need: Aligning Perception with Language ModelsA big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large...https://arxiv.org/abs/2302.14045