Most multimodal models available process different modalities separately, which limits their ability to integrate across modalities and generate documents with arbitrary sequences of images and text.
Multimodal Representations
Seonglae Cho
Seonglae Cho