Most multimodal models available process different modalities separately, which limits their ability to integrate across modalities and generate documents with arbitrary sequences of images and text.Multimodal RepresentationsMultimodal StreamCross-modal Attention