Multimodal AI

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2022 Jun 23 2:50
Editor
Edited
Edited
2025 Oct 15 15:10
Modality determines the type of data contained in a data point
Information integration or exchanging across Vision, Text, Speech, Touch, Smell from diverse sensors unlike unimodal AI
Multimodal AI Models
 
 
Multimodal AI Notion
 
 

CS224n

Stanford CS 224N | Natural Language Processing with Deep Learning
Note: In the 2023–24 academic year, CS224N will be taught in both Winter and Spring 2024. We hope to see you in class!

Notion

Multimodality and Large Multimodal Models (LMMs)
For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition).
Multimodality and Large Multimodal Models (LMMs)
What Is Multimodal AI?
Applications, Principles, and Core Research Challenges in Multimodal AI
What Is Multimodal AI?
[미라클레터] 인공지능이 하나가 된다 ?!?
미라클 모닝을 하는 일잘러들의 '참고서'
[미라클레터] 인공지능이 하나가 된다 ?!?
thegenerality.com

UX

Multi-Modal AI is a UX Problem
Transformers and other AI breakthroughs have shown state-of-the-art performance across different modalities * Text-to-Text (OpenAI ChatGPT) * Text-to-Image (Stable Diffusion) * Image-to-Text (Open AI CLIP) * Speech-to-Text (OpenAI Whisper) * Text-to-Speech (Meta’s Massively Multilingual Speech) * Image-to-Image (img2img or pix2pix) * Text-to-Audio (Meta MusicGen) * Text-to-Code (OpenAI Codex / GitHub Copilot) * Code-to-Text (ChatGPT, etc.) The next frontier in AI is combining these mo
Multi-Modal AI is a UX Problem
 
 

Recommendations