In VLM, data is the bottleneck rather than model architecture, the multimodal field is now moving from "model-centric → data-centric"
A paper that created a large-scale open VLM training dataset (FineVision) by integrating and refining existing public multimodal data at scale, and demonstrated that training with this data achieves better performance than existing open datasets.
Open Data Is All You Need
huggingface.co
https://huggingface.co/spaces/HuggingFaceM4/FineVision

Seonglae Cho