image sementationbuild image nearest feature graph generationgraphical positional embeddingvision transformer inference인간은 pixel기반이지만 convolution 하지 않고 selective attention 한다는 intuition