Vision Language Model augmented with additional behavior tokensCoT Decoder as PolicyEvery 128 steps, OmniJARVIS is forced to reason again and produce new behavior tokens with the latest observation. arxiv.orghttps://arxiv.org/pdf/2407.00114