CNN-based CAM is accurate at class discrimination but captures less of the entire object. ViT-based CAM captures semantic parts of objects well but is weak at class discrimination. Therefore, CNN (Class-Aware Knowledge, CAK) and ViT (Semantic-Aware Knowledge, SAK) are combined as complementary dual branches. Mutual knowledge exchange based on contrastive loss to complement each other's weaknesses
arxiv.org
https://arxiv.org/pdf/2403.08801

Seonglae Cho