Contrastive RL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 16 13:44
Editor
Edited
Edited
2025 Dec 16 14:2

Goal-conditioned RL

Actor = goal-conditioned policy π(a∣s,g): A network that takes the current state s and goal g as input and outputs action a. When the goal g changes, it becomes a different task, defined as multi-task RL. The key is that the Actor is goal-conditioned.
Critic = contrastive similarity score f(s,a,g) = φ(s,a)⊤ψ(g): Instead of a traditional Q-network, the critic computes the inner product between a state-action embedding φ(s,a) and a goal embedding ψ(g). This score reflects how likely (s,a) leads to goal g.
In self-supervised learning, labels are not provided by humans but are automatically generated: In a trajectory, the future state st+k that actually occurs after (st, at) is used as a positive goal, while states from other trajectories are used as negatives (this is called goal relabeling).
The biggest problem in goal-conditioned RL is that rewards are typically sparse (1 only when the goal is reached, 0 otherwise). This means the critic and actor receive very few learning signals.
Contrastive learning transforms this into a "classification problem" to create dense learning signals:
  • The actual future state is set as the positive goal ("This is a reachable/reached goal")
  • Other states s⁻ in the same batch are set as negative goals (used as marginal distribution samples for InfoNCE/contrastive classification)
The critic learns via contrastive loss (cross-entropy on batch logits), which provides dense gradients at every step without needing sparse task rewards.
 
 
 

Contrastive RL

Model Layer Scaling
Residual Connection
,
Layer Normalization
,
Swish Function

The actor/critic (including two encoders) incorporates Residual connections + LayerNorm + Swish to stably train very deep MLPs. Instead of the conventional 2-5 layer shallow networks, increasing the depth from 8→64→256→1024 layers dramatically improves performance. This is model architecture scaling (not inference-time scaling), though layer-skip techniques could enable test-time compute tradeoffs.
Across various locomotion/maze/manipulation environments, success rates and time spent near goals improve by 2×~50× or more, with particularly large gains in challenging Humanoid tasks (some improvements reach tens to hundreds of times better). Rather than performance improving smoothly with depth, it suddenly jumps after crossing a "critical depth", at which point qualitatively different behaviors (skills) like walking or climbing over walls emerge. In offline goal-conditioned settings, increasing depth showed minimal benefit, and the effect is primarily seen when combined with online exploration/data collection.
 
 
 

Recommendations