Goal-conditioned RL
Actor = goal-conditioned policy π(a∣s,g): A network that takes the current state s and goal g as input and outputs action a. When the goal g changes, it becomes a different task, defined as multi-task RL. The key is that the Actor is goal-conditioned.
Critic = contrastive similarity score f(s,a,g) = φ(s,a)⊤ψ(g): Instead of a traditional Q-network, the critic computes the inner product between a state-action embedding φ(s,a) and a goal embedding ψ(g). This score reflects how likely (s,a) leads to goal g.
In self-supervised learning, labels are not provided by humans but are automatically generated: In a trajectory, the future state st+k that actually occurs after (st, at) is used as a positive goal, while states from other trajectories are used as negatives (this is called goal relabeling).
The biggest problem in goal-conditioned RL is that rewards are typically sparse (1 only when the goal is reached, 0 otherwise). This means the critic and actor receive very few learning signals.
Contrastive learning transforms this into a "classification problem" to create dense learning signals:
- The actual future state is set as the positive goal ("This is a reachable/reached goal")
- Other states s⁻ in the same batch are set as negative goals (used as marginal distribution samples for InfoNCE/contrastive classification)
The critic learns via contrastive loss (cross-entropy on batch logits), which provides dense gradients at every step without needing sparse task rewards.
Contrastive RL
Model Layer Scaling Residual Connection, Layer Normalization, Swish Function
The actor/critic (including two encoders) incorporates Residual connections + LayerNorm + Swish to stably train very deep MLPs. Instead of the conventional 2-5 layer shallow networks, increasing the depth from 8→64→256→1024 layers dramatically improves performance. This is model architecture scaling (not inference-time scaling), though layer-skip techniques could enable test-time compute tradeoffs.
Across various locomotion/maze/manipulation environments, success rates and time spent near goals improve by 2×~50× or more, with particularly large gains in challenging Humanoid tasks (some improvements reach tens to hundreds of times better). Rather than performance improving smoothly with depth, it suddenly jumps after crossing a "critical depth", at which point qualitatively different behaviors (skills) like walking or climbing over walls emerge. In offline goal-conditioned settings, increasing depth showed minimal benefit, and the effect is primarily seen when combined with online exploration/data collection.

Seonglae Cho