NeurIPS 2025 Best Paper: Scaling Depth in Self-Supervised Reinforcement Learning

Why I wrote this

Recent progress in large-scale learning has highlighted a growing gap between how depth is leveraged in vision and language models versus reinforcement learning. I wrote this post to examine a concrete attempt to close that gap and to clarify what aspects of the approach appear to matter in practice.

NeurIPS 2025 recognized “1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities” with a Best Paper Award. This post provides a structured summary of the paper’s core ideas, empirical findings, and design choices, along with clarifications drawn from an accompanying author interview.

Sources (paper and interview)

Official NeurIPS poster page (Best Paper designation)

NeurIPS 2025 awards listing

OpenReview page (paper PDF and reviews)

Project website (figures and code links)

Author interview (Latent Space, YouTube)

Motivation and background

As described by the authors in the interview, the project originated in a Princeton independent research class. The central motivation was an observed discrepancy between fields: while depth scaling is standard practice in vision and language models, reinforcement learning systems typically rely on comparatively shallow networks.

The authors note that skepticism was warranted, given the historical difficulty of training deep networks in RL. Their guiding hypothesis, however, was that the limitations of deep RL stem not only from optimization challenges, but also from the choice of learning objectives and architectural stability.

Scaling dimensions explored

The study investigates multiple scaling axes, including network depth, width, and batch size. Among these, depth scaling emerged as the most salient factor. The authors increased depth far beyond conventional RL baselines, reaching up to 1024 layers in certain components, and observed non-linear performance behavior.

Naively increasing depth often degraded performance in early experiments.
Substantial gains appeared at specific depth thresholds rather than increasing smoothly.
At comparable parameter budgets, depth scaling yielded stronger improvements than width or batch size scaling.

Clarifying the contribution

A key clarification emphasized by the authors is that their results should not be interpreted as evidence that standard RL algorithms benefit directly from deeper networks. Simply enlarging the function approximators in PPO, SAC, or related methods does not reproduce the reported gains.

Instead, the observed scaling behavior depends on a combination of architectural stabilization techniques and a learning objective that closely resembles self-supervised representation learning. The approach occupies an intermediate space between reinforcement learning and self-supervised learning.

Role of the learning objective

The authors argue that conventional TD-based value regression introduces noise and bias that hinder scalability. Their method shifts emphasis toward a contrastive, InfoNCE-style objective, reframing learning as a discrimination problem rather than pure regression.

Under this formulation, representations of states from the same trajectory are encouraged to align, while representations from unrelated trajectories are separated. This induces training dynamics that more closely resemble those of large-scale self-supervised models in vision and language.

Training stability and failed early attempts

An important empirical detail is that depth alone was insufficient. Early experiments with deeper networks frequently failed, and residual connections by themselves did not guarantee stable training.

Successful scaling required a precise combination of architectural choices, including normalization strategies and activation functions, together with the contrastive objective. Performance improvements often manifested abruptly once these components aligned.

Depth versus width scaling

The paper also highlights practical trade-offs between depth and width. Increasing width typically results in rapid growth in parameter count, whereas depth can scale parameters more gradually depending on the architecture.

Width scaling can be effective but is parameter-intensive.
Depth scaling was often more parameter-efficient at similar performance levels.
In several settings, deeper models exhibited improved sample efficiency.

Interaction with batch size

The interview notes that batch size scaling, which is often ineffective in traditional RL, became beneficial once sufficiently deep models were trainable. The authors suggest that increased model capacity may be a prerequisite for exploiting larger batches.

Empirical results indicate that depth scaling can enable other scaling dimensions, rather than acting independently.

Importance of data scale

Data availability is identified as a critical factor. The experiments rely on GPU-accelerated environments capable of generating large volumes of experience in parallel, reducing the relative cost of data collection.

The authors report that significant gains typically emerge only after training surpasses tens of millions of transitions, with approximately 15 million transitions cited as a practical lower bound.

Compute considerations

Despite the scale of the models, the authors emphasize that many experiments are feasible on a single high-memory GPU, specifically an 80GB H100. While deeper networks increase per-step computation, environment interaction often dominates total runtime in RL workloads.

Per-step computation increases with depth, but data collection may remain the bottleneck.
Many environments reach near-saturation performance at moderate depths.
The limits of depth scaling under large distributed setups remain an open question.

Implications for robotics

Several evaluated environments resemble goal-conditioned robotic control tasks. The authors argue that scalable, self-supervised RL could reduce dependence on dense rewards or large demonstration datasets in robotics.

This suggests a potential alternative pathway: scaling interaction data and stable objectives rather than manual supervision.

Future research directions

Training deep teacher models followed by distillation or pruning for deployment.
Joint scaling of depth, width, and batch size with sufficient compute.
Further theoretical analysis of why contrastive objectives scale more reliably than TD regression.
Evaluation on real robotic platforms and more challenging sim-to-real tasks.
Investigation of compositional generalization as a complementary mechanism.

Summary

Overall, the work demonstrates that reinforcement learning can enter a favorable scaling regime by adopting self-supervised learning principles. While the extreme depth figures are notable, the primary contribution lies in identifying a combination of objective, architecture, and data scale that enables consistent performance improvements.