Summary
We've achieved 10x data efficiency with NanoGPT Slowrun within a few weeks. An ensemble of 1.8B parameter models (18B total params) trained on 100M tokens matches what would normally require 1B tokens with a standard LM baseline. Data efficiency matters because compute grows much faster than data . Since our current scaling laws require proportional increases in both , intelligence will eventually be bottlenecked by data, not compute. This data efficiency result allows us to improve model performance by scaling with compute rather than with data.
[...]
Ensembling is probably the most understudied axis of scaling in pretraining. Instead of training one model, you train many models somewhat independently and aggregate their predictions at inference. This way, you can keep leveraging more compute under fixed data and keep improving generalization.
[...]
We start by training our 30 layer transformer without looping, and then halfway through training we loop layers 15-24 four times. This means we first run layers 0-24 of the transformer, then re-run layers 15-24 4 times, and finally run layers 25-29. This configuration was found to be optimal: it is important not to loop the last few layers. There remains more work in extending and formalizing these heuristics.