Karpathy's depth- improvements transferred to depth-. That's a x scale difference. But research on proxy models shows that transfer breaks down when the size gap gets extreme. Models much smaller than the target tend to predict larger-model behavior poorly.
AutoResearch runs on small models by design. The -minute budget limits how big your model can be. If you find an optimization on a M parameter model, it might not help a B parameter model. The training dynamics, gradient noise profiles, and loss surfaces change at different scales.
Treat small-model findings as hypotheses for larger models, not guarantees.