Machine Learning

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as μP, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete(d) Parameterisation that unifies scaling in width and depth — using an adaptation of…

Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as μP, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete(d) Parameterisation that unifies scaling in width and depth — using an adaptation of… Read More

Related Posts

Trace Length is a Simple Uncertainty Signal in Reasoning Models

How we’re bringing AI image verification to the Gemini app

Introducing Community Benchmarks on Kaggle

Leave a Reply Cancel reply