Poster
On the Parameterization of Second-Order Optimization Effective towards the Infinite Width
Satoki Ishikawa · Ryo Karakida
Halle B
Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parametrization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parametrization achieves higher generalization performance in feature learning.In particular, it enables us to transfer the hyperparameters across models with different widths.