Many algorithms in optimization for ML are motivated by the high-level idea of having “adaptive” step-sizes for each parameter. Such methods include Adam, RMSProp or other methods in the same family tree, hyper-gradient methods that update hyperparameters like the step-size of gradient descent by gradient descent, and one could also put approximate second-order methods in that category. But we don’t have a good formal definition of what we mean by “adaptive step-sizes”. The one algorithm that has such a definition is AdaGrad, in online learning. But that definition has to account for adversarial non-smooth functions. This is a difficult and to be “adaptive” under that definition, an algorithm has to be very conservative and use decreasing step-sizes, which can lead to bad performance on simple problems, compared to baselines like gradient descent with a line-search.

Our work introduces a different definition of adaptivity for non-adversarial problems, specifically for deterministic, smooth, strongly convex functions. In such problems, the performance of gradient descent is typically characterized by the condition number of the Hessian. Fixed per-coordinate step sizes, or a diagonal preconditioner, would improve performance by improving the condition number, and the “best” diagonal preconditioner would minimizes the conditioner number. We define an “adaptive” algorithm as one that it is competitive with the best diagonal preconditioner for the given problem, and develop multidimensional backtracking, an algorithm that satisfies this definition of competitivity. The algorithm is a generalization of the standard backtracking line-search for smooth optimization to multiple dimensions, and combines intuition from the hypergradient literature with the formalism of a cutting plane method to search for a good preconditioner while running preconditioned gradient descent.