Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent
Alexander Terenin (Cornell University)
The ability to deploy Gaussian-process-based decision-making systems such as Bayesian optimization at scale has traditionally been limited by computational costs arising from the need to solve large linear systems. The de-facto standard for solving linear systems at scale is via the conjugate gradient algorithm - in particular, stochastic gradient descent is known to converge near-arbitrarily-slowly on quadratic objectives that correspond to Gaussian process models’ linear systems. In spite of this, we show that it produces solutions which have low test error, and quantify uncertainty in a manner that mirrors the true posterior. We develop a spectral characterization of the error caused by finite-time non-convergence, which we prove is small both near the data, and sufficiently far from the data. Stochastic gradient descent therefore only differs from the true posterior between these regions, demonstrating a form of implicit bias caused by benign non-convergence. We conclude by showing, empirically, that stochastic gradient descent achieves state-of-the-art performance on sufficiently large-scale regression tasks, and produces uncertainty estimates which match the performance of significantly more expensive baselines on large-scale Bayesian optimization.