Expected smoothness is the key to understanding mini-batching for stochastic gradient methods

Robert Gower (Telecom Paris)

Stochastic gradient methods are efficient when compared to full batch methods (gradient descent/ Newton) because for each iteration they need only a mini-batch of data. But how big should this mini-batch be? There is currently little theory that suggests what the size of the mini-batch should be, and so in practice a rule of thumb is used. Here I will show how to choose the mini-batch size in a way that results in the fastest execution (in other words that optimizes the total complexity). The key to choosing this optimal mini-batch size is in quantifying the expected smoothness constant, a definition I will introduce in my talk and that depends on the data and the sampling used. In particular, by quantifying the expected smoothness constant for sampling without replacement, I will show how we can choose the optimal mini-batch size and larger stepsizes for both the standard SGD (Stochastic gradient descent) methods and stochastic variance reduced methods such as SAGA and SVRG.