Gradient Descent Converges Linearly for Logistic Regression on Separable Data
We show that running gradient descent on the logistic regression objective guarantees loss f(x) ≤ 1.1 · f(x ∗ ) + ε, where the error ε decays exponentially with the number of iterations. This is in contrast to the common intuition that the absence of strong convexity precludes linear convergence of first-order methods, and highlights the importance of variable learning rates for gradient descent. For separable data, our analysis proves that the error between the predictor returned by gradient descent and the hard SVM predictor decays as poly(1/t), exponentially faster than the previously known bound of O(log log t/ log t). Our key observation is a property of the logistic loss that we call multiplicative smoothness and is (surprisingly) little-explored: As the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. Our results also extend to sparse logistic regression, where they lead to an exponential improvement of the sparsity-error tradeoff