Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition

Work by Hamed Karimi, Julie Nutini, and Mark Schmidt, ECML PKDD 2016

Keywords: Nonconvex, optimization, Polyak-Lojasiewicz inequality, gradient domination, convergence, global

Summary

The authors re-explore the Polyak-Lojasiewicz inequality first analyzed by Polyak in 1963 by deriving convergence results under various descent methods for various modern machine learning tasks and establishing equivalences and sufficiency-necessity relationships with several other nonconvex function classes. A highlight of the paper is a beautifully simple proof of linear convergence using gradient descent on PL functions, which we walk through in our discussion.

Read the paper on arXiv here.

Convergence of gradient descent under the Polyak-Lojasiewicz inequality

This proof is extremely short and simple, requiring only a few assumptions and a basic mathematical background. It is strange that it is as universally known as the theory related to convex optimization.

Relating to the theory of Lyapunov stability for nonlinear systems, the PL inequality essentially means that the function value itself is a valid Lyapunov function for exponential stability of the global minimum under the nonlinear gradient descent dynamics.

Lieven Vandenberghe’s lecture notes for the case of strongly convex functions are an excellent supplement.

The setting we consider is that of minimizing a function f(x).

Lipschitz continuity

A function is Lipschitz continuous with constant L if

This image has an empty alt attribute; its file name is image1-2.png

Likewise the gradient (first derivative) of a function is Lipschitz continuous with constant L if

This image has an empty alt attribute; its file name is image2-1.png

For the next steps we follow slide 1.13 of Lieven Vandenberghe’s lecture notes.

Recall the Cauchy-Schwarz inequality

This image has an empty alt attribute; its file name is image3-1.png

Applying the Cauchy-Schwartz inequality to the Lipschitz gradient condition gives

This image has an empty alt attribute; its file name is image4-1.png

Define the function g(t) as

This image has an empty alt attribute; its file name is image5-1.png

If the domain of f is convex, then g(t) is well-defined for all t in [0, 1].

Using the definition of g(t), the chain rule of multivariate calculus, and the previous inequality we have

This image has an empty alt attribute; its file name is image6-1.png

Using the definition of g(t) we can rewrite f(y) in terms of an integral by using the fundamental theorem of calculus

This image has an empty alt attribute; its file name is image7-1.png

Integrating the second term from 0 to t and using the previous inequality and the derivative of g(t) we obtain

This image has an empty alt attribute; its file name is image8-3.png

Substituting back into the expression for f(y) we obtain the quadratic upper bound

This image has an empty alt attribute; its file name is image.png

The Polyak-Lojasiewicz inequality

A function is said to satisfy the Polyak-Lojasiewicz inequality if the following condition holds:

This image has an empty alt attribute; its file name is image16-1.png

where f* is the minimum function value.

This means that the norm of the gradient grows at least as fast as a quadratic as the function value moves away from the optimal function value.

Additionally, this implies that every stationary point of f(x) is a global minimum.

Gradient descent

The gradient descent update simply takes a step in the direction of the negative gradient:

This image has an empty alt attribute; its file name is image10-2.png

We are now ready to prove convergence of gradient descent under the PL inequality i.e. Theorem 1 of Karimi et al.

Rearranging the gradient descent update gives the difference

This image has an empty alt attribute; its file name is image12-2.png

Using the gradient descent update rule in the quadratic upper bound condition (from Lipschitz continuity of the gradient) we obtain

This image has an empty alt attribute; its file name is image13-2-1024x465.png

If the stepsize is chosen so that the coefficient on the righthand side is negative, then using the Polyak-Lojasiewicz inequality gives

This image has an empty alt attribute; its file name is image14-2.png

The range of permissible stepsizes is [0, 2/L] with the best rate achieved with a stepsize of 1/L. Under this choice, we obtain

This image has an empty alt attribute; its file name is image15-1.png

Adding f(x_k) – f* to both sides gives

This image has an empty alt attribute; its file name is image16-2.png

Dividing by f(x_k) – f* gives the linear (geometric) convergence rate

This image has an empty alt attribute; its file name is image17-1.png

This shows that the difference between the current function value and the minimum decreases at least as fast as a geometric series with a rate determined by the ratio of the PL and Lipschitz constants.

Why would control systems researchers care about this?

The Polyak-Lojasiewicz inequality is key to analysis of convergence of policy gradient for LQR and LQR with multiplicative noise.