Proximal-Proximal-Gradient Method

Ernest K. Ryu and Wotao Yin

Submitted

Overview

In this paper, we present the proximal-proximal-gradient method (PPG), a novel optimization method that is simple to implement and simple to parallelize. PPG generalizes the proximal-gradient method and ADMM and is applicable to minimization problems written as a sum of many differentiable and many non-differentiable convex functions. The non-differentiable functions can be non-separable; that is, they can depend on the same set of variables).

We furthermore present a related stochastic variation, which we call Stochastic PPG (S-PPG). S-PPG can be interpreted as a generalization of Finito and MISO over to the sum of many coupled non-differentiable convex functions.

We present applications that can benefit from PPG and S-PPG and prove convergence for both methods. A key strength of PPG and S-PPG is, compared to existing methods, its ability to directly handle a large sum of non-differentiable non-separable functions with a constant stepsize that is independent of the number of functions. Such a non-diminishing stepsize allows them to be fast.

Specifically, consider the optimization problem

 mathrm{minimize}~ r(x) + frac{1}{n}sum_{i=1}^nbig(f_i(x)+g_i(x)big),

where x is the d-dimensional optimization variable and r, f_1,ldots, f_n, g_1,ldots, g_n are proper closed convex functions. They can take extended values. Assume f_1,ldots, f_n are differentiable. We call

 

the PPG method. In the standard version, the 2nd line (x_i^{k+1} update) and 3rd line (z_i^{k+1} update) are performed for all i=1,ldots,n. In the stochastic version, S-PPG, each iteration will randomly choose an i and perform the 2nd and 3rd lines only for that i. The 1st line is performed at every iteration of both versions.

PPG is well-suited for distributed computing. Consider the server-worker computing model with a server node and n worker nodes. Each worker i maintains its private copy of z_i, and they do not communicate one another. The server node maintains bar{z} to equal frac{1}{n}sum_{i=1}^n z_i. At the beginning of each iteration, the server broadcasts x^{1/2} = mathbf{prox}_{alpha r}(bar{z}) to the workers. Then, each worker i computes the updates of x_i and z_i, and only sends Delta z_i (the difference between the old and new z_i) to the server. The server adds the received Delta z_1,ldots,Delta z_n to bar{z} to keep it up to date. This implements an iteration of PPG. As the server performs very simple tasks, it can handle the communication with many workers. Each worker will compute the 2nd and 3rd lines for one i, and all of them can work simultaneously. Between the server and each worker, each iteration only sends and receives one d-vector. (This is both simpler and cheaper than applying SAGA.)

Assume a standard regularity condition and that the gradient of each f_i is L-Lipschitz. Then, both PPG and S-PPG provably converge under the step size 0<alpha < 3/(2L). We do not need strong convexity to establish convegence.

Citation

E. Ryu and W. Yin, Proximal-Proximal-Gradient Method. UCLA CAM Report 17-51, 2017.


« Back