## Proximal-Proximal-Gradient MethodErnest K. Ryu and Wotao Yin Submitted ## OverviewIn this paper, we present the proximal-proximal-gradient method (PPG), a novel optimization method that is simple to implement and simple to parallelize. PPG generalizes the proximal-gradient method and ADMM and is applicable to minimization problems written as a sum of many differentiable and many non-differentiable convex functions. The non-differentiable functions can be non-separable; that is, they can depend on the same set of variables). We furthermore present a related stochastic variation, which we call Stochastic PPG (S-PPG). S-PPG can be interpreted as a generalization of Finito and MISO over to the sum of many coupled non-differentiable convex functions. We present applications that can benefit from PPG and S-PPG and prove convergence for both methods. A key strength of PPG and S-PPG is, compared to existing methods, its ability to directly handle a large sum of non-differentiable non-separable functions with a constant stepsize that is independent of the number of functions. Such a non-diminishing stepsize allows them to be fast. Specifically, consider the optimization problem where is the -dimensional optimization variable and are proper closed convex functions. They can take extended values. Assume are differentiable. We call the PPG method. In the standard version, the 2nd line ( update) and 3rd line ( update) are performed for all . In the stochastic version, S-PPG, each iteration will randomly choose an and perform the 2nd and 3rd lines only for that . The 1st line is performed at every iteration of both versions. PPG is well-suited for distributed computing. Consider the server-worker computing model with a server node and worker nodes. Each worker maintains its private copy of , and they do not communicate one another. The server node maintains to equal . At the beginning of each iteration, the server broadcasts to the workers. Then, each worker computes the updates of and , and only sends (the difference between the old and new ) to the server. The server adds the received to to keep it up to date. This implements an iteration of PPG. As the server performs very simple tasks, it can handle the communication with many workers. Each worker will compute the 2nd and 3rd lines for one , and all of them can work simultaneously. Between the server and each worker, each iteration only sends and receives one -vector. (This is both simpler and cheaper than applying SAGA.) Assume a standard regularity condition and that the gradient of each is -Lipschitz. Then, both PPG and S-PPG provably converge under the step size . We do not need strong convexity to establish convegence. ## Citation
« Back |