Parallel and Distributed Sparse Optimization

Published in Asilomar’13 Conference

Background

Modern datasets usually have a large number of features or training samples, and they are usually stored in a distributed manner. Motivated by the need of solving sparse optimization problems with large datasets, we propose two approaches including (i) distributed implementations of prox-linear algorithms and (ii) GRock, a parallel greedy coordinate descent method.

Codes and demos

Three parallel C solvers for LASSO

More codes and applications will be uploaded in the coming weeks

Solving LASSO with a 170GB matrix on Amazon EC2

Test dataset

  • A: dense matrix with 100K rows and 200K columns, 20 billion entries in total, 170GB in size

  • x^o: 200K entries, 4K nonzeros (2 percent), Gaussian values

Requested systems on Amazon EC2

  • 20 “high-memory quadruple extra-large instances”

  • each instance has 8 cores and 60GB memory

LASSO

We tried to recover x^o by solving

min_x lambda|x|_1 + frac{1}{2}|Ax - b|_2^2

in which lambda and b were set so that x^o is the exact solution

Tested algorithms

  • p D-ADMM: parallelized dual ADMM, solving the dual formulation of LASSO

  • p FISTA: parallelized FISTA, solving the dual formulation of LASSO

  • GRock: parallized greedy coordinate-block descent

parallel 
  • ADMM's performance depends on a certain penalty parameter beta. We picked beta as the best out of only a few trials (we cannot afford more). It is likely that the performance of PD-ADMM can be further improved

  • step size for FISTA is evaluated based on the spectral radius of A

stopping creterions include:

  • maximum number of iterations: 2500

  • a relative error of 1E-5

Sparse logistic regression on a cluster

Test dataset

System

  • Rice cluster STIC

  • We tested our codes on 1, 8, 16, 64, and 128 cores

Sparse logistic regression model

We tried to recover a hyperplane h(x)=w^Tx+c for classification with a sparse normal vector w. The model is

min_{w,c} lambda|w|_1 + frac{1}{m}sum_{i=1}^mlogleft(1+expleft(-b_i(w^Ta_i + cright)right)

for given feature data points a_i and labels b_i, i=1,ldots,m.

Tested algorithms

  • p FISTA: our parallel implementation of FISTA. FISTA is a prox-linear algorithm with Nesterov's acceleration by Beck and Toubelle.

  • GRock: parallized greedy coordinate-block descent

cores p FISTA seconds GRock seconds
1 8.74 1.74
8 2.24 0.48
16 0.6 0.13
64 0.24 0.05
128 0.24 0.07

Citation

Z. Peng, M. Yan, and W. Yin. Parallel and Distributed Sparse Optimization, Asilomar’13, 2013


« Back