In [2]:

```
# Preamble here, add paths and import relevant modules
import os
import sys
sys.path.append('/'.join(os.getcwd().split('/')[:-1]))
import util
reload(util)
import numpy as np
import sklearn
from sklearn import datasets
from scipy.sparse.linalg import eigsh
from numpy.random import permutation
% matplotlib inline
import matplotlib.pyplot as plt
```

`spectral clustering`

, which first does a spectral embedding of the original data, and then applies `k-means`

to the transform domain. In image processing, the Graph Laplacian is mainly used as a non-local smoothing term $X^TLX$ to boost the performance in images with lots of self-similarities(e.g., texture rich images).

`Two Moons`

for demonstration purpose. Other datas will be explored in future tutorials. The code below generates the Two Moons dataset with 500 sample points having different variances.

In [4]:

```
# Two Moons dataset with different noise levels
n_samples = 800
x = np.zeros([n_samples,2,3])
y = np.zeros([n_samples,3])
plt.figure(1)
for i in range(3):
x[:,:,i],y[:,i] = sklearn.datasets.make_moons(n_samples=n_samples, noise=0.02+.03*(i+1),shuffle = False)
n_mid = int(n_samples/2)
color = 'r'*n_mid + 'b'*n_mid
plt.subplot(1,3,i+1)
plt.scatter(x[:,0,i],x[:,1,i],s = 3, color = color)
plt.axis('off')
plt.title('TwoMoons, sigma = {sig}'.format(sig = 0.02+.03*(i+1)))
plt.tight_layout()
plt.rcParams['figure.figsize'] = (9, 5)
plt.show()
```

`0-1 nearest neighbors`

, `rbf`

, `cosine`

. Given an affinity matrix $W$, we can build a symmetric graph Laplacian $L_s = I- D^{1/2}WD^{1/2}$ and compute its smallest eigenvectors $V$.

`K-means`

to the transformed domain

In [8]:

```
# rbf nearest neighbor graph
plt.figure(2)
plt.rcParams['figure.figsize'] = (12, 14)
for i in range(3):
for j in range(4):
raw_data = x[:,:,i]
graph = util.BuildGraph(Eig_solver = 'arpack', affinity = 'rbf', neighbor_type = 'knearest',
n_neighbors = 15+13*(j), Neig = 4,gamma =0.02+.03*(i+1) )
graph.build_Laplacian(raw_data)
V = graph.laplacian_matrix_['V']
x_t = V[:,1]
y_t = V[:,2]
color = 'r'*n_mid + 'b'*n_mid
plt.subplot(3,4,i*4+j+1)
plt.scatter(x_t,y_t,s = 3, color = color)
plt.axis('off')
plt.title('NN, K = {K}, Sig = {sigma}'.format(K = 15+13*(j), sigma = 0.02+.03*(i+1)))
plt.show()
```

`rbf`

kernel next which is also frequently used.

In [11]:

```
# rbf full graph
plt.figure(3)
plt.rcParams['figure.figsize'] = (12, 14)
for i in range(3):
for j in range(4):
raw_data = x[:,:,i],
graph = util.BuildGraph(Eig_solver = 'arpack', affinity = 'rbf', neighbor_type = 'full',
Neig = 5,gamma = .4+.8*j)
x_t = V[:,1]
y_t = V[:,2]
color = 'r'*n_mid + 'b'*n_mid
plt.subplot(3,4,i*4+j+1)
plt.scatter(x_t,y_t,s = 3, color = color)
plt.axis('off')
plt.title('gamma = {g}, Sig = {sigma}'.format(g = .4+.5*j, sigma = 0.02+.09*(i+1)))
plt.show()
```

`rbf`

kernel is not suited for segmenting this problem

We will demonstrate how to use the graph clustering package and show some results

`GL_fidelity`

, and relevant parameters such as `eps`

, `eta`

and `dt`

. Detailed documentations for parameters could be found in the code for LaplacianClustering class(graph_cluster folder). You can always reset the parameters by calling `set_params()`

In [44]:

```
#initialization
import graph_cluster as gc
clf = gc.LaplacianClustering(scheme_type = 'GL_fidelity', eps = 1, eta = 2)
```

`load_data`

function. Ground truth is optional, but can be used to calculate miss-classification and also generate random fidelity.

In [45]:

```
# load raw_data, ground_truth
x,y = sklearn.datasets.make_moons(n_samples=800, noise=0.02+.03*(2),shuffle = False)
clf.load_data(raw_data = x, ground_truth = y)
# generate 10% fidelity
clf.generate_random_fidelity(percent = .1)
```

`build_graph()`

function. The arguments are the parameters for constructing the graph Laplacian.

In [46]:

```
clf.build_graph(Eig_solver = 'arpack', affinity = 'rbf', neighbor_type = 'knearest',
n_neighbors = 20, Neig = 4,gamma =0.02+.03*2 )
```

`fit_predict()`

. The prediction results are stored in `clf.labels_`

.

In [47]:

```
# Testing supervised learning
plt.figure(4)
plt.rcParams['figure.figsize'] = (12, 5)
def gen_colors(u):
color = ""
for l in u:
if l>0:
color += 'r'
else:
color += 'b'
return color
#classification results
for i in range(3):
clf.generate_random_fidelity(percent = (i+1)*.03)
clf.fit_predict()
plt.subplot(1,3,i+1)
plt.scatter(x[:,0],x[:,1],s = 3, color = gen_colors(clf.labels_))
plt.title('Fidelity= {perc}, Err = {r}'.format(perc = (i+1)*.03, r = clf.compute_error_rate()))
plt.axis('off')
plt.show()
```

`MBO fidelity`

on the same problem. We use `set_parameters`

to change the scheme type to 'MBO_fidelity'.

In [48]:

```
clf.set_parameters(scheme_type='MBO_fidelity')
```

`scheme_type`

is changed, nothing else) and keep the remaining parameters the same. This allows for fast experimentation.

In [49]:

```
# Testing supervised learning
reload(gc)
plt.figure(5)
plt.rcParams['figure.figsize'] = (12, 5)
def gen_colors(u):
color = ""
for l in u:
if l>0:
color += 'r'
else:
color += 'b'
return color
#classification results
for i in range(3):
clf.generate_random_fidelity(percent = (i+1)*.03)
clf.fit_predict()
plt.subplot(1,3,i+1)
plt.scatter(x[:,0],x[:,1],s = 3, color = gen_colors(clf.labels_))
plt.title('Fidelity= {perc}, Err= {err}'.format(perc = (i+1)*.03, err = clf.compute_error_rate()))
plt.axis('off')
plt.show()
```

`MBO`

has an `inner loop count`

constant that needs to be tuned. Too low of a loop count does degrade the prediction error.

`three_moons`

dataset, which is a toy data similiar to the two moons example.

In [52]:

```
import data
reload(data)
plt.figure(6)
plt.rcParams['figure.figsize'] = (7, 5)
X,Y = data.three_moons(n_samples = 900, noise = .03)
palet = 'rbg'
color = [palet[c] for c in Y.astype(int)]
plt.scatter(X[:,0],X[:,1],s = 3, color = color)
plt.title('Three Moons')
plt.axis('off')
plt.rcParams['figure.figsize'] = (7, 5)
plt.show()
```

`n_class = 3`

. Again, we set the classifier.

In [58]:

```
#initialization
import graph_cluster as gc
clf.set_parameters(scheme_type = 'MBO_fidelity', eps = 1, eta = 2, n_class = 3)
clf.generate_initial_value()
# load raw_data
clf.load_data(raw_data = X, ground_truth = Y)
# set graph parameters
clf.build_graph(Eig_solver = 'arpack',affinity = 'rbf',
neighbor_type = 'knearest' ,n_neighbors = 20, gamma = 1, Neig = 20)
```

`fit_predict`

. The graph is a nearest neighbor graph with $30$ neighbors

In [62]:

```
# Testing supervised learning
reload(gc)
plt.figure(7)
palet = 'rbg'
#classification results
for i in range(3):
clf.generate_random_fidelity(percent = (i+1)*.02)
clf.fit_predict()
plt.subplot(1,3,i+1)
color = [palet[c] for c in clf.labels_]
plt.scatter(X[:,0],X[:,1],s = 3, color = color)
plt.title('Fid = {perc}, Err = {err}'.format(perc = (i+1)*.02, err = clf.compute_error_rate()))
plt.axis('off')
plt.rcParams.update({'figure.figsize':(12, 5)})
plt.show()
```

**Not** monotonic with respect to the number of eigenvectors used.(e.g., try 100 eigenvectors in this example) This shows that throwing away eigenvectors is not just a computational convenience, but may actually inprove performance!