On Spectral Clustering: Analysis and Algorithm

2020-10-06

Citation

Ng, A. et al. “On Spectral Clustering: Analysis and an algorithm.” NIPS (2001). [Paper Link]

tl;dr

Spectral clustering clusters samples based on their connectivity as opposed to euclidean distance, which \(k\)-means relies on.

To perform spectral clustering, you:

  1. Calculate the Affinity Matrix \(A\)
  2. Create the Laplacian Matrix of \(A\), \(L\)
  3. Make a matrix \(X\) whose columns are the \(k\) top eigenvectors of \(L\)
  4. Run \(k\)-means on a normalised \(X\), where the rows correspond to \(k\)-dimensional samples

Sample labels are determined by the labels assigned by \(k\)-means on the eigenvector matrix of the Laplacian (i.e.: the normalised \(X\), known as \(Y\)).

Background

This paper on spectral clustering by Jordan & Ng outlines a very practical algorithm. Spectral clustering relies on the spectrum of the affinity matrix of the points. Prior to this paper it wasn’t clear how to use spectral clustering to label points as belonging to one cluster or another, nor was the choice of eigenvalues clear.

At the time this paper was released, GMMs were one of the more powerful clustering methods with a decent (i.e practical) algorithm (EM).

GMMs assume each cluster belongs to a Gaussian distribution, and requires you optimise a non-convex space (i.e.: no guarantees on convergence).

What the authors achieve in this paper:

  • An algorithm for applying cluster labels after spectral analysis
  • Which eigenvectors from the spectrum of the affinity matrix to use

Spectral Clustering

Spectral clustering uses solutions on distance-based measures to make it linkage-based.

Step One: Calculate the Affinity Matrix \(A\)

Calculating the affinity matrix is as simple as measuring the distance between each point to every other point.

Formally,

$$ A_{ij} = \exp\left(-||s_i - s_j ||^2/2\sigma^2\right) $$

For points \(S = {s_1,\ldots,s_n}\) for \(i \neq j\) and some cut-off distance \(\sigma\).

Also, \(A_{ii} = 0\) by definition.

In some ways, you can think of the affinity matrix as creating a graph from a cloud of points, where the length of the edge is the distance between the points.

Step Two: Calculate the Laplacian of \(A\), \(L\)

Calculate the Degree Matrix \(D\). The degree matrix is a diagonal matrix which gives you a measure of how connected and the strength of the connections to the node. I say “node” because we’re moving forward with the graph analogy.

Formally, the Degree Matrix is calculated like so

$$ D_{ii} = \sum^n_{j=0} s_{ij} $$

The Laplacian matrix can now be calculated using \(D\). The Laplacian matrix has plenty of useful properties, the most relevant of which here is:

For a graph with multiple connected components, \(L\) is a block diagonal matrix, where each block is the respective Laplacian matrix for each component, possibly after reordering the vertices (i.e. \(L\) is permutation-similar to a block diagonal matrix).

~ Wikipedia

Formally, we calculate the normalised symmetric Laplacian as follows:

$$ L = D^{-\frac{1}{2}}AD^{-\frac{1}{2}} $$

Step Three: Construct the Eigenvector Matrix \(X\)

Finally, we construct a matrix from the top \(k\) orthogonal eigenvetors of \(L\). Orthogonal here means that \(x_i^Tx_j = 1\) (\(x_i^Tx_j\) is defined to be \(0\) for \(i \neq j\)). The eigenvectors form the columns of the matrix \(X\).

The way I understand this to work, is that the eigenvectors take advantage of the “block diagonal” property.

Illustration of block diagonal properties of Laplacian and resulting eigenvectors Image courtesy of Charles H. Martin, Ph.D.

Formally, the eigenvector matrix \(X\) is,

$$ X = [x_1,\ldots,x_k] \in \mathbb{R}^{n \times k} $$

where \(x_1,\ldots,x_k\) are the top \(k\) orthogonal eigenvectors of \(L\).

Step Four: Assigning Points to Clusters

First, normalise \(X\) to have unit length like so

$$ Y_{ij} = \dfrac{X_{ij}}{\left(\sum X^2_{ij}\right)^{1/2}} $$

You can then treat each row of \(Y\) as a point, then use \(k\)-means lcustering to assign each point one of \(k\) clusters

Ok, here is the cool part where we assign clusters to points:

Assign the original point \(s_i\) to cluster \(z\) iff row \(i\) of \(Y\) was assigned to cluster \(z\) via \(k\)-means.

Put another way, we can treat the rows of \(Y\) as \(k\)-dimensional vectors that we can cluster using \(k\)-means, and then we just assign the cluster number.

Wait! Is spectral clustering just \(k\)-means clustering with a cooler name? It might seem like it, but it’s very different, as illustrated in the example below:

Illustration of block diagonal properties of Laplacian and resulting eigenvectors Adapted from the sci-kit learn docs (Pegdrosa et al., 2012 )

Conclusion

The idea here is that, while \(k\)-means clusters points according to their Euclidean distance, we instead create a matrix which represents the connectivity of the nodes (i.e. the matrix composed of the eigenvectors of the Laplacian matrix \(L\)), and run \(k\)-means on that; effectively clustering by the connectivity of points.

So, when might you not use spectral clustering? Well, it’s easily ten times slower than \(k\)-means for small numbers of \(k\), and the comparisons get even worse as \(k\) gets bigger.

Papersmachine learningclusteringunsupervisedspectralMichael Irwin JordanAndrew Ng

Semi-Supervised Classification with Graph Convolutional Networks