Semi-Supervised Classification with Graph Convolutional Networks

2020-10-06

Citation

Kipf, Thomas and M. Welling. “Semi-Supervised Classification with Graph Convolutional Networks.” ArXiv abs/1609.02907 (2017): n. pag. [Paper Link]

Background

The paper is motivated by the promise of solving for graphs the problems normally solved by semi-supervised approaches (e.g.: datasets for which only a subset of the labels are known).

Laplacian Regularisation Terms, and why they can be bummers

One way to train a semi-supervised network on graphs is by lopping on a “Laplacian Regularisation” term on to the loss function like so:

$$ \mathcal{L} = \mathcal{L_0} + \lambda\mathcal{L}_\textrm{reg} $$

where \(\mathcal{L}\) is the loss function, \(\mathcal{L_0}\) is called the “supervised loss” (or “the loss you normally would use”), and \(\mathcal{L}_\textrm{reg}\) is the Laplacian Regularisation term whose contribution to the loss you can parameterise with the \(\lambda\) variable. I’ll describe it more below.

The Laplacian regularisation term relies on the assumption that “connected nodes in the graph are likely to share the same label”. Depending on what meaning you’ve assigned to your labels and edges, this may or may not be an assumption that holds. You may have different types of edges, some which relay relevant information to your label, others no. You might also have complicated relationships between edges that aren’t as simple as the “guilt by association” relationship assumed by the Laplacian.

So what is this loss regularisation term? It’s a term \(\mathcal{L}_{\textrm{reg}}\) which is added to the loss \(\mathcal{L}_0\):

$$ \mathcal{L}_{\textrm{reg}} = \sum_{i,j} A_{i,j}||f(X_i) - f(X_j)||^2 = f(X)^T\Delta f(X) $$

Where \(A\) is the affinity matrix, \(f\) is a neural network, \(X\) is a matrix of node feature vectors, and finally \(\Delta\) is the unnormalised graph Laplacian. The what, now?

The unnormalised graph Laplacian is equal to \(\Delta = D - A\), where \(D\) is the degree matrix. I’ve seen the normalised symmetric cousin of the Laplacian before in the Ng & Jordan paper on Spectral Clustering. In those notes, I mention that the Laplacian is useful for summarising the number and strength of connections to nodes and identifying blocks of connected nodes. These properties of the Laplacian are where the “similar connection, similar label” assumption come from.

Ditching \(\mathcal{L}_{\textrm{reg}}\) for Spectral Graph Convolutions

Ok, so now that we’ve understood that the Laplacian Regularisation tool isn’t a panacea, what do we do about it?

The authors propose that we pass the affinity matrix \(A\) to the neural network \(f\) along with the node features \(X\) (\(f(X,A)\)), instead of just using \(X\) in the forward pass and accounting for \(A\) in the loss function.

What our GCN should look like

The authors invite us to consider the following propagation rule to be applied at each layer \(l\):

$$ H^{(l+1)} = \sigma\left(\tilde{D}^{-0.5}\tilde{A}\tilde{D}^{-0.5}H^{(l)}W^{(l)}\right) $$

Where

  • \(\sigma\) is the symbol for a non-linear activation function

  • \(\tilde{A}\) is the affinity matrix \(A\) with an identity matrix of the same size added. The identity matrix allows us to represent “self-connections”, i.e.: edges from each node to itself.

  • \(\tilde{D}\) is the degree matrix

  • \(H^{(l)}\) is the hidden layer at layer \(l\)

  • \(W^{(l)}\) are the hidden weights at layer \(l\)

This is the type of spectral graph convolution network the authors want to implement. You may recognise the normalised Laplacian above in the term \(\tilde{D}^{-0.5}\tilde{A}\tilde{D}^{-0.5}\).

By moving the Laplacian from the regularisation term to the forward-propagation of the layers, we can account for graph topology during training, but ditch the assumption that connected nodes share the same label.

Speeding things up

Spectral convolutions are really slow, and that slowness will grow linearly with each hidden layer the model has, so to make this practical the authors use a computationally-efficient approximation. It depends on “a truncated expansion in terms of Chebyshev polynomials”, as described in this 2011 paper by Hammond et al.

Papersmachine learninggraphsgraph learningspectralsemi-supervisedThomas N. KipfMax Welling

Multiple Sequence Aligning with STAR

On Spectral Clustering: Analysis and Algorithm