Relational Inductive Biases, Deep Learning, and Graph Networks

2021-01-25

Citation

Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gülçehre, Ç., Song, H., Ballard, A., Gilmer, J., Dahl, G., Vaswani, A., Allen, K.R., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., & Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks. ArXiv, abs/1806.01261. [Paper Link]

Introduction

The paper begins by making an argument that "combinatorial generalizations" are exceptionally important aspects of human cognition. More convincingly (personally speaking), the authors argue that an "integrative approach" that involves balancing the concepts of structure and flexibility is the best way to acheive it, using the balance of nature and nurture as an analogy. On a personal note, while I put a high-value on intuition in these matters, such analogies with animal biology always make me nervous.

Graphs are particularly well suited for such a balanced approach because they are highly structured, and our more "traditional" approaches struggle with it. Such "traditional" approaches are characterised by high flexibility (to use the language of the paper), in that features are either wholly or entirely learned, with little to no structural guidance or "relational reasoning".

Relational Reasoning

"Relational reasoning", you say? Why, there's a whole box dedicated to it in the paper. Sufficed to say, relational reasoning involves logic over the structure of data. An entity, which may have attributes, has relations with other entities which may compare their attributes (\(x\) is heavier than \(y\)). You might query relationships with rules.

These definitions aren't super important for this paper, but it does position your thinking for what comes.

Inductive Biases

There's one more concept to introduce before we can get to the Title Drop, and that's inductive biases. In a nutshell, deep learning models search a large space for their answer, and that space likely has multiple solutions that work equally well. It's overwhelmingly easier to forego being a bean dad to your network, and present it with the information/priors you have.

This can take the form of regularization or it can take the form of architecture decisions. This usually comes at the cost of "flexibility".

Inductive biases can reflect assumptions from either the space from which the data belongs, or the space of the solutions. e.g.: L2 norms reflect assumptions about the solution space.

Relational Inductive Biases

Relational inductive biases (it's the title drop!) are informally defined in this paper as "inductive biases which impose constraints on relationships and interactions among entities in a learning process".

How does one determine what relative inductive biases a neural network component uses? The authors provide three tips:

  1. Which entities and relationships are used as inputs?
  2. Across what dimension is the rule function reused? (e.g. temporal for RNN and spatial for CNN, etc…)
  3. What does the architecture treat as interacting vs. isolated? (e.g. What entity relations are strictly preserved, which entity relations are ignored.)

Relational Inductive Biases in Standard Deep Learning Components

If that's not clear on it's own, we can consider how common neural network components use relational inductive biases in a handy table prepared by the authors.

Table 1 from Battaglia et al.

Table 1 from Battaglia et al., Copyright 2018 Battaglia, P.W. et al.

Fully Connected Layers

A "dense" or "linear" layer has an "all-to-all" relationship with all the nodes, and such the relational inductive bias is very weak since we’re making as few assumptions as possible w.r.t the relationship between entities.

Convolutional Layers

Convolutional layers look at the pixels within a neighborhood, and apply the relational inductive bias that neighborhoods are relevant to one another, and that relationship becomes weaker with distance.

Recurrent Layers

Recurrent networks have time as the relational inductive bias, account for the "locality in the sequence via thier Markovian structure".

Graph Networks

There aren’t any standard network architectures for graph networks like there are for sequences or images.

This paper presents a "framework" for thinking of graph networks that generalise to common network architecture like message-passing and non-local neural networks.

Graph network (GN) block

GN blocks here simply provide a set of update functions \(\phi\) and another set of aggregation functions \(\rho\) that operate at the edge, vector, and global contexts.

Concretely:

$$ e'_k = \phi^e\left(e_k, v_{r_k}, v_{s_k}, u\right)\\\\ $$ $$ \bar{e}'_i = \rho^{e\to v}\left(E_i'\right)\\\\ $$ $$ v'_i = \phi^v\left(\bar{e}', v{i}, u\right)\\\\ $$ $$ \bar{e}' = \rho^{e\to u}\left(E'\right)\\\\ $$ $$ u' = \phi^u\left(\bar{e}', \bar{v}', u\right)\\\\ $$ $$ \bar{v}' = \rho^{v\to u}\left(V'\right)\\\\ $$ There are a lot of definitions here: \(e\) are edges, \(v\) are vertices, \(u\) are global attributes. \(r_k\) and \(s_k\) are receiving and sending vertices, respectively. \(E'_i = \{(e'_k, r_k, s_k)\}\), \(V' = \{v'_i\}_{i=1:N^v}\), and \(E' = \bigcup_i E'_i \).

The \(\phi\) functions compute per-edge and node as well as global updates.

The \(\rho\) function computes scalar representations of informations across edges and nodes. Good \(rho\) functions are insensitive to permuations, like summation, mean, and maximum functions.

Updating GN Blocks

The algorithm is as follows:

  1. Update edges with \(\phi^e\)
  2. Aggregate edge updates that project to vertex \(v_i\) with \(\bar{e}'_i = \rho^{e\to v}\)
  3. Apply \(\phi^v\) to each node
  4. Aggregate all edge updates for the global update with \(\bar{e}' = \rho^{e\to u}\)
  5. Aggregate all node updates for the global update with \(\bar{v}' = \rho^{v\to u}\)
  6. Apply \(\phi^u\) for the global update

Relation inductive bias of GN block

The authors argue that this GN block is Awfully Nifty (TM) because it imposes relational inductive biases in three ways:

  1. Graphs can be arbitrarily connected. Any configuration of nodes can be isolated or made to interact.
  2. Relationships between entities are sets, which are invariant to permutations (thanks to the way \(\rho\) works)
  3. Per-edge and per-node functions are re-used across all edges and nodes

Design principles for graph network architectures

Flexibility

There’s a great deal of flexibility when applying GNs.

First, GN blocks can output changes in edges, nodes, and/or graph globals.

Second, graphs inputs can either come with all the edges they will ever have, or you can learn interaction within entities.

Designing within-block architectures

It's up to the architect of a given GNN as to what gets computed at every \(\phi). The edge, node, and global updates need not take all the inputs described above. Figure 4 in this paper describes this far better.

Figure 4 from Battaglia et al.

Figure 4 from Battaglia et al., Copyright 2018 Battaglia, P.W. et al.

Message-passing neural networks (MPNNs)

Now that the authors have established this nifty framework, we can go ahead and retrofit popular GNN architectures to it. This allows us, among other things, to better reason about the relation inductive biases imposed therein.

Message passing neural networks have been described by Gilmer et al. in 20171, and a lot of papers have since built off of it.

MPNNs fit the GN block in Fig. 4c best. Some notes about MPNN/GN terminology that the authors outline:

  1. MPNN’s equivalent of \(\phi^e\) (\(M_t\)) does not take the global graph attribute \(u\) as an input.
  2. The aggregation function is elementwise summation (i.e. \(\rho^{e \to v}\) )
  3. The update function \(U_t\) is analagous to the \(\phi^v\) here.

All-in-all, I personally find the GN notation to be more intuitive to the message-passing terminology / protocol analogy.

Composing GN blocks

Just like you might with convolutional blocks or RNNs when dealing with images or sequences, you can compose GN blocks into common general architectures like encoders/decoders, recurrent networks, or just chaining them.

There is pretty much nothing novel to add/note here, since GNs very much just behave like most NN blocks.

Implementing GNs

DeepMind have released a library for dealing with these graphs for Tensorflow/Sonnet in the terms described here.

A number of demos are also supplied for e.g. finding the shortest distance between two points in a graph and a mass-spring physical demo.


  1. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural messagepassing for quantum chemistry.arXiv preprint arXiv:1704.01212. ↩︎

Papersmachine learninggraphsgraph learninginductive biasPeter W. BattagliaRazvan Pascanu

Functional Discovery via a Compendium of Expression Profiles

Position-Aware Graph Neural Networks