Citation
Battaglia, P.W., Hamrick, J.B., Bapst, V., SanchezGonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gülçehre, Ç., Song, H., Ballard, A., Gilmer, J., Dahl, G., Vaswani, A., Allen, K.R., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., & Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks. ArXiv, abs/1806.01261. [Paper Link]
Introduction
The paper begins by making an argument that "combinatorial generalizations" are exceptionally important aspects of human cognition. More convincingly (personally speaking), the authors argue that an "integrative approach" that involves balancing the concepts of structure and flexibility is the best way to acheive it, using the balance of nature and nurture as an analogy. On a personal note, while I put a highvalue on intuition in these matters, such analogies with animal biology always make me nervous.
Graphs are particularly well suited for such a balanced approach because they are highly structured, and our more "traditional" approaches struggle with it. Such "traditional" approaches are characterised by high flexibility (to use the language of the paper), in that features are either wholly or entirely learned, with little to no structural guidance or "relational reasoning".
Relational Reasoning
"Relational reasoning", you say? Why, there's a whole box dedicated to it in the paper. Sufficed to say, relational reasoning involves logic over the structure of data. An entity, which may have attributes, has relations with other entities which may compare their attributes (\(x\) is heavier than \(y\)). You might query relationships with rules.
These definitions aren't super important for this paper, but it does position your thinking for what comes.
Inductive Biases
There's one more concept to introduce before we can get to the Title Drop, and that's inductive biases. In a nutshell, deep learning models search a large space for their answer, and that space likely has multiple solutions that work equally well. It's overwhelmingly easier to forego being a bean dad to your network, and present it with the information/priors you have.
This can take the form of regularization or it can take the form of architecture decisions. This usually comes at the cost of "flexibility".
Inductive biases can reflect assumptions from either the space from which the data belongs, or the space of the solutions. e.g.: L2 norms reflect assumptions about the solution space.
Relational Inductive Biases
Relational inductive biases (it's the title drop!) are informally defined in this paper as "inductive biases which impose constraints on relationships and interactions among entities in a learning process".
How does one determine what relative inductive biases a neural network component uses? The authors provide three tips:
 Which entities and relationships are used as inputs?
 Across what dimension is the rule function reused? (e.g. temporal for RNN and spatial for CNN, etc…)
 What does the architecture treat as interacting vs. isolated? (e.g. What entity relations are strictly preserved, which entity relations are ignored.)
Relational Inductive Biases in Standard Deep Learning Components
If that's not clear on it's own, we can consider how common neural network components use relational inductive biases in a handy table prepared by the authors.
Fully Connected Layers
A "dense" or "linear" layer has an "alltoall" relationship with all the nodes, and such the relational inductive bias is very weak since we’re making as few assumptions as possible w.r.t the relationship between entities.
Convolutional Layers
Convolutional layers look at the pixels within a neighborhood, and apply the relational inductive bias that neighborhoods are relevant to one another, and that relationship becomes weaker with distance.
Recurrent Layers
Recurrent networks have time as the relational inductive bias, account for the "locality in the sequence via thier Markovian structure".
Graph Networks
There aren’t any standard network architectures for graph networks like there are for sequences or images.
This paper presents a "framework" for thinking of graph networks that generalise to common network architecture like messagepassing and nonlocal neural networks.
Graph network (GN) block
GN blocks here simply provide a set of update functions \(\phi\) and another set of aggregation functions \(\rho\) that operate at the edge, vector, and global contexts.
Concretely:
$$ e'_k = \phi^e\left(e_k, v_{r_k}, v_{s_k}, u\right)\\\\ $$ $$ \bar{e}'_i = \rho^{e\to v}\left(E_i'\right)\\\\ $$ $$ v'_i = \phi^v\left(\bar{e}', v{i}, u\right)\\\\ $$ $$ \bar{e}' = \rho^{e\to u}\left(E'\right)\\\\ $$ $$ u' = \phi^u\left(\bar{e}', \bar{v}', u\right)\\\\ $$ $$ \bar{v}' = \rho^{v\to u}\left(V'\right)\\\\ $$ There are a lot of definitions here: \(e\) are edges, \(v\) are vertices, \(u\) are global attributes. \(r_k\) and \(s_k\) are receiving and sending vertices, respectively. \(E'_i = \{(e'_k, r_k, s_k)\}\), \(V' = \{v'_i\}_{i=1:N^v}\), and \(E' = \bigcup_i E'_i \).The \(\phi\) functions compute peredge and node as well as global updates.
The \(\rho\) function computes scalar representations of informations across edges and nodes. Good \(rho\) functions are insensitive to permuations, like summation, mean, and maximum functions.
Updating GN Blocks
The algorithm is as follows:
 Update edges with \(\phi^e\)
 Aggregate edge updates that project to vertex \(v_i\) with \(\bar{e}'_i = \rho^{e\to v}\)
 Apply \(\phi^v\) to each node
 Aggregate all edge updates for the global update with \(\bar{e}' = \rho^{e\to u}\)
 Aggregate all node updates for the global update with \(\bar{v}' = \rho^{v\to u}\)
 Apply \(\phi^u\) for the global update
Relation inductive bias of GN block
The authors argue that this GN block is Awfully Nifty (TM) because it imposes relational inductive biases in three ways:
 Graphs can be arbitrarily connected. Any configuration of nodes can be isolated or made to interact.
 Relationships between entities are sets, which are invariant to permutations (thanks to the way \(\rho\) works)
 Peredge and pernode functions are reused across all edges and nodes
Design principles for graph network architectures
Flexibility
There’s a great deal of flexibility when applying GNs.
First, GN blocks can output changes in edges, nodes, and/or graph globals.
Second, graphs inputs can either come with all the edges they will ever have, or you can learn interaction within entities.
Designing withinblock architectures
It's up to the architect of a given GNN as to what gets computed at every \(\phi). The edge, node, and global updates need not take all the inputs described above. Figure 4 in this paper describes this far better.
Messagepassing neural networks (MPNNs)
Now that the authors have established this nifty framework, we can go ahead and retrofit popular GNN architectures to it. This allows us, among other things, to better reason about the relation inductive biases imposed therein.
Message passing neural networks have been described by Gilmer et al. in 2017^{1}, and a lot of papers have since built off of it.
MPNNs fit the GN block in Fig. 4c best. Some notes about MPNN/GN terminology that the authors outline:
 MPNN’s equivalent of \(\phi^e\) (\(M_t\)) does not take the global graph attribute \(u\) as an input.
 The aggregation function is elementwise summation (i.e. \(\rho^{e \to v}\) )
 The update function \(U_t\) is analagous to the \(\phi^v\) here.
Allinall, I personally find the GN notation to be more intuitive to the messagepassing terminology / protocol analogy.
Composing GN blocks
Just like you might with convolutional blocks or RNNs when dealing with images or sequences, you can compose GN blocks into common general architectures like encoders/decoders, recurrent networks, or just chaining them.
There is pretty much nothing novel to add/note here, since GNs very much just behave like most NN blocks.
Implementing GNs
DeepMind have released a library for dealing with these graphs for Tensorflow/Sonnet in the terms described here.
A number of demos are also supplied for e.g. finding the shortest distance between two points in a graph and a massspring physical demo.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural messagepassing for quantum chemistry.arXiv preprint arXiv:1704.01212. ↩︎