Structure and Relationships: Graph Neural Networks and a Pytorch Implementation

Introduction: Interconnected graphical data is prevalent in our world, spanning everything from molecular structures to social networks and city designs. Graph Neural Networks (GNNs) have emerged as a potent method for modeling and learning the spatial and graphical structure of such data. They have been applied to areas like protein structures and other molecular applications, such as drug discovery, as well as modeling systems like social networks. Recently, the conventional GNN has been merged with concepts from other ML models, resulting in fascinating new innovations. One notable development is the combination of GNNs with sequential models – Spatiotemporal GNNs capable of capturing both temporal and spatial dependencies of data, potentially applicable to numerous industrial and research challenges.

Mathematical Description of GNNs: A graph G can be defined as G = (V, E), where V denotes the set of nodes and E represents the edges among them. A graph is typically represented by an adjacency matrix, A, which reflects the existence or absence of edges between nodes. If a graph contains n nodes, A possesses a dimension of (n × n). The adjacency matrix is illustrated in Figure 1.

## Figure 1. Adjacency matrix for three distinct graphs

Each node (and edges!) Although we omit this detail for now, every node will have a set of characteristics (e.g., if the node is a person, the attributes would be age, gender, height, occupation, etc.). If each node has f features, the characteristic matrix X assumes a shape of (n × f). In certain situations, each node may also carry a target label, either categorical or numerical values (depicted in Figure 2).

Single Node Calculations: To understand how nodes interactively learn the connection between themselves and neighbors, we require consideration of the features of their neighbors. This is where GNNs excel in comprehending the structural representation of data via a graph. We describe the transformations for a single node within a single layer as follows:

Figure 2. A depiction of a node (j) accompanied by its features xj and associated label yj along with neighboring nodes (i, 2, 3), each bearing their respective feature embeddings and related labels.

Neighborhood Feature Transformation: Various strategies exist for converting neighboring node features, such as passing them through an MLP network or applying linear transformation. Here, we employ the latter approach, denoted as:

Information Aggregation: After transforming neighboring node features, we aggregate them: Several methods can accomplish this, including summation, averaging, minimum/maximum pooling, and concatenation. In this instance, we opt for concatenation.

Updating Node Features: Once the aggregation occurs, we proceed to update the node’s feature space. Each of these actions is summarized in the following equations:

The entire process is repeated for all remaining nodes in the graph, guided by the Adjacency matrix.

Graph Level Computation: With n nodes and each node featuring f properties, we combine all the characteristics in a solitary matrix:

The neighbor feature transformation and aggregation processes can subsequently be expressed as:

To account for the node’s individual characteristics, we introduce a normalization phase:

However, a common practice involves omitting this normalization stage, leading to the GCN technique. Nonetheless, issues arise when the weight vector for neighbor feature transformation remains constant across all neighbors, as real systems differ significantly. Consequently, graph attention networks (GATs) are introduced to calculate the significance of a neighbor’s feature to the target node, enabling unique contributions from varying neighbors according to their relevance.

Attention Coefficients Determination: GATs determine attention coefficients using a learned, adjustable matrix:

Where W constitutes a shared, learnable feature linear transformation, while wa signifies a learnable weight vector for each neighboring node. The raw attention scores, denoting the importance of node i’s features to node j, are computed as:

## Normalizing the attention scores yields:

## Finally, the aggregation step is performed utilizing the attention coefficients:

This concludes the explanation of a single layer. Multiple layers can be stacked to enhance model intricacy and accommodate complex relationships, although regularization techniques should consistently be employed to counteract overfitting.

Final Note: Upon acquiring complete feature vectors for all nodes, a feature matrix H can be constructed. This matrix serves diverse objectives, such as node or graph classification.