GNNs to Detect Smart Contract Vulnerabilities

Paper Summary: “Smart Contract Vulnerability Detection Using Graph Neural Networks”

Dickson Wu
DataDrivenInvestor

--

Paper by: Yuan Zhuang, Zhenguang Liu, Peng Qian, Qi Liu, Xiang Wang, Qinming He

Abstract:

Smart contracts, in crypto, need to be secure — if not they can lose a lot of money. Existing methods aren’t perfect, so this paper uses Graph Neural Networks (GNNs) to detect smart contract vulnerabilities.

They use a contract graph to represent the syntactic and semantic structures of the smart contract. The graph is then normalized, and then a degree-free Graph Convolutional NN (DR-GCN) + temporal messages propagation network (TMP) are used to detect the vulnerability.

Introduction:

Crypto’s cool! A core component of crypto is smart contracts, contracts that are written in code. But if those smart contracts have bugs in them, a lot of money will be lost ($10 Billion lost so far).

The current solution to this problem is testing methods, which aren’t perfect. They rely heavily on expert-defined hand rules/patterns. But these rules aren’t perfect → They lead to high false-positive and false-negative rates (thus still vulnerable).

They’re crafted with a few experts, but this is a fundamentally un-scalable solution as the number of smart contract vulnerabilities increases, thus those experts are unable to sift through all of them and create good rules for them.

This paper’s approach

Let’s use machine learning. We take the smart contract and turn it into a contract graph. Nodes = functions / variables. Edges = temporal execution traces.

Graphs are flat, thus there’s an elimination process to normalize the graph. So we use a DR-GCN to handle the normalized graphs + TMP to take into account the roles and relationships.

They experimented on over 300k smart contracts and beat the SOTA for reentrancy, timestamp dependence and infinite loop vulnerability detection.

Problem Statement:

We’re given the smart contract code, the model will take it and detect vulnerabilities at the function level → 0 for no vulnerability, and 1 for a vulnerability of a certain type.

Reentrancy attack: during the execution of a smart contract function, the same function is called over and over again (re-entered), in most cases, it’s transferring more money than the attacker is supposed to withdraw.

Infinite loop: an infinite loop where the execution doesn’t stop.

Timestamp dependence vulnerability: when the smart contract depends on the timestamp of the block to do something → Thus the miners can fiddle with the timestamp to gain an advantage.

Our Method:

There are 3 major parts:

  1. Create graph: need to extra control + data flow + fallback mechanisms
  2. Graph normalization: Inspired by k-partite graph
  3. Message Propagation Network: To detect vulnerability detection

Graph Generation:

Previous work has been done to turn programs into symbolic graphs → Thus this work builds off of that and does it for smart contracts. There are 3 types of nodes + edges:

  • Major node: These are functions that are important to detecting specific vulnerabilities, so each vulnerability has different types of major nodes. Reentry = call.value function, while timestamp = block.timestamp, and infinite-loop = any function
  • Secondary node: These are critical variables
  • Fallback node: This node simulates a fallback function
  • Edge: These capture the relationship between the nodes (are directional) and there are 4 types: control flow, data flow, forward, and fallback edges. They’re also ordered

Graph Normalization:

We have to normalize the graphs (prune them) because GNNs are flat when propagating information + contract graphs all have distinct shapes

We eliminate secondary nodes first. We pass its features to the closest major node (if there are several equally close nodes we pass it to all of them). Edges also migrate to the nodes. We do the same for the fallback node.

In the end, the major nodes have an aggregate of 3 feature sets:

  • Original features
  • In-features, features that originally came from the secondary node and are pointing towards the major node
  • Out-features features that came from the secondary node and point outwards.

We can see this in Figure 1b & 1c

Message Propagation Neural Network:

Both of these networks take in the contract graph, and output a label 0 or 1.

DR-GCN, is a CNN that’s applied to graph-structured data. The layer-wise propagation looks like:

  • Â = A + I, A = Adjacency matrix (a matrix that represents the graph), I = the Self loops
  • X_l = feature matrix of layer l (inputs)
  • W_l = trainable weight matrix of layer l (weights)
  • D^ = Diagonal node degree matrix, used to normalize Â

But our graphs is already normalized thus it becomes:

TMP consists of message propagation and readout phases. Message propagation = it’s just running the graph (information is passed along the edges), and the readout phase takes the final state of the graph and computes a label from it

Message propagation phrase: At timestep 0, each node (V_i) has a hidden state (h⁰_i) which is initialized to the features of V_i.

We pass along messages across timestep. At each timestep we only propagate 1 message → More specifically at timestep k, information is passed along edge e_k.

First the message looks like this:

  • x_k = the original message
  • h_sk = the hidden state of the starting node
  • t_k = the edge type

h_sk and t_k are contacted together to get x_k. We take this message and process it again:

  • m_k = message
  • W_k = the weight
  • b_k = the bias

The message (m_k) is passed along to the receiving node, which then aggregates the message with its hidden state:

  • h_ek = hidden state of the receiving node
  • U, Z = Weights
  • b_1 = bais

We just slap some weights to the message an previous hidden state, add a bias to it and pipe it through a tanh… And then softmax it:

  • R = Weight
  • b_2 = bias

Readout phase: This is the phase at which we take all the hidden states and then compute a label from them… One way to do it would be:

  • y^ = the label
  • |V| = number of major nodes
  • f = a neural network
  • h_i^T = final hidden state

We loop through all the major nodes, pass each of the final hidden states into a neural network, then sum up the results

But the researchers found that capturing the difference between the final and initial hidden state was valuable, thus we get this:

First we concatenate the final and hidden states

  • All W = Weights
  • All b = biases

Take the hidden state, multiply it by weight, add bias, tanh it. Then we multiply its output by a weight, add a bias, and then softmax it

We do this twice with another set of parameters. The final form is this:

We iterate through all the major nodes. For each we computer o_i and g_i. We take the o_i and g_i and do an element-wise product, pass it through a sigmoid. Then sum up all the end states and get a label from it

They trained both networks by feeding in normalized graphs + their labels.

Experiments:

Dataset:

They tested out their methods with 12 other methods:

  • 4 existing smart contract vulnerability detection: Oyente, Mythril, Smartcheck, Securify
  • 4 NN based methods: RNN, LSTM, GRU, GCN
  • 4 program loop detection methods: Jolt, PDA, SMT, Looper

They used 20% of the data as training and the rest for testing.

The datasets are both Ethereum Smart Contracts & VNT Smart Contracts. Ethereum was for Reentrancy and Timestamp dependence, while VNT was for an infinite loop.

Results:

For the reentry & timestamp dependence attack, the new methods steam the existing solutions on all metrics. Existing methods are all too rigid, thus providing low accuracy and having many false warnings.

When compared to existing NN methods, we can see that graph representation of the data seems to allow for a great advantage over sequential models that just treat the source code as sequential information. This is because sequential models ignore the structural information.

GCN and DR-GCN have lower accuracies than TMP because TMP also takes into account the temporal information, while the GCNs do not.

They also experimented without normalization, and they saw that normalization improved performance:

Related work:

Smart Contract Vulnerability Detection: Current methods rely on symbolic execution methods, thus have high false-negative rates — Recent work explores dynamic execution but needs hand-crafted agent contract, thus not fully automated.

Graph Neural Networks: There are 2 categories of GNNs, Spectral based-approaches (they’re like CNNs for graph data), and Spatial based approaches (where it’s like RNNs & attention mechanisms for graph data)

Conclusion:

This paper proposes a fully automated, GNN based approach to tackling smart contract vulnerability detection. Experiments show that this method significantly beats other state-of-the-art methods (NN and non-NN methods).

If you want to find out more: Read the paper here!

Thanks for reading! I’m Dickson, currently working on Deus Ex Securitas, where we aspire to achieve superhuman level performance in smart contract exploit detection!

--

--

Hi I’m Dickson! I’m an 18-year-old innovator who’s excited to change the course of humanity for the better — using the power of ML!