Artificial Intelligence in Finance

Reinforcement Learning for Options Trading

Q Learning + Black-Scholes = Optimal Option Price

Roshan Adusumilli

Published in

DataDrivenInvestor

12 min readMar 13, 2020

Artificial intelligence is making its impact on many areas of finance, particularly trading. A diverse range of artificial intelligence subfields such as deep learning, reinforcement learning, and natural language processing are currently being utilized to predict stock movements. A reinforcement learning trading agent attempts to learn stock prices through trial and error. By combining Q learning, a type of reinforcement learning algorithm, with the Black-Scholes model, a traditional model for option pricing, we can create a Q Learning Black Scholes (QLBS) model to determine optimal option prices.

In this article, I’ll go over options, the Black-Scholes model, and Q learning before showing the implementation of a Q Learning Black-Scholes model for an European put option.

Note: click here if you want to go straight to the implementation of the QLBS model (link doesn’t work in mobile app)

Options Explained

Options are a type of derivative security, meaning their value depends on the price of some other asset such as stock or commodities. For example, an option contract for a stock usually represents 100 shares of the underlying stock. The price of the option contract is known as the premium. Essentially, the contract allows the bearer to either buy or sell an amount of the underlying asset at a pre-determined price (referred to as the strike price) at or before the contract expires. Additionally, the bearer does not have an obligation to buy or sell, so they can also let the contract expire.

The two major types of options are call options and put options; the former allows the bearer to buy the asset at a stated price and the latter allows the bearer to sell the asset at a stated price. Buyers can use call options for speculation and put options for hedging purposes.

To understand this better, we’ll look at a real-world example. Suppose that Apple shares are trading at $200 per share. You think that shares may rise above $210 over the next month and buy a $210 call trading at $0.67 per contract. If the price rises over $210 before or on the expiration date of the contract, you buy the shares at $210 and if the shares fall or don’t rise over $210, you only lose money from the $67 premium ($0.67 x 100 shares). Conversely, let’s say you already own Apple shares and think that the price may fall, leading you to buy a $190 put trading at $0.63 per contract. If the price falls below $190 before or on the expiration date of contract, you profit by selling the shares at $190 and if the shares don’t fall, you only lose money from the $63 premium ($0.63 x 100 shares). Now that we understand options, let’s move on to the Black-Scholes Model.

Black-Scholes Model

The Black-Scholes equation, which is probably the most famous equation in finance, provided the first widely used model for option pricing. The current stock prices, the option’s strike price, expected interest rates, time to expiration and volatility (measure of fluctuations in price) are all used as inputs to calculate the theoretical value of options. Introduced in 1973 by economists Fischer Black, Myron Scholes and Robert Merton, the equation was so influential that it won Scholes and Merton the 1997 Nobel Prize in Economics (Black unfortunately died before he could receive the honor).

In the image above, C is the call option price, N(d1) is the normal distribution corresponding to the call option’s delta (ratio comparing change in the price of an asset to the corresponding change in the price of its derivative), and N(d2) is the normal distribution corresponding to the probability that the call option will be exercised at expiration.

For the sake of brevity, I’ll focus on the assumptions the Black-Scholes equation makes as well as its limitations rather than the actual math behind it. If you want a deeper understanding of Black-Scholes, watch this.

Assumptions

The option is European (can only be exercised at expiration, not before)
No dividends are paid out during the life of the option.
Markets are efficient (market movements cannot be predicted).
No transaction costs in buying the option.
Known and constant risk-free interest rate and volatility of underlying asset.
Normally distributed returns on the underlying asset.

Limitations

Doesn’t work for US options (can be exercised before expiration)
Volatility fluctuates in real life
Transaction costs exist
Risk free interest rate is not always constant in real life

Great, now that we have an overview of Black-Scholes, we’ll go over Q learning before jumping into the implementation of the QLBS model.

Q-Learning

Image Credit: Reinforcement Learning:An Introduction

In reinforcement learning, the goal is to maximize rewards. The agent performs an action to transition from one state to the next and the action taken in each state gives the agent an reward. For example, think of a scenario with a dog and its human master. In the home (environment), the dog (agent) runs around (action) when the master commands it to sit down (state) and receives no treats (reward). In the next state if the master commands the dog to sit down, it sits down (because running did not give treats) and receives treats. Essentially, the dog learns through trial and error.

5 Artificial Intelligence Trends To Watch This Year | Data Driven Investor

Expect a broad range of AI-enabled significant advancements in 2019. From Google searches to handling complex jobs like…

www.datadriveninvestor.com

Q-learning is a reinforcement learning algorithm where the goal is to learn the optimal policy (the policy tells an agent what action to take under what circumstances). A Q-Table of dimensions states x actions has values initialized to zero. Then, the agent chooses an action, observes a reward, and enters a new state, updating Q, the “quality” of the action taken in a state at each time t. Here’s the algorithm below.

The learning rate, which is usually constant for all time t, determines to what extent from 0 (agent learns nothing new) to 1 (agent only considers recent information) new information overrides old information. Furthermore, the discount factor, which determines the importance of future rewards, ranges from 0 (only current rewards matter) to 1 (long term reward prioritized).

The agent can interact with the environment in two ways. One way is exploitation, where the agent uses the Q-table as reference and chooses the action that has the highest value. However, the Q-table begins with all zeros so actions sometimes have to chosen randomly. This is exploring, when the agent selects an action at random instead of choosing based on max future reward. The epsilon value sets the percent of time you want your agent to explore instead of exploit.

Finally, its time to move on to the intersection of Q Learning and Black -Scholes!

Q-Learning + Black-Scholes

When Q-learning and Black-Scholes are combined, our QLBS model uses trading data to autonomously learn both the optimal option price and optimal hedge. For our implementation of the model, we’ll be working with a European put option. Before implementing the QLBS model, we’ll also implement the classic Black-Scholes formula to compare the results of the two. I’ll be leaving out the code for graphs and show the graphs directly instead to avoid making this article unnecessarily long; you can still click here to view the full code (the project is an exercise from Coursera’s Reinforcement Learning in Finance course).

Imports

First, we make the necessary imports.

import numpy as np
import pandas as pd
from scipy.stats import norm
import random
import time
import matplotlib.pyplot as plt
import sys

Monte Carlo Simulation

After making imports, we’ll set the parameters for a Monte Carlo simulation of prices. A Monte Carlo simulation is used to model the probability of different outcomes in a process (such as stock price movement) which is unpredictable due to the presence of random variables.

S0 = 100      # initial stock price
mu = 0.05     # drift
sigma = 0.15  # volatility
r = 0.03      # risk-free rate
M = 1         # maturityT = 24        # number of time steps
N_MC = 10000  # number of pathsdelta_t = M / T                # time interval
gamma = np.exp(- r * delta_t)  # discount factor

Black-Scholes Simulation

Images from Coursera: Reinforcement Learning in Finance

np.random.seed(42)# stock price
S = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
S.loc[:,0] = S0# standard normal random numbers
RN = pd.DataFrame(np.random.randn(N_MC,T), index=range(1, N_MC+1), columns=range(1, T+1))for t in range(1, T+1):
    S.loc[:,t] = S.loc[:,t-1] * np.exp((mu - 1/2 * sigma**2) * delta_t + sigma * np.sqrt(delta_t) * RN.loc[:,t])delta_S = S.loc[:,1:T].values - np.exp(r * delta_t) * S.loc[:,0:T-1]
delta_S_hat = delta_S.apply(lambda x: x - np.mean(x), axis=0)# state variable
X = - (mu - 1/2 * sigma**2) * np.arange(T+1) * delta_t + np.log(S)   # delta_t here is due to their conventions

Here’s what some stock price and state variable paths look like.

Terminal Payoff

The terminal payoff is the dollar amount an investor receives at expiration from following the option strategy. We’ll define a function to compute the terminal payoff of an European put option.

def terminal_payoff(ST, K):
    # ST   final stock price
    # K    strike
    payoff = max(K - ST, 0)
    return payoff

Spline Basis Functions

A spline function is a function that is constructed piece-wise from polynomial functions. The B-spline function is the maximally differentiable interpolative basis function, which we can use for the state variable X.

!pip install bspline
import bspline
import bspline.splinelab as splinelabX_min = np.min(np.min(X))
X_max = np.max(np.max(X))
print('X.shape = ', X.shape)
print('X_min, X_max = ', X_min, X_max)p = 4            # order of spline (as-is; 3 = cubic, 4: B spline?)
ncolloc = 12tau = np.linspace(X_min,X_max,ncolloc)  
# These are the sites to which we would like to interpolate# k is a knot vector that adds endpoints repeats as appropriate for a spline of order p
# To get meaninful results, one should have ncolloc >= p+1
k = splinelab.aptknt(tau, p) 
                             
# Spline basis of order p on knots k
basis = bspline.Bspline(k, p)        
        
f = plt.figure()
# B   = bspline.Bspline(k, p)     # Spline basis functions 
print('Number of points k = ', len(k))
basis.plot()

Data Matrices

Now we make data matrices with feature values; the features here are the values of basis functions at data points and the outputs are 3D arrays with dimensions num_tSteps x num_MC x num_basis.

num_t_steps = T + 1
num_basis =  ncolloc # len(k) #data_mat_t = np.zeros((num_t_steps, N_MC,num_basis ))
print('num_basis = ', num_basis)
print('dim data_mat_t = ', data_mat_t.shape)# fill it 
for i in np.arange(num_t_steps):
    x = X.values[:,i]
    data_mat_t[i,:,:] = np.array([ basis(el) for el in x ])

Dynamic Programming solution for QLBS

Quickly explained, a Markov Decision Process (MDP) contains a set of possible world states S, a set of possible actions A, A real valued reward function R(s,a), and a description T of each action’s effects in each state. The Bellman optimality equation can provide an optimal policy to solve every MDP.

Define the option strike and risk aversion parameter

risk_lambda = 0.001 # risk aversion
K = 100             # option stike

Calculate coefficients of the optimal action

# functions to compute optimal hedges
def function_A_vec(t, delta_S_hat, data_mat, reg_param):
    X_mat = data_mat[t, :, :]
    num_basis_funcs = X_mat.shape[1]
    this_dS = delta_S_hat.loc[:, t]
    hat_dS2 = (this_dS ** 2).values.reshape(-1, 1)
    A_mat = np.dot(X_mat.T, X_mat * hat_dS2) + reg_param * np.eye(num_basis_funcs)
    return A_mat
   
        
def function_B_vec(t, Pi_hat, delta_S_hat=delta_S_hat, S=S, data_mat=data_mat_t, gamma=gamma, risk_lambda=risk_lambda):
    tmp = Pi_hat.loc[:,t+1] * delta_S_hat.loc[:, t]
    X_mat = data_mat[t, :, :]  # matrix of dimension N_MC x num_basis
    B_vec = np.dot(X_mat.T, tmp)
    return B_vec

Compute optimal hedge and portfolio value

# portfolio value
Pi = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
Pi.iloc[:,-1] = S.iloc[:,-1].apply(lambda x: terminal_payoff(x, K))Pi_hat = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
Pi_hat.iloc[:,-1] = Pi.iloc[:,-1] - np.mean(Pi.iloc[:,-1])# optimal hedge
a = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
a.iloc[:,-1] = 0reg_param = 1e-3 # free parameter
for t in range(T-1, -1, -1):
    A_mat = function_A_vec(t, delta_S_hat, data_mat_t, reg_param)
    B_vec = function_B_vec(t, Pi_hat, delta_S_hat, S, data_mat_t, gamma, risk_lambda)
    # print ('t =  A_mat.shape = B_vec.shape = ', t, A_mat.shape, B_vec.shape)
    
    # coefficients for expansions of the optimal action
    phi = np.dot(np.linalg.inv(A_mat), B_vec)
    
    a.loc[:,t] = np.dot(data_mat_t[t,:,:],phi)
    Pi.loc[:,t] = gamma * (Pi.loc[:,t+1] - a.loc[:,t] * delta_S.loc[:,t])
    Pi_hat.loc[:,t] = Pi.loc[:,t] - np.mean(Pi.loc[:,t])
    
a = a.astype('float')
Pi = Pi.astype('float')
Pi_hat = Pi_hat.astype('float')

Compute rewards for all paths

# Compute rewards for all paths
# reward function
R = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
R.iloc[:,-1] = - risk_lambda * np.var(Pi.iloc[:,-1])for t in range(T):
    R.loc[1:,t] = gamma * a.loc[1:,t] * delta_S.loc[1:,t] - risk_lambda * np.var(Pi.loc[1:,t])
  
# plot 10 paths
plt.plot(R.T.iloc[:, idx_plot])
plt.xlabel('Time Steps')
plt.title('Reward Function')
plt.show()

Compute the optimal Q-function

def function_C_vec(t, data_mat, reg_param):
    X_mat = data_mat[t, :, :]
    num_basis_funcs = X_mat.shape[1]
    C_mat = np.dot(X_mat.T, X_mat) + reg_param * np.eye(num_basis_funcs)
    return C_mat
   
def function_D_vec(t, Q, R, data_mat, gamma=gamma):
    X_mat = data_mat[t, :, :]
    D_vec = np.dot(X_mat.T, R.loc[:,t] + gamma * Q.loc[:, t+1])
    return D_vec

Now we can call the functions to get the optimal Q function

# Q function
Q = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
Q.iloc[:,-1] = - Pi.iloc[:,-1] - risk_lambda * np.var(Pi.iloc[:,-1])reg_param = 1e-3
for t in range(T-1, -1, -1):
    C_mat = function_C_vec(t,data_mat_t,reg_param)
    D_vec = function_D_vec(t, Q,R,data_mat_t,gamma)
    omega = np.dot(np.linalg.inv(C_mat), D_vec)
    
    Q.loc[:,t] = np.dot(data_mat_t[t,:,:], omega)

Comparison

Let’s compare the QLBS price to the European put price given by the Black-Scholes formula.

# The Black-Scholes prices
def bs_put(t, S0=S0, K=K, r=r, sigma=sigma, T=M):
    d1 = (np.log(S0/K) + (r + 1/2 * sigma**2) * (T-t)) / sigma / np.sqrt(T-t)
    d2 = (np.log(S0/K) + (r - 1/2 * sigma**2) * (T-t)) / sigma / np.sqrt(T-t)
    price = K * np.exp(-r * (T-t)) * norm.cdf(-d2) - S0 * norm.cdf(-d1)
    return pricedef bs_call(t, S0=S0, K=K, r=r, sigma=sigma, T=M):
    d1 = (np.log(S0/K) + (r + 1/2 * sigma**2) * (T-t)) / sigma / np.sqrt(T-t)
    d2 = (np.log(S0/K) + (r - 1/2 * sigma**2) * (T-t)) / sigma / np.sqrt(T-t)
    price = S0 * norm.cdf(d1) - K * np.exp(-r * (T-t)) * norm.cdf(d2)
    return price

Here, we can see that the QLBS put price was 0.4 higher than the Black-Scholes put price. Given that the option price fell heavily in the majority of paths, we can see that the QLBS put price was clearly better than the Black-Scholes put price in this instance.

Summary

For reference, here are the various graphs we have plotted throughout the project. It would be interesting to test QLBS vs BS on a European call option, and try to figure out how Q learning can be applied to American call/put options. Once again, I’m linking the full code for the project here if you want to take a look at it.