DOTA 2 vs OpenAi 5

Manish Pawar
DataDrivenInvestor
Published in
3 min readSep 23, 2018

--

This is a big deal. The entire Ai community is excited about its results…

OpenAI called its network OpenAI 5 because it has 5 neural network architecture that works together to defeat the opposing team.

It’s fascinating since an Ai can learn how to behave, strategize in an unpredictable environment and thus we can transform all these skills it has learned in real-life problems like in drug discovery, climatic change, etc. Similar situations have happened when Deepmind’s AlphaGo & IBM’s DeepBlue beat their respective world champions.

For non-gamers, DOTA 2 (defence of the ancient 2) is a real-time strategy game where two teams fight against to destroy their structure (ancients) while defending their own in a single map with the 3D perspective. DOTA 2 is the most earned EA’s sports game (about $100 million) due to its global competition.

DOTA 2 has 5 RNN’s (LSTM) networks which are used with the sequence data like text, audio, speech, etc. You can find more about LSTM’s here. LSTM’s take input data and outputs the next word. Thus, each RNN of DOTA 2 takes the input state (extracted from valve API software) and predicts the next state. Each lstm takes the specific step from all the possible states around it & executes it.

The whole key lies in reinforcement learning. Unlike supervised/unsupervised, we don’t have datasets here. Instead, we have signals which are time delayed labels called the reward.

OpenAi developed a reward of an aggregated matrix of kills, assists, etc. which can be positive or negative depending on step taken by an agent. In Dota 2, an agent is a lstm network which can be framed the way we want to take decisions by Markov decision process. It states that the output of an action taken in a state depends only on that state and not on any prior states. Find more about Markov here

In reinforcement learning, the goal of an agent is to maximise its reward following a policy. Many policies exist for this(one such called as policy gradients)

But, there’s a drawback in policy gradients that when step size is small, progress is small and big for large step size. This may leave our agent stuck in an environment.

This can be solved by separating policy in 2 parts –old(pre) & new(post) and advantage which means how better that policy is than other old one and using TRPO(image below). OpenAi introduced PPO( less complex), using the same base of TRPO method. You can check PPO here

So, OpenAi trained 5 LSTM’s using PPO on 256 GPUs on Google cloud. The team is training models every day to improve agent’s skills…

USEFUL LINKS:

https://dota2api.readthedocs.io/en/latest/

https://blog.openai.com/openai-five/

https://www.theverge.com/2018/8/28/17787610/openai-dota-2-bots-ai-lost-international-reinforcement-learning

Originally published at blog.lipishala.com on September 23, 2018.

--

--