In order to understand this paper together, first, I am going to describe some basic concepts in Reinforcement Learning. You can always watch Udacity’s free course on Deep Learning with Pytorch. Also, it is better if you have implemented the DQN algorithm before. If you have not done so, do not worry about it. There will be a quick refresher in Part 1.

This blog consists of four parts:

1. Deep Neural Network for Single-Agent: Reinforcement Review, DQN and Replay Memory

2. Overview of multi-agent RL

3. Deep Neural Network for multi-agent: Independent Q Learning (IQL) and Challenges combing with multi-agent RL

4. FingerPrinting

Ok, let’s go! :)

What is Stabilising Replay for Multi-Agent and Why is it Important?

In 2015, DeepMind was able to successfully combine Deep Neural Network with Reinforcement Learning for a single-agent. Combing Deep Neural Network with RL enabled AI for the first-time to surpass the performance of professional human players across many game scenarios (RL beat people at backgammon in the 1980).

This worked well fora  single agent. However, when we have many agents, we cannot easily combine Deep Neural Networks with RL, mainly because each agent constantly changes the dynamic of the environment and makes it really hard for other agents to learn what to do.  This paper is proposing two solutions, importance sampling, and fingerprinting to enable agents to learn when the dynamic of the environment is changing or in other word, when the environment is non-stationary.

In Reinforcement Learning, we have an agent interacting with the environment. We model the agent and its interaction with an environment as a Markov Decision Process (MDP). MDP is a mathematical framework for the decision-making process. Where each agent starts in the state “S” and at each time step takes an action “A”, gets the reward “R”, and land in the next state “S`” and this cycle repeats.


Before we go further, let’s understand some terminologies that are frequently being used in RL.

  • Action space is a set of all the possible actions. Actions can be either continuous or discrete. For example, {left, right, up, down} is the set of all the possible actions in a discrete environment.
  • State space is the set of all the possible states that the agent can explore.
  • Discounted Factor: You can consider money know is worth more than money later. For example, if you are trading stock, it is more beneficial to make a profit sooner rather than later. Similarly, it's about how much it cares about reward far in the future because if you have more money now, you can invest more and receive more returns.
  • Transition probability is a probability distribution that indicates given state “s” and action “a”, what the likelihood is of you landing in the next state “s`”.
  • Policy is the agent action selection that decides how to map states  to actions.
  • Observation vs States: State has the property that is well-defined and they exist. However, Observability means you, as an agent, won’t be able to observe your states. For example, if you are dealing with a grid environment, your home might be located at a particular state, but if it is cloudy, you won’t able to observe your home. In RL, sometimes RL agents won’t be able to observe all the states, therefore, we call this partially observable setting.

Note: observation and state sometimes refer to the same thing. If your environment is fully observable, state and observation are the equivalents.

At each time step, we record the agent’s observation, action, reward, transition probability, or other variables (depending on the task) into a tuple. For more information look at: OpenAI Spinning Up in RL

  • Trajectory is the sequence of action-observation (or state-action) pairs until the episode ends.

For more information, take look at the following lecture.

According to Jousha Achiam, Reinforcement Learning can generally be categorized into two broader categories: model-based and model-free. This blog post by Vitchyr Pong clearly explains the difference between the two.

Note: Model-based is lot more sample efficient. Model-based is an active area of research in RL and until now most of the progress in Reinforcement Learning has been in model-free.

Generally, model-free can be categorized into two more broader categories: the policy-based approach and the value-based (Q-learning) approach.

In this post, I am going to explain how to use the Q-learning (a model-free, value-based) approach both for single agent (DQN) scenarios and multi-agent (IQL) scenarios.



DQN: Deep Q-Network is one of the main algorithms successfully combined Deep Neural Network  with a Q-learning algorithm for a single-agent.

Q-learning uses the Bellman equation.  Q-learning is an algorithm that gets state “s”, and action “a” (or observation as an input) and aims to find an optimal policy that maximizes expected reward (action-value) for over the trajectory.

Why Deep? and what is Q* ?

Note: Action-value function also refers to Q function. Action-value is also sometimes refer as Q values.

Q* : “*” refers to an approximation of Q function. We use the neural network as a function approximator for computing optimal action-values, also known as Q. Given observations as input, the output of our neural network is a set of estimated action-values (in our action-space) and our goal is to pick the maximum action-values.

DQN and Atari: The input of neural network is consist of 4 frames of images (84 x 84 x 4). You can consider a stack of frames as an observation in an Atari game. The output is sets of action values for our agent. In this case, we want to know where to move the pong.

Note: In the DQN paper, the network Architecture that they used consists of two CNN layers and two fully connected layers.

I recommend reading Andrej Kerphaty’s blog post on DeepRL: Pong from Pixels.

Replay Memory

One of the main problems with DQN is stability. When we are feeding observations into our network, our network generalizes over the past experiences.

Imagine your agent lives in Hawaii and we want the agent learn to wear shorts when the weather is sunny and use an umbrella when the weather is rainy. If recently the agent has only seen rainy weather, it can eventually forget how to act in sunny weather, and a replay buffer helps keep old experience around in replay memory. In this case weather is an observation and the action is wearing shorts or using an umbrella. To avoid this problem, we record the agent’s action-observation into an experience tuple in replay memory and randomly sample from these experiences. This helps us to know what actions to take given an observation. That is why the replay buffer helps stabilize the training. We later in Part4 see how replay buffer helps avoid non-stationarity in a multi-agent setting.

Multi-agent in Reinforcement Learning is when we are considering various AI agents interacting with an environment. Multi-agent setting is still the under-explored area of the research in reinforcement learning but has a tremendous applications such as self-driving cars, drones, and games like StarCraft and DoTa.

Multi-agent scenario

You can consider multi-agent with three main categories:

  1. Cooperative: All agents share the same reward and when all agents work together to get the reward,  consider this a team reward.
  2. Competitive: A  zero-sum game.  Each agent (or group of agents) compete with one another to get a reward. The reward cannot be shared amongst agents or groups of agents.
  3. Mixture of both: The third is a mixture of both; it can be both competitive and cooperative . For example, a soccer game is cooperative amongst teammates and competitive between two teams. For a mixture of both scenarios, sometimes it is hard to draw the exact distinction on how agents decide to cooperate or compete — Stag hunt is a game theory that applies to this method.

Partially Observable vs Fully Observable


Imagine you are playing poker .  If you can view all of your opponents' cards, you have enough information about what card to play next (what action to choose), so you do not need to memorize the past actions.

However, if your opponents' cards are hidden, then in order to increase the probability of what card to play (optimal action), you want to memorize what card has been played (past action-observation).

The fully-observable setting has Markov property, This means that given the state you have all the information to choose the next action. You do not need memory to store any information. However, in a partially observable setting, you need to memorize past actions-observations in order to make an optimal decision.

Joint Action Observation vs Independent Action Observation:

It is important to understand the difference between these two because it comes up a lot in a multi-agent setting.

So, what does this mean? Imagine a group of five students in the classroom. The professor asks the students to write 50-page papers consisting of 5 topics within 2 hours.

One option is that all students (agents) try to read about the same topics and write together each sentence of the paper. For each word that they write, they should come to an agreement. This is just impractical given the time constraint. If we increase the number of students, this problem will become exponentially harder and harder.

The second option is that each student can read and write about a particular topic (each student has its their observation-action). Each student can roughly write 10 pages and in the end, they accumulate their writings together (action-observations) and get a reward. This seems to be a lot more scalable than the first approach.

In the multi-agent setting, we refer to the first case as joint action-observation and refer to the second case as independent action-observation.

Part 3: Independent Q Learning (IQL) and Challenges combing with multi-agent RL

This paper is on Deep Independent Q Learning for Cooperative and Partially Observable setting. By now, you should know what each of these mean, but let’s recap:

Deep: Means we are using a deep neural network to approximate Q-values.

Independent: As we discuss above independent observation, IQL ( Independent Q Learning) is a method where each agent has their own independent action-observation and therefore learn a different policy. This means they all learn separately and therefore have separate Q-values and a separate network. Let’s consider Dota game: each agent is a different character with its own observations (part of the game is that you need to explore) and actions that need to be taken (actions can be fighting an enemy that it encounters).

Cooperative: This means that even if the agent has their own observations, actions, and policy, the agents still share the same final reward. In Dota, for example, the outcome of the game is win or lose. So, five agents will work together to either win or lose the game.

Partially Observable: This means that since the agent cannot access the whole state, it requires memory to record the history of its own actions and/or observations.

As you might recall, the input of Q function was state and action. However, we instead of state use T to refer to agent-action observation history, and U to refer to agent actions. “a”refers to agent “a”. T a means the history of action-observation for an agent a and similarly, for U. Q a means a Q function for an agent a.

Since we are using independent Q-learning, each agent has their own action-observation, and like the classroom example, this means it's more scalable.

However, the environment is non-stationary. What does this mean?

Well, since each agent has its own policy and policy is learned for the sequence of actions, every time, we are trying to update the Q function, the agent policy is changing, and since agents are considered object part of the environment, the behavior of the agent is changing. This means the dynamic of the environment is constantly changing and therefore, it is really hard to approximate the Q function.

This is indeed, one of the main problems with multi-agent reinforcement learning for independent observation is a non-stationary problem.

Until now, all you learned was the basics to fully understand this problem. So, how do we resolve this? This is all this paper about.

4, FingerPrinting

The idea behind fingerprint is simple, it says if we condition over policy ( weights of other agents), then our environment is no longer non-stationary.

This means I have a master brain and can upload all the brains of other agents in my own brain. BUT, this is just impossible. The input of the Q-function is extremely large and you explode.

Instead of uploading the brains of all other agent ( Θ_a - weights of the neural networks of all other agents), you decide to just upload their memories (we condition over other agents’ experiences on replay buffer).

And this is the idea behind the fingerprint method. Here is the recap of what we have learned so far: We integrate deep neural network with independent Q-learning.

1, This means each agent has their own observation and feeds this observation into a neural network in order to approximate the value function. You can consider the weights of the neural network as the POLICY that maps observations into actions.

2, Since this is independent, this means each agent has their own observations and therefore, their own policy.

3, The environment is non-stationary (we consider other agents as a part of the environment ),  and each agent policy is constantly changing and therefore the dynamic of your environment is changing.

4, The solution to resolve non-stationary is to condition over other agents’ weights (policy , brain , whatever you want to call it). BUT this is just impractical!

5, So, we condition only over other agents’ observation in replay buffer.

6, For each agent, we create the new observation called “O`” that includes weights of other agents: (Θ_a) and its own observation.

Thanks to Matthew McAteer ,Tom Higgins for reviewing this blog. You can also follow me on twitter for the future posts.