.:: YM ::.


Reinforcement learning (RL) has already been shown to be a powerful tool for solving single agent Markov Decision Processes (MDPs). It allows a single agent to learn a policy that maximises a possibly delayed reward signal in an initially unknown stochastic stationary environment. However, when multiple agents are present in the environment and influence each other, the convergence guarantees of RL no longer hold since the agents now experience a non-stationary environment.

A straightforward approach to deal with this issue is to provide the agents with sufficient information to make the environment they experience stationary. Generally, this means allowing them to observe the state information and selected actions of all the agents in the environment. This becomes untractable very quickly since usually both the state and the action space in which the agents now learn are exponential in the number of agents. As such, this approach is unsuitable for all but the smallest environments with only a few agents present.
Below are some number of the size of the state space in function of the number of agents for the environment on the left:

number of joint states number of joint actions total joint-state-joint-action space
1 Agent 25 4 100
2 Agents 625 16 10.000
3 Agents 15.625 64 1.000.000
4 Agents 390.625 256 100.000.000
5 Agents 9.765.625 1024

The main intuition behind our approach lies in the fact that agents should not always observe each other, but only when this is necessary. In the gridworld example shown above, this is when agents are close to and a danger of colliding into each other exists. What we want to obtain is a set of system states or even better a more abstract representation of those situations in which coordination is necessary and where agents can use a more global view of the system. In all other situations, agents should learn in a compact state space containing only local state information.