Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Jakob N. Foerster, Nando de Freitas, Yannis M. Assael, Shimon Whiteson.
the learning process is centralized: agents communication is unrestricted during learning
execution is decentralized: agents can only communicate over a discrete, limited-bandwidth channel
there are two approaches covered here:
through these approaches, agents learn their own communication protocols.
(see elsewhere for deep-Q network [DQN] notes)
independent DQNs are multi-agent settings in which each agent behaves independently and learns its own Q function (using a DQN). the reward an agent receives is shared across other agents. independent Q-learning can lead to convergence problems (since as agents learn, they cause the environment to change for all other agents), but in practice it works well
DQNs and independent DQNs assume full observability (the agent receive the full state
deep recurrent Q-networks (DRQN) are applicable to single-agent settings with partially observable environments (that is,
this paper explores partially observable environments in multi-agent settings. here, in addition to selecting an action
so here agents actually learn two
agents have no a priori communication protocol so they must come up with one on their own - so the question is: how to agents efficiently communicate what they know to other agents? agents need to be able to understand each other.
experience replay (typically used for DQNs) is disabled here because it is less effective in multi-agent situations (since agents change the environment so much that past experience memories may be invalidated)
RIAL can have independent parameters for each agent (i.e. they each learn their own networks) or shared across agents (they all learn the same networks). even in the latter case agents can still behave differently (during execution) because they receive different observations (and thus accumulate their own hidden state). learning is also much faster in the latter case since there are much fewer parameters.
another extension is to include an agent's index (i.e. id)
so in parameter sharing, the agents learn two
DIAL goes a step further than shared-parameter RIAL: gradients are pushed across agents. That is, during centralized learning, communication actions are replaced with direct connections between the output of one agent's network and the input of another's. In this way, agents can "communicate" real valued messages to each other.
This aggregate network is called a C-Net. It outputs two types of values:
See also: Learning Multiagent Communication with Backpropagation (Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus)