Denis Steckelmacher, Ph.D. student at the Artificial Intelligence Lab, presented his research on Bootstrapped Dual Policy Iteration at our weekly research meeting last week:

Sample-Efficient Reinforcement Learning with Bootstrapped Dual Policy Iteration

In reinforcement learning, we not only want the agent to learn how to perform well in a given environment, we also want it to learn quickly, that is, using as few trials (and errors) as possible.

For instance, a robotic wheelchair that may collide with a wall or person while learning must learn as quickly as possible. Denis presents a new reinforcement learning algorithm that is extremely sample-efficient. Reinforcement learning algorithms can be divided into three families:

-Critic-only (learns how good every action is in every state, so, “scores” for actions that are largest when the action is best)

-Actor-only (directly learns a function that maps a state to an action)

-Actor-critic (learns both an actor and a critic).

With discrete actions, that is, when the actions available to the agent can be numbered between 0 and N, critic-only algorithms outperform the other ones.

Actor-only algorithms are extremely sample-efficient, due to reasons Denis will explain in the talk.

Actor-critic algorithms try to increase the sample-efficiency of the actor but fail to do so currently. BDPI is a new actor-critic algorithms that fundamentally differs from the other actor-critic algorithms.

Thanks to its use of “off-policy” critics, instead of “on-policy” ones as everyone else does, the actor can learn very fast.

Moreover, our experiments show that removing the actor of BDPI, thus making it critic-only, lowers its sample-efficiency. BDPI is, therefore, the first actor-critic algorithm that outperforms critic-only algorithms, and whose actor provides a net benefit.