Denis Steckelmacher’s research on Bootstrapped Dual Policy Iteration

Denis Steckelmacher, Ph.D. student at the Artificial Intelligence Lab, presented his research on Bootstrapped Dual Policy Iteration at our weekly research meeting last week:

Sample-Efficient Reinforcement Learning with Bootstrapped Dual Policy Iteration

In reinforcement learning, we not only want the agent to learn how to perform well in a given environment, we also want it to learn quickly, that is, using as few trials (and errors) as possible.

For instance, a robotic wheelchair that may collide with a wall or person while learning must learn as quickly as possible. Denis presents a new reinforcement learning algorithm that is extremely sample-efficient. Reinforcement learning algorithms can be divided into three families:

-Critic-only (learns how good every action is in every state, so, “scores” for actions that are largest when the action is best)

-Actor-only (directly learns a function that maps a state to an action)

-Actor-critic (learns both an actor and a critic).

With discrete actions, that is, when the actions available to the agent can be numbered between 0 and N, critic-only algorithms outperform the other ones.

Actor-only algorithms are extremely sample-efficient, due to reasons Denis will explain in the talk.

Actor-critic algorithms try to increase the sample-efficiency of the actor but fail to do so currently. BDPI is a new actor-critic algorithms that fundamentally differs from the other actor-critic algorithms.

Thanks to its use of “off-policy” critics, instead of “on-policy” ones as everyone else does, the actor can learn very fast.

Moreover, our experiments show that removing the actor of BDPI, thus making it critic-only, lowers its sample-efficiency. BDPI is, therefore, the first actor-critic algorithm that outperforms critic-only algorithms, and whose actor provides a net benefit.

Leave a Reply