Knowledge Transfer in Deep Reinforcement Learning

Reinforcement learning is a branch from the machine learning field where we learn from an environment by interacting with it. The reinforcement learning algorithm selects an action, executes it and can receive a reward. It can then change its way of selecting actions in order to get a higher reward, which is possibly received later on.

Determining which action to take can be done by using an artificial neural network. It is inspired by the brain of a human and consists of interconnected elements called units. The network receives an input as a vector of numerical values. These are then propagated to layers of units and results in one or more output units.

In the context of reinforcement learning, these input units can be the current state of the environment. The output units can represent the action that has to be taken. The learning process can then involve changing the strength of connections between units and as such influencing the output values.

For high-dimensional inputs, different kinds of artificial neural networks must be used to be able to process these inputs. Techniques involving these networks are called deep learning methods. For example, an image can be used as input to the network. These typically include several thousands of pixels of each a certain color. Convolutional neural networks are able to detect patterns in the images using several layers of filters.

Another deep learning network is the recurrent neural network, which is able to process a sequence of data such as a video or text.

By combining deep learning and reinforcement learning, it is possible to learn in an environment that has a high-dimensional input space. This is called deep reinforcement learning. For example, the agent can learn how to play Pong just using an image of the screen, just like a human.

An environment in which is learned is roughly defined by the possible states in which it can be, which actions can be taken and in which state one ends up in when taking an action being in a certain state. These however can be changed such that the environment is easier or harder to learn. In an environment where a self driving car must be controlled for example, the amount of obstacles may vary or the weather conditions may change.

Although these changes may require different capabilities of the agent, some knowledge may still be useful. It can thus be beneficial for the agent to transfer the already learned knowledge from the initial situation, called source task to the agent learning in the new situation, called the target task. This domain is called transfer learning.

One use of this is for example in cases where it is too expensive or time consuming to learn in the real world. Instead, one can first learn in a simulation and then transfer the knowledge to use and fine tune it in the real world, saving time and money. It is necessary to know from which source tasks to transfer knowledge and which knowledge to transfer. For this, we need to know how the tasks are related and possibly how an agent can interpret and act using the new state space and action space. What we expect to see using tranfer is a jumpstart in performance compared to agents not using knowledge transfer. This is shown in the following graph.


Proposed approach

In this project, we investigate the use of transfer learning in reinforcement learning using artificial neural networks. Our aim is to learn from multiple source tasks using shared knowledge in order to have a better performance on a target task than when we wouldn’t train on source tasks. While a version of the algorithm was implemented where the source tasks are executed sequentially for each episode, the focus is on a version where source tasks are learned simultaneously.

Specifically, we combine the A3C algorithm (Mnih et al., 2016)
with the transfer learning algorithm in Isele and Eaton, 2016. This is done by executing A3C using tasks defined by different environment parameters instead of exactly the same ones and training them using both shared knowledge and knowledge that is specific to the task.

The source tasks can be learned in two ways, either sequentially or in parallel. In the sequential way, we collect trajectories and calculate the gradients for every task after another. 

In the parallel way, all source tasks are learned at the same time, each executing a certain number of updates. Trajectories are collected for each task continuously. After learning the source tasks, the knowledge base that they jointly learned is transferred to the target task. This task then separately learns and executes updates for a number of episodes.


Experimental Design

Experiments were executed on variations of the cart-pole environment and on variations of the acrobot environment. 

Several experiments were executed with our transfer learning algorithm, using varying types of transferred knowledge and artificial neural networks and a different number of source tasks.

For an experiment with our algorithm, first a number of environments are randomly generated. We try to learn with either 5 or 10 environments and thus source tasks. The environment for each task can differ in a predefined number of parameters specific to the task, of which the values can each be in a certain range. For example for a cart-pole task, these are the mass of the cart, the mass of the pole and the length of the pole.


Parallel and sequential knowledge transfer

We first compare the performance of the algorithm that learns the source tasks in parallel with the one that learns them sequentially. We see that, on average, the parallel versionconverges faster than the sequential version.



We explore the differences between learning directly on a target task and first learning on source tasks. We consider 5 and 10 tasks and compare our algorithm to REINFORCE.

We can see that using 10 or 5 tasks results in a jumpstart which REINCFORCE lacks since it is not trained on source tasks. There is also a marginal difference between 10 and 5 tasks with 10 being better.



Performing the same experiment but this time in the Acrobot environment we experience the same results as before.

REINFORCE using a source and target task

We now compare REINFORCE when it has trained in one source task and how it performs compared to our proposed algorithm. We compare both in the cart-pole and acrobot environments as previously.

In both cases we can see that training in multiple source tasks as proposed by our algorithm benefits the performance of the agents.



We presented an algorithm suitable for learning in parallel or sequentially on a set of source tasks. These task share a knowledge base, but they also have their proper sparse representation. The learned knowledge can then be transferred to the target task with the goal of having an increased performance over an algorithm that does not use prior knowledge.

Our algorithm has better performance on the target task than when just using the REINFORCE algorithm on it. Our algorithm learns faster and is able to receive higher rewards. The performance is even better when we transfer the sparse representation from a randomly chosen source task to the target task. The algorithm then only needs to tune the sparse representation for it to work on its own task.

To see if multiple source tasks are really necessary, we compared our algorithm with the REINFORCE algorithm where it learns on a single source task and transfers all its knowledge to the target task. Although the asymptotic performance was similar, our algorithm learned better on the source tasks and has a higher jumpstart performance.

We can conclude that it is beneficial to learn on multiple source tasks in parallel and transfer knowledge learned on these tasks to the target task.


For more information read the thesis here.

Involved members:
Arno Moonens
Peter Vrancx (previously)
Kyriakos Efthymiadis