_{An intrinsically motivated agent}

_{How is it possible to discover what can be controlled from images ?}

This blog post is accompanied with a colab notebookDespite recent breakthroughs in artificial intelligence, machine learning agents remain limited to tasks predefined by human engineers. The autonomous and simultaneous discovery and learning of many-tasks in an open world remains very challenging for reinforcement learning algorithms. In this blog post we explore recent advances in developmental learning to tackle the problems of autonomous exploration and learning.

Consider a robot like the one depicted on the first picture. In this environment it can do many things: it can move its arms around, use its arms to play with the joysticks, move the ball in the arena using the joysticks. Imagine that we want to teach this robot how to move the ball to various locations. We could craft a reward function that rewards the agent for putting the ball at a given location, and launch our favorite deep RL algorithm. Without going into details, this popular approach has several drawbacks:

- the algorithm would require a lot of trials before sampling an action which might move the ball
- the robot would only learn how to move the ball but not hot to move its arms to many locations, and even less how to move other objects that are unrelated to the ball
- we would need to specifically craft a reward for this task (this may be hard in itself (Christiano
et al.))Now imagine that we want the agent to learn all these tasks, i.e. learn to control various objects,

withoutany supervision or reward. One strategy inspired by infants’ development that was shown to be efficient in this case consists in modeling the robot as a curiosity driven agent that wants to explore the world, by autonomously generating and selecting goals that provide maximal learning progress (Forestieret al.). Concretely, the robot sets for itself goals that it then tries to achieve, in an episodic fashion. For example one goal could be to put its arm at a specific place, or achieve a specific trajectory, or to try and move the ball to a certain location. Using this strategy the robot will soon realize that some goals are easier to reach than others, focusing on them and progressively shifting to learn more and more complex goals and associated policies. At the same time, it will also avoid spending too much time exploring goals that are either trivial or impossible to learn (e.g. distractor objects that move independently of the actions of the robot).This idealized situation is fine, but what if we want our robot to learn all these skills using only raw pixels from a camera? What would a goal look like in this case? The robot could sample goals uniformly in the pixel space. This is clearly a poor strategy, as it amounts to sample noise which is by definition not reproducible. The robot could also sample images from a database of observed situations, and try to reproduce them. It could then try to compare the results of its actions with the goals. However, computing distances in the pixel space is a bad idea, as noise and changes in the scene (due to distractors for example) could put large distances between perceptually equivalent scenes.

From our perspective, we know that the world is structured and made of independent entities, with distinct properties. There are much fewer entities than the number of pixels in an image. As such it makes more sense to set goals for the entities rather than for the pixels that represent them. As humans we are very good at detecting those entities in an image and that’s what allows us to be efficient even in an unseen environment.

Coming back to our robot, is it possible for it to discover and learn to represent the entities in the environment from raw images? Can the robot use them to set goals that it can try to achieve? Will this lead to an efficient exploration of the environment? Can it discriminate between entities that can be controlled and those that cannot?

Those are the questions that we explored in two papers (Péré

et al., ICLR 2018 and Laversanne-Finotet al. CoRL 2018). In particular, we show that:

- It is possible to leverage tools from the representation learning literature in order to extract features that can serve as goals for intrinsically motivated goal exploration algorithms.
- Using a representation of the environment as a goal space can provide performances as good as engineered features for exploration algorithms.
- Using disentangled representation is beneficial for exploration algorithms in the presence of distractors: using a disentangled representation as a goal space allows the agent to explore its environment more widely in a shorter amount of time.
- Curiosity driven exploration allows to extract high level controllable features of the environment when the representation is disentangled.
## Environments

The experiments that we describe have been performed on variants of the

Arm-Ballenvironment. In this environment a 7-joint robotic arm evolves in a scene containing a ball that can be grasped and moved around by the robotic arm. The agent perceives the scene as a 64 \times 64 pixels image. Simple as it may be, this environment is challenging since the action space is highly redundant. Random motor commands will most of the time produce the same dynamic: the arm moving around and the ball staying in the same position. Here we consider two variants of this environment: one where there is only the ball and one with an additional distractor: a ball that cannot be controlled and moves randomly across the scene. Examples of motor commands performed on these environments are presented on the figure above.## Intrinsically Motivated Goal Exploration Process (IMGEPs)

A good exploration strategy for the agent when there is no reward signal is to set for itself goals and to try to reach them. This strategy, known as Intrinsically Motivated Goal Exploration Processes (IMGEPs) (Forestier

et al., Baraneset al.), is summarized in the figure above. For example, in this context, a goal could consist in trying to put the ball at a specific position (more generally, in the IMGEP framework, goals can be any target dynamical properties over entire trajectories). An important aspect of this approach is that the agent needs to have a goal space to sample those goals.Up to now the Intrinsically Motivated Goal Exploration Process approach has only been applied in experiments where we have access hand-designed representations of the state of the system. Now, consider a problem where a robot has to move an object from the raw images that it gets from a camera. The images are naturally living in a high dimensional space. However, we know that the underlying state is low dimensional (the number of degrees of freedom of the object).

In this case, a natural idea is to learn a low dimensional state representation. Having a state representation is advantageous in many ways Lesort

et al.: to overcome the curse of dimensionality, it is easier to understand and interpret from a human point of view and it might improve performance and learning speed in machine learning scenarios. Another advantage of using state representation is that a policy learned on a representation is often more robust to changes in the environment. For example, if we consider a typical transfer learning scenario where the relevant parameters of the problem are kept fixed (e.g. shape and size of the object) but some irrelevant parameters may have changed (e.g. the color of the object that must be grasped by the robot) a policy learned on the pixel space is bound to fail when transferred, whereas the representation may still capture the relevant parameters.In a first paper (Péré

et al.), we proposed to learn a representation of the scene using various unsupervised learning algorithms, such as Variational Auto-Encoders. The general idea consists in letting the agent observe another agent acting on the environment (enabling to observe a distribution of possible outcomes in that environment), and learn a compressed representation of these outcomes, called a latent space. The learned latent space can then be used as a goal space. In this case, instead of sampling as a goal the position of the ball at the end of the episode, the goal consists in reaching a certain point in the latent space (i.e. to obtain an observation at the end of the episode whose representation is as close as possible to the goal in the latent space). In this paper, it was shown that is is possible to use a wide range of representation algorithms to learn the goal space. Most of these algorithms perform almost as well as a true state representation. For instance the figure above shows thatwithoutany form of supervision or reward signal the agent is capable of learning how to place the ball in many distinct locations. On the contrary when the agent performs random motor commands (RPE) the diversity of outcomes is much smaller.## Modular IMGEPs

The results published in the first paper were obtained in environments containing always a single object. However, in many environments there is often more than one object. These objects can be very different and can be controlled with a varying degree of difficulty (e.g. moving a small object, hard to pick up vs moving a big ball across the environment). Or it can also happen that it is necessary to know how to use one object to use another one (e.g. using a fork to eat something). There can even be objects that are uncontrollable (e.g. moving randomly). As a result it seems natural to separate the exploration of different categories of objects. The intuitive idea is that an algorithm should start with controlling easy to learn objects before moving to more complex objects. It should also ignore objects that cannot be controlled (distractors). This is precisely what

modularIMGEPs where designed for. The idea is that instead of sampling goals globally (i.e. target value for all dimensions characterizing the world and including all objects), the algorithm samples goals only as target values for particular dimensions of particular objects. For example, in the previously considered experiment the agent could decide to set a goal for the position of the joystick or for the position of the ball. By monitoring how well it performs for each task (theprogress) the agent would discover that the ball is much harder to control than the joystick since it is necessary to master the joystick before moving the ball. By focusing on tasks (i.e. sampling goals for specific modules) for which the agent has a large learning progress the agent will always set for itself goals with the adequate difficulty. This approach leads to the formation of an automatic curriculum.Ideally, in the case of goal spaces learned with a representation algorithm, if the representation is disentangled, then each latent variable corresponds to one factor of variation (Bengio). It is thus natural to see one, or a group, of latent variables as an independent module in which to set goals that could be explored by the agent. If the disentanglement properties of the representation are good, then it should in principle lead the agent to discover, through the representation, which objects can and which cannot be controlled. On the contrary, using an entangled representation will introduce spurious correlations between the action of the agent and the outcomes, which in turn will lead the agent to sample more frequently actions that in fact did not have any impact on the outcome.

Following this idea, in a second paper (Laversanne-Finot

et al.), we adopted the architecture in the above picture. The architecture is composed of a representation algorithm (in our case a VAE/\beta-VAE (Higginset al.)) which learns a representation of the world. Using this representation we define modules by grouping some of the latent variables together. For example a module could be made of the first and second latent variables. A goal for this module would be to reach a position where the first and second latent variables have certain values. The idea behind this definition of modules is that if the modules are made of latent variables encoding for independent degrees of freedom/objects, then the algorithm should be able, by monitoring the progress, to understand which latent variables can or cannot be controlled. In other words, it will discover independently controllable features of the world.

_{VAE, 5 modules}

_{βVAE, 5 modules}_{βVAE, 10 modules}This is illustrated in the figure above. For example, when the goal space is disentangled and the modules are defined by groups of two latent variables, we see that the interest of the agent is high only for the module encoding for the ball position. On the other hand when the representation is entangled all the latent variables encode for the ball and distractor positions and thus the interest is low for all latent variables. Similar results are obtained if we define modules made of only one latent variable: when the goal space is disentangled the interest is high only for modules which encode the ball position, whereas when the representation is entangled all the modules have similar interest. The high interest is thus a marker that this latent variable is an independantly controllable feature of the environment.

The fact that the algorithm is capable of extracting the controllable feature of the environments is reflected on its exploration performance. As seen on the figure below, modular goal exploration (MGE) algorithms with disentangled representations (\beta-VAE) explore much more than their entangled (VAE) counterparts, with performances similar to modular goal exploration with engineered features (EFR) (x and y positions of the ball and the distractor). We also see that in the presence of a distractor the performances of flat architecture (RGE) is negatively impacted.

## Future work

In this series of works we studied how handcrafted goal spaces can be replaced by embeddings learnt from raw observations of images in IMGEPs. We have shown that, while entangled representations are a good baseline as goal spaces for IMGEPs, when the representation possesses good disentanglement properties, they can be leveraged by a curiosity-driven modular goal exploration architecture and lead to highly efficient exploration. In particular, this enables exploration performances as good as when using engineered features. In addition, the monitoring of learning progress enables the agent to discover which latent features can be controlled by its actions, and focus its exploration by setting goals in their corresponding subspace. This allows the agent to learn which are the controllable features of the environment.

An interesting line of work beyond using learning progress to discover controllable features during exploration, would be to re-use this knowledge to acquire more abstract representations and skills. For example, once we know which latent variables can be controlled, we can use a RL algorithm to learn to use them to acquire a specific skill in that environment.

Another interesting perspective would be to apply the ideas developed in these papers to real world robotic experiments. We are currently working on such a project. The setup that we are working on is very similar to the one presented throughout this blog post (see first picture): a robot can play with two joysticks. These two joysticks control the position of a robotic arm that can move a ball inside an arena. Currently the position of the ball and of the arm is extracted from the images using handcrafted features. Modular IMGEPs using those extracted features have been shown to be very efficient for exploration in this setup (Forestier

et al.). The focus of our work is to remove this part and replace it with an embedding that would serve as a goal space.Of course our approach is not the only possible one and the ideas developed in these papers may be applicable in other domains. In fact, similar ideas have been experimented in the context of Deep Reinforcement Learning. For example, it was suggested (Nair

et al.) to rather train the RL algorithm in the embedding space obtained after training a Variational Auto Encoder (VAE) on images of the scene. Using this approach, it was shown that a robot can learn how to manipulate a simple object across a plane. However this paper did not study how the algorithm would perform in the presence of a distractor (an object that cannot be controlled by the robot but can move across the scene). In this case it is not clear that the RL algorithm would succeed since the embedding for two similar positions of the ball can vary wildly due to the distractor. See also (Bengioet al.) for another approach to discovering independently controllable features.## Code and notebook

## References

- Curiosity Driven Exploration of Learned Disentangled Goal Spaces, Laversanne-Finot, A., Péré, A., & Oudeyer, P. Y., CoRL, 2018.
- Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration, Alexandre Péré, Sébastien Forestier, Olivier Sigaud, Pierre-Yves Oudeyer, ICLR, 2018.
- Hindsight Experience Replay, Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba.
- Deep reinforcement learning from human preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei.
- Early Visual Concept Learning with Unsupervised Deep Learning, Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, Alexander Lerchner.
- Visual Reinforcement Learning with Imagined Goals, Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine.
- Learning Deep Architectures for AI, Yoshua Bengio.
- State Representation Learning for Control: An Overview, Timothée Lesort, Natalia Díaz-Rodríguez, Jean-François Goudou, David Filliat.
- Independently Controllable Features, Emmanuel Bengio, Valentin Thomas, Joelle Pineau, Doina Precup, Yoshua Bengio.
- Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning, Sébastien Forestier, Yoan Mollard, Pierre-Yves Oudeyer.
- Active learning of inverse models with intrinsically motivated goal exploration in robots, Adrien Baranes, Pierre-Yves Oudeyer, Robotics and Autonomous Systems, 2013.
## Contact

Email: adrien.laversanne-finot@inria.fr

Twitter of Flowers lab: @flowersINRIA

## Subscribe to our RSS Feed.

## Subscribe to our mailing list.

Posts: 1

Participants: 1

]]>Reproducibility in Machine Learning and Deep Reinforcement Learning in particular has become a serious issue in the recent years. Reproducing an RL paper can turn out to be much more complicated than you thought, see this blog post about lessons learned from reproducing a deep RL paper. Indeed, codebases are not always released and scientific papers often omit parts of the implementation tricks. Recently, Henderson et al. conducted a thorough investigation of various parameters causing this reproducibility crisis [Henderson et al., 2017]. They used trendy deep RL algorithms such as DDPG, ACKTR, TRPO and PPO with OpenAI Gym popular benchmarks such as Half-Cheetah, Hopper and Swimmer to study the effects of the codebase, the size of the networks, the activation function, the reward scaling or the random seeds. Among other results, they showed that different implementations of the same algorithm with the same set of hyperparameters led to drastically different results.

Perhaps the most surprising thing is this: running the same algorithm 10 times with the same hyper-parameters using 10 different random seeds and averaging performance over two splits of 5 seeds can lead to learning curves seemingly coming from different statistical distributions. Then, they present this table:

_{ Figure 1: Number of trials reported during evaluation in various works, from [Henderson et al., 2017].}This table shows that all the deep RL papers reviewed by Henderson et al. use less than 5 seeds. Even worse, some papers actually report the average of the best performing runs! As demonstrated in Henderson et al., these methodologies can lead to claim that two algorithms performances are different when they are not. A solution to this problem is to use more random seeds, to average more different trials to obtain a more robust measure of your algorithm performance. OK, but how many more? Should I use 10, should I use 100 as in [Mania et al, 2018]? The answer is, of course,

it depends.If you read this blog, you must be in the following situation: you want to compare the performance of two algorithms to determine which one performs best in a given environment. Unfortunately, two runs of the same algorithm often yield different measures of performance. This might be due to various factors such as the seed of the random generators (called

random seedorseedthereafter), the initial conditions of the agent, the stochasticity of the environment, etc.Part of the statistical procedures described in this article are available on Github here. The article is available on ArXiv here.

## Definition of the statistical problem

The performance of an algorithm can be modeled as a

random variableX and running this algorithm in an environment results in arealizationx. Repeating the procedure N times, you obtain a statisticalsamplex=(x^1, .., x^N). A random variable is usually characterized by itsmean\mu and itsstandard deviation, noted \sigma. Of course, you do not know what are the values of \mu and \sigma. The only thing you can do is to compute their estimations \overline{x} and s:\large \overline{x} \mathrel{\hat=} \sum\limits_{i=1}^n{x^i}, \hspace{1cm} s \mathrel{\hat=}\sqrt{\frac{\sum_{i+1}^{N}(x^i-\overline{x})^2}{N-1}},where \overline{x} is called the empirical mean, and s is called the empirical standard deviation. The larger the sample size N, the more confidence you can be in the estimations.

Here, two algorithms with respective performances X_1 and X_2 are compared. If X_1 and X_2 follow normal distributions, the random variable describing their difference (X_{\text{diff}} = X_1-X_2) also follows a normal distribution with parameters {\sigma_{diff}=(\sigma_1^2+\sigma_2^2)^{1/2}} and \mu_{\text{diff}}=\mu_1-\mu_2. In this case, the estimator of the mean of X_{\text{diff}} is \overline{x}_{\text{diff}} = \overline{x}_1-\overline{x}_2 and the estimator of {\sigma_{\text{diff}}} is {s_{\text{diff}}=\sqrt{s_1^2+s_2^2}}. The

effect size\epsilon can be defined as the difference between the mean performances of both algorithms: {\epsilon = \mu_1-\mu_2}.Testing for a difference between the performances of two algorithms (\mu_1 and \mu_2) is mathematically equivalent to testing a difference between their difference \mu_{\text{diff}} and 0. The second point of view is considered from now on. We draw a sample x_{\text{diff}} from X_{\text{diff}} by subtracting two samples x_1 and x_2 obtained from X_1 and X_2.

Example 1To illustrate the concepts developed in this article, let us take two algorithms (Algo 1 and Algo 2) and compare them on the Half-Cheetah environment from the OpenAI Gym framework. The actual algorithms used are not so important here, and will be revealed later. First, we run a preliminary study with N=5 random seeds for each and plot the results in Figure 2. This figure shows the average learning curves, with the 95\% confidence interval. Each point of a learning curve is the average cumulated reward over 10 evaluation episodes. The

measure of performanceof an algorithm is the average performance over the last 10 points (i.e. last 100 evaluation episodes). From the figure, it seems that Algo1 performs better than Algo2. Moreover, the confidence intervals do not overlap much at the end. Of course, we need to run statistical tests before drawing any conclusion.## Comparing performances with a difference test

In a

difference test, statisticians define thenull hypothesisH_0 and thealternate hypothesisH_a. H_0 assumes no difference whereas H_a assumes one:

- H_0: \mu_{\text{diff}} = 0
- H_a: \mu_{\text{diff}} \neq 0
These hypothesis refers to the two-tail case. When you have an a-priori on which algorithm performs best, (let us say Algo1), you can use the one-tail version:

- H_0: \mu_{\text{diff}} \leq 0
- H_a: \mu_{\text{diff}} > 0
At first, a statistical test always assumes the null hypothesis. Once a sample x_{\text{diff}} is collected from X_{\text{diff}}, you can estimate the probability p (called p-value) of observing data as extreme, under the null hypothesis assumption. By

extreme, one means far from the null hypothesis (\overline{x}_{\text{diff}} far from 0). The p-value answers the following question:how probable is it to observe this sample or a more extreme one, given that there is no true difference in the performances of both algorithms?Mathematically, we can write it this way for the one-tail case:p{\normalsize \text{-value}} = P(X_{\text{diff}}\geq \overline{x}_{\text{diff}} \hspace{2pt} |\hspace{2pt} H_0),and this way for the two-tail case:

p{\normalsize \text{-value}}=\left\{ \begin{array}{ll} P(X_{\text{diff}}\geq \overline{x}_{\text{diff}} \hspace{2pt} |\hspace{2pt} H_0)\hspace{0.5cm} \text{if} \hspace{5pt} \overline{x}_{\text{diff}}>0\\ P(X_{\text{diff}}\leq \overline{x}_{\text{diff}} \hspace{2pt} |\hspace{2pt} H_0) \hspace{0.5cm} \text{if} \hspace{5pt} \overline{x}_{\text{diff}}\leq0. \end{array} \right.When this probability becomes really low, it means that it is highly improbable that two algorithms with no performance difference produced the collected sample x_{\text{diff}}. A difference is called

significant at significance level \alphawhen the p-value is lower than \alpha in the one-tail case, and lower than \alpha/2 in the two tail case (to account for the two-sided test). Usually \alpha is set to 0.05 or lower. In this case, the low probability to observe the collected sample under hypothesis H_0 results in its rejection. Note that a significance level \alpha=0.05 still results in 1 chance out of 20 to claim a false positive, to claim that there is a true difference when there is not.Another way to see this, is to consider confidence intervals. Two kinds of confidence intervals can be computed:

- CI_1: The 100\cdot(1-\alpha)\hspace{3pt}\% confidence interval for the mean of the difference \mu_{\text{diff}} given a sample x_{\text{diff}} characterized by \overline{x}_{\text{diff}} and s_{\text{diff}}.
- CI_2: The 100\cdot(1-\alpha)\hspace{3pt}\% confidence interval for any realization of X_{\text{diff}} under H_0 (assuming \mu_{\text{diff}}=0).
Having CI_2 that does not include \overline{x}_{\text{diff}} is mathematically equivalent to a p-value below \alpha. In both cases, it means there is less than 100\cdot\alpha\% chance that \mu_{\text{diff}}=0 under H_0. When CI_1 does not include 0, we are also 100\cdot(1-\alpha)\hspace{3pt}\% confident that \mu\neq0, without assuming H_0. Proving one of these things leads to conclude that the difference is

significant at level \alpha.Two types of errors can be made in statistics:

- The
type-I errorrejects H_0 when it is true, also calledfalse positive. This corresponds to claiming the superiority of an algorithm over another when there is no true difference. Note that we call both the significance level and the probability of type-I error \alpha because they both refer to the same concept. Choosing a significance level of \alpha enforces a probability of type-I error \alpha, under the assumptions of the statistical test.- The
type-II errorfails to reject H_0 when it is false, also calledfalse negative. This corresponds to missing the opportunity to publish an article when there was actually something to be found.

Important:

- In the two-tail case, the null hypothesis H_0 is \mu_{\text{diff}}=0. The alternative hypothesis H_a is \mu_{\text{diff}}\neq0.
- p{\normalsize \text{-value}} = P(X_{\text{diff}}\geq \overline{x}_{\text{diff}} \hspace{2pt} |\hspace{2pt} H_0).
- A difference is said
statistically significantwhen a statistical test passed. One can reject the null hypothesis when 1) p-value <\alpha; 2) CI_1 does not contain 0; 3) CI_2 does not contain \overline{x}_{\text{diff}}.statistically significantdoes not refer to the absolute truth. Two types of error can occur. Type-I error rejects H_0 when it is true. Type-II error fails to reject H_0 when it is false.## Select the appropriate statistical test

You must decide which statistical tests to use in order to assess whether the performance difference is significant or not. As recommended in [Henderson et al., 2017], the two-sample t-test and the bootstrap confidence interval test can be used for this purpose. Henderson et al. also advised for the

Kolmogorov-Smirnov test, which tests if two samples comes from the same distribution. This test should not be used to compare RL algorithms because it is unable to prove any order relation.## T-test and Welch’s t-test

We want to test the hypothesis that two populations have equal means (null hypothesis H_0). A 2-sample t-test can be used when the variances of both populations (both algorithms) are assumed equal. However, this assumption rarely holds when comparing two different algorithms (e.g. DDPG vs TRPO). In this case, an adaptation of the 2-sample t-test for unequal variances called Welch’s t-test should be used. T-tests make a few assumptions:

- The scale of data measurements must be continuous and ordinal (can be ranked). This is the case in RL.
- Data is obtained by collecting a representative sample from the population. This seem reasonable in RL.
- Measurements are independent from one another. This seems reasonable in RL.
- Data is normally-distributed, or at least bell-shaped. The normal law being a mathematical concept involving infinity, nothing is ever perfectly normally distributed. Moreover, measurements of algorithm performances might follow multi-modal distributions.
Under these assumptions, one can compute the t-statistic t and the degree of freedom \nu for the Welch’s t-test as estimated by the Welch–Satterthwaite equation, such as:

t = \frac{x_{\text{diff}}}{\sqrt{\frac{s^2_1+s^2_2}{N}}}, \hspace{1cm} \nu \approx \frac{(N-1)\cdot \Big(s^2_1+s^2_2\Big)^2}{s^4_1+s^4_2},with x_{\text{diff}} = x_1-x_2; s_1, s_2 the empirical standard deviations of the two samples, and N the sample size (same for both algorithms). The t-statistics are assumed to follow a t-distribution, which is bell-shaped and whose width depends on the degree of freedom. The higher this degree, the thinner the distribution.

Figure 3 helps making sense of these concepts. It represents the distribution of the t-statistics corresponding to X_{\text{diff}}, under H_0 (left distribution) and under H_a (right distribution). H_0 assumes \mu_{\text{diff}}=0, the distribution is therefore centered on 0. H_a assumes a (positive) difference \mu_{\text{diff}}=\epsilon, the distribution is therefore shifted by the t-value corresponding to \epsilon, t_\epsilon. Note that we consider the one-tail case here, and test for a positive difference.

A t-distribution is defined by its

probability density functionT_{distrib}^{\nu}(\tau) (left curve in Figure 3, which is parameterized by \nu. Thecumulative distribution functionCDF_{H_0}(t) is the function evaluating the area under T_{distrib}^{\nu}(t) from \tau=-\infty to \tau=t. This allows to write:p\text{-value} = 1-CDF_{H_0}(t) = 1-\int_{-\infty}^{t} T_{distrib}^{\nu}(\tau) \cdot d\tau._{ Figure 3: Representation of H0 and Ha under the t-test assumptions. Areas under the distributions represented in red, dark blue and light blue correspond to the probability of type-I error alpha, type-II error beta and the statistical power 1-beta respectively. }In Figure 3, t_\alpha represents the critical t-value to satisfy the significance level \alpha in the one-tail case. When t=t_\alpha, p-value =\alpha. When t>t_\alpha, the p-value is lower than \alpha and the test rejects H_0. On the other hand, when t is lower than t_\alpha, the p-value is superior to \alpha and the test fails to reject H_0. As can be seen in the figure, setting the threshold at t_\alpha might also cause an error of type-II. The rate of this error (\beta) is represented by the dark blue area: under the hypothesis of a true difference \epsilon (under H_a, right distribution), we fail to reject H_0 when t is inferior to t_\alpha. \beta can therefore be computed mathematically using the CDF:

\beta = CDF_{H_a}(t_\alpha) = \int_{-\infty}^{t_\alpha} T_{distrib}^{\nu}(\tau-t_{\epsilon}) \cdot d\tau.Using the translation properties of integrals, we can rewrite \beta as:

\beta = CDF_{H_0}(t_\alpha-t_{\epsilon}) = \int_{-\infty-t_{\epsilon}=-\infty}^{t_\alpha-t_{\epsilon}} T_{distrib}^{\nu}(\tau) \cdot d\tau.The procedure to run a Welch’s t-test given two samples (x_1, x_2) is:

- Computing the degree of freedom \nu and the t-statistic t based on s_1, s_2, N and \overline{x}_{\text{diff}}.
- Looking up the t_\alpha value for the degree of freedom \nu in a t-table or by evaluating the inverse of the CDF function in \alpha.
- Compare the t-statistic to t_\alpha. The difference is said statistically significant (H_0 rejected) at level \alpha when t\geq t_\alpha.
Note that t<t_\alpha does not mean there is no difference between the performances of both algorithms. It only means there is not enough evidence to prove its existence with 100 \cdot (1-\alpha)\% confidence (it might be a type-II error). Noise might hinder the ability of the test to detect the difference. In this case, increasing the sample size N could help uncover the difference.

Selecting the significance level \alpha of the t-test enforces the probability of type-I error to \alpha. However, Figure 3 shows that decreasing this probability boils down to increasing t_\alpha, which in turn increases the probability of type-II error \beta. One can decrease \beta while keeping \alpha constant by increasing the sample size N. This way, the estimation \overline{x}_{\text{diff}} of \overline{\mu}_{\text{diff}} gets more accurate, which translates in thinner distributions in the figure, resulting in a smaller \beta. The next section gives standard guidelines to select N so as to meet requirements for both \alpha and \beta.

## Bootstrapped confidence intervals

Bootstrapped confidence interval is a method that does not make any assumption on the distributions of performance differences. It estimates the confidence intervals by re-sampling among the samples actually collected and by computing the mean of each generated sample.

Given the true mean \mu and standard deviation \sigma of a normal distribution, a simple formula gives the 95\% confidence interval. But here, we consider an unknown distribution F (the distribution of performances for a given algorithm). As we saw above, the empirical mean \overline{x} is an unbiased estimate of its true mean, but how do we compute a confidence interval? One solution is to use the

bootstrap principle.Let us say we have a sample x_1, x_2, .., x_N of measures (performance measures in our case), where N is the sample size. The empirical bootstrap sample is obtained by sampling with replacement inside the original sample. This bootstrap sample is noted x^*_1, x^*_2, …, x^*_N and has the same number of measurements N. The bootstrap principle then says that, for any statistics u computed on the original sample and u^* computed on the bootstrap sample, variations in u are well approximated by variations in u^*. More explanations and justifications can be found in this document from MIT. You can therefore approximate variations of the empirical mean (let’s say its range), by variations of the bootstrapped samples.

The computation would look like this:

- Generate B bootstrap samples of size N from the original sample x_1 of Algo1 and B samples from from the original sample x_2 of Algo2.
- Compute the empirical mean for each sample: \mu^1_1, \mu^2_1, ..., \mu^B_1 and \mu^1_2, \mu^2_2, ..., \mu^B_2
- Compute the differences \mu_{\text{diff}}^{1:B} = \mu_1^{1:B}-\mu_2^{1:B}
- Compute the bootstrapped confidence interval at 100\cdot(1-\alpha)\%. This is basically the range between the 100 \cdot\alpha/2 and 100\cdot(1-\alpha)/2 percentiles of the vector \mu_{\text{diff}}^{1:B} (e.g. for \alpha=0.05, the range between the 2.5^{th} and the 97.5^{th} percentiles).
The number of bootstrap samples B should be chosen large (e.g. >1000). If the confidence interval bounds does not contain 0, it means that you are confident at 100 \cdot (1-\alpha)% that the difference is either positive (both bounds positive) or negative (both bounds negative). You just found a statistically significant difference between the performances of your two algorithms. You can find a nice implementation of this here.

Example 1 (continued)

Here, the type-I error requirement is set to \alpha=0.05. Running the Welch’s t-test and the bootstrap confidence interval test with two samples (x_1,x_2) of 5 seeds each leads to a p-value of 0.031 and a bootstrap confidence interval such that P\big(\mu_{\text{diff}} \in [259, 1564]\big) = 0.05. Since the p-value is below the significance level \alpha and the CI_1 confidence interval does not include 0, both test passed. This means both tests found a significant difference between the performances of Algo1 and Algo2 with a 95\% confidence. There should have been only 5\% chance to conclude a significant difference if it did not exist.

In fact, we did encounter a type-I error. I know that for sure because:Algo 1 and Algo 2 are the exact same algorithmThey are both the canonical implementation of DDPG [Lillicrap et al., 2015]. The codebase can be found on this repository. This means that H_0 was the true hypothesis, there is no possible difference in the true means of the two algorithms. Our first conclusion was wrong, we committed a type-I error, rejecting H_0 when it was true. In our case, we selected the two tests so as to set the type-I error probability \alpha to 5\%. However, statistical tests often make assumptions, which results in wrong estimations of the probability of the type-I error. We will see in the last section that the false positive rate was strongly under-evaluated.

Important:

- T-tests assume t-distributions of the t-values. Under some assumptions, they can compute analytically the p-value and the confidence interval CI_2 at level \alpha.
- The Welch’s t-test does not assume both algorithms have equal variances but the t-test does.
- The bootstrapped confidence interval test does not make assumptions on the performance distribution and estimates empirically the confidence interval CI_1 at level \alpha.
- Selecting a test with a significance level \alpha enforces a type-I error \alpha when the assumptions of the test are verified.
## The theory: power analysis for the choice of the sample size

We saw that \alpha was enforced by the choice of the significance level in the test implementation. The second type of error \beta must now be estimated. \beta is the probability to fail to reject H_0 when H_a is true. When the effect size \epsilon and the probability of type-I error \alpha are kept constant, \beta is a function of the sample size N. Choosing N so as to meet requirements on \beta is called

statistical power analysis. It answers the question:what sample size do I need to have 1-\beta chance to detect an effect size \epsilon, using a test with significance level \alpha?The next paragraphs present guidelines to choose N in the context of a Welch’s t-test.As we saw above, \beta can be analytically computed as:

\beta = CDF_{H_0}(t_\alpha-t_{\epsilon}) = \int_{-\infty-t_{\epsilon}=-\infty}^{t_\alpha-t_{\epsilon}} T_{distrib}^{\nu}(\tau) \cdot d\tau,where CDF_{H_0} is the cumulative distribution function of a t-distribution centered on 0, t_\alpha is the critical value for significance level \alpha and t_\epsilon is the t-value corresponding to an effect size \epsilon. In the end, \beta depends on \alpha, \epsilon, (s_1, s_2) the empirical standard deviations computed on two samples (x_1,x_2) and the sample size N.

Example 2

To illustrate, we compare two DDPG variants: one with action perturbations (Algo 1) [Lillicrap et al., 2015], the other with parameter perturbations (Algo 2) [Plappert et al., 2017]. Both algorithms are evaluated in the Half-Cheetah environment from the OpenAI Gym framework.## Step 1 - Running a pilot study

To compute \beta, we need estimates of the standard deviations of the two algorithms (s_1, s_2). In this step, the algorithms are run in the environment to gather two samples x_1 and x_2 of size n. From there, we can compute the empirical means (\overline{x}_1, \overline{x}_2) and standard deviations (s_1, s_2).

Example 2 (continued)

Here we run both algorithms with n=5. We find empirical means (\overline{x}_1, \overline{x}_2) = (3523, 4905) and empirical standard deviations (s_1, s_2) = (1341, 990) for Algo1 (blue) and Algo2 (red) respectively. From Figure 4, it seems there is a slight difference in the mean performances \overline{x}_{\text{diff}} =\overline{x}_2-\overline{x}_1 >0.

Running preliminary statistical tests at level \alpha=0.05 lead to a p-value of 0.1 for the Welch’s t-test, and a bootstrapped confidence interval of CI_1=[795, 2692] for the value of \overline{x}_{\text{diff}} = 1382. The Welch’s t-test does not reject H_0 (p-value >\alpha) but the bootstrap test does (0\not\in CI_1). One should compute \beta to estimate the chance that the Welch’s t-test missed an underlying performance difference (type-II error)._{ Figure 4: DDPG with action perturbation versus DDPG with parameter perturbation tested in Half-Cheetah. Mean and 95% confidence interval computed over 5 seeds are reported. The figure shows a small difference in the empirical mean performances.}## Step 2 - Choosing the sample size

Given a statistical test (Welch’s t-test), a significance level \alpha (e.g. \alpha=0.05) and empirical estimations of the standard deviations of Algo1 and Algo2 (s_1,s_2), one can compute \beta as a function of the sample size N and the effect size \epsilon one wants to be able to detect.

Example 2 (continued)

For N in [2,50] and \epsilon in [0.1,..,1]\times\overline{x}_1, we compute t_\alpha and \nu using the formulas given in Section \ref{sec:ttest}, as well as t_{\epsilon} for each \epsilon. Finally, we compute the corresponding probability of type-II error \beta using Equation~\ref{eq:beta}. Figure 5 shows the evolution of \beta as a function of N for the different \epsilon. Considering the semi-dashed black line for \epsilon=\overline{x}_{\text{diff}}=1382, we find \beta=0.51 for N=5: there is 51\% chance of making a type-II error when trying to detect an effect \epsilon=1382. To meet the requirement \beta=0.2, N should be increased to N=10 (\beta=0.19)._{ Figure 5: Evolution of the probability of type-II error as a function of the sample size N for various effect sizes epsilon, when (s1, s2)= (1341, 990) and alpha=0.05. The requirement 0.2 is represented by the horizontal dashed black line. }In our example, we find that N=10 was enough to be able to detect an effect size \epsilon=1382 with a Welch’s t-test, using significance level \alpha and using empirical estimations (s_1, s_2) = (1341, 990). However, let us keep in mind that these computations use various approximations (\nu, s_1, s_2) and make assumptions about the shape of the t-values distribution.

## Step 3 - Running the statistical tests

Both algorithms should be run so as to obtain a sample x_{\text{diff}} of size N. The statistical tests can be applied.

Example 2 (continued)

Here, we take N=10 and run both the Welch’s t-test and the bootstrap test. We now find empirical means (\overline{x}_1, \overline{x}_2) = (3690, 5323) and empirical standard deviations (s_1, s_2) = (1086, 1454) for Algo1 and Algo2 respectively. Both tests rejected H_0, with a p-value of 0.0037 for the Welch’s t-test and a confidence interval for the difference \mu_{\text{diff}} \in [732,2612] for the bootstrap test. Both tests passed. In Figure 7, plots for N=5 and N=10 can be compared. With a larger number of seeds, the difference that was not found significant with N=5 is now more clearly visible. With a larger number of seeds, the estimate \overline{x}_{\text{diff}} is more robust, more evidence is available to support the claim that Algo2 outperforms Algo1, which translates to tighter confidence intervals represented in the figures.

\end{myex}_{ Figure 7: Performance of DDPG with action perturbation (Algo1) and parameter perturbation (Algo2) with N=5 seeds (left) and N=10 seeds (right). The 95% confidence intervals on the right are smaller, because more evidence is available (N larger). The underlying difference appears when N grows. }

Important:

Given a sample size N, a minimum effect size to detect \epsilon and a requirement on type-I error \alpha the probability of type-II error \beta can be computed. This computation relies on the assumptions of the t-test.

The sample size N should be chosen so as to meet the requirements on \beta.## In practice: influence of deviations from assumptions

Under their respective assumptions, the t-test and bootstrap test enforce the probability of type-I error to the selected significance level \alpha. These assumptions should be carefully checked, if one wants to report the probability of errors accurately. First, we propose to compute an empirical evaluation of the type-I error based on experimental data, and show that: 1) the bootstrap test is sensitive to small sample sizes; 2) the t-test might slightly under-evaluate the type-I error for non-normal data. Second, we show that inaccuracies in the estimation of the empirical standard deviations s_1 and s_2 due to low sample size might lead to large errors in the computation of \beta, which in turn leads to under-estimate the sample size required for the experiment.

## Empirical estimation of the type-I error

Remember, type-I errors occur when the null hypothesis (H_0) is rejected in favor of the alternative hypothesis (H_a), H_0 being correct. Given the sample size N, the probability of type-I error can be estimated as follows:

- Run twice this number of trials (2 \times N) for a given algorithm. This ensures that H_0 is true because all measurements come from the same distribution.
- Get average performance over two randomly drawn splits of size N. Consider both splits as samples coming from two different algorithms.
- Test for the difference of both fictive algorithms and record the outcome.
- Repeat this procedure T times (e.g. T=1000)
- Compute the proportion of time H_0 was rejected. This is the empirical evaluation of \alpha.

Example 3

We use Algo1 from Example 2. From 42 available measures of performance, the above procedure is run for N in [2,21]. Figure 8 presents the results. For small values of N, empirical estimations of the false positive rate are much larger than the supposedly enforced value \alpha=0.05._{ Figure 8: Empirical estimations of the false positive rate on experimental data (Example 3) when N varies, using the Welch's t-test (blue) and the bootstrap confidence interval test (orange). }In our experiment, the bootstrap confidence interval test should not be used with small sample sizes (<10). Even in this case, the probability of type-I error (\approx10\%) is under-evaluated by the test (5\%). The Welch’s t-test controls for this effect, because the test is much harder to pass when N is small (due to the increase of t_\alpha). However, the true (empirical) false positive rate might still be slightly under-evaluated. In this case, we might want to set the significance level to \alpha<0.05 to make sure the true positive rate stays below 0.05. In the bootstrap test, the error is due to the inability of small samples to correctly represent the underlying distribution, which impairs the enforcement of the false positive rate to the significance level \alpha. Concerning the Welch’s t-test, this might be due to the non-normality of our data (whose histogram seems to reveal a bimodal distribution). In Example 1, we used N=5 and encountered a type-I error. We can see on the Figure 8 that the probability of this to happen was around 10\% for the bootstrap test and above 5\% for the Welch’s t-test.

## Influence of the empirical standard deviations

The Welch’s t-test computes t-statistics and the degree of freedom \nu based on the sample size N and the empirical estimations of standard deviations s_1 and s_2. When N is low, estimations s_1 and s_2 under-estimate the true standard deviation in average. Under-estimating (s_1,s_2) leads to smaller \nu and lower t_\alpha, which in turn leads to lower estimations of \beta. Finally, finding lower \beta leads to the selection of smaller sample size N to meet \beta requirements. We found this had a significant effect on the computation of N. Figure 9 shows \beta the false negative rate when trying to detect effects of size \epsilon between two normal distributions \mathcal{N}(3,1) and \mathcal{N}(3+\epsilon,1). The only difference between both figures is that the left one uses the true values of \sigma_1, \sigma_2 to compute \beta, whereas the right figure uses (inaccurate) empirical evaluations s_1,s_2 to compute \beta. We can see that the estimation of standard deviations influences the computation of \beta, and the subsequent choice of an appropriate sample size N to meet requirements on \beta. See our paper for further details.

_{ Figure 9: Evolution of the probability of type-II error as a function of the sample size N and the effect size epsilon, when (s1, s2)= (1-error, 1-error) and alpha=0.05. Left: error=0, this is the ideal case. Right: error=0.40, a large error that can be made when evaluating s over n=5 samples. The compared distributions are normal, one is centered on 3, the other on 3+\epsilon. }

Important:

- One should not blindly believe in statistical tests results. These tests are based on assumptions that are not always reasonable.
- \alpha must be empirically estimated, as the statistical tests might underestimate it, because of wrong assumptions about the underlying distributions or because of the small sample size.
- The bootstrap test evaluation of type-I error is strongly dependent on the sample size. A bootstrap test should not be used with less than 20 samples.
- The inaccuracies in the estimation of the standard deviations of the algorithms (s_1,s_2), due to small sample sizes n in the preliminary study, lead to under-estimate the sample size N required to meet requirements in type-II errors.
## Conclusion

In this post, I detailed the statistical problem of comparing the performance of two RL algorithms. I defined type-I and type-II errors and proposed ad-hoc statistical tests to test for performance difference. Finally, I detailed how to pick the right number of random seeds (your sample size) so as to reach the requirements in terms of type-I and II errors and illustrated the process with a practical example.

The most important part is what came after. We challenged the hypotheses made by the Welch’s t-test and the bootstrap test and found several problems. First, we showed significant difference between empirical estimations of the false positive rate in our experiment and the theoretical values supposedly enforced by both tests. As a result, the bootstrap test should not be used with less than N=20 samples and tighter significance level should be used to enforce a reasonable false positive rate (<0.05). Second, we show that the estimation of the sample size N required to meet requirements in type-II error were strongly dependent on the accuracy of (s_1,s_2). To compensate the under-estimation of N, N should be chosen systematically larger than what the power analysis prescribes.

## Final recommendations

- Use the Welch’s t-test over the bootstrap confidence interval test.
- Set the significance level of a test to lower values (\alpha<0.05) so as to make sure the probability of type-I error (empirical \alpha) keeps below 0.05.
- Correct for multiple comparisons in order to avoid the linear growth of false positive with the number of experiments.
- Use at least n=20 samples in the pilot study to compute robust estimates of the standard deviations of both algorithms.
- Use larger sample size N than the one prescribed by the power analysis. This helps compensating for potential inaccuracies in the estimations of the standard deviations of the algorithms and reduces the probability of type-II errors.
Note that I am not a statistician. If you spot any approximation or mistake in the text above, please feel free to report corrections or clarifications.

## References

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2017). Deep Reinforcement Learning that Matters. link

Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937. link

Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. link

Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML). link

Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; Schölkopf, B.;

and Levine, S. 2017. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. linkLillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; andWierstra, D. 2015. Continuous control with deep reinforcement learning. link

Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015a. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML). link

Wu, Y.; Mansimov, E.; Liao, S.; Grosse, R.; and Ba, J. 2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. link

Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., … & Andrychowicz, M. (2017). Parameter space noise for exploration. link

## Code

The code is available on Github here.

## Paper

The paper can be found on ArXiv here.

## Contact

Email: cedric.colas@inria.fr

## Subscribe to our RSS Feed.

## Subscribe to our mailing list.

Posts: 1

Participants: 1

]]>

Deep Reinforcement Learning algorithms have attracted unprecedented attention due to remarkable successes in games like ATARI and Go, and have been extended to control domains involving continuous actions. However, standard deep reinforcement learning algorithms using continuous actions like DDPG suffer from inefficient exploration when facing sparse or deceptive reward problems.

One natural approach is to rely on imitation learning, i.e. leveraging observations of a human solving the problem. However, humans cannot always help. They can be unavailable, or simply unable to demonstrate a good behavior (e.g. how to demonstrate locomotion to a 6-leg robot).

Another approach relies on the use of various forms of curiosity-driven Deep RL. This generally consists in adding an exploration bonus term to the reward function, measuring quantities such as information gain, entropy, uncertainty or prediction errors (e.g. [Bellemare et al.]) . Sometimes the reward function is even ignored and replaced by such an intrinsic reward [Pathak et al., 2017]). However, it is challenging to leverage them in environments with complex continuous action spaces, especially on real world robots.

In our recent ICML 2018 paper, we propose to leverage evolutionary and developmental curiosity-driven exploration methods that were initially designed from a very different perspective. These are population-based approaches like Novelty Search, Quality-Diversity or Intrinsically Motivated Goal Exploration Processes. The primary purpose of these methods has been to enable autonomous machines to discover diverse repertoire of skills, i.e. to learn a population of policies that produce maximally diverse behavioral outcomes. Such discoveries have often been used to build good internal world models in a sample efficient manner, e.g. through curiosity-driven goal exploration [Baranes and Oudeyer, 2013]. This led to a variety of applications where real world robots where capable to learn very fast complex skills [Forestier et al.,2017] or to adapt to damages [Cully et al., 2015].

Our new paper shows that the strengths of monolithic Deep RL methods and population-based diversity search methods can be combined for solving RL problems with rare or deceptive rewards. The general idea is as follows. In the exploration phase, we use a population-based diversity search approach (in the experiments below, a simple form of Goal Exploration Process [Forestier et al, 2017]). During this phase, diverse goals are sampled, leading to sampling corresponding small-size neural network policies, which behavioral trajectories are recorded in an archive. While the sampling process is not influenced at all by the extrinsic reward of the RL problem, these rewards are nevertheless observed and memorized. Then, in a second phase, all trajectories (and associated rewards) discovered during the first phase are used to initialize the replay buffer of a Deep RL algorithm.

The general intuition is that population-based pure diversity search enables to find rare rewards faster than Deep RL algorithms, or to collect observation data that is very useful to Deep RL algorithms for getting out of deceptive local minima. However, as Deep RL algorithms are very strong at exploiting reward gradient information when it is available, they can be used to learn policies that refine those found during the diversity search phase.

Our experiments use a simple goal exploration process for the first phase, and several variants of DDPG for the second phase.

We show that, and analyze why:

- DDPG fails on a simple low dimensional deceptive reward problem called Continuous Mountain Car,
- GEP-PG obtains state-of-the-art performance on the larger Half-Cheetah benchmark, reaching faster and higher performances than DDPG alone,
- The diversity of outcomes discovered during the first phase correlates with the efficiency of DDPG in the second phase.
## The methodology

Our experiments follow the methodological guidelines presented in [Henderson et al, 2017]:

- we use standard baseline implementations and hyperparameters of DDPG,
- we run robust statistical analysis, averaging our results over 20 different seeds for each algorithm,
- we provide the code of the algorithm, and the code to make the figures, here.
## The environments

Continuous Mountain Car (CMC) is a simple 2D problem available in the OpenAI Gym environments. In this problem, an underpowered car must learn to swing in a valley in order to reach the top of a hill. Success yields a large positive reward (+100) while there is a small penalty for the car energy expenses (-0.1 x |a|²).

In Half-Cheetah (HC), a 2D biped must learn how to run as fast as possible. It receives observations about its absolute and joints positions and velocities (17D) and can control the torques of all joints (6D).

_{ Continuous Mountain Car }

_{ Half Cheetah }## The GEP-PG approach

GEP-PG, for Goal Exploration Process - Policy Gradient, is a general framework which sequentially combines an algorithm from the Intrinsically Motivated Goal Exploration Process family (IMGEP) [Forestier et al, 2017] and Deep Deterministic Policy Gradient (DDPG) [Lillicrap, 2015].

## DDPG

DDPG is an actor-critic off-policy method which stores samples in a replay buffer to perform policy gradient descent (see original paper [Lillicrap, 2015] for detailed explanations of this algorithm). In this paper, we use two variants:

- DDPG with action perturbations, for which an Ornstein-Uhlenbeck noise process is added to the actions.
- DDPG with parameter perturbations, where an adaptive noise is added directly to the actor’s parameters, see [Plappert, 2018] for details.
## GEP

Here, we use a very simple form of goal exploration process. First, we consider neural network policies, typically smaller in size than the one learnt in the PG phase. Second, we define a “behavioral representation” or “outcome space” that describes properties of the agent trajectory over the course of an episode (also called “roll-out” of a policy). For CMC, the minimum and maximum position on the x-axis could be used as behavioral features to define the outcome space:

_{ Each trajectory is mapped to an outcome space. Here we use the minimum and maximum. positions along the x-axis as behavioral features.}Every time a roll-out is made with a policy, the policy parameters and the corresponding outcome vector are stored inside an archive. In addition, one stores the full (state, action) trajectory and the extrinsic reward observations: these observations are used in the second phase, but they are not used in the data collection achieved by the goal exploration process.

The GEP algorithm then repeats the following steps (Figure below):

- sample a goal at random in the outcome space,
- find the nearest neighbor in outcome space archive and select the associated policy,
- add Gaussian noise to policy and play it in the environment to obtain a new outcome,
- save the new (policy, outcome) pair in the archive.
_{ GEP performs efficient exploration because the nearest-neighbor selection mechanism introduces a selection bias toward policies showing original outcomes. On the right hand-side figure above, the point in light green has much less chance to be selected as nearest neighbor of a randomly sampled goal than the dark green outcomes. The dark green outcomes located at the frontier of behavioral clusters show more novel behaviors. By selecting them, the algorithm tries to extend these clusters, to cover the whole outcome space. }Other implementations of goal exploration processes perform curiosity-driven goal sampling (e.g. maximizing competence progress) in step 1, or use regression-based forward and inverse models in step 2. However, the very simple form of goal exploration used here was previously shown to be already very efficient to discover a diversity of outcomes. This may seem surprising at first as it does not include any explicit measure of behavioral novelty. Yet, when one samples a random goal, there is a high probably that this corresponds to a vector in outcome space that is outside the cloud of outcome vectors already produced. As a consequence, when looking at the nearest-neighbor, there is a high probability to select a an outcome that is on the edge of what has been discovered so far. And thus trying a stochastic variation of the corresponding policy tends to push this edge further, thus discovering novel behaviors. So, this simple GEP implementation behaves similarly to the Novelty-Search algorithm [Lehman & Stanley, 2011], yet never measuring explicitly novelty.

Besides, one can note that GEP maintains a population of solutions by storing each (policy, outcome) pair in memory. This prevents catastrophic forgetting and enables one-shot learning of novel outcomes. The policies associated to these behaviors can easily be retrieved from memory by using a nearest neighbor search in the space of outcomes and taking the corresponding policy.

_{ Running forward }

_{Running backward }

_{ Falling }## GEP-PG

After a few GEP episodes, the actions, states and rewards experienced are loaded into the replay buffer of DDPG. DDPG is then run with a randomly initialized actor and critic, but benefits from the content of this replay buffer.

## DDPG fails on Continuous Mountain Car

Perhaps, the most surprising result of our study is that DDPG, which is considered as a state-of-the-art method in deep reinforcement learning with continuous actions, does not perform well on the very low dimensional Continuous Mountain Car benchmark, where simple exploration methods can easily find the goal.

In CMC, until the car reaches the top for the first time, the gradient of performance points towards standing still in order to avoid negative penalties corresponding to energy expenses. Thus, the gradient is deceptive as it drives the agent to a local optimum where the policy fails.

Below we show the effect of this deceptive gradient: the form of exploration used in DDPG can escape the local optimum by chance, but the average time to reach the goal for the first time is worse than one would get using purely random exploration. Using noise on the actions, DDPG finds the top in the first 50 episodes only 22% of the time. Using policy parameter perturbations, it happens 42% of the time. By contrast, because its exploration strategy is insensitive to the gradient of performance, GEP is good at quickly reaching the goal for the first time, no matter the complexity of the policy (either a simple linear policy or the same as DDPG).

When filling the replay buffer of DDPG with GEP trajectories, good policies are found more consistently. It should be noted that although GEP-PG reached better performance than DDPG alone across learning (see histogram of the performances of best policies found across learning), it sometimes forgets them afterwards (see learning curves) due to known instabilities in DDPG [Henderson et al, 2017].

_{ Good GEP-PG policy }## GEP-PG obtains state-of-the-art results on Half-Cheetah

On the Half-Cheetah benchmark, GEP-PG runs 500 episodes of GEP then switches to one of two DDPG variants that we call action and parameter perturbation. The two GEP-PG variants (dark blue and red) significantly outperform their DDPG counterparts (light blue and red), and GEP alone (green). The variance of performance across different runs is also smaller in GEP-PG compared to DDPG alone.

## What makes a good replay buffer

A question that remains unresolved is: what makes a good replay buffer? To answer it, we looked for potential correlations between the final performance of GEP-PG on Half-Cheetah and a list of other factors. We ran GEP-PG with various replay buffer sizes (100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000 episodes) and a fixed budget for the DDPG part (1500 episodes). We found out that GEP-PG’s performance is not a function of the replay buffer size. Filling the replay buffer with 100 GEP episodes is enough to bootstrap DDPG. However, the quality and diversity of the replay buffer are important factors. We found that the performance of GEP-PG correlates significantly to the buffer’s quality:

- the final performance of GEP (p<2\times10^{-6})
- the average performance of GEP during training (p<4\times10^{-8})
but also to the buffer’s diversity, as quantified by various measures:

- the standard deviation of the performances obtained by GEP during training. This measure quantifies the diversity of performances reached during training. (p<3\times10^{-10})
- the standard deviation of the observation vectors averaged across dimensions. This quantifies the diversity of sensory inputs. (p<3\times10^{-8})
- outcome diversity measured by the average distance to the k-nearest neighbors in outcome space (for various k). This measure is normalized by the average distance to the 1-nearest neighbor in the case of a uniform distribution, which makes it insensitive to the sample size. This is a measure of outcomes diversity. (p<4\times10^{-10})
- the percentage of cells filled when the outcome space is discretized (with various number of cells). We also use a number of cells equal to the number of points, which make the measure insensitive to this number. This is a measure of outcomes diversity. (p<4\times10^{-5})
- the discretized entropy with various number of cells. (p<6\times10^{-7})

_{GEP-PG versus GEP performance, color represents the buffer size }

_{GEP-PG performance versus diversity score}These correlations show that a good replay buffer should be both efficient and diverse (in terms of outcomes or observations). This means that implementing more efficient exploration strategies targeting these two objectives would likely further improve the performance of GEP-PG. GEP as used here, only aims at maximizing the diversity of outcomes. On the other hand, Quality-Diversity algorithms (e.g. MAP-Elites, Behavioral Repertoire Evolution) optimize for these two objectives, and could therefore prove to be strong candidates to replace GEP.

## Future work

In our work, we have presented the general idea of decoupling exploration and exploitation in deep reinforcement learning algorithms and proposed a specific implementation of this idea using GEP and DDPG. In future work, we will investigate other implementations of this idea using different exploration algorithms (Novelty Search, MAP-Elites, BR-Evo or different gradient-based methods (ACKTR, SAC). More sophisticated implementations of IMGEP could also be used to improve the efficiency of exploration (e.g. curiosity-driven goal exploration or modular goal exploration).

Another aspect concerns the way GEP and DDPG are combined. Here, we studied the simplest possible combination: filling the replay buffer in DDPG with GEP trajectories. This way to transfer results in a drop in performance at switch time because DDPG starts from new randomly initialized actor and critic networks. In further work, we will try to avoid this drop by bootstrapping the actor network from GEP data at switch time. For instance, using the best GEP policy to generate (observation, action) samples from GEP trajectories, the DDPG actor network could be trained in a supervised manner. We could also call upon a multi-arm bandit paradigm to alternate between GEP and DDPG as required to maximize the learning efficiency, actively selecting the best learning strategy.

## References

- Colas, C., Sigaud, O., & Oudeyer, P. Y. (2018). GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms, ICML 2018 link
- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. Nature link
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489. link
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. ICLR 2016 link
- Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (pp. 1471-1479). link
- Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017, May). Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML) (Vol. 2017). link
- Lehman, J., & Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2), 189-223. Evolutionary Computation link
- Cully, A., & Demiris, Y. (2018). Quality and diversity optimization: A unifying modular framework. IEEE Transactions on Evolutionary Computation, 22(2), 245-259. link
- Cully, A., Clune, J., Tarapore, D., & Mouret, J. B. (2015). Robots that can adapt like animals. Nature, 521(7553), 503. link
- Forestier, S., Mollard, Y., & Oudeyer, P.-Y. (2017). Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning, 1–21. link
- Benureau, F. C., & Oudeyer, P. Y. (2016). Behavioral diversity generation in autonomous exploration through reuse of past experience. Frontiers in Robotics and AI, 3, 8.link
- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2017). Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560. AAAI 2018 link
- Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., … & Andrychowicz, M. (2017). Parameter space noise for exploration. ICLR 2018 link
- Mouret, J. B., & Clune, J. (2015). Illuminating search spaces by mapping elites. link
- Cully, A., & Mouret, J. B. (2013, July). Behavioral repertoire learning in robotics. In Proceedings of the 15th annual conference on Genetic and evolutionary computation (pp. 175-182). ACM. link
- Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in neural information processing systems (pp. 5285-5294). link
- Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. link
- Lopes, M., & Oudeyer, P. Y. (2012, November). The strategic student approach for life-long exploration and learning. In IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), (pp. 1-8). link
- Nguyen, M., & Oudeyer, P. Y. (2012). Active choice of teachers, learning strategies and goals for a socially guided intrinsic motivation learner. Paladyn Journal of Behavioral Robotics, 3(3), 136-146. link
## Contact

Email: cedric.colas@inria.fr

Twitter of Flowers lab: @flowersINRIA

## Subscribe to our RSS Feed.

## Subscribe to our mailing list.

Posts: 1

Participants: 1

]]>This category is used as a blog for promoting Flowers team research.

You can subscribe to our mailing list or our RSS Feed.

Posts: 1

Participants: 1

]]>