## Abstract

In many reinforcement learning tasks, the goal is to learn a policy to manipulate an agent, whose design is fixed, to maximize some notion of cumulative reward. The design of the agent's physical structure is rarely optimized for the task at hand. In this work, we explore the possibility of learning a version of the agent's design that is better suited for its task, jointly with the policy. We propose a minor modification to the Gym

## Introduction

Embodied cognition

While evolution shapes the overall structure of the body of a particular species, an organism can also change and adapt its body to its environment during its life. For instance, professional athletes spend their lives body training while also improving specific mental skills required to master a particular sport

Hardcore |

We are interested to investigate embodied cognition within the reinforcement learning (RL) framework. Most baseline tasks

## Related Work

There is a broad literature in evolutionary computation, artificial life and robotics devoted to studying, and modelling embodied cognition*Strandbeests* creatures that can walk on their own consuming only wind energy.

*Strandbeest*

Literature in the area of passive dynamics study robot designs that rely on natural swings of motion of body components instead of deploying and controlling motors at each joint

## Method

In this section, we describe the method used for learning a version of the agent's design better suited for its task jointly with its policy. In addition to the weight parameters of our agent's policy network, we will also parameterize the agent's environment, which includes the specification of the agent's body structure. This extra parameter vector, which may govern the properties of items such as width, length, radius, mass, and orientation of an agent's body parts and their joints, will also be treated as a learnable parameter. Hence the weights $w$ we need to learn will be the parameters of the agent's policy network combined with the environment's parameterization vector. During a rollout, an agent initialized with $w$ will be deployed in an environment that is also parameterized with the same parameter vector $w$.

The goal is to learn $w$ to maximize the expected cumulative reward, $E\left[R\right(w\left)\right]$, of an agent acting on a policy with parameters $w$ in an environment governed by the same $w$. In our approach, we search for $w$ using a population-based policy gradient method based on Section 6 of Williams' 1992 REINFORCE

Armed with the ability to change the design configuration of an agent's own body, we also wish to explore encouraging the agent to challenge itself by rewarding it for trying more difficult designs. For instance, carrying the same payload using smaller legs may result in a higher reward than using larger legs. Hence the reward given to the agent may also be augmented according to its parameterized environment vector.

## Experiments

### Learning better legs for better gait

*RoboschoolAnt-v1*

Ant agent: |

In this work, we experiment on continuous control environments from Roboschool*RoboschoolAnt-v1**Ant*. The body is supported by 4 legs, and each leg consists of 3 parts which are controlled by 2 motor joints. The bottom right diagram in the below figure describes the initial orientation of the agent.

*RoboschoolAnt-v1*environment

The above figure illustrates the learned agent design compared to the original design. With the exception of one leg part, it learns to develop longer, thinner legs while jointly learning to carry the body across the environment. While the original design is symmetric, the learned design breaks symmetry, and biases towards larger rear legs while jointly learning the navigation policy using an asymmetric body. The original agent achieved an average cumulative score of 3447 $\pm $ 251 over 100 trials, compared to 5789 $\pm $ 479 for an agent that learned a better body design.

*BipedalWalker-v2*

The Bipedal Walker series of environments is based on the Box2D*BipedalWalker-v2*

*BipedalWalker-v2*environment (left). Agent learns a body to allow it to bounce forward efficiently (right).

Keeping the head payload constant, and also keeping the density of materials and the configuration of the motor joints the same as the original environment, we only allow the lengths and widths for each of the 4 leg parts to be learnable, subject to the same range limit of $\pm $ 75% of the original design. In the original environment, the agent learns a policy that is reminiscent of a joyful skip across the terrain, achieving an average score of 347. In the learned version, the agent's policy is to hop across the terrain using its legs as a pair of springs, achieving a higher average score of 359.

### Joint learning of body design facilitates policy learning

*BipedalWalkerHardcore-v2*

*BipedalWalkerHardcore-v2*

Learning a better version of an agent's body not only helps achieve better performance, but also enables the agent to jointly learn policies more efficiently. We demonstrate this in the much more challenging *BipedalWalkerHardcore-v2*

*BipedalWalkerHardcore-v2*.

In this environment, our agent generally learns to develop longer, thinner legs, with the exception in the rear leg where it developed a thicker lower limb to serve as useful stability function for navigation. Its front legs, which are smaller and more manoeuvrable, also act as a sensor for dangerous obstacles ahead that complement its LIDAR sensors. While learning to develop this newer structure, it jointly learns a policy to solve the task in 30% of the time it took the original, static version of the environment. The average scores over 100 rollouts for the learnable version is 335 $\pm $ 37 compared to the baseline score of 313 $\pm $ 53.

### Optimize for both the task and desired design properties

Allowing an agent to learn a better version of its body obviously enables it to achieve better performance. But what if we want to give back some of the additional performance gains, and also optimize also for desirable design properties that might not generally be beneficial for performance? For instance, we may want our agent to learn a design that utilizes the least amount of materials while still achieving satisfactory performance on the task. Here, we reward an agent for developing legs that are smaller in area, and augment its reward signal during training by scaling the rewards by a utility factor of $1+\mathrm{log}\left(\frac{\text{origlegarea}}{\text{newlegarea}}\right)$. We see that augmenting the reward encourages development of smaller legs:

*BipedalWalker-v2*environment.

This reward augmentation resulted in much a smaller agent that is still able to support the same payload. In the easier *BipedalWalker* task, given the simplicity of the task, the agent's leg dimensions eventually shrink to near the lower bound of $\sim $ 25% of the original dimensions, with the exception of the heights of the top leg parts which settled at $\sim $ 35% of the initial design, while still achieving an average (unaugmented) score of 323 $\pm $ 68. For this task, the leg area used is 8% of the original design.

*BipedalWalkerHardcore-v2*.

However, the agent is unable to solve the more difficult *BipedalWalkerHardcore* task using a similar small body structure, due to the various obstacles presented. Instead, it learns to set the widths of each leg part close to the lower bound, and instead learn the shortest heights of each leg part required to navigate, achieving a score of 312 $\pm $ 69. Here, the leg area used is 27% of the original.

## Discussion and Future Work

We have shown that allowing a simple population-based policy gradient method to learn not only the policy, but also a small set of parameters describing the environment, such as its body, offer many benefits. By allowing the agent's body to adapt to its task within some constraints, it can learn policies that are not only better for its task, but also learn them more quickly.

The agent may discover design principles during this process of joint body and policy learning. In both *RoboschoolAnt* and *BipedalWalker* experiments, the agent has learned to break symmetry and learn relatively larger rear limbs to facilitate their navigation policies. While also optimizing for material usage for *BipedalWalker*'s limbs, the agent learns that it can still achieve the desired task even by setting the size of its legs to the minimum allowable size. Meanwhile, for the much more difficult *BipedalWalkerHardcore-v2* task, the agent learns the appropriate length of its limbs required for the task while still minimizing the material usage.

This approach may lead to useful applications in machine learning-assisted design, in the spirit of

In this work we have only explored using a simple population-based policy gradient method*during* a rollout to obtain a dense reward signal, but we find this unpractical for realistic problems. Future work may look at separating the learning from dense-rewards and sparse-rewards into an inner loop and outer loop, and also examine differences in performance and behaviours in structures learned using various different learning approaches.

Separation of policy learning and body design into inner loop and outer loop will also enable the incorporation of evolution-based approaches to tackle the vast search space of morphology design, while utilizing efficient RL-based methods for policy learning. The limitations of the current approach is that our RL algorithm can learn to optimize only existing design properties of an agent's body, rather than learn truly novel morphology in the spirit of Karl Sims' *Evolving Virtual Creatures*

Nevertheless, our approach of optimizing the specifications of an existing design might be more practical for many applications. An evolutionary algorithm might come up with trivial designs and corresponding simple policies that outperform designs we actually want -- for instance, a large ball that rolls forward will easily outperforming the best bipedal walkers, but this might not be useful to a game designer who simply wants to optimize the dimensions of an existing robot character for a video game. Due to the vast search space of morphology, a search algorithm can easily come up with a trivial, but unrealistic or unusable design that exploits its simulation environment

Just as REINFORCE

*If you would like to discuss any issues or give feedback regarding this work, please visit the GitHub repository of this article.*

We would like to thank Luke Metz and Douglas Eck for their thoughtful feedback. This article was prepared using the Distill template.

## Open Source Code

The code to reproduce experiments in this article will be released at a later date.

## Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.

## Configuration

All agents were implemented using 3 layer fully-connected networks with $\mathrm{tanh}$ activations. The agent in *RoboschoolAnt-v1* has 28 inputs and 8 outputs, all bounded between $-1$ and $+1$, with hidden layers of 64 and 32 units. The agents in *BipedalWalker-v2* and *BipedalWalkerHardcore-v2* has 24 inputs and 4 outputs all bounded between $-1$ and $+1$, with 2 hidden layers of 40 units each.

## Training

Our population-based training experiments were conducted on 96-CPU core machines on Google Cloud Platform. Following the approach described in

## Population-based Policy Gradient Method

In this section we provide an overview of the population-based policy gradient method described in Section 6 of William's REINFORCE

$J\left(\theta \right)={E}_{\theta}\left[R\right(w\left)\right]=\int R\left(w\right)\phantom{\rule{0.277778em}{0ex}}\pi (w,\theta )\phantom{\rule{0.277778em}{0ex}}dw$

Using the *log-likelihood trick* allows us to write the gradient of $J\left(\theta \right)$ with respect to $\theta $:

${\nabla}_{\theta}J\left(\theta \right)={E}_{\theta}\left[\phantom{\rule{0.277778em}{0ex}}R\right(w\left)\phantom{\rule{0.277778em}{0ex}}{\nabla}_{\theta}\mathrm{log}\pi \right(w,\theta \left)\phantom{\rule{0.277778em}{0ex}}\right]$.

In a population size of $N$, where we have solutions ${w}^{1}$, ${w}^{2}$, ..., ${w}^{N}$, we can estimate this as:

${\nabla}_{\theta}J\left(\theta \right)\approx \frac{1}{N}{\sum}_{i=1}^{N}\phantom{\rule{0.277778em}{0ex}}R\left({w}^{i}\right)\phantom{\rule{0.277778em}{0ex}}{\nabla}_{\theta}\mathrm{log}\pi ({w}^{i},\theta )$.

With this approximated gradient ${\nabla}_{\theta}J\left(\theta \right)$, we then can optimize $\theta $ using gradient ascent:

$\theta \to \theta +\alpha {\nabla}_{\theta}J\left(\theta \right)$

and sample a new set of candidate solutions $w$ from updating the pdf using learning rate $\alpha $. We follow the approach in REINFORCE where $\pi $ is modelled as a factored multi-variate normal distribution. Williams derived closed-form formulas of the gradient ${\nabla}_{\theta}\mathrm{log}\pi ({w}^{i},\theta )$. In this special case, $\theta $ will be the set of mean $\mu $ and standard deviation $\sigma $ parameters. Therefore, each element of a solution can be sampled from a univariate normal distribution ${w}_{j}\sim N({\mu}_{j},{\sigma}_{j})$. Williams derived the closed-form formulas for the ${\nabla}_{\theta}\mathrm{log}N({z}^{i},\theta )$ term for each individual $\mu $ and $\sigma $ element of vector $\theta $ on each solution $i$ in the population:

${\nabla}_{{\mu}_{j}}\mathrm{log}N({w}^{i},\theta )=\frac{{w}_{j}^{i}-{\mu}_{j}}{{\sigma}_{j}^{2}},$ ${\nabla}_{{\sigma}_{j}}\mathrm{log}N({w}^{i},\theta )=\frac{({w}_{j}^{i}-{\mu}_{j}{)}^{2}-{\sigma}_{j}^{2}}{{\sigma}_{j}^{3}}$.

For clarity, we use subscript $j$, to count across parameter space in $w$, and this is not to be confused with superscript $i$, used to count across each sampled member of the population of size $N$. Combining the last two equations, we can update ${\mu}_{j}$ and ${\sigma}_{j}$ at each generation via a gradient update.

## Bloopers

For those of you who made it this far, we would like to share some “negative results” of things that we tried but didn't work. In the experiments, we constrain the elements in the modified design to be $\pm $ 75% of the original design's values. We accomplish this by defining a scaling factor for each learnable parameter as $1.0+0.75\mathrm{tanh}\left({w}_{k}\right)$ where ${w}_{k}$ is the ${k}^{\text{th}}$ element of the environment parameter vector, and multiply this scaling factor to the original design's value, and find that this approach works well as it usually preserves the intention and *essence* of the original design.

We also tried to let the RL algorithm discover new designs without any constraints, and found that it would usually create longer rear legs during the initial learning phase designed so it can tumble over further down the map to achieve higher rewards.

Using a lognormal scaling factor of $\mathrm{exp}\left({w}_{k}\right)$ made it easier for the RL algorithm to come up with an extremely tall bipedal walker agent that “solves” the task by simply falling over and landing at the exit: