Policy Transfer via Kinematic Domain Randomization and Adaptation
Abstract
Transferring reinforcement learning policies trained in physics simulation to the real hardware remains a challenge, known as the “simtoreal” gap. Domain randomization is a simple yet effective technique to address dynamics discrepancies across source and target domains, but its success generally depends on heuristics and trialanderror. In this work we investigate the impact of randomized parameter selection on policy transferability across different types of domain discrepancies. Contrary to common practice in which kinematic parameters are carefully measured while dynamic parameters are randomized, we found that virtually randomizing kinematic parameters (e.g., link lengths) during training in simulation generally outperforms dynamic randomization. Based on this finding, we introduce a new domain adaptation algorithm that utilizes simulated kinematic parameters variation. Our algorithm, MultiPolicy Bayesian Optimization, trains an ensemble of universal policies conditioned on virtual kinematic parameters and efficiently adapts to the target environment using a limited number of target domain rollouts. We showcase our findings on a simulated quadruped robot in five different target environments covering different aspects of domain discrepancies.
I Introduction
The advent of Deep Reinforcement Learning (DRL) has demonstrated a promising approach to design robotic controllers for diverse robot motor skills [1, 24, 8, 5]. Nevertheless, DRL comes with the caveat of a very high demand in training data; this constitutes direct training in the real world cumbersome, if not infeasible. Highfidelity computer simulators of physics offer a way to bypass this practical obstacle but introduce a further complication: policies trained in simulation often fail to transfer to the real world due to modeling discrepancies between simulated and real environments. This is known as simtoreal gap [7].
Domain randomization is an approach that directly addresses the simtoreal gap issue. Numerous evidence has shown that randomizing the parameters in the simulator during policy training leads to a policy that can perform well when tested in a different, target environment [16, 5, 9, 17]. However, the process of domain randomization depends heavily on heuristics and trialanderror. Conventionally, practitioners randomize a selected set of parameters that are believed to have high impact or are difficult to measure precisely in the target environment, such as center of mass, latency, or friction coefficients.
What would be the effect if we intentionally randomize simulation parameters that are known, or can be easily measured? In this work, we first systematically investigate the impact of randomized parameter selection for a quadruped locomotion task on a variety of reality gaps. We analyze the parameters of a dynamic system in two categories: Kinematic and Dynamic parameters. We define as kinematic parameters those parameters that are required for computing forward kinematics on a robot, such as the link lengths, joint orientation, or joint degreesoffreedom, while dynamic parameters include everything else. We engineer three types of domain discrepancies that mimic reality gaps to analyze the impact of different parameter categories with finer granularity: Kinematic Gap, Dynamic Gap, and Environment Gap. Using the same definition, a kinematic gap refers to deviations in one or more kinematic parameters between the source and target environments. We add environment gap as a third case to indicate differences that impact the agent’s policy but not the parameters of the agent themselves. In the context of locomotion, the environment gap is typically related to the type of surface the robot is walking on.
Contrary to common practice in which kinematic parameters are carefully measured while dynamic parameters are randomized, we found that randomizing kinematic parameters during training in simulation produces the best performing policy, across all three type of gaps. On the other hand, randomizing dynamic parameters only produces a policy capable of overcoming the dynamic gap itself, but typically fails on other unmodelled gaps. We hypothesize that randomizing kinematic parameters results in wider exploration of the state space because it has a more global impact on the system dynamics–it effectively modifies the Jacobian matrix of center of mass for each link in an articulated robot system. In contrast (and perhaps counterintuitively), randomizing dynamic parameters affects only specific aspects of the dynamic system. For example, inertial properties only take effect when the joint acceleration is high, and friction coefficients only matter when particular contact states occur.
Based on the above observation, a natural next step is to investigate whether domain adaptation methods based on universal policies (UP) [22] can also work well when conditioning on varying kinematic parameters. To train such a UP, we randomize the training environment with varying robot geometries and inform the UP of the kinematics parameters as part of its policy input. This creates diverse control behaviors parameterized by kinematics values. During deployment in the target environment and though the physical robot has fixed geometry, we can still treat the virtual kinematics input as a control knob and search for the virtual control that maximizes target domain task performance with a small amount of trials. UPs conditioned on kinematic parameters improve target domain performance as expected, but we further notice that they highly depend on the random seeds used to initialize the policy optimization – some seeds are better for different types of gaps than others. As such, we introduce a simple algorithm, which we call MultiPolicy Bayesian Optimization (MPBO), that utilizes am ensemble of UPs from different random seeds to further improve domain generalization. MPBO combines Bayesian optimization (BO) [6] and the UpperConfidenceBound Action Selection (UCBAS) algorithm [15] in multiarmed bandit problems to determine the most effective UP as well as its optimal kinematic parameters using a limited number of rollouts.
We show experiments in simulations using a Laikago quadruped robot model and five different target environments. Our results suggest that randomizing kinematic parameters leads to a more generalizable policy across multiple types of simtoreal gaps.Additionally, we show that MPBO can further improve domain generalization for all the environments we evaluated, at only a moderate amount of rollouts in the target environment.
Ii Related Work
Despite the tremendous progress in Deep Reinforcement Learning (DRL) for training complex motor skills such as walking [24], flying [20], and parkour [8], applying DRL to real robotic control problems is still a challenging problem due to the simtoreal gap. Efforts in enabling simulationtrained policies to be applied to real robots have largely focused on two complementary fronts: 1) improving the simulation model to better match the realworld dynamics, and 2) improving the policy training process such that it can generalize to a large variety of situations including the realworld. To improve the simulation model, researchers have proposed new algorithms for making the simulation models more expressive, identified key factors that cause the reality gap and demonstrated successful deployment of simulationtrained policies on real robots [5, 16, 4]. For instance, Hwangbo et al. trained a neural network for the actuator model, which is combined with an analytical rigidbody simulator to generate training data for a quadruped robot [5], achieving thus direct transfer to the real quadruped robot.
Another key direction in combating the reality gap is to improve the generalization capability of the trained policies. One of the most commonly used techniques in this category is domain randomization, where a robust policy is trained with randomized simulator parameters for the robot [9, 16, 1, 13, 11]. By forcing the control policy to learn actions that can work for different simulation parameters during training, one can obtain a controller that is robust and generalizable to parameters outside the training range. To train an effective robust policy, researchers have explored numerous ways of introducing the randomization to the simulation model, such as adding adversarial perturbation [11], adding latency in the model [16], and varying the dynamics parameters of the robot [9]. The design of these randomization schemes are largely inspired by the factors that contribute to the reality gap. For example, Tan et al. observed that latency and actuator modeling are two major sources of the reality gap and by including them in the randomization scheme they were able to achieve successful simtoreal transfer for a quadruped robot [16]. On the other hand, kinematicsrelated parameters, such as the length of the robot’s limb or the placement of the actuator, are rarely explored in domain randomization as they can usually be measured to high precision and be recreated in the simulation almost exactly. In this work, we demonstrate that randomizing kinematicsrelated parameters can in fact produce surprisingly strong performance compared to randomizing dynamicsrelated parameters.
Another approach for improving the generalization capability of the trained control policies is to finetune them using data collected from the real robot [21, 23, 10, 14, 2]. For example, Yu et al. proposed to learn a dynamicsconditioned policy in simulation and directly optimize the input dynamics parameters to the policy on the real hardware [21], showing results on a real biped robot. Recently, Peng et al. [10] adopted a similar strategy and demonstrated that a real quadruped robot can learn from animal data by leveraging simulation and transfer learning [10]. Similar to prior work in domain randomization, these methods mostly focus on dynamicsrelated variations during the training of the policies and do not consider kinematicsrelated variations. In our work, we demonstrate that introducing kinematicsbased variations in this class of methods also leads to improved simtoreal performance. We further introduce an ensemble modelbased approach that further improves the reliability of the framework proposed by Yu et al. [21].
Iii Background and Selection of Domain Randomization Parameters
We begin by presenting a few preliminaries useful for the problem under consideration. The robot task is formulated as a Markov Decision Process (MDP) , where is the state space, is the action space, is the system dynamics, is the reward function, is the initial state distribution, and is the discount factor. We use a policy gradient method, PPO [12], to solve for a policy such that the accumulated reward is maximized:
(1) 
where , , and . Here, is a set of parameters pertaining to the robot that we can vary during training in simulation. We divide these parameters in two categories, kinematic parameters () and dynamic parameters (). Kinematic parameters include those required for forward kinematic computation, assuming the robot can be represented as an articulated rigid body system. Specifically, includes every coefficient in the kinematic transformation chain from the robot frame to the local frame of every link. For example, the coefficients required to compute the transformation of the “foot frame” includes the link lengths of the upper leg and the lower leg, the orientation of the hip, knee and ankle joints, and the location of the hip joint relative to the torso. The definition of dynamic parameters is simply .
We further define three types of simtoreal gaps frequently encountered in robotic applications (Concrete examples of simtoreal gaps are shown in Section V):
Kinematic Gap: One or more have different values between source and target environments. For example, the length or joint orientation of the robot legs are different.
Dynamic Gap: One or more have different values between the source and target environments. For example, the mass distribution of the robot or the actuator modeling are different.
Environment Gap: Other parameters outside of in the simulator are different between the source and target environments. For example, the surfaces the robot walks on are made of different materials.
We randomize a selected set of and and train two types of policies, and , using PPO [12]. These two polices are then tested on three types of simtoreal gaps. Note that we intentionally model these gaps using parameters different from those being randomized. The detailed results are shown in Section V, but here we highlight a few key results. First, performs well when transferred over a kinematic gap and performs well when transferred over a dynamic gap. Second, outperforms significantly when transferred over the opposite gap, as well as the environment gap.
We hypothesize that randomizing kinematic parameters results in wider exploration of the state space because it has a more global impact on the system dynamics. A small change in will affect the kinematic transformation in the Jacobian matrix, which in turn affects the mass matrix, Coriolis force, gravitational force, and every Cartesian positiondependent applied force, such as contacts. In contrast (and perhaps counterintuitively), randomizing dynamic parameters affects only specific aspects of the dynamic system. For example, inertial properties only take effect when the joint acceleration is high, and friction coefficients only matter when particular contact states occur.
Iv Domain Adaptation using Kinematic Parameters
Domain Adaptation techniques refer to a class of transfer methods that train a family of policies (universal policies, UP), conditioned on varying explicit physical parameters (e.g [22]), or implicit latent parameters (e.g. [21, 10]). During simulationbased training, the environment dynamics change with the conditioned parameters so that we obtain different strategies good for different dynamics, parameterized by different conditioned input. When deploying the universal policy to the target environment, the conditioned input (purely virtual, as the geometry of the physical robot is fixed) can be quickly searched to find a wellperforming strategy in the target domain with only a few rollouts, using sample efficient optimizers such as Bayesian Optimization [6]. The key insight of these methods is that the conditioning parameters that lead to good performance do not need to have physical correspondence (for latent space methods they have no physical meaning at all), the only relevant aspect being that learned strategies are diverse enough so that the likelihood of good adaptation is high.
As we shall see in results of Section V, an adapted UP can potentially outperform a policy trained with kinematic domain randomization. However, we also observed that the transfer ability of a policy in the target environments varies across training seeds, despite all seeds producing policies with virtually identical training environment performance. This seems to imply that policy transfer performance could be further improved if we were able to perform UPbased policy adaptation with a policy ensemble rather than one single UP as in previous works. However, the problem of allocating limited target domain rollouts to multiple polices during adaptation poses a challenge. We explore how to prioritize policy and parameter value sampling on a limited budget of rollouts in what follows.
Iva MultiPolicy Bayesian Optimization
Given an ensemble of UPs trained with different seeds, we wish to bias target domain sampling towards the current most promising UP, but without overexploitation. We propose MultiPolicy Bayesian Optimization (MPBO), a hybrid procedure based on a combination of Bayesian optimization (BO) [6] and the UpperConfidenceBound Action Selection (UCBAS) algorithm [15] from the field of multiarmed bandit problems. With a reward evaluation function which takes in a conditioning parameter sample and the universal policy as input and returns accumulated reward, BO progressively builds a Gaussian process (GP) regressor [19] of input vs. obtained reward for that policy based on the history of samples. This function is then used to create an acquisition function, whose role is to suggest new sample points [6]. In MPBO, we accordingly create an ensemble of GP, one for each UP. The acquisition function returns a new sample point suggestion (i.e., a new parameter input value) for each policy, along with a metric called expected improvement (), which indicates how useful the new sample is expected to be for the optimization. We only sample from the most promising policy/parameter value at the current iteration, rather than sampling from all of them. The process of selecting which policy to sample is similar to a multiarmed bandit problem: we wish to prioritize the sampling of policies that seem promising overall, but also explore policies that have received less attention. Specifically, we adopt the technique “UpperConfidenceBound Action Selection” which suggests sampling the option with the highest following value:
(2) 
where is the average reward (over all samples) for that option so far, is the algorithm iteration number (starting from 1), is the number of times this option has been selected before, is the option index, and is a constant that balances exploration vs. exploitation. In our context, would be the mean reward that policy has produced for all its sampled parameters values so far and the number policy has been selected for sampling. Combining this metric with expected improvement (), the final selection criterion is sampling the policy/parameter value with the highest product . After the budget of rollouts has been exhausted, MPBO returns the best policy, best parameter input, and expected reward. The MPBO procedure is summarized in Alg. 1.
V Experiments
In this section, we design a set of experiments to answer the following questions: A. Do both kinematics and dynamics domain randomization transfer well when the actual domain discrepancy is related by type to their randomized aspects? B. Does kinematics domain randomization achieve better transfer performance on unmodelled discrepancies compared to the typical dynamics domain randomization? C. Can a kinematics domain UP optimized with MPBO further improve the transfer performance?
Va Experiment setup
We use a 18DoF quadruped robot modelled from the Unitree Laikago [18] as our experiment platform. The robot is simulated in PyBullet [3]. Fig. 1(a) shows its real kinematics structure. To introduce virtual kinematics variations during simulationbased training, we consider the length scale of the two links of the front leg pair as well as the back leg pair, leading to a total of four parameters: , where and are the scales of the upper and lower link of the front legs, and and are the scales of the upper and lower link of the back legs. These parameters are drawn uniformly in both in Kinematics Randomization and Kinematics UP, that is, ranging from a 50% decrease to a 50% increase in link length. A randomly sampled virtual design is shown in Fig. 1(b).
All generated policies consist of a 3layer feedforward neural network where the input is the robot observation along with the kinematics parameters in the case of kinematicsUPs. The kinematics parameters are constant throughout each rollout, in a manner similar to goalconditioned polices. We apply position control and the policy output corresponds to the change in desired joint positions, where the position errors are used to compute the required torque with a PDcontrol scheme. The observation vector consists of the joint states along with root global orientation, root height and root linear velocities. Control frequency is set to 50Hz. The reward function is given by
to encourage walking forward with smooth joint motion and low energy, where is the root forward velocity, and is the number of joints at joint limit.
VB Target Environments
We consider the following types of gaps between training and target settings: (a) dynamics gaps originating from modeling errors inherent to robot dynamics, (b) kinematic gaps representing discrepancies in the robot kinematics, and (c) environment gaps where unmodelled effects are introduced to the surrounding environment that the robot is moving within. To this end, we evaluate the ability of policies to transfer to unseen environments by introducing five target environment variants:

Dynamics – Low Power: The maximum allowed torque in the front left leg is reduced by a half, to mimic broken motors.

Dynamics – BackEMF: Increased joint angular velocity reduces the ability to apply torque to mimic BackEMF motor forces.

Kinematics – Joint Orientation: All set points (zero angle) of the robot joints are perturbed by different constants. Note that this is not in the list of kinematics randomization parameters, which only includes link lengths.

Environment – Deform: Here, the fixed and solid floor is replaced by a large deformable cube, see Fig. 2(a). Note that deformable bodies are simulated in PyBullet independently using FEM and parameters controlling its properties are thus not related to randomized rigidbody parameters such as restitution.

Environment – Soft: In this target environment the floor is soft and the robot legs sink in, in a similar manner to a muddy terrain, Fig. 2(b).
These gaps are made challenging enough so that a policy trained to convergence without randomization will perform poorly when in them (Table I first row). More importantly, these gaps emphasize unmodelled effects not covered by the selection randomized parameters, which mimics the realistic challenge that there are always aspects on the real hardware which we cannot randomize well.
VC Baselines
To evaluate the performance of kinematics randomization used to create robust policies for zeroshot transfer, we utilized the following baselines:

No Domain Randomization (No DR): a conventional PPO policy trained without domain randomization.

Dynamic parameter randomization (Dynamic DR): we train a robust policy with randomized robot dynamics, using the same randomization settings as Peng et al. which have been demonstrated on a real robot [10].
For evaluating kinematic parameter UPs and MPBO, we introduce the following baselines that perform adaptation in the target environment:

No DR with finetuning: we finetune the No DR policy with additional PPO steps in the target environment, using approximately 20 rollouts for each seed.

DynUP + 10 rollouts: UPs conditioned on dynamic parameters during training. We follow the latent dimension scheme by Yu et al. [21], where the high dimensional dynamics parameters (52 in our case) is mapped to a low dimensional latent space (4 in our experiments to match KinUP) and the adaptation is performed directly in the latent space. Three policies are generated for three seeds, and each UP is adapted in its latent dimension using regular BO for 10 rollouts, for a budget total of 30 rollouts.

KinUP, preadapt. / KinUP + 10 rollouts: UPs conditioned on kinematic parameters during training. Three policies are generated for three seeds. For the preadapt. version, the best performance for nominal parameter input values among the three UPs is reported (no adaptation), whereas for the 10 rollouts version, each of the three UPs is adapted for 10 rollouts, for a total budget of 30 rollouts.
The proposed adaptation scheme, KinUP + MPBO consists of the aforementioned KinUPs for three seeds, with adaptation of a total budget of 30 rollouts spent unequally among them, in accordance with MPBO. To provide a fair comparison to our proposed MPBO algorithm where three policies trained from different random seeds are utilized to adapt to one final policy selection, we trained three policies for all baseline methods, and for each target environment, we report the return corresponding to the best performing policy among the three. Final performance evaluations are averaged over 15 rollouts to provide a more accurate performance estimate.
Method  No gap  Dynamic Gap  Kinematic Gap  Environment Gap 


Low power  EMF  Joint orientation  Deform  Soft  

No DR  9458  597  4980  79  1636  1368  1732  
Dynamic DR  8346  8010  5635  220  1122  508  3099  

9141  7463  5112  5401  5795  7305  6215  
Adapt.  No DR + Finetuning    600  5267  80  1386  4334  2334  
DynUP + 10 rollouts per policy    7861  4721  223  1407  1538  3150  
KinUP preadapt    2262  4967  657  2657  477  2204  
KinUP + 10 rollouts per policy    7249  5564  7040  3246  6927  6005  

  7768  5572  7277  8250  7375  7248 
VD Results for Kinematic Domain Randomization
We first present our results for training a kinematic domain randomization (kinematic DR) policy and compare them to No DR and dynamics DR as described in Section VC. The results can be found in the top part of Table I. Unsurprisingly, training without any randomization (No DR) performs the worst in the target setting despite the high reward obtained in the training environment. This validates that our target environments are substantially different from the training setting. Dynamics DR, on the other hand, achieves decent performance for the target environments where the discrepancies lie mainly in the robot dynamics. This is likely because the policy has been trained with a variety of robot dynamics and is thus robust to dynamicsrelated variations even though our target environment consists of dynamics gaps that were not presented during training. However, the results also indicate that when we apply dynamics DR to a target setting where the kinematics and the environment parameters are varied, it is not able to achieve successful transfer. In contrast, our proposed kinematic DR approach is able to reach high performance for all the target scenarios, and furthermore significantly outperforms the baseline methods in target settings with kinematic and environment gaps.
VE Results for MultiPolicy Bayesian Optimization (MPBO)
Having seen the promising performance of Kinematic DR, a natural question to ask is: can we further improve the transfer performance of the algorithm by allowing policies to adapt in the target environment? In the bottom part of Table I, we present our experiment results for our proposed MPBO algorithm applied to a universal policy (UP) conditioned on kinematics parameters of the robot. To illustrate the effect of the adaptation process, along with MPBO, we also report the performance of the trained UP with the nominal design as input, i.e. no input adaptation is performed. As this is analogous to training a single policy at the nominal design, the performance is similarly poor to No DR. However, as shown in our results, performing adaptation using MPBO for rollouts in the target environment outperforms regular BO for 10 rollouts on each one of them individually. Comparing KinUP + MPBO to a baseline method where we finetune the No DR policy with an at least equal number of rollouts as well as the three DynUPs updated by 10 rollouts each (Section VC), we observe that finetuning the No DR policy with PPO typically achieves little improvement, and DynUP performs slightly better only in one environment and trails behind in the rest.
Vi Conclusions and Discussion
The combined results of Table I suggest that introducing variations in kinematics during training in simulation can benefit policy transfer to novel settings that are not seen during training. Perhaps surprisingly, this approach substantially outperforms the commonly used dynamics domain randomization in a variety of novel target environment settings. One hypothesis behind this unintuitive observation is that varying kinematic parameters during training leads to more global changes in the dynamic equations which subsequently leads to the wider area of the state space that is visited. Further investigation is needed to validate this hypothesis.
Another key result in our experiment is that with limited amount of data in the target environment, we can significantly improve the transfer performance using an improved version of strategy optimization [22]. By exploiting inexpensive computation in simulation, we can turn the variance due to random seeds in policy optimization to our advantage and extensively explore multiple universal policies at different local minima. We then rely on our MPBO to find the most effective strategy among multiple families of policies.
To investigate the mechanism behind the success of MPBO, we depict the reward landscape of a kinematic UP over the kinematic parameters. Specifically, we obtain a 2D slice of the 4D landscape by tying parameters in groups of two, i.e., by varying . We then evaluated the KinUPs on such a grid of , with each sampled parameter input being tested for 15 rollouts. The results for the training environment and three target environments (joint orientation, deformable, and soft floor) are depicted in Fig. 3. For the training environment, maximum performance is obtained at an area near the actual kinematic parameters of the robot, as expected. However, for target environments the the default kinematic parameter lies in a low reward region and performance can be greatly increased by modifying the kinematic parameter input to the policies. Furthermore, the reward landscape shows some underlying structure; thus, the performance increases due to MPBO seen in Table I are not merely attributed to randomness, but are rather the outcome of a diversified policy, whose tuning knobs exert enough influence over the policy to successfully bridge the gaps.
This work opens a few interesting future directions. While our experiments show strong signals that randomizing kinematic parameters is beneficial, it requires broader investigation on other types of robot tasks, such as manipulation, to validate whether our results can be generalized. The hypothesis we put forth is based on our intuition of multibody dynamic systems. A formal proof or empirical evidence to support our hypothesis is also an important future direction we would like to investigate.
References
 [1] (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §I, §II.
 [2] (2019) Closing the simtoreal loop: adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), Vol. , pp. 8973–8979. External Links: Document Cited by: §II.
 [3] (2017) Pybullet, a python module for physics simulation in robotics, games and machine learning. Cited by: §VA.
 [4] (2017) Grounded action transformation for robot learning in simulation.. In AAAI, pp. 3834–3840. Cited by: §II.
 [5] (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26), pp. eaau5872. Cited by: §I, §I, §II.
 [6] (2012) Bayesian approach to global optimization: theory and applications. Vol. 37, Springer Science & Business Media. Cited by: §I, §IVA, §IV.
 [7] (2017) Why offtheshelf physics simulators fail in evaluating feedback controller performancea case study for quadrupedal robots. In Advances in Cooperative Robotics, pp. 464–472. Cited by: §I.
 [8] (2018) Deepmimic: exampleguided deep reinforcement learning of physicsbased character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §I, §II.
 [9] (2018) Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I, §II.
 [10] (2020) Learning agile robotic locomotion skills by imitating animals. arXiv preprint arXiv:2004.00784. Cited by: §II, §IV, 2nd item.
 [11] (2017) Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702. Cited by: §II.
 [12] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III, §III.
 [13] (202007) Learning MemoryBased Control for HumanScale Bipedal Locomotion. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. External Links: Document Cited by: §II.
 [14] (2020) Rapidly adaptable legged robots via evolutionary metalearning. arXiv preprint arXiv:2003.01239. Cited by: §II.
 [15] (2018) Reinforcement learning: an introduction (2nd ed.). MIT press. Cited by: §I, §IVA.
 [16] (201806) Simtoreal: learning agile locomotion for quadruped robots. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. External Links: Document Cited by: §I, §II, §II.
 [17] (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 23–30. Cited by: §I.
 [18] (2018) Laikago: let’s challenge new possibilities. External Links: Link Cited by: Fig. 1, §VA.
 [19] (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §IVA.
 [20] (2017) How to train your dragon: exampleguided control of flapping flight. ACM Transactions on Graphics (TOG) 36 (6), pp. 1–13. Cited by: §II.
 [21] (2019) Simtoreal transfer for biped locomotion. arXiv preprint arXiv:1903.01390. Cited by: §II, §IV, 2nd item.
 [22] (2019) Policy transfer with strategy optimization. In International Conference on Learning Representations, External Links: Link Cited by: §I, §IV, §VI.
 [23] (2019) Learning fast adaptation with meta strategy optimization. arXiv preprint arXiv:1909.12995. Cited by: §II.
 [24] (2018) Learning symmetric and lowenergy locomotion. ACM Transactions on Graphics (TOG) 37 (4), pp. 144. Cited by: §I, §II.