Sensorimotor Robot Policy Training using RL(ref163篇 90页) 笔记超长

CreateAMind

发布于 2018-07-20 14:51:59

6920

发布于 2018-07-20 14:51:59

文章被收录于专栏：CreateAMind

https://www.diva-portal.org/smash/get/diva2:1208897/FULLTEXT01.pdf

翻译为搜狗翻译。

In other words, instead of decomposing a policy to perception, planning and control, the behavior of a robot should be governed by a number of complete action-observation processing units [17, 18].

机器人的行为应该由多个完整的动作观察处理单元来控制，而不是将策略分解为感知、规划和控制

i.e., the way we perceive a 3D, colorful view of the world, is not merely associated to photoreceptor cells of the retina of our eyes, but is also associated to all muscles that move the eyes. The concept was earlier introduced by Ballard [23] as active vision.

我们感知世界的3D、彩色视图的方式，不仅与我们眼睛视网膜的感光细胞相关联，而且还与移动眼睛的所有肌肉相关联。这一概念早些时候由巴拉德作为主动视觉提出

As an example, it is possible for a bodiless computer program to understand a notion of color or size of objects it is observing by a camera, but rather impossible to get a feeling of softnesses of different surfaces unless it touches them by a physical hand equipped with tactile sensing. There are many such physical notions that are linked to actions which may not be understood without a body making the required actions feasible.

作为一个例子，无体计算机程序可以理解其通过照相机观察的物体的颜色或尺寸的概念，但是除非其通过装备有触觉感测的物理手触摸它们，否则不可能获得不同表面的柔软感。有许多这样的物理概念与行动联系在一起，如果没有一个身体使所要求的行动切实可行，就不可能理解这些行动

Furthermore, developmental robotics studies emergence of intelligence by (1) an incremental learning process based on which former knowledge and skills are reused to gradually implement more advanced ones, (2) a constraint-based optimization in which constraints on sensations and motor abilities are diminished gradually to impose the learner at a proper level of complexity, (3) motivational systems which produce internal rewards either to satisfy extrinsic desires, such as finding a power supply, or intrinsic curiosity, e.g., to reduce an uncertainty, (4) integration of sensorimotor data to acquire anticipation and categorization abilities and, (5) social interaction to guide the agent to learn novel skills by, e.g., imitation or demonstration [24, 25]

发展机器人学通过以下方式研究智力的出现: ( 1 )增量学习过程，基于该过程，以前的知识和技能被重复使用以逐渐实现更高级的知识和技能，( 2 )基于约束的优化，其中对感觉和运动能力的约束逐渐减小以将学习者强加于适当的复杂性水平，( 3 )激励系统，其产生内部奖励以满足外部期望，例如找到电源，或者产生内部好奇心，例如以减少不确定性，( 4 )整合感觉运动数据以获得预期和分类能力，以及( 5 )社会交互以引导代理通过例如

resulted in more advanced deep CNN architectures which even surpass the performance of humans. Although this is a significant achievement in the history of AI, still human-level visual perception is far from being realized. This level of intelligence would require drastic paradigm shift, since our proficiency to interpret a visual scene is highly intertwined with our motor abilities to direct the gaze to infer more information.

导致更先进的深层CNN架构，甚至超过人类的性能。虽然这是人工智能发展史上的一项重大成就，但人类的视觉感知还远远没有实现。这种智力水平需要剧烈的范式转变，因为我们理解视觉场景的熟练程度与我们引导凝视以推断更多信息的运动能力密切相关

Furthermore, We exploit internal models to continuously make predictions about what is likely to come next, and therefore unlike the Imagenet challenge, visual objects are not appearing to us out of the context.

此外，我们利用内部模型持续预测接下来可能发生的情况，因此与Imagenet挑战不同的是，视觉对象并未脱离上下文

we should focus on methods that (1) design in- cremental learning algorithms which exploit earlier skills to construct more com- plicated ones given limited and sparse data (2) design architectural hierarchies to abstract sensorimotor regularities for knowledge sharing among different tasks, (3) comprise model uncertainties when making action-decisions and also explain why such decisions are made based on the internal state of the model and the latest action-observation data and, (4) construct perception models that actively pro- cess observations over a period of time to alleviate issues such as the difference in viewpoints and depth prediction ambiguities and partial observabilities [41].

我们应该着重于以下方法: ( 1 )设计内部学习算法，利用早期的技能在有限和稀疏的数据下构造更复杂的算法；( 2 )设计体系结构层次以抽象感知运动规律，用于不同任务之间的知识共享；( 3 )在作出动作决策时包括模型不确定性，并解释为什么基于模型的内部状态和最新的动作观察数据作出这样的决策；( 4 )构造感知模型，在一段时间内主动观察，以缓解诸如视角差异和深度预测模糊性以及部分观察性等问题[ 41 )。

Although this is a significant achievement in the history of AI, still human-level visual perception is far from being realized. This level of intelligence would require drastic paradigm shift, since our proficiency to interpret a visual scene is highly intertwined with our motor abilities to direct the gaze to infer more information. Furthermore, We ex- ploit internal models to continuously make predictions about what is likely to come next, and therefore unlike the Imagenet challenge, visual objects are not appearing to us out of the context.

虽然这是人工智能发展史上的一项重大成就，但人类的视觉感知还远远没有实现。这种智力水平需要剧烈的范式转变，因为我们理解视觉场景的熟练程度与我们引导凝视以推断更多信息的运动能力密切相关。此外，我们利用内部模型不断预测接下来可能发生的情况，因此与Imagenet挑战不同的是，视觉对象并未脱离上下文

However, in order to obtain more human-like physical skills, we should focus on methods that (1) design in- cremental learning algorithms which exploit earlier skills to construct more com- plicated ones given limited and sparse data (2) design architectural hierarchies to abstract sensorimotor regularities for knowledge sharing among different tasks, (3) comprise model uncertainties when making action-decisions and also explain why such decisions are made based on the internal state of the model and the latest action-observation data and, (4) construct perception models that actively pro- cess observations over a period of time to alleviate issues such as the difference in viewpoints and depth prediction ambiguities and partial observabilities [41].

为了获得更多类人的物理技能，我们应该着重于以下方法: ( 1 )设计创新学习算法，利用早期的技能构造更复杂的算法，给定有限和稀疏的数据；( 2 )设计体系结构层次，抽象感知运动规律，用于不同任务之间的知识共享；( 3 )在作出行动决策时包含模型不确定性，并解释为什么根据模型的内部状态和最新的行动观察数据作出这样的决策；( 4 )构造感知模型，在一段时间内积极处理观察，以缓解观点和观点差异等问题

We proposed to divide the deep network into three blocks (super-layers) as perception, policy and behavior super-layers and either train each block individually (paper (E)) or the whole network, end-to-end, by the use of the expectation-maximization method (paper (F)).

An important novelty in our proposed framework is the introduction of the be- havior super-layer as a generative model to produce sequences of motor actions given a point in a low-dimensional latent variable space called an action manifold. We demonstrated that using this technique, we are able to convert a sequential action-decision process, that is hardly possible to be solved by reinforcement learn- ing approaches, to a manageable multi-armed bandit problem which can be solved very efficiently using standard policy search techniques.

在我们提出的框架中，一个重要的创新点是引入行为超层作为生成模型，在一个称为行为流形的低维潜变量空间中给定一个点生成运动序列。我们证明，使用这种技术，我们能够将一个难以通过强化学习方法解决的顺序行动决策过程转化为一个可管理的多臂问题，这一问题可以通过使用标准策略搜索技术非常有效地解决。

Furthermore, we demonstrated that we can efficiently train the gigantic behavior and perception super-layers, respectively, by motor samples collected in simulation and, visual data either generated synthetically or borrowed from computer vision datasets, e.g., Imagenet. We illustrated experimentally that the discrepancy be- tween the synthetic and real-robot data can be compensated by training the policy super-layer, which maps the low-dimensional state representation, extracted by the perception super-layer, to the low-dimensional action manifold, as the input of the behavior super-layer, using policy search approaches by real-robot data.

此外，通过仿真采集的运动样本和从计算机视觉数据集(如Imagenet )中综合生成或借用的视觉数据，我们证明了我们可以有效地训练巨大的行为和感知超层。实验结果表明，通过训练策略超层，将感知超层提取的低维状态表示映射到行为超层的低维动作流形，作为行为超层的输入，利用策略搜索方法对真实机器人数据进行补偿。

We validated our proposed framework and architecture on a number of com- plicated visuomotor tasks, such as training a deep policy network implemented on a PR2 robot to throw a ball to different visual objects given qualitative terminal rewards. According to our best knowledge, this is the first time that a policy search method is implemented on a real robot to train such complicated visuomotor tasks merely based on qualitative terminal rewards.

我们在一些复杂的视觉运动任务上验证了我们提出的框架和体系结构，例如训练在PR2机器人上实现的深度策略网络，以便在给定定性终端奖励的情况下向不同的视觉对象投掷球。据我们所知，这是首次在真实机器人上实现策略搜索方法，仅基于定性终端奖励来训练这种复杂的视觉运动任务。

2章

We also review related work to active perception, state representation learning, model-based policy acquisition and reinforcement learning.

我们还回顾了与主动感知、状态表征学习、基于模型的策略获取和强化学习相关的工作。

Applying RL methods to a physical robot is very challenging. There are many studies dealing with this problem (see [45, 46]). Here, we list a number of such challenges:

Data inefficiency: RL approaches generally require a huge amount of action- observation data to train a policy properly. However, in most cases, it is not sensible to run a physical robot long enough to generate sufficient data.
数据低效: RL方法通常需要大量的动作观察数据来正确地训练策略。然而，在大多数情况下，运行物理机器人足够长的时间以生成足够的数据是不明智的。
Reward sparsity and reward shaping: Random explorations in complicated robotic problems rarely result in a reward outcome. This problem, referred to as the reward sparsity problem, hinders a RL agent to learn a proper action policy. A common solution, known as reward shaping [47], is to devise a reward-function which guides the agent to a solution. In practice, devising such functions requires expert knowledge and often introduces biases causing suboptimal solutions.
奖励稀疏性和奖励成形:复杂机器人问题中的随机探索很少产生奖励结果。这个问题被称为报酬稀疏性问题，阻碍了RL代理学习适当的动作策略。一个常见的解决方案，称为报酬成形[ 47 )，是设计一个报酬函数，引导代理到一个解决方案。在实践中，设计这样的功能需要专家知识，并且经常引入导致次优解决方案的偏见。
High-dimensional sensorimotor spaces: Unlike many standard RL bench- marks, robotic problems are often best represented by high-dimensional con- tinuous sensorimotor spaces which render many standard RL solutions im- practical to train a policy with a reasonable amount of robot data.
高维感测运动空间:与许多标准RL基准点不同，机器人问题通常最好由高维连续感测运动空间来表示，这使得许多标准RL解决方案对于训练具有合理数量机器人数据的策略变得不实际。
Explorations: An important question for a RL agent is how to explore the action space efficiently to obtain an optimal action policy. In general, the problem is to find a balance between exploitation of the actions found to be rewarding versus exploration of new actions which potentially give rise to higher rewards. Therefore, exploration is an inevitable part of reinforcement learning. However, exploring a new action in a sequential decision making process alters the distribution of the sensorimotor data which hinders esti- mating the return value statistically because of insufficient samples drawn from the new distribution.
探索: RL agent的一个重要问题是如何有效地探索行动空间以获得最优的行动策略。一般来说，问题是如何在利用被发现有回报的行动与探索可能带来更高回报的新行动之间找到平衡。因此，探索是强化学习的必然组成部分。然而，在顺序决策过程中探索新的动作改变了感觉运动数据的分布，这阻碍了对返回值的统计估计，因为从新分布提取的样本不足。
Credit assignment problem: The RL paradigm is based on reinforcing actions which are found to be effective in generating rewards. However, in a sequential decision making process, it is not intuitive to assign a received reward to a particular action in the sequence. This is known as the credit assignment problem in reinforcement learning. The common solution is to reinforce all the actions in the sequence based on their temporal difference to the reward event. However, this would reinforce poor actions in the episode as well, therefore resulting in a suboptimal solution.
信用分配问题: RL范式是建立在强化行动的基础上的，这些行动被发现能有效地产生回报。然而，在顺序决策过程中，将接收到的奖励分配给顺序中的特定动作是不直观的。这就是强化学习中的学分分配问题。常见的解决方案是基于与奖励事件的时间差异来增强序列中的所有动作。然而，这也将加强该事件中的不良行动，从而导致次优的解决方案。

2.3 Model predictive control

Model predictive control (MPC) (see [23]) is a closely related method to reinforce- ment learning to obtain an action-selection policy based on which a sequence of action decisions are made to optimize an objective function. Similar to RL, the objective function can be defined as the accumulated reward/cost over a period of time. However, unlike RL which reinforces good actions in a trial-and-error phase, MPC directly searches for the optimal actions over a period of time which is shorter than the length of the episode.

模型预测控制( MPC ) (见[ 23 )是一种加强学习以获得动作选择策略的密切相关的方法，基于该策略做出一系列动作决策以优化目标函数。类似于RL，目标函数可以被定义为一段时间内的累积报酬/成本。然而，与RL在试错阶段加强良好动作不同，MPC在短于事件长度的时间段内直接搜索最优动作。

MPC refers to a range of methods to obtain action-selection policies (controllers) consisting of (1) a dynamic forward model which predicts state evolutions over a period (horizon) of future time instants given a sequence of actions (2) an objective function which maps distributions over the predicted states to a cost value and, (3) a receding strategy to update the optimization time-horizon based on which the next action decision (control signal) is found.

MPC指的是获得动作选择策略(控制器)的一系列方法，其包括( 1 )动态前向模型，其在给定动作序列的未来时刻的周期(水平)上预测状态演进( 2 )目标函数，其将预测状态上的分布映射到成本值，以及( 3 )后退策略，其基于所述后退策略更新优化时间水平，找到下一动作决策(控制信号)。

In the following sections, we provide a short explanation of each component of MPC. Before doing that though, we have to clarify our assumptions about the MPC paradigm used in this thesis. We assume that:

• the dynamic model of the system is not known a priori,

• the states of the system are not directly measurable and instead the robot perceives the world through its sensory observations,

• the objective function is defined over the sensory observations.

In such scenarios, instead of learning a dynamic model representing state evolutions, we learn sensorimotor contingencies. Such contingencies represent distributions over the actions conditioned on the observations (policy) and, distributions over future observations given the current action-observation pair (forward model).

系统的动态模型不是先验已知的，系统的状态不是可直接测量的，而是机器人通过其感官观察来感知世界，目标函数是通过感官观察确定的。在这种情况下，我们不是学习代表状态演化的动态模型，而是学习感觉运动偶然性。此类意外事件表示基于观察(策略)的动作分布，以及给定当前动作-观察对(正向模型)的未来观察的分布。

Model-based policy training highly depends on the quality of the model to ob- tain a proper policy. Finding a perfect model of a physical robot is nearly impossible due to the sparsity of data in high-dimensional sensorimotor spaces, noisy measure- ments and partial observability. Furthermore, the model itself can be too restrictive to represent the underlying process. Since the policy is trained based on the infor- mation provided by the model, a small error in the chain of predictions made by the model results in a large error to update the policy. Hence, extra considerations should be taken into account when updating a policy based on a model. We explain our contributions to formulate an efficient model-based policy training in Chapter3.

由于高维感知运动空间的数据稀疏性、噪声测量和局部可观测性，寻找一个理想的物理机器人模型几乎是不可能的。此外，模型本身可能限制性太强，无法表示底层流程。由于策略是基于模型提供的信息来训练的，因此模型所作的预测链中的小误差导致更新策略的大误差。因此，在根据模型更新策略时，应考虑额外的因素。

2.4 Literature review

In this section, we provide a general overview of the related work. The sensorimotor foundations of the current thesis is based on the mechanisms of biological systemsdemonstrating skillful behaviors reviewed by Wolpert et al., in Nature Reviews Neu- roscience [48]. They highlighted the main components of motor learning to acquire skilled behaviors as (1) efficient gathering of task-related sensory data, a process which is intertwined with the motor actions to provide the most informative data, (2) three different types of motor learning such as gradient-based and gradient-free reinforcement learning and use-dependent learning and, (3) efficiently representing the acquired knowledge, in terms of motor primitives, credit assignments and the structure of a given task. In this thesis, we studied mostly the first and the second components introduced in the review. Here, we introduce the related work in these areas.

( 1 )有效地收集与任务相关的感觉数据，这一过程与运动动作交织在一起以提供信息最多的数据，( 2 )三种不同类型的运动学习，例如基于梯度和无梯度的强化学习和使用依赖学习，( 3 )在运动原语、信用分配和给定任务的结构方面有效地表示所获得的知识。

Active perception

Active perception, which refers to deliberate use of motor capabilities to support efficient processing of the sensory observations, 主动知觉是指有意使用运动能力来支持感官观察的有效处理 has a long tradition in psychologi- cal, neuroscience and robotic studies [22, 49, 50, 51, 52, 53, 54]. O’regan and Noë [22] explained emergence of different sensations, e.g., seeing or hearing in humans as the consequence of mastery of the corresponding sensorimotor contingencies. As an example, the neural mechanism which enables seeing a red patch is developed by sensorimotor experiences when acting on red objects, e.g., gazing on the objects or moving the eyes away from them. The idea is that the sensation is not the neural influx but how the influx alters as the result of the motor actions. Accord- ing to this theory, perception is defined as mastery of object-related sensorimotor contingencies. 知觉被定义为对与物体有关的感觉运动偶然性的掌握

Perception through anticipation, proposed by Möller [55], focuses on active per- ception and the role of internal models to anticipate high-level semantics based on low-level sensorimotor contingencies. Hoffman [56], also Möller and Schenck [57] trained sensorimotor contingencies in a mobile robotic application to classify an arrangement of obstacles as a passage or a dead-end by internally exercising the learned sensorimotor contingencies. Schenck [58, 59] applied a similar concept to control robotic manipulators to acquire visuomotor skills based on anticipating sensorimotor regularities by training internal models. He demonstrated that spatial understanding can be achieved merely based on sensorimotor modeling without any high-level symbolic representation.

默勒·[提出的“通过预期感知”主要关注积极的知觉和内部模型在基于低水平感知运动偶然性的高级语义预测中的作用。霍夫曼·[ 56 ]以及默勒和申克·[ 57 ]在移动机器人应用中训练感知运动应急能力，通过在内部练习所学的感知运动应急能力，将障碍物的排列分类为通道或死胡同。schenck [ 58，59 ]应用类似的概念来控制机器人，以通过训练内部模型，基于对感生运动规律的预测来获得视觉运动技能。他证明，空间理解可以仅仅基于感觉运动建模而不需要任何高级符号表示。

We studied advantages of formulating perception problem in robotics as an active process in a number of our work. In [60, 61] we demonstrated that mastery of sensorimotor contingencies endows a robot with the capabilities to distinguish sensory cues caused externally by other agents or a change in the environment from the ones caused by the robot own motor actions. This capability is then used by the agent to quickly adapt its behavior w.r.t. new conditions. In another study [62], we presented a learning framework based on the perception through anticipation concept which ground spatial reasoning about objects in the robot’s own sensorimotor space. This enables acquiring visuomotor reaching skill, from scratch, purely based on sensorimotor data collected in an active process. Furthermore in [63], we explained how a robot can acquire high-level understanding of a human partner through its mastery of sensorimotor contingencies in a physical collaboration task.

我们研究了将机器人中的感知问题表述为一个主动过程的优点。在[ 60，61 ]我们证明，掌握感知运动的偶然性使机器人能够区分外部由其他因素或环境变化引起的感知线索与机器人自身运动行为引起的感知线索。然后，代理使用该能力在新条件下快速调整其行为。在另一项研究[ 62中，我们提出了一个基于通过预期感知概念的学习框架，该框架基于机器人自身感知-运动空间中物体的空间推理。这使得能够完全基于在活动过程中收集的感觉运动数据从零开始获得视觉运动到达技能。此外，在[ 63 ]中，我们解释了机器人如何通过在物理协作任务中掌握感知运动突发事件来获得对人类伴侣的高级理解

Finally, it is worth mentioning that state-of-the-art approaches to address vi- sual perception problems, e.g., visual object detection, is based on processing single images using deep convolutional neural networks without any considerations of ac- tions. We argue that these approaches owe much of their success to the availability of tremendous hand labeled images. Furthermore, such methods are applicable to problems where the objective is to conclude something about an image, e.g., classi- fying image contents as a cup on a table top, but they do not say much about how to interact with the object, i.e., how to pick-up the cup in the given example.

解决视觉感知问题(例如，视觉对象检测)的现有技术方法基于使用深度卷积神经网络处理单个图像而不考虑任何通信。我们认为这些方法的成功在很大程度上归功于巨大的手标记图像的可用性。

State representation learning

Studies on humans motor learning have shown that skilled motor behaviors re- quire efficient gathering and processing of task-relevant sensory information [4, 48, 64]. Due to limited computational resources, it is very important to abstract high-dimensional sensations from multiple sources to a task-relevant representation which is suitable to make action decisions. The same argument is applicable to robotic systems with real-time action decision processes. Here, it is vital to repre- sent a task with a set of states holding sufficient information to train optimal action policies.

对人类运动学习的研究表明，熟练的运动行为需要有效地收集和处理任务相关的感觉信息[ 4，48，64 ]。由于有限的计算资源，从多个源提取高维感觉到适合于作出动作决策的任务相关表示是非常重要的。

Unlike computer vision problems [65], it is very hard to find one suitable rep- resentation which is applicable to many tasks. A satisfactory representation devel- oped for a manipulation problem does not suit a navigation task, as an example.很难找到一个适用于许多任务的合适的表示。例如，针对操纵问题提出的满意表示不适合导航任务。 Even it is not possible in many cases to transfer representations to similar prob- lems, e.g., a representation learned to picking up a cup may not be transferable to a task which involves pouring into the same cup [31]. Similarly, there is no approach which can be applied to many different tasks to learn a suitable state representation. End-to-end training is by far the most generalizable method to learn a task-related state representation from raw sensory observations. However, features learned by end-to-end training do not transfer well to other similar tasks and have to be learned from scratch for every new task. Also, such methods require too much data with high diversity to learn a proper representation, or otherwise, are very prone to overfitting, i.e., learning a set of irrelevant features that are just common in the limited training dataset [66, 67].

Therefore, state representation learning is challenging, task dependent, weakly transferable and moreover hard to be evaluated before being assessed on a policy training problem [68].

因此，状态表征学习具有挑战性、任务依赖性、弱可转移性，而且在被评估为策略训练问题之前难以评估

To deal with these challenges, many heuristics are devised to characterize desirable behaviors of the extracted features. Generally these behaviors are included in representation learning problems by a number of priors. Priors refer to preliminary knowledge regarding desirable attributes of extracted features formulated explicitly in the objective function to be optimized. They act as task- dependent regularizers to compensate for the limited sampled data to acquire an abstraction of high-dimensional observations. The use of such priors is very common in the representation learning literature. We list a number of most practical ones in the following:

设计了许多启发式算法来表征所提取特征的期望行为。通常，这些行为被许多先验知识包含在表征学习问题中。先验是指关于在要优化的目标函数中显式表达的提取特征的期望属性的初步知识。它们充当依赖于任务的正则化器，以补偿有限的采样数据，从而获得高维观测的抽象。这种先验知识在表象学习文献中的运用非常普遍。下面列出了一些最实用的方法:

慢特征分析( SFA ) [ 69 : SFA建立在这样的假设基础上:通过从快速变化的观测中推断慢特征，可以提取一组不变(与任务无关的)状态。基本假设是感知物质或现象与快速变化的感知信号相比变化缓慢。已经在不同的工作中研究了SFA，例如[ 70、71、72 ]，以从视频记录中提取不变特征。在给定原始感觉观测值的情况下，求解一组函数的参数是一个优化问题，每个函数构成状态向量的一维。更新参数以提取一组特征，这些特征对于作为未监督训练数据给出的观测序列而言变化尽可能慢。为了避免平凡解并找到最小表示，特征被强制为具有单位方差的零均值且没有互相关。 2 .物理先验:除了上一项中解释的时间相干性，其他物理定律也指导机器人与其真实世界环境的交互。jonschkowski和Brock [ 73，74 ]总结了这样的定律: ( 1 )一个简单的定律，它有利于用最小的表示来模拟任务的相关方面，( 2 )一个比例定律，它将任务属性的变化与相应动作的幅度联系起来，( 3 )一个因果定律，它确保了至少在两个状态分量上的差异，在这两个状态分量上，相同的动作导致不同的结果，( 4 )一个重复性定律，它有利于产生类似动作的可重复结果的状态表示。 3 .系统动力学先验:期望所提取的状态集合与机器人动作一起形成具有良好研究特性的动态系统，例如线性[ 75，76 ]或MDP特性[ 77 ]。这种偏好也可以在表征学习问题中表述为先验。 4 .全面性先验:这种先验倾向于状态表示，其与解码器模型一起包括所有信息以重构相应的观察。自动编码器(在第4章中描述)是实现这种抽象的最流行的体系结构[ 78、79、80、75、81、82 ]。au - toencoders将数据压缩到一个低维流形中，在此基础上可以恢复原始数据。还有益的是构造一些感官观察通道的嵌入，基于该嵌入可以恢复另一个通道[ 83 ]。 5 .可预测性先验:作为马尔可夫决策过程，要求状态-动作对包含足够的信息来预测后继状态[ 77上的分布。此外，为了改进任务相关性，期望所提取的状态包括能够与动作数据一起用于预测未来回报的数据[ 84、40 ]、观察[ 85 ]或根据值函数[ 83 ]的回报，该函数估计给定状态或状态-动作对的预期回报。这样的先验可以明确地包括在任务学习中，作为改进所提取特征的任务相关性的一般任务无关先验。 6 .语义先验:由于许多物理任务涉及到对真实对象的操纵，因此将诸如对象性[ 31 ]或更明确地将任务相关对象[ 32 ]的类等高级语义视为先验知识是合理的。通过这种方式，可以更有效地更新策略，以集中于与任务相关的功能。此外，这种语义数据可以被提供给策略训练问题以补偿交互式机器人数据的缺乏。 7 .任务特定先验:文献中还提出了其他启发式算法来利用不同的任务特定先验来提高策略训练方法的效率。这种先验适用于一个特定问题，与列表中提到的其他先验相比，不能被视为通用先验。作为一个例子，在迷宫[ 83 ]中导航的环路闭合预测提高了代理的空间推理能力，但仅适用于非常类似的问题。

8 .辅助任务:辅助任务可以被表述为辅助策略搜索问题，通过共享参数和表示来改进主策略训练任务。这种辅助任务可以定义在相同任务[ 86 ]的不同复杂度水平上，或者定义在被认为对原始任务[ 40、83 ]有益的不同任务上，例如控制观察。 9 .多任务多机器人学习:一般来说，端到端深度策略可以看作是任务特定模块和机器人特定模块的串联[ 33 ]。任务特定的部件可以在不同的机器人之间共享，即使具有不同的形态学[ 34 ]，以改善特征不变性。此外，机器人专用组件还可以在不同任务之间共享，以增强深度策略[ 37、36 ]的内部表示。 10 .域随机化:域随机化是一种主要用于将仿真中获得的策略传输到现实世界的技术。[ 87，88，89，86 ]。其基本思想是随机化观测的非本质方面(主要是在仿真中)，以获得不变的表示。然而，操纵实际观测并不简单。此外，随机化过程需要关于任务的先验知识的下降量，并且可能对提取的特征引入偏差。 11 .特征成形:通常提供给学习代理的另一种类型的先验知识是明确指定构成状态表示的特征类型。然而，与将知识编码为目标函数的其他引入的先验不同，这里的特征通常由感知模型的体系结构规定。这种特征成形的一个很好的例子可以在Levine等人的著作中找到。[ 29，78 ]，其中由体系结构事先选择状态作为空间图像点。

然而，我们认为表示和映射一样重要，因为( 1 )内部表示可以被成形以提高策略训练问题的数据效率，并且确保所获取的策略对看不见的情况的适用性，( 2 )表示可以在不同任务之间共享或者被构造成分层结构以获取更复杂的技能，( 3 )感官观察的表示方式指示策略如何做出动作决策，因此它是研究所获取的策略的质量的重要线索。

动态前向模型，根据该模型获得更新策略的信息。一般来说，策略培训是一个不适定的问题，即有几个可能的动作组合来最小化给定的目标函数。结果，不能仅基于观测数据获得目标函数梯度的适当的低方差估计。这使得与基于模型的方法相比，无模型方法的数据效率较低，后者根据数据构造前向模型以提供梯度的可靠估计。

Error-based skill learning

这种类型的学习在生物学研究中被称为基于错误的学习

代理感测其自身运动动作的实际结果并将其与期望结果进行比较的能力。在这种情况下，代理不仅能够确定误差，而且还能够估计误差的梯度。误差梯度的估计需要前向模型来提供反向传播误差所需的信息。此外，前向模型可用于直接提供对作用梯度的估计。因此，在不需要策略参数化的情况下，可以直接找到最优动作。我们将在以下几节中讨论直接行动优化和参数策略优化的相关研究。

Parametric policy training using forward models

Jordan和Rumelhart [ 102 ]所做的工作可以被认为是一项开创性的研究，以训练基于动态正向模型提供的梯度信息的神经网络策略。它们在单个神经网络结构中容纳了行动策略和前向模型，并分两步训练模型。在第一步中，基于一组状态-动作转换数据训练前向模型。在第二步中，保持前向模型不变，并且通过预训练的前向模型反向传播任务错误来端到端地训练策略。

考虑到机器人问题中有限的可用数据和状态动作空间的维数，在许多情况下训练精确的前向模型是不可行的。为了解决这个问题，Abbeel等人。[ 103 ]建议构造低(不准确的)前向模型来更新策略参数，同时使用真实机器人数据来评估更新的策略，以在基于模型的策略训练方法和无模型的策略训练方法之间折衷

有效地利用分段线性模型来表示多个采样状态-动作轨迹周围的复杂非线性机器人系统。线性系统可以有效地与最优控制策略组合，例如迭代线性二次调节器( ILQR ) [ 104 )，以获得最优动作。Levine和Abbeel [ 105 ]提出了一种通过给定最新数据重新调整模型参数迭代地构造局部时变线性模型的方法。他们将局部线性for - ward模型与ILQR相结合，得到一系列解决给定任务的动作

戴森罗斯等人。[ 106，107，108 ]提出通过非参数高斯过程渐进地学习前向模型。他们引入了一个称为学习控制概率推理( PILCO )的框架，该框架基于前向模型做出的随机长期预测更新参数策略。在实践中，该框架通过在给定不确定的国家行动值的情况下进行一系列预测来估计预期的长期成本。然后，在每次迭代中根据策略参数更新策略，计算预期长期成本的梯度。PILCO表明，它可以通过从提供的数据中提取更多相关信息来加快政策培训。然而，它的计算复杂性使得它无法应用于解决具有高维传感器输入的机器人任务

所有方法中，前向模型都提供了更新策略参数的方向。大多数方法依赖于构造局部前向模型，该局部前向模型用新采样的轨迹递增地更新。一般来说，由于策略参数的变化，感知运动数据的分布不断变化，因此前向模型应不断适应于表示非平稳数据分布

Direct action optimization using forward models

模型预测控制，如第2.3节所述。我们只处理从数据增量构建前向模型的方法。有许多心理学研究通过利用感觉运动偶然性来理解长期预测在实现认知行为中的重要性。在这里，我们介绍一些与本文更相关的内容。默勒和申克[ 57 ]训练了一个估计视觉观察变化的视觉前向模型和一个预测与障碍物碰撞的触觉前向模型，以便在模拟移动代理上学习简单的导航任务。他们证明，该代理可以仅仅基于所学的感知运动偶然性来区分通道和死胡同，这是通过试图找到一系列内部模拟的动作来实现的，这些动作不会导致前向模型预测的碰撞。申克等人。[ 59 ]研究了视觉前向模型的潜在作用，以抓住人眼中央凹区域之外的物体，即没有完全视觉敏锐度的视网膜部分。他们假设，在计划手臂运动以抓住物体之前，视觉固定是内部模拟的。他们在机器人平台上验证了他们的假设，方法是训练一个视觉前向模型，并使用该模型使用所提出的范例来抓取视网膜中央凹外的物体。

sun和Scassellati [提出了一种解决视觉伺服的数据驱动方法，即通过视觉反馈控制机器人操纵器。他们训练了一个前向模型，将机械手的关节位置映射到摄像机图像中相应的视觉点。利用前向模型的导数求出雅可比矩阵，利用雅可比矩阵获得最优的电机动作来控制机械手的位置。

与训练任务相关策略相比，通过训练前向模型获得动作选择策略的一个重要益处是所获得的解决方案适用于更广泛的任务，而不依赖于奖励函数。这在实施视觉运动技能时尤其重要，因为视觉任务需要大量详细的人工监督 111 即 learnng to act by predictiongthe future

In Chapter 3, we introduce our methods which obtain an action policy by constructing a forward model and directly optimizing the model to find optimal actions. Similar to the introduced approaches, the idea is to update a local model incrementally by sampling new data according to a policy which seeks to reach to a given goal by optimizing the forward model directly. We demonstrate that long-term action-selection strategies can be achieved by incrementally refining the forward model using only a handful of trials by the techniques we developed in our studies.

在第三章中，我们介绍了通过构造正向模型并直接对模型进行优化以找到最优行为来获得行为策略的方法。与所介绍的方法类似，该方法的思想是通过根据通过直接优化前向模型来寻求达到给定目标的策略对新数据进行采样来增量更新局部模型。我们证明，长期的行动选择策略可以通过我们在研究中开发的技术，仅使用少量试验逐步完善正向模型来实现。

chapter 3

基于模型的策略搜索方法的主要优点是基于经验数据的动态正向模型比直接训练策略更简单。原因是，与策略训练不同，训练前向模型的目标值是可直接观察到的(但通常被测量噪声破坏)。此外，训练前向模型没有时间复杂性。该模型仅依赖于单个状态-动作转换数据，并且在静止过程(即，不随时间改变的过程)的情况下，可以使用状态-动作转换数据来训练前向模型，而不管其时间戳如何

3.1 Forward dynamic models

加权不确定网络[ 132 ]，其明确地对预测的不确定性建模,

在下一节中，我们将高斯过程作为一种回归工具来表示前向模型，然后讨论模型不确定性的来源，确定感觉运动空间的相关维度，并在不确定性假设下进行一系列预测。

GP前向动态模型的主要优点是: ( 1 )由于固有的贝叶斯正则化而具有良好的泛化能力，( 2 )不确定性的显式建模，( 3 )模型易于更新，( 4 )可以解析地导出模型的输入梯度。 GP由它的均值m ( x )和协方差k ( Xi，XJ )函数指定，其中x表示GP模型的输入参数。我们假设前向模型由零均值GP，F∼GP ( m ( x ) = 0，k )表示，具有平方指数协方差函数，定义为:

model uncertainties

由于状态动作空间的高维性和训练数据的有限性，建模物理系统(如嵌入式机器人)具有很大的挑战性。在这种情况下，一个适当的模型不仅应该给出一个足够好的估计，而且应该在作出预测时告知它有多确定。模型不确定性随后被传递到策略训练算法以做出行动决策，该行动决策( 1 )通过遵守具有已知结果的状态行动轨迹来提高成功率，( 2 )通过避免具有高不确定性的动作来提高安全性，以及( 3 )有效地探索状态行动空间以减少在可能获得高回报的区域中的不确定性。本文研究了两类不确定性，即( 1 )噪声测量引起的不确定性和( 2 )数据不足引起的不确定性。随着观测数据的增多，区分这两种不确定性来源对于提高模型精度具有重要意义。在情况( 2 )中，用新的训练数据更新GP模型以提高未来预测的精度。然而，在情况( 1 )中，向训练数据集添加更多的数据仅增加了模型复杂度(因为矩阵K的尺寸增大)，而没有提高预测的精度。在包括的出版物[ 61，62，63 ]中，我们开发了一种使用高斯过程的增量模型学习方法，该方法通过考虑模型的计算复杂性来改进模型，因为观察到更多的训练数据。本章简要介绍了这些方法。

Relevance determination

Making chain predictions

策略搜索方法通常通过对机器人应用策略进行多次试验并使用收集到的状态动作展开来估计目标函数来更新参数化策略ρθ。基于模型的方法通过使用正向模型和策略对状态动作值进行一系列预测来模拟内部展开。例如，PILCO [ 106 ]，通过在给定策略和GP前向模型p (τ|τθ，F )的情况下找到整个轨迹上的分布来估计目标函数。

In the first part of our studies, we developed two data-efficient frameworks to obtain action-selection policies from scratch with no prior expert knowledge. The first framework was developed in two stages and published in IROS2015 [62] and ICRA2016 [61]. The framework was applied to learn kinematic and dynamic models of a PR2 robot to control the robot from scratch using a small amount of interactive training data. The second framework, published in IROS2016 [63], was applied to solve a physical human-robot collaboration problem. We demonstrated that our PR2 robot learns to collaborate with a human partner to control a ball on a jointly held plank after a few minutes of interaction. In the next two sections, we outline these two frameworks.

3.2

我们的解决方案的重要特征是: ( 1 )在信任区域内作出行动决策，该信任区域定义为具有已知转移的状态行动空间的子集；( 2 )用新的观测数据递增地训练和连续地调整前向模型；( 3 )利用一般行动计划器进行长期规划，并结合局部优化方法在状态之间转移。

我们假设该模型仅在已知轨迹附近有效，该轨迹被连续改进以收敛到解。在这种情况下，模型在状态动作空间中的任何其他地方都不能被信任，并且基于模型的不信任区域选择的动作注定要失败。为了避免不确定动态模型下的动作选择，首先需要对模型的不确定性进行量化。针对这一问题，提出了几种方法，其中大多数方法都适用于无模型参数化策略搜索问题。这些方法通过使用Kullback - Leibler ( KL )发散度量(例如，[ 29，44 )或通过温度度量(例如，[ 98，135 )估计KL发散来约束策略分布，从而隐含地避免不可信区域。我们直接利用高斯过程的方差输出作为模型不确定性的度量，并对其进行积分以有效地寻找最优行为。

增量学习如前所述，前向模型必须逐步更新，以便在新采样的轨迹收敛到最终解时赶上它们。高斯过程非常适合于这种增量学习。GPs表示的模型可以简单地通过添加样本或从训练数据集中移除样本来更新。在下面，我们将介绍更新GP模型的不同方案: 1 .如果不确定性是由于在查询的样本点附近缺少足够的训练数据，则通过将新样本附加到GP数据集来补偿由于高不确定性而导致的对新观测样本的预测失败。GP协方差函数可以用作距离度量来确定邻域。 2 .否则，在不确定性不是由于缺少数据而是由于噪声测量的情况下，附加新样本仅增加了模型复杂度，而没有提高预测的准确性或确定性。在这种情况下，给定新样本邻域中的训练数据列表，样本可以被列表中的最接近或最早的数据替换。 3 .由GP模型以高确定性做出的不正确预测很可能是系统行为变化的指示。由于未被怀疑的GP预测是由于在查询点附近存在许多训练数据点；因此，假设适当的距离测量来确定附近的训练数据，则不正确的预测由基于其抽取样本的基础过程中的变化引起。在这种情况下，所有干扰最近观测的训练数据都应该从数据集中删除，并由新观测代替。 4 .GP模型的复杂性随着训练数据的收集而增加。随着状态-动作轨迹随时间演化以收敛到解，对应于较早轨迹的较旧GP训练数据变得过时。因此，我们可以使用遗忘机制[ 53 ]来修剪旧数据，以在增量学习方案中控制模型复杂性。

Many robotic tasks can only be formulated as a feed- forward control problem due to (1) the lack of informative feedback that is provided immediately after each action, (2) the latencies of the sensory observations and, (3) the necessity of realizing a skilled behavior.

作为人类，我们在没有太多注意的情况下，迅速地进行许多基本的身体活动，例如开门。然而，经典反馈控制的机器人必须根据接收到的反馈和给定的控制律不断调整其动作。在打开门的示例中，机器人通过使用在每个时间步长捕获的2D图像(基于图像的视觉伺服[ 139 )连续测量末端执行器和把手之间的距离并找到使距离最小化的电机命令来到达门把手。一旦机器人抓住手柄，它就必须小心地推动门以基于在手腕[ 140处接收到的重力补偿力/扭矩反馈来估计其旋转中心。

进行估计所花费的时间显著地减慢了任务的执行。另一方面，前馈策略不会停止任务执行以接收和处理传感器反馈，并持续更新电机命令以完成任务。为了提高控制鲁棒性，即控制策略对小错误和干扰的抗毁性，前馈控制机制可以配备前向动态模型，如前一章所解释的，以预测运动动作的感觉结果，并且在感觉到失配的情况下，执行例如反馈校正策略。

因此，问题是找到策略参数，使得在给定观测值o的情况下，策略在具有更可能的回报结果的轨迹上分配更高的分布。由于没有报酬概率的解析模型，因此采用蒙特卡罗抽样方法对积分进行估计。然而，考虑到电机轨迹空间的尺寸，直接采样估计积分是非常低效的，如果不是不可能的话。此外，该策略在以高维视觉输入为条件的电机轨迹上分配分布。观测的高维性也阻碍了通过直接采样估计积分。 解决这一问题的方法是基于缩小机器人感知运动空间的维数。我们找到了一个与任务相关的低维状态作用空间，使得方程4.1中的积分可以通过从这个新空间采样来有效地近似。在接下来的几节中，我们将详细介绍如何为数据高效的深度前馈策略训练降低感知运动空间的维数。

4.3 Trajectory generative models

如前所述，由于电机轨迹空间的高维性，不能有效地估计方程4.1中引入的积分。由于奖励信号的稀疏性，几乎任何采样轨迹都不能以可接受的结果执行任务。在这种情况下，给定有限数量的样本，直接采样仅产生负样本，其不携带更新策略参数的太多信息。

由g (τ|α)表示的概率模型是生成模型[ 148 )，其中通过从g采样生成观测变量(τ)，该观测变量以潜在变量α为条件，该潜在变量α与轨迹空间相比具有相当低的维数。我们称这个潜在空间为动作流形，因为它与定义为一系列运动轨迹的动作直接相关。基于这种直觉，为了解决运动轨迹空间维数较高的问题，我们假设运动轨迹只能由g给出。此外，我们考虑了一个确定性生成模型

由于潜在变量α的维数较低，因此可以通过采样方法更有效地估计这种积分。此外，报酬稀疏性的问题也得到缓解，因为从g采样导致显著更高的报酬结果。

生成模型的训练到目前为止，我们解释了生成模型通过提供合理的轨迹来改进前馈策略训练，并且通过引入低维动作流形来潜在地减少搜索空间。然而，训练生成模型本身需要任务相关的轨迹样本。这似乎与为什么首先需要生成模型的原因相矛盾。这里可能会出现两个问题: ( 1 )与收集数据直接培训策略相比，什么使数据收集更有效地培训生成模型？( 2 )为什么不能利用数据直接训练策略？为了回答这些问题，我们首先强调，方程4.2中引入的技巧不仅是对轨迹施加的限制，而且是用低维潜变量表示轨迹的结果。生成模型基于轨迹的时域特性来分配轨迹，使得采样潜在空间中的两个相邻点也在时域中产生闭合轨迹。这是设计随机策略搜索解决方案所基于的基本属性[ 42 ]。因此，生成模型也是解决深度前馈策略搜索难题的一部分

我们的目标是获得一个生成模型τ←g (α)，它将潜在空间中的一个点映射到运动指令τ的完整轨迹上

变分自动编码器( VAE ) :变分自动编码器[ 150 ]是对潜在空间上的编码器分布分配额外约束的特定类型的自动编码器。它通过附加损耗项来惩罚编码器输出与正态分布的偏差

这个额外的约束迫使生成模型恢复从正态分布抽取的给定样本的轨迹。在潜在空间样本上分配正态分布是有利的有两个原因: 1 .与复杂未知分布相比，从正态分布进行采样更有效，该复杂未知分布否则被分配给数据而没有变化的惩罚项。 2 .训练生成模型保证了邻域保持。在训练阶段，对于每个训练数据τI，生成模型被训练为最小化| g (αI，j ) -τI |，其中αI，j是从编码器ψ(α|τI )给出的正态分布中提取的不同样本。实际上，这迫使生成模型将潜在空间中的相邻点(从正态分布采样)映射到给定距离度量下的相邻轨迹。

在单个神经网络体系结构中实现多种行为(例如，两足机器人的行走、奔跑和跳跃)的更复杂的生成模型将受益于多方面的分布。值得注意的是，随后在策略网络中不使用与生成模型(解码器)端到端地训练的编码器模型。实际上，编码器只是训练生成模型的辅助模型。

4.4

思想是将感知训练为降维映射，该降维映射提取高维感知观察的低维任务相关状态表示。然后，以低维状态表示和低维潜变量作为策略超层的输入和输出空间，利用策略搜索强化学习方法对策略超层进行有效训练。

视觉是最丰富的感觉方式之一，在许多情况下，它与视觉运动模型相结合，携带执行运动任务所需的所有信息。然而，一般来说，它包含大量的细节和与任务无关的数据，这使得处理效率低下。许多机器人任务和生物有机体[ 48，64，152 ]的解决方案是过滤与任务无关的感官观察

引入了几种方法来自主地学习适于仅使用感觉运动经验以及可能的一组先验知识( [ 78、73、95、75 )来获得给定任务的表示。这种方法的优点是: 1 .消除了人类专家为新任务提取一组符号状态表示的需要，从而提高了任务学习过程的自主性。 2 .从而能够适应人类专家没有预见到的新情况。 3 .通过将整个处理链作为单个单元[ 29 ]进行端到端训练来提高最终策略的鲁棒性。在后一种情况下，最终政策的敏感性大大高于端到端培训计划；因为其中一个块中的小错误将被导致任务失败的后续块放大。

此外，状态表示学习方法还应当能够处理从多个传感器模态(例如深度传感器、激光扫描器、麦克风、力/扭矩和触觉传感器)接收的感觉，因为通过整合由机器人的所有感测电机通道接收的信息来找到合适的表示。该问题可以表述为，给定一系列感知运动经验，利用ot表示的观测值，训练感知模型以提取一组任务相关状态ST，在此基础上策略可以做出近似最优的动作决策。

学习任务相关表示的一个突出方法是通过深度策略网络的端到端培训将策略丢失反向传播到所有层。我们将在本章后面的一节中概述这些方法。

培训政策超级层到目前为止，我们使用不同的自动编码器结构分别解释了深度前馈策略网络的运动轨迹和感知超层的训练。通过这种方法，我们将高维的感觉空间和运动空间抽象为状态空间和动作流形。在这一部分中，我们描述了将状态s映射到动作变量α的策略超级层的训练。由于以下原因，可以使用策略搜索方法非常有效地训练此映射: 1 .输入-输出空间的低维:策略超级层将抽象状态映射到电机命令的低维表示。因此，假设感知和运动超层在训练的早期阶段被预训练和固定，则策略超层的训练涉及找到两个低维空间之间的映射。 2 .参数数量少:策略超级层包含的参数少于整个深层网络的0.05 %。这是因为模型的输入-输出空间的低维度以及由于必须实现的简单映射。由于感知和运动超层实现映射的连续性和拓扑保持性，策略超层不需要太大的灵活性来学习最优映射。 3 .解决一个多臂土匪问题:原策略搜索问题需要获得一个使最优序贯行动决策具有延迟报酬的策略，但由于机动超层对机动轨迹的抽象，该问题被归结为一个多臂土匪问题。土匪问题更容易解决，因为( 1 )不存在要解决的信用分配问题，( 2 )在作出行动决定之后立即接收奖励信号，以及( 3 )不再需要长期规划。因此，即使仅提供终端奖励来训练深度前馈策略，也可以使用标准无模型策略搜索方法(例如，[ 44、135、153、154 ) )非常有效地训练策略超级层。

The goal is to obtain a variational policy and then a final policy that behaves like the variational one but uses only raw visual data.

我们应用所介绍的方法来训练深度前馈策略，以执行复杂的视觉马达任务，例如，抓物体、扔球[ 141 ]、从瓶子拧入液体并将其倒入杯子[ 142 ]。主要结果如下: 1 .数据效率:通过数百个真实机器人实验，验证了所提方法对不同复杂任务的深度策略网络训练的数据效率。 2 .使用定性终端奖励的训练:我们通过实验证明了该方法能够使用定性终端奖励训练深度前馈策略。据我们所知，这是第一次训练深层神经网络来执行这样复杂的任务，提供了定性的奖励，例如良好或优秀，只在每次试验结束时给出。 3 .可推广策略训练:我们的实验证明，通过引入辅助视觉任务，可以实现一个可推广策略，该策略不仅可以应用于已训练的对象类，还可以应用于同一个对象类的其他实例。作为一个例子，我们证明了一个训练有素的向杯子倒水的政策，可以用来做同样的任务与各种杯子从未见过。 4 .将学习从仿真转移到真实机器人:我们通过实验验证了我们的假设，即在仿真中训练生成模型，并将学习到的模型转移到真实机器人。虽然仿真机器人的动态行为不同于真实机器人，但生成模型仍然可以被真实机器人使用，并通过用真实机器人数据训练策略超级层来补偿差异。综上所述，我们在[ 141，142 ]给出的实验结果证明了所提出的方法在只给予定性终端奖励的情况下，利用高维视觉输入训练深度前馈策略的适用性。鉴于机器人编程方面的专业知识有限，这使得能够在几个小时内获得新的visouomotor任务。

更多内容请查阅原论文。

你如果看到这里了，可以发简历来或加我微信说说你对这个论文思路的想法。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-05-29，如有侵权请联系 cloudcommunity@tencent.com 删除

人工智能

本文分享自 CreateAMind 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

人工智能