为你分享73篇论文解决深度强化学习的18个关键问题

来源:PaperWeekly

作者:王凌霄

本文共2434字,建议阅读5分钟。 本文为大家分享了73篇论文,介绍深度学习的方法策略以及关键问题分析。

这两天我阅读了两篇篇猛文 A Brief Survey of Deep Reinforcement Learning 和 Deep Reinforcement Learning: An Overview,作者排山倒海的引用了 200 多篇文献,阐述强化学习未来的方向。

论文:A Brief Survey of Deep Reinforcement Learning 作者:Nevertiree链接:http://www.paperweekly.site/papers/922

论文:Deep Reinforcement Learning: An Overview 作者:Nevertiree 链接:http://www.paperw2eekly.site/papers/137

原文归纳出深度强化学习中的常见科学问题,并列出了目前解法与相关综述,我在这里做出整理,抽取了相关的论文。 这里精选 18 个关键问题,涵盖空间搜索、探索利用、策略评估、内存使用、网络设计、反馈激励等等话题。

本文精选了 73 篇论文(其中 2017 年的论文有 27 篇,2016 年的论文有 21 篇),为了方便阅读,原标题放在文章最后,可以根据索引找到。

问题一:预测与策略评估

prediction, policy evaluation

万变不离其宗,Temporal Difference 方法仍然是策略评估的核心哲学【Sutton 1988】。TD的拓展版本和她本身一样鼎鼎大名—1992 年的 Q-learning 与 2015 年的 DQN。

美中不足,TD Learning 中很容易出现 Over-Estimate(高估)问题,具体原因如下:

The max operator in standard Q-learning and DQN use the same values both to select and to evaluate an action. — van Hasselt

旷世猛将 van Hasselt 先生很喜欢处理 Over-Estimate 问题,他先搞出一个 Double Q-learning【van Hasselt 2010】大闹 NIPS,六年后搞出深度学习版本的 Double DQN【van Hasselt 2016a】。

问题二:控制与最佳策略选择

control, finding optimal policy

目前解法有三个流派,一图胜千言:

图1:台大李宏毅教授的 Slide

最传统的方法是 Value-Based,就是选择有最优 Value 的 Action。最经典方法有:Q-learning 【Watkins and Dayan 1992】、SARSA 【Sutton and Barto 2017】。

后来 Policy-Based 方法引起注意,最开始是 REINFORCE 算法【Williams 1992】,后来策略梯度 Policy Gradient【Sutton 2000】出现。

最时行的 Actor-Critic 【Barto et al 1983】把两者做了结合。楼上 Sutton 老爷子的好学生、AlphaGo 的总设计师 David Silver 同志提出了 Deterministic Policy Gradient,表面上是 PG,实际讲了一堆 AC,这个改进史称 DPG【Silver 2014】。

图2:Actor-Critic 的循环促进过程

问题三:不稳定与不收敛问题

Instability and Divergence when combining off-policy,function approximation,bootstrapping

早在 1997 年 Tsitsiklis 就证明了如果 Function Approximator 采用了神经网络这种非线性的黑箱,那么其收敛性和稳定性是无法保证的。

分水岭论文 Deep Q-learning Network【Mnih et al 2013】中提到:虽然我们的结果看上去很好,但是没有任何理论依据(原文很狡猾的反过来说一遍)。

This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in stable manner.

图3:征服 Atari 游戏的 DQN

DQN 的改良主要依靠两个 Trick:

  • 经验回放【Lin 1993】 虽然做不到完美的独立同分布,但还是要尽力减少数据之间的关联性 。
  • Target Network【Mnih 2015】 Estimated Network 和 Target Network 不能同时更新参数,应该另设 Target Network 以保证稳定性。 Since the network Q being updated is also used in calculating the target value, the Q update is prone to divergence.(为什么我们要用 Target Network)

下面几篇论文都是 DQN 相关话题的:

  • 经验回放升级版:Prioritized Experience Replay 【Schaul 2016】
  • 更好探索策略 【Osband 2016】
  • DQN 加速 【He 2017a】
  • 通过平均减少方差与不稳定性 Averaged-DQN 【Anschel 2017】

下面跳出 DQN 的范畴:

  • Duel DQN【Wang 2016c】(ICML 2016 最佳论文)
  • Tips:阅读此文请掌握 DQN、Double DQN、Prioritized Experience Replay 这三个背景。
  • 异步算法 A3C 【Mnih 2016】
  • TRPO (Trust Region Policy Optimization)【Schulman 2015】
  • Distributed Proximal Policy Optimization 【Heess 2017】
  • Policy gradient 与 Q-learning 的结合【O'Donoghue 2017、Nachum 2017、 Gu 2017、Schulman 2017】
  • GTD 【Sutton 2009a、Sutton 2009b、Mahmood 2014】
  • Emphatic-TD 【Sutton 2016】

问题四:End-to-End 下的训练感知与控制

train perception and control jointly end-to-end

现有解法:

  • Guided Policy Search 【Levine et al 2016a】

问题五:数据利用效率

data/sample efficiency

现有解法有:

  • Q-learning 与 Actor-Critic
  • 经验回放下的actor-critic 【Wang et al 2017b】
  • PGQ,policy gradient and Q-learning 【O'Donoghue et al 2017】
  • Q-Prop, policy gradient with off-policy critic 【Gu et al 2017】
  • return-based off-policy control, Retrace 【Munos et al 2016】, Reactor 【Gruslyset al 2017】
  • learning to learn, 【Duan et al 2017、Wang et al 2016a、Lake et al 2015】

问题六:无法取得激励

reward function not available

现有解法基本上围绕模仿学习:

  • 吴恩达的逆强化学习【Ng and Russell 2000】
  • learn from demonstration 【Hester et al 2017】
  • imitation learning with GANs 【Ho and Ermon 2016、Stadie et al 2017】 (附TensorFlow 实现 [1])
  • train dialogue policy jointly with reward model 【Su et al 2016b】

问题七:探索-利用问题

exploration-exploitation tradeoff

现有解法有:

  • unify count-based exploration and intrinsic motivation 【Bellemare et al 2017】
  • under-appreciated reward exploration 【Nachum et al 2017】
  • deep exploration via bootstrapped DQN 【Osband et al 2016)】
  • variational information maximizing exploration 【Houthooft et al 2016】

问题八:基于模型的学习

model-based learning

现有解法:

  • Sutton 老爷子教科书里的经典安利:Dyna-Q 【Sutton 1990】
  • model-free 与 model-based 的结合使用【Chebotar et al 2017】

问题九:无模型规划

model-free planning

比较新的解法有两个:

  • Value Iteration Networks【Tamar et al 2016】是勇夺 NIPS2016 最佳论文头衔的猛文。 知乎上有专门的文章解说:Value iteration Network [2],还有作者的采访:NIPS 2016 最佳论文作者:如何打造新型强化学习观 [3]。VIN 的 TensorFlow 实现 [4]。

图4:Value Iteration Network 的框架

  • DeepMind 的 Silver 大神发表的 Predictron 方法 【Silver et al 2016b】,附 TensorFlow 实现 [5]。

问题十:它山之石可以攻玉

focus on salient parts

@贾扬清 大神曾经说过:伯克利人工智能方向的博士生,入学一年以后资格考试要考这几个内容:强化学习和 Robotics、 统计和概率图模型、 计算机视觉和图像处理、 语音和自然语言处理、 核方法及其理论、 搜索,CSP,逻辑,Planning 等。

如果真的想做人工智能,建议都了解一下,不是说都要搞懂搞透,但是至少要达到开会的时候和人在 poster 前面谈笑风生不出错的程度吧。

因此,一个很好的思路是从计算机视觉与自然语言处理领域汲取灵感,例如下文中将会提到的 unsupervised auxiliary learning 方法借鉴了 RNN+LSTM 中的大量操作。

下面是 CV 和 NLP 方面的几个简介:物体检测 【Mnih 2014】、机器翻译 【Bahdanau 2015】、图像标注【Xu 2015】、用 Attention 代替 CNN 和 RNN【Vaswani 2017】等等。

问题十一:长时间数据储存

data storage over long time, separating from computation

最出名的解法是:

  • 在 Nature 上大秀一把的 Differentiable Neural Computer【Graves et al 2016】

问题十二:无回报训练

benefit from non-reward training signals in environments

现有解法围绕着无监督学习开展:

  • Horde 【Sutton et al 2011】
  • 极其优秀的工作:
  • unsupervised reinforcement and auxiliary learning 【Jaderberg et al 2017】
  • learn to navigate with unsupervised auxiliary learning 【Mirowski et al 2017】
  • 大名鼎鼎的 GANs 【Goodfellow et al 2014】

问题十三:跨领域学习

learn knowledge from different domains

现有解法全部围绕迁移学习走:

  • 【Taylor and Stone, 2009、Pan and Yang 2010、Weiss et al 2016】,learn invariant features to transfer skills 【Gupta et al 2017】

问题十四:有标签数据与无标签数据混合学习

benefit from both labelled and unlabelled data

现有解法全部围绕半监督学习:

  • 【Zhu and Goldberg 2009】
  • learn with MDPs both with and without reward functions 【Finn et al 2017)】
  • learn with expert's trajectories and those may not from experts 【Audiffren et al 2015】

问题十五:多层抽象差分空间的表示与推断

learn, plan, and represent knowledge with spatio-temporal abstraction at multiple levels

现有解法:

  • 多层强化学习 【Barto and Mahadevan 2003】
  • strategic attentive writer to learn macro-actions 【Vezhnevets et al 2016】
  • integrate temporal abstraction with intrinsic motivation 【Kulkarni et al 2016】
  • stochastic neural networks for hierarchical RL 【Florensa et al 2017】
  • lifelong learning with hierarchical RL 【Tessler et al 2017】

问题十六:不同任务环境快速适应

adapt rapidly to new tasks

现有解法基本上是 learn to learn learn:

  • a flexible RNN model to handle a family of RL tasks 【Duan et al 2017、Wang et al 2016a】
  • one/few/zero-shot learning 【Duan et al 2017、Johnson et al 2016、 Kaiser et al 2017b、Koch et al 2015、Lake et al 2015、Li and Malik 2017、Ravi and Larochelle, 2017、Vinyals et al 2016】

问题十七:巨型搜索空间

gigantic search space

现有解法:

  • 依然是蒙特卡洛搜索,详情可以参考初代 AlphaGo 的实现【Silver et al 2016a】

问题十八:神经网络架构设计

neural networks architecture design

现有的网络架构搜索方法:

  • 【Baker et al 2017、Zoph and Le 2017】,其中 Zoph 的工作分量非常重。

新的架构:

  • 有【Kaiser et al 2017a、Silver et al 2016b、Tamar et al 2016、Vaswani et al 2017、Wang et al 2016c】。

相关链接

[1] imitation learning with GANs 实现

https://github.com/openai/imitation

[2] Value iteration Network

https://zhuanlan.zhihu.com/p/24478944

[3] 如何打造新型强化学习观

http://www.sohu.com/a/121100017_465975

[4] Value Iteration Networks 实现

https://github.com/TheAbhiKumar/tensorflow-value-iteration-networks

[5] Predictron 实现

https://github.com/zhongwen/predictron

参考文献

[1] Anschel, O., Baram, N., and Shimkin, N. (2017). Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In the International Conference on Machine Learning (ICML).

[2] Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, M. (2015). Maximum entropy semisupervised inverse reinforcement learning. In the International Joint Conference on Artificial Intelligence (IJCAI).

[3] Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2017). An actor-critic algorithm for sequence prediction. In the International Conference on Learning Representations (ICLR).

[4] Baker, B., Gupta, O., Naik, N., and Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In the International Conference on Learning Representations (ICLR).

[5] Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.

[6] Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846

[7] Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv e-prints.

[8] Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning. In the International Conference on Machine Learning (ICML)

[9] Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J.,Sutskever, I., Abbeel, P., and Zaremba, W. (2017). One-Shot Imitation Learning. ArXiv e-prints.

[10] Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between GANs, inverse reinforcement learning, and energy-based models. In NIPS 2016 Workshop on Adversarial Training.

[11] Florensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. In the International Conference on Learning Representations (ICLR)

[12] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., , and Bengio, Y. (2014). Generative adversarial nets. In the Annual Conference on Neural Information Processing Systems (NIPS), page 2672?2680.

[13] Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Col- ´ menarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., nech Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476

[14] Gruslys, A., Gheshlaghi Azar, M., Bellemare, M. G., and Munos, R. (2017). The Reactor: A Sample-Efficient Actor-Critic Architecture. ArXiv e-prints

[15] Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2017). Q-Prop: Sampleefficient policy gradient with an off-policy critic. In the International Conference on Learning Representations (ICLR).

[16] Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017). Learning invariant feature spaces to transfer skills with reinforcement learning. In the International Conference on Learning Representations (ICLR).

[17] He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2017a). Learning to play in a day: Faster deep reinforcement learning by optimality tightening. In the International Conference on Learning Representations (ICLR)

[18] Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., and Silver, D. (2017). Emergence of Locomotion Behaviours in Rich Environments. ArXiv e-prints

[19] Hester, T. and Stone, P. (2017). Intrinsically motivated model learning for developing curious robots. Artificial Intelligence, 247:170–86.

[20] Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

[21] Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In the Annual Conference on Neural Information Processing Systems (NIPS).

[22] Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In the International Conference on Learning Representations (ICLR).

[23] Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F., Watten- ´berg, M., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. ArXive-prints.

[24] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., and Uszkoreit, J. (2017a). One Model To Learn Them All. ArXiv e-prints.

[25] Kaiser, Ł., Nachum, O., Roy, A., and Bengio, S. (2017b). Learning to Remember Rare Events. In the International Conference on Learning Representations (ICLR).

[26] Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In the International Conference on Machine Learning (ICML).

[27] Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In the Annual Conference on Neural Information Processing Systems (NIPS)

[28] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.

[29] Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016a). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17:1–40.

[30] Li, K. and Malik, J. (2017). Learning to optimize. In the International Conference on Learning Representations (ICLR).

[31] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., & Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. Computer Science, 8(6), A187.

[32] Lin, L. J. (1993). Reinforcement learning for robots using neural networks.

[33] Mahmood, A. R., van Hasselt, H., and Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS).

[34] Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. (2017). Learning to navigate in complex environments. In the International Conference on Learning Representations (ICLR).

[35] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[36] Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual attention. In the Annual Conference on Neural Information Processing Systems (NIPS).

[37] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.

[38] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In the International Conference on Machine Learning (ICML)

[39] Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G.(2016). Safe and efficient offpolicy reinforcement learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

[40] Nachum, O., Norouzi, M., and Schuurmans, D. (2017). Improving policy gradient by exploring under-appreciated rewards. In the International Conference on Learning Representations (ICLR).

[41] Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. ArXive-prints.

[42] Ng, A. and Russell, S. (2000).Algorithms for inverse reinforcement learning. In the International Conference on Machine Learning (ICML).

[43] O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017). PGQ: Combining policy gradient and q-learning. In the International Conference on Learning Representations (ICLR).

[44] Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016). Deep exploration via bootstrapped DQN. In the Annual Conference on Neural Information Processing Systems (NIPS).

[45] Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 – 1359.

[46] Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. In the International Conference on Learning Representations (ICLR).

[47] Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In the International Conference on Learning Representations (ICLR).

[48] Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. In the International Conference on Machine Learning (ICML).

[49] Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence Between Policy Gradients and Soft Q-Learning. ArXiv e-prints.

[50] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. International Conference on International Conference on Machine Learning (pp.387-395). JMLR.org.

[51] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016a). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.

[52] Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. (2016b). The predictron: End-to-end learning and planning. In NIPS 2016 Deep Reinforcement Learning Workshop.

[53] Stadie, B. C., Abbeel, P., and Sutskever, I. (2017).Third person imitation learning. In the International Conference on Learning Representations (ICLR).

[54] Sutton, R. S. and Barto, A. G. (2017). Reinforcement Learning: An Introduction (2nd Edition, in preparation). MIT Press.

[55] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In the Annual Conference on Neural Information Processing Systems

(NIPS).

[56] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, ´E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In the International Conference on Machine Learning (ICML).

[57] Sutton, R. S., Szepesvari, C., and Maei, H. R. (2009b). A convergent O( ´ n) algorithm for off-policy temporal-difference learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS).

[58] Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction, , proc. of 10th. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS).

[59] Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17:1–29

[60] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning,3(1):9–44.

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the International Conference on Machine Learning (ICML).

[61] Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In the Annual Conference on Neural Information Processing Systems (NIPS).

[62] Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685.

[63] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In the AAAI Conference on Artificial Intelligence (AAAI).

[64] van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23:, Conference on Neural Information Processing Systems 2010.

[65] van Hasselt, H., Guez, A., , and Silver, D. (2016a). Deep reinforcement learning with double Qlearning. In the AAAI Conference on Artificial Intelligence (AAAI).

[66] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. ArXiv e-prints.

[67] Vezhnevets, A. S., Mnih, V., Agapiou, J., Osindero, S., Graves, A., Vinyals, O., and Kavukcuoglu, K. (2016). Strategic attentive writer for learning macro-actions. In the Annual Conference on Neural Information Processing Systems (NIPS).

[68] Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching networks for one shot learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

[69] Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016a). Learning to reinforcement learn. arXiv:1611.05763v1.

[70] Wang, S. I., Liang, P., and Manning, C. D. (2016b). Learning language games through interaction. In the Association for Computational Linguistics annual meeting (ACL)

[71] Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016c). Dueling network architectures for deep reinforcement learning. In the International Conference on Machine Learning (ICML).

[72] Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292

[73] Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(9)

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256.

[74] Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A.,Salakhutdinov, R., Zemel, R. S., and Bengio,Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In the International Conference on Machine Learning (ICML).

[75] Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-supervised learning. Morgan & Claypool

Zoph, B. and Le, Q. V. (2017). Neural architecture search with reinforcement learning. In the International Conference on Learning Representations (ICLR)

END.

来源:数据派THU

原文发布于微信公众号 - PPV课数据科学社区(ppvke123)

原文发表时间:2018-04-27

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏专知

【论文推荐】最新六篇目标跟踪相关论文—双重Siamese网络、判别性相关滤波、多目标跟踪、深度多尺度时空判别性、综述、显著性增强

【导读】专知内容组整理了最近六篇目标跟踪(Object Tracking)相关文章,为大家进行介绍,欢迎查看! 1. A Twofold Siamese Net...

66780
来自专栏专知

【专知荟萃23】深度强化学习RL知识资料全集(入门/进阶/论文/综述/代码/专家,附查看)

【AlphaGoZero核心技术】深度强化学习专知荟萃 【AlphaGoZero核心技术】深度强化学习专知荟萃 基础入门 进阶文章 Papers Papers ...

76090
来自专栏专知

【专知荟萃01】深度学习知识资料大全集(入门/进阶/论文/代码/数据/综述/领域专家等)(附pdf下载)

【导读】主题荟萃知识是专知的核心功能之一,为用户提供AI领域系统性的知识学习服务。主题荟萃为用户提供全网关于该主题的精华(Awesome)知识资料收录整理,使得...

51070
来自专栏数据派THU

为你分享73篇论文解决深度强化学习的18个关键问题

来源:PaperWeekly 作者:王凌霄 本文共2434字,建议阅读5分钟。 本文为大家分享了73篇论文,介绍深度学习的方法策略以及关键问题分析。 这两天我阅...

32190
来自专栏目标检测和深度学习

全球最全计算机视觉资料(6:问答|边缘检测|姿态估计|图像生成)

15520
来自专栏专知

【专知荟萃06】计算机视觉CV知识资料大全集(入门/进阶/论文/课程/会议/专家等)(附pdf下载)

【导读】主题荟萃知识是专知的核心功能之一,为用户提供AI领域系统性的知识学习服务。主题荟萃为用户提供全网关于该主题的精华(Awesome)知识资料收录整理,使得...

1.6K140
来自专栏专知

【论文推荐】最新八篇机器翻译相关论文—自注意力残差解码器、条件序列生成式对抗网络、检索译文、域自适应、细粒度注意力机制

30340
来自专栏新智元

ICLR 2017深度学习(提交)论文汇总:NLP、无监督学习、自动编码、RL、RNN(150论文下载)

【新智元导读】ICLR 2017 将于2017年4月24日至26日在法国土伦(toulon)举行,11月4日已经停止接收论文。本文汇总了本年度NLP、无监督学习...

505100
来自专栏专知

【论文推荐】最新六篇机器翻译相关论文— 自注意力残差解码器、SGNMT、级联方法、神经序列预测、Benchmark、人类水平

【导读】专知内容组整理了最近六篇机器翻译(Machine Translation)相关文章,为大家进行介绍,欢迎查看! 1.Self-Attentive Res...

39370
来自专栏AI研习社

126篇殿堂级深度学习论文分类整理 从入门到应用(上)

█ 如果你有非常大的决心从事深度学习,又不想在这一行打酱油,那么研读大牛论文将是不可避免的一步。而作为新人,你的第一个问题或许是:“论文那么多,从哪一篇读起?”...

37080

扫码关注云+社区

领取腾讯云代金券