前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >探讨AGI如何更安全的论文

探讨AGI如何更安全的论文

作者头像
用户1908973
发布2018-07-20 16:51:47
4640
发布2018-07-20 16:51:47
举报
文章被收录于专栏:CreateAMindCreateAMind

The Alignment Problem for History-Based Bayesian Reinforcement Learners∗

PUBLIC DRAFT

Tom Everitt1,2 Marcus Hutter2tom.everitt@anu.edu.au marcus.hutter@anu.edu.au

April 27, 2018 1DeepMind, 2Australian National University

Abstract: Future artificial intelligences may be many times smarter than humans (Bostrom, 2014). If humans should have any chance of controlling such systems, their goals better be aligned with our human goals. Unfortu- nately, the goals of RL agents as designed today are heavily misaligned with human values for a number of reasons. In this paper, we categorize sources of misalignment, and give examples for each type. We also describe a range of tools for managing misalignment. Combined, the tools yield a number of aligned AI designs, though much future work remains for assessing their practical feasibility.

Keywords: AGI safety, reinforcement learning, Bayesian learning, causal graphs

http://www.tomeveritt.se/papers/alignment.pdf

Controlling AGI. A superhuman AGI is a system who outperforms humans on most cognitive tasks. In order to control it, humans would need to control a system more intelligent than themselves. This may be nearly impossible if the difference in intelligence is large, and the AGI is trying to escape control.

Humans have one key advantage: As the designers of the system, we get to decide the AGI’s goals, and the way the AGI strives to achieve its goals. This may allow us design AGIs whose goals are aligned with ours, and that pursue them in a responsible way. Increased intelligence in an AGI is not a threat as long as the AGI only strives to help us achieve our own goals.

On a high level, this goal alignment problem is challenging for at least two reasons: Specificity of human values, and Goodhart’s law. First, human values are rather specific. For example, most humans prefer futures with happy sentient beings over futures with much suffering or no sentient beings at all. But even futures where happiness greatly outweighs suffering can still be undesirable. Yudkowsky (2009) gives an example where a happy, sentient experience is set to repeat everywhere until the heat-death of the uni- verse. Most people find it repulsive, because it lacks important values such as variation and growth. However, listing all important values is not an easy task, which makes it hard to define a goal for an AGI that matches our human values even if the goal is optimized by a highly intelligent system. In this way, an AGI resembles a fairy-tale genie who grants you a wish, but interprets it over-literally (Wiener, 1960). In most fairy tales, the hero eventually wants his wish undone.

A second reason why alignment is hard is Goodhart’s law (Goodhart, 1975, 1984), which roughly states that:

“When a measure becomes a target, it ceases to be a good measure.” (Strath- ern, 1997).

In the context of AGI, it means that if we give our system a proxy-measure for how well it is satisfying a goal, then the proxy is likely to cease being a good measure once the AGI starts optimizing for it. An oft-mentioned example is that of a reinforcement learning (RL) agent that uses a reward signal as goal proxy. In most circumstances, the reward may correspond well to the agent’s performance. But an AGI optimizing the reward may find a way to short circuit the reward signal, obtaining maximum reward for behaviors considered undesirable by its human designers (Ring and Orseau, 2011). This is a rather extreme instance of Goodhart’s law.

Outline. Following an initial section for background and definitions (Section 2), the bulk of the paper is structured along the three RL setups (Sections 3 to 5). Each of these sections contains sections with causal graph formalizations, misalignment exam- ples, and available tools for managing misalignment. Section 6 concludes with some final considerations.

Readers not familiar with causal graphs are recommended to start with Appendix A, which describes the basics of the causal graph notation. Readers already familiar with causal graphs need only remember the following additions to the standard causal-graph notation: Whole node represent observed variables, and dashed nodes represent unob- served (or latent) variables. Analogously, whole (non-dashed) arrows represent known causal relationships, and dashed arrows represent unknown or partially known causal relationships.

In the main text, we sometimes provide semi-formal statements in place of fully for- mal theorems. This is to improve the flow of the text, and to avoid readers getting stuck on inessential details. Appendix C contains formal results supporting many of these statements. Finally, Appendix B has the full causal graphs for the RL models we consider.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-05-03,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CreateAMind 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档