嵌入在复杂环境中的代理的自我修改是难以避免的,无论是通过直接手段(例如,自己的代码修改)还是间接手段(例如,影响操作员、利用漏洞或环境)。虽然有人认为智能代理有避免修改其效用函数的动机,以便它们的未来实例将朝着相同的目标工作,但尚不清楚这是否也适用于非二元场景,其中代理嵌入在环境中。博斯特罗姆在《超级智能》(2014)中提出了自我改造安全的问题,这是在AGI安全部署的背景下提出的。Everitt等人(2016)正式表明,提供自修改选项对完全理性的代理是无害的,与此相反,我们表明,对于有限理性的代理,自修改可能导致性能的指数级恶化和先前对齐的代理的逐渐错位。我们研究了这种影响的大小如何取决于代理人理性中缺陷的类型和大小(下面的1-4)。我们还讨论了模型假设和更广泛的问题和框架空间。具体来说,我们引入了几种类型的有限理性主体,它们要么(1)不总是选择最优行为,(2)不完全符合人类价值观,(3)具有不精确的环境模型,或者(4)使用了错误的时间贴现因子。我们表明,虽然在情况(2)-(4)中,由代理的缺陷引起的未对准不会随着时间而恶化,但是(1)未对准可能呈指数增长。
原文题目:Performance of Bounded-Rational Agents With the Ability to Self-Modify
原文:Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). While it has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances will work towards the same goals, it is not clear whether this also applies in non-dualistic scenarios, where the agent is embedded in the environment. The problem of self-modification safety is raised by Bostrom in Superintelligence (2014) in the context of safe AGI deployment.
In contrast to Everitt et al. (2016), who formally show that providing an option to self-modify is harmless for perfectly rational agents, we show that for agents with bounded rationality, selfmodification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent’s rationality (1-4 below). We also discuss model assumptions and the wider problem and framing space.
Specifically, we introduce several types of a bounded-rational agent, which either (1) doesn’t always choose the optimal action, (2) is not perfectly aligned with human values, (3) has an innacurate model of the environment, or (4) uses the wrong temporal discounting factor. We show that while in the cases (2)-(4) the misalignment caused by the agent’s imperfection does not worsen over time, with (1) the misalignment may grow exponentially.
原文作者:Jakub Tˇetek,Marek Sklenka
原文地址:https://arxiv.org/abs/2011.06275
原创声明,本文系作者授权云+社区发表,未经许可,不得转载。
如有侵权,请联系 yunjia_community@tencent.com 删除。
我来说两句