DeepMind Dreamer 在这个任务上栽了

CreateAMind

发布于 2023-09-01 08:23:01

1300

发布于 2023-09-01 08:23:01

文章被收录于专栏：CreateAMind

https://github.com/jurgisp/memory-maze

Memory Maze

Memory Maze is a 3D domain of randomized mazes designed for evaluating the long-term memory abilities of RL agents. Memory Maze isolates long-term memory from confounding challenges, such as exploration, and requires remembering several pieces of information: the positions of objects, the wall layout, and keeping track of agent’s own position.

Key features:

Online RL memory tasks (with baselines)
Offline dataset for representation learning (with baselines)
Verified that memory is the key challenge
Challenging but solvable by human baseline
Easy installation via a simple pip command
Available gym and dm_env interfaces
Supports headless and hardware rendering
Interactive GUI for human players
Hidden state information for probe evaluation

Task Description

The task is based on a game known as scavenger hunt or treasure hunt:

The agent starts in a randomly generated maze, which contains several objects of different colors.
The agent is prompted to find the target object of a specific color, indicated by the border color in the observation image.
Once the agent successfully finds and touches the correct object, it gets a +1 reward and the next random object is chosen as a target.
If the agent touches an object of the wrong color, there is no effect.
Throughout the episode, the maze layout and the locations of the objects do not change.
The episode continues for a fixed amount of time, so the total episode reward equals the number of reached targets.

摘要：

智能代理需要记住重要信息才能在部分观察到的环境中进行推理。例如，具有第一人称视角的代理应该记住相关对象的位置，即使它们不在视野范围内。同样，为了有效地浏览房间，代理商需要记住房间连接方式的平面图。然而，强化学习中的大多数基准测试任务并不测试代理的长期记忆，这减缓了这一重要研究方向的进展。在本文中，我们介绍了记忆迷宫，这是一个随机迷宫的 3D 域，专门设计用于评估代理的长期记忆。与现有的基准测试不同，Memory Maze 衡量的是与混杂代理能力分开的长期记忆，并要求代理通过随时间整合信息来定位自己。有了记忆迷宫，我们提出了一个在线强化学习基准、一个多样化的离线数据集和一个离线探测评估。记录人类玩家建立了一个强大的基线，并验证了建立和保留记忆的必要性，这反映在他们在每一集中逐渐增加的奖励中。我们发现，当前的算法受益于通过时间进行截断反向传播的训练，并在小迷宫上取得了成功，但在大型迷宫上的表现不及人类的表现，这为未来的算法设计留下了空间，可以在记忆迷宫上进行评估。这反映在他们在每一集中逐渐增加的奖励中。我们发现，当前的算法受益于通过时间进行截断反向传播的训练，并在小迷宫上取得了成功，但在大型迷宫上的表现不及人类的表现，这为未来的算法设计留下了空间，可以在记忆迷宫上进行评估。这反映在他们在每一集中逐渐增加的奖励中。我们发现，当前的算法受益于通过时间进行截断反向传播的训练，并在小迷宫上取得了成功，但在大型迷宫上的表现不及人类的表现，这为未来的算法设计留下了空间，可以在记忆迷宫上进行评估。