ray框架及ray-rllab

CreateAMind

发布于 2018-07-20 14:17:14

1.4K0

发布于 2018-07-20 14:17:14

文章被收录于专栏：CreateAMind

rllab paper：https://www.groundai.com/project/ray-rllib-a-framework-for-distributed-reinforcement-learning1917/

很好地扩展到8192个核心。凭借8192个核心，我们在3.7分钟的中位时间内获得6000奖励，这是最佳公布结果的两倍

大规模测试：我们评估RLlib在进化策略（ES），近端政策优化（PPO），和A3C的性能，对建立专门的系统比较特别为那些算法 [ OPE（2017年），黑塞等人（2017） Hesse，Plappert，Radford，Schulman，Sidor和Wu，ope（2016） ]使用Redis，OpenMPI和Distributed TensorFlow。在所有实验中使用相同的超参数（包括在补充材料中）。我们使用TensorFlow为所评估的RLlib算法定义神经网络。

对复杂体系结构的推广 Ape-X：Ape-X [ Horgan等人（2018）Horgan，Quan，Budden，Barth-Maron，Hessel，van Hasselt和Silver ]是DQN的一种变体，它利用分布式经验优先级来扩展到数百个核心。为了适应我们的DQN实现，我们不得不将经验优先级方法添加到DQN评估器（约20行Python），以及一个新的高吞吐量策略优化器（约200行），用于管理重放缓冲器actor之间的数据采样和传输。我们的实现几乎线性地扩展到每秒160k环境帧，256名工人（图b），展示了策略优化器抽象的健壮性。相比之下，Ape-X作者将Ape-X实现为自定义分布式系统。 PPO-ES：我们尝试实现一种新的RL算法，该算法在ES优化步骤的内循环中运行PPO更新，该步骤随机扰乱PPO模型。在一小时内，我们就可以部署到一个小型集群进行评估。该实现只需要约50行代码，并且不需要修改PPO实现，显示了分层控制模型的价值。在我们的实验中（参见补充材料），PPO-ES的性能优于基础PPO，在Walker2d-v1任务上收敛速度更快，回报更高。类似修改的A3C-ES实施解决了PongDeterministic-v4比基线少30％的时间。

我们的实现几乎线性地扩展到每秒160k环境帧，256名工人（图b）

------------------

ray 文档：

Second, you can connect to and use an existing Ray cluster. 可以跑一个ray 集群，按需跑任务就行了。

http://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html

http://ray.readthedocs.io/en/latest/using-ray-on-a-large-cluster.html

http://ray.readthedocs.io/en/latest/using-ray-and-docker-on-a-cluster.html

ray docker: http://ray.readthedocs.io/en/latest/install-on-docker.html Build Docker images Run the script to create Docker images. cd ray ./build-docker.sh This script creates several Docker images: The ray-project/deploy image is a self-contained copy of code and binaries suitable for end users. The ray-project/examples adds additional libraries for running examples. The ray-project/base-deps image builds from Ubuntu Xenial and includes Anaconda and other basic dependencies and can serve as a starting point for developers. http://ray.readthedocs.io/en/latest/install-on-docker.html ray cluster: http://ray.readthedocs.io/en/latest/autoscaling.html

Autoscaling CPU cluster: use a small head node and have Ray auto-scale workers as needed. This can be a cost-efficient configuration for clusters with bursty workloads. You can also request spot workers for additional cost savings.

min_workers: 0max_workers: 10head_node:    InstanceType: m4.largeworker_nodes:    InstanceMarketOptions:        MarketType: spot    InstanceType: m4.16xlarge

Autoscaling GPU cluster: similar to the autoscaling CPU cluster, but with GPU worker nodes instead.

min_workers: 0max_workers: 10head_node:    InstanceType: m4.largeworker_nodes:    InstanceMarketOptions:        MarketType: spot    InstanceType: p2.xlarge

env:

Multi-Agent

A multi-agent environment is one which has multiple acting entities per step, e.g., in a traffic simulation, there may be multiple “car” and “traffic light” agents in the environment. The model for multi-agent in RLlib as follows: (1) as a user you define the number of policies available up front, and (2) a function that maps agent ids to policy ids. This is summarized by the below figure:

If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:

trainer = pg.PGAgent(env="my_multiagent_env", config={    "multiagent": {        "policy_graphs": {            "car1": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.85}),            "car2": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.99}),            "traffic_light": (PGPolicyGraph, tl_obs_space, tl_act_space, {}),        },        "policy_mapping_fn":            lambda agent_id:                "traffic_light"  # Traffic lights are always controlled by this policy                if agent_id.startswith("traffic_light_")                else random.choice(["car1", "car2"])  # Randomly choose from car policies        },    },})while True:    print(trainer.train())

more advanced usage, e.g., different classes of policies per agent, or more control over the training process, you can use the lower-level RLlib APIs directly to define custom policy graphs or algorithms.

不同agent 不同训练策略

Package Reference

ray.rllib.agents
ray.rllib.env
ray.rllib.evaluation
ray.rllib.models
ray.rllib.optimizers
ray.rllib.utils

Algorithms

Ape-X Distributed Prioritized Experience Replay
Asynchronous Advantage Actor-Critic
Deep Deterministic Policy Gradients
Deep Q Networks
Evolution Strategies
Policy Gradients
Proximal Policy Optimization

http://ray.readthedocs.io/en/latest/tutorial.html

Tutorial

To use Ray, you need to understand the following:

How Ray executes tasks asynchronously to achieve parallelism.
How Ray uses object IDs to represent immutable remote objects.

Overview

Ray is a Python-based distributed execution engine. The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations.

One object store per node stores immutable objects in shared memory and allows workers to efficiently share objects on the same node with minimal copying and deserialization.
A Redis server maintains much of the system’s state. For example, it keeps track of which objects live on which machines and of the task specifications (but not data). It can also be queried directly for debugging purposes.

# The following takes ten seconds.[f1() for _ in range(10)]# The following takes one second (assuming the system has at least ten CPUs).ray.get([f2.remote() for _ in range(10)])

Web UI

The Ray web UI includes tools for debugging Ray jobs. The following image shows an example of using the task timeline for performance debugging:

The UI supports a search for additional details on Task IDs and Object IDs, a task timeline, a distribution of task completion times, and time series for CPU utilization and cluster usage.

http://ray.readthedocs.io/en/latest/using-ray-with-gpus.html

actor：并行串行

# Create ten Counter actors.counters = [Counter.remote() for _ in range(10)]# Increment each Counter once and get the results. These tasks all happen in# parallel.results = ray.get([c.increment.remote() for c in counters])print(results)  # prints [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]# Increment the first Counter five times. These tasks are executed serially# and share state.results = ray.get([counters[0].increment.remote() for _ in range(5)])print(results)  # prints [2, 3, 4, 5, 6]

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-07-12，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 CreateAMind 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度