rllab paper:https://www.groundai.com/project/ray-rllib-a-framework-for-distributed-reinforcement-learning1917/
很好地扩展到8192个核心。凭借8192个核心,我们在3.7分钟的中位时间内获得6000奖励,这是最佳公布结果的两倍
大规模测试:我们评估RLlib在进化策略(ES),近端政策优化(PPO),和A3C的性能,对建立专门的系统比较 特别为那些算法 [ OPE(2017年),黑塞等人(2017) Hesse,Plappert,Radford,Schulman,Sidor和Wu,ope(2016) ]使用Redis,OpenMPI和Distributed TensorFlow。在所有实验中使用相同的超参数(包括在补充材料中)。我们使用TensorFlow为所评估的RLlib算法定义神经网络。
对复杂体系结构的推广 Ape-X:Ape-X [ Horgan等人(2018)Horgan,Quan,Budden,Barth-Maron,Hessel,van Hasselt和Silver ]是DQN的一种变体,它利用分布式经验优先级来扩展到数百个核心。为了适应我们的DQN实现,我们不得不将经验优先级方法添加到DQN评估器(约20行Python),以及一个新的高吞吐量策略优化器(约200行),用于管理重放缓冲器actor之间的数据采样和传输。我们的实现几乎线性地扩展到每秒160k环境帧,256名工人(图b),展示了策略优化器抽象的健壮性。相比之下,Ape-X作者将Ape-X实现为自定义分布式系统。 PPO-ES:我们尝试实现一种新的RL算法,该算法在ES优化步骤的内循环中运行PPO更新,该步骤随机扰乱PPO模型。在一小时内,我们就可以部署到一个小型集群进行评估。该实现只需要约50行代码,并且不需要修改PPO实现,显示了分层控制模型的价值。在我们的实验中(参见补充材料),PPO-ES的性能优于基础PPO,在Walker2d-v1任务上收敛速度更快,回报更高。类似修改的A3C-ES实施解决了PongDeterministic-v4比基线少30%的时间。
我们的实现几乎线性地扩展到每秒160k环境帧,256名工人(图b)
------------------
ray 文档:
Second, you can connect to and use an existing Ray cluster. 可以跑一个ray 集群,按需跑任务就行了。
http://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html
http://ray.readthedocs.io/en/latest/using-ray-on-a-large-cluster.html
http://ray.readthedocs.io/en/latest/using-ray-and-docker-on-a-cluster.html
ray docker: http://ray.readthedocs.io/en/latest/install-on-docker.html Build Docker images Run the script to create Docker images. cd ray ./build-docker.sh This script creates several Docker images: The ray-project/deploy image is a self-contained copy of code and binaries suitable for end users. The ray-project/examples adds additional libraries for running examples. The ray-project/base-deps image builds from Ubuntu Xenial and includes Anaconda and other basic dependencies and can serve as a starting point for developers. http://ray.readthedocs.io/en/latest/install-on-docker.html ray cluster: http://ray.readthedocs.io/en/latest/autoscaling.html
Autoscaling CPU cluster: use a small head node and have Ray auto-scale workers as needed. This can be a cost-efficient configuration for clusters with bursty workloads. You can also request spot workers for additional cost savings.
min_workers: 0max_workers: 10head_node: InstanceType: m4.largeworker_nodes: InstanceMarketOptions: MarketType: spot InstanceType: m4.16xlarge
Autoscaling GPU cluster: similar to the autoscaling CPU cluster, but with GPU worker nodes instead.
min_workers: 0max_workers: 10head_node: InstanceType: m4.largeworker_nodes: InstanceMarketOptions: MarketType: spot InstanceType: p2.xlarge
env:
A multi-agent environment is one which has multiple acting entities per step, e.g., in a traffic simulation, there may be multiple “car” and “traffic light” agents in the environment. The model for multi-agent in RLlib as follows: (1) as a user you define the number of policies available up front, and (2) a function that maps agent ids to policy ids. This is summarized by the below figure:
If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:
trainer = pg.PGAgent(env="my_multiagent_env", config={ "multiagent": { "policy_graphs": { "car1": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.85}), "car2": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.99}), "traffic_light": (PGPolicyGraph, tl_obs_space, tl_act_space, {}), }, "policy_mapping_fn": lambda agent_id: "traffic_light" # Traffic lights are always controlled by this policy if agent_id.startswith("traffic_light_") else random.choice(["car1", "car2"]) # Randomly choose from car policies }, },})while True: print(trainer.train())
more advanced usage, e.g., different classes of policies per agent, or more control over the training process, you can use the lower-level RLlib APIs directly to define custom policy graphs or algorithms.
不同agent 不同训练策略
http://ray.readthedocs.io/en/latest/tutorial.html
To use Ray, you need to understand the following:
Ray is a Python-based distributed execution engine. The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations.
# The following takes ten seconds.[f1() for _ in range(10)]# The following takes one second (assuming the system has at least ten CPUs).ray.get([f2.remote() for _ in range(10)])
The Ray web UI includes tools for debugging Ray jobs. The following image shows an example of using the task timeline for performance debugging:
The UI supports a search for additional details on Task IDs and Object IDs, a task timeline, a distribution of task completion times, and time series for CPU utilization and cluster usage.
http://ray.readthedocs.io/en/latest/using-ray-with-gpus.html
actor: 并行 串行
# Create ten Counter actors.counters = [Counter.remote() for _ in range(10)]# Increment each Counter once and get the results. These tasks all happen in# parallel.results = ray.get([c.increment.remote() for c in counters])print(results) # prints [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]# Increment the first Counter five times. These tasks are executed serially# and share state.results = ray.get([counters[0].increment.remote() for _ in range(5)])print(results) # prints [2, 3, 4, 5, 6]