我正在尝试使用ray.init(address="{node_external_ip}:6379")连接到远程ray.init集群头节点,以进行进一步的远程过程调用负载测试。
我使用以下命令启动head节点:
ray start --head --node-ip-address <node-external-IP>
(注意:我指定了头节点的外部IP,否则,根据我以前尝试的结果,客户端将根本无法建立与远程集群的连接。使用的TCP端口是默认的6379,我反复检查它是否是开放的和可访问的)。
之后,尽管客户端成功地建立了与远程集群的连接:
Connecting to existing Ray cluster at address: <node-external-IP>:6379...
global_state_accessor.cc:357: This node has an IP address of <client-internal-IP>, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
Connected to Ray cluster.
因此,...it在以下消息中失败:
Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: .Please see `raylet.out` for more details.
反过来,远程集群端的raylet.out
包含以下日志记录:
The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
...while dashboard_agent.log
显示:
ERROR agent.py:473 -- Agent is working abnormally. It will exit immediately.
(...)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1661955376.270755430","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1661955376.270754305","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
仪表板的TCP端口也是开放的和可访问的。当使用--include-dashboard false
CLI选项启动时,结果是相同的,即使对于dashboard_agent.log
,日志记录也是相同的。
此外,在使用--block
选项发出命令数秒后,head节点将随下列消息死亡:
Some Ray subprocesses exited unexpectedly:
raylet [exit code=1]
Remaining processes will be killed.
日志记录完全相同。
我确保客户机和远程集群头节点都使用相同版本的Python和ray (用ray 1.12.0、1.13.0、2.0.0、Python3.9.13、3.10.5进行测试)。
我还尝试指定_node_ip_address并在调用ray.init()时添加"ray://“,但仍然失败。
客户端操作系统: Manjaro x86_64,内核5.10.136-1-MANJARO.
远程集群端操作系统:Ubuntu20.04 x86_64,内核5.13.0-1031-aws (它是一个AWS EC2实例)。我还尝试使用上面提到的Manjaro安装程序在物理机器上部署远程集群,并得到了相同的结果。
码头工人没有被使用。
解决这一问题的办法是什么?
发布于 2022-09-06 22:00:05
看起来您使用的是GCS服务器端口(6379
),但是您可能需要的是雷客户端端口10001
。你能试着用ray.init("ray://<address>:10001")
连接吗?
雷客户端文档以获取更多详细信息:https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
https://stackoverflow.com/questions/73558532
复制相似问题