首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >没有找到用于分布式培训的Torchrun命令,需要单独安装吗?

没有找到用于分布式培训的Torchrun命令,需要单独安装吗?
EN

Stack Overflow用户
提问于 2022-02-03 20:39:27
回答 2查看 2.1K关注 0票数 1

我是遵循torchrun教程,但从来没有告诉我们如何安装torchrun。既然我有火把,它应该自动在那里吗?或者发生了什么?

输出:

代码语言:javascript
运行
复制
(meta_learning_a100) [miranda9@hal-dgx ~]$ torchrun --nnodes=1 --nproc_per_node=2 ~/ultimate-utils/tutorials_for_myself/my_l2l/dist_maml_l2l_from_seba.py
bash: torchrun: command not found...

我问这个问题,因为官方的火炬页面似乎建议使用它,https://pytorch.org/docs/stable/elastic/run.html

例如:

代码语言:javascript
运行
复制
TORCHRUN (ELASTIC LAUNCH)
torchrun provides a superset of the functionality as torch.distributed.launch with the following additional functionalities:

Worker failures are handled gracefully by restarting all workers.

Worker RANK and WORLD_SIZE are assigned automatically.

Number of nodes is allowed to change between minimum and maximum sizes (elasticity).

Transitioning from torch.distributed.launch to torchrun
torchrun supports the same arguments as torch.distributed.launch except for --use_env which is now deprecated. To migrate from torch.distributed.launch to torchrun follow these steps:

If your training script is already reading local_rank from the LOCAL_RANK environment variable. Then you need simply omit the --use_env flag, e.g.:

torch.distributed.launch

torchrun

$ python -m torch.distributed.launch --use_env train_script.py
$ torchrun train_script.py
If your training script reads local rank from a --local_rank cmd argument. Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

torch.distributed.launch

torchrun

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()

local_rank = args.local_rank
import os
local_rank = int(os.environ["LOCAL_RANK"])
The aformentioned changes suffice to migrate from torch.distributed.launch to torchrun. To take advantage of new features such as elasticity, fault-tolerance, and error reporting of torchrun please refer to:

Train script for more information on authoring training scripts that are torchrun compliant.

the rest of this page for more information on the features of torchrun.

Usage
Single-node multi-worker

>>> torchrun
    --standalone
    --nnodes=1
    --nproc_per_node=$NUM_TRAINERS
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Fault tolerant (fixed sized number of workers, no elasticity):

>>> torchrun
    --nnodes=$NUM_NODES
    --nproc_per_node=$NUM_TRAINERS
    --rdzv_id=$JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400), specifies the node and the port on which the C10d rendezvous backend should be instantiated and hosted. It can be any node in your training cluster, but ideally you should pick a node that has a high bandwidth.

交叉张贴:

EN

回答 2

Stack Overflow用户

发布于 2022-04-01 07:05:30

我认为它只有在pytorch版本1.11之后才能使用。

票数 0
EN

Stack Overflow用户

发布于 2022-06-01 14:53:08

更准确地说,它被添加到了pytorch版本1.10.0中,所以是的,如果您安装了pytorch,它应该是自动的。https://github.com/pytorch/pytorch/releases/tag/v1.10.0

如果使用的是旧版本,则可以使用

python -m torch.distributed.run

而不是

torchrun

这只是一个切入点。https://github.com/pytorch/pytorch/pull/64049

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70977980

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档