我是遵循torchrun教程,但从来没有告诉我们如何安装torchrun。既然我有火把,它应该自动在那里吗?或者发生了什么?
输出:
(meta_learning_a100) [miranda9@hal-dgx ~]$ torchrun --nnodes=1 --nproc_per_node=2 ~/ultimate-utils/tutorials_for_myself/my_l2l/dist_maml_l2l_from_seba.py
bash: torchrun: command not found...
我问这个问题,因为官方的火炬页面似乎建议使用它,https://pytorch.org/docs/stable/elastic/run.html
例如:
TORCHRUN (ELASTIC LAUNCH)
torchrun provides a superset of the functionality as torch.distributed.launch with the following additional functionalities:
Worker failures are handled gracefully by restarting all workers.
Worker RANK and WORLD_SIZE are assigned automatically.
Number of nodes is allowed to change between minimum and maximum sizes (elasticity).
Transitioning from torch.distributed.launch to torchrun
torchrun supports the same arguments as torch.distributed.launch except for --use_env which is now deprecated. To migrate from torch.distributed.launch to torchrun follow these steps:
If your training script is already reading local_rank from the LOCAL_RANK environment variable. Then you need simply omit the --use_env flag, e.g.:
torch.distributed.launch
torchrun
$ python -m torch.distributed.launch --use_env train_script.py
$ torchrun train_script.py
If your training script reads local rank from a --local_rank cmd argument. Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:
torch.distributed.launch
torchrun
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
local_rank = args.local_rank
import os
local_rank = int(os.environ["LOCAL_RANK"])
The aformentioned changes suffice to migrate from torch.distributed.launch to torchrun. To take advantage of new features such as elasticity, fault-tolerance, and error reporting of torchrun please refer to:
Train script for more information on authoring training scripts that are torchrun compliant.
the rest of this page for more information on the features of torchrun.
Usage
Single-node multi-worker
>>> torchrun
--standalone
--nnodes=1
--nproc_per_node=$NUM_TRAINERS
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Fault tolerant (fixed sized number of workers, no elasticity):
>>> torchrun
--nnodes=$NUM_NODES
--nproc_per_node=$NUM_TRAINERS
--rdzv_id=$JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=$HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400), specifies the node and the port on which the C10d rendezvous backend should be instantiated and hosted. It can be any node in your training cluster, but ideally you should pick a node that has a high bandwidth.
交叉张贴:
发布于 2022-04-01 07:05:30
我认为它只有在pytorch版本1.11之后才能使用。
发布于 2022-06-01 14:53:08
更准确地说,它被添加到了pytorch版本1.10.0中,所以是的,如果您安装了pytorch,它应该是自动的。https://github.com/pytorch/pytorch/releases/tag/v1.10.0
如果使用的是旧版本,则可以使用
python -m torch.distributed.run
而不是
torchrun
https://stackoverflow.com/questions/70977980
复制相似问题