我使用pytorch来分布式训练我的模型。distributed/distributed_c10d.py", line 1489, in barrierRuntimeError: NCCLerror in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL ver
在只有一个titanx的服务器上,有cudnn7和cuda9,但没有nccl,所以我从nvidia下载nccl2并将其解压缩到路径/ to /local/nccl2,然后将第42行中的./pytorch/conda/integrated/build.sh编辑为:“export NCCL_ROOT_DIR= path/ to /local/nccl2”。c
在4 A6000 GPU上运行分布式培训时,我得到以下错误:
[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out.[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out.我用的是标准<em
我试着运行我的pytorch代码,但是得到了以下错误:
A40 with CUDA capability sm_86 is not compatible with the current PyTorchIf you want to use the A40 GPU with PyTorch, please check the instructions at https://pytorch.org/get-starteddistributed/dist