首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何解决著名的“未处理的cuda错误,NCCL版本2.7.8‘错误?

如何解决著名的“未处理的cuda错误,NCCL版本2.7.8‘错误?
EN

Stack Overflow用户
提问于 2021-03-25 20:28:40
回答 4查看 20.2K关注 0票数 14

我见过很多关于这个问题的问题:

代码语言:javascript
运行
复制
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

但似乎没有人能帮我解决这个问题:

在每个脚本的开头,我都尝试手动执行torch.cuda.set_device(device)。这对我来说似乎不管用。我试过不同的GPUS。我已经试过降低pytorch版本和cuda版本。1.6.0、1.7.1、1.8.0和cuda 10.2、11.0、11.1的不同组合。我不知道还能做些什么。为了解决这个问题,人们做了些什么?

也许关系很密切吧?

更完整的错误消息:

代码语言:javascript
运行
复制
('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2

opts.world_size=2

ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
rank=0
mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>
os.getpid()=30704
setting up rank=0 (with world_size=2)
MASTER_ADDR='127.0.0.1'
59264
backend='nccl'
--> done setting up rank=0
setup process done for rank=0
Traceback (most recent call last):
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
    main_distributed()
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
    spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train
    tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
  File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
    model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

奖金1:

我仍然有错误:

代码语言:javascript
运行
复制
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1423, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1365, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1385, in train
    args.opt = move_opt_to_cherry_opt_and_sync_params(args) if is_running_parallel(args.rank) else args.opt
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/distributed.py", line 456, in move_opt_to_cherry_opt_and_sync_params
    args.opt = cherry.optim.Distributed(args.model.parameters(), opt=args.opt, sync=syn)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 62, in __init__
    self.sync_parameters()
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 78, in sync_parameters
    dist.broadcast(p.data, src=root)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

其中一个答案建议使用nvcca & pytorch.version.cuda来匹配,但它们不匹配:

代码语言:javascript
运行
复制
(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)"

11.1
(meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

我怎么才能匹配他们?

EN

回答 4

Stack Overflow用户

发布于 2021-04-07 18:46:49

这不是一个非常令人满意的答案,但这似乎是最终为我工作的原因。我只使用了py手电1.7.1,它的cuda版本为10.2。只要加载了Cuda11.0,它就能正常工作。要安装该版本,请执行以下操作:

代码语言:javascript
运行
复制
conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge

如果您在HPC中,请执行module avail以确保加载正确的cuda版本。也许您需要为提交作业提供bash和其他东西。我的设置如下:

代码语言:javascript
运行
复制
#!/bin/bash

echo JOB STARTED

# a submission job is usually empty and has the root of the submission so you probably need your HOME env var
export HOME=/home/miranda9
# to have modules work and the conda command work
source /etc/bashrc
source /etc/profile
source /etc/profile.d/modules.sh
source ~/.bashrc
source ~/.bash_profile

conda activate metalearningpy1.7.1c10.2
#conda activate metalearning1.7.1c11.1
#conda activate metalearning11.1

#module load cuda-toolkit/10.2
module load cuda-toolkit/11.1

#nvidia-smi
nvcc --version
#conda list
hostname
echo $PATH
which python

# - run script
python -u ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py

我也响应其他有用的事情,如nvcc版本,以确保加载工作(注意,顶部的nvidia-smi没有显示正确的库达版本)。

注意,我认为这可能只是一个bug,因为在撰写本文时,Cuda11.1+pyors1.8.1是新的。我确实试过

代码语言:javascript
运行
复制
            torch.cuda.set_device(opts.gpu)  # https://github.com/pytorch/pytorch/issues/54550

但是,我不能说它总是有效的,也不能解释为什么它不起作用。我确实在我的当前代码中有它,但是我认为我仍然会在pytoror1.8.x+Cuda11.x中出错。

请参阅我的conda列表,以防有帮助:

代码语言:javascript
运行
复制
$ conda list


# packages in environment at /home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
absl-py                   0.12.0           py38h06a4308_0  
aioconsole                0.3.1                    pypi_0    pypi
aiohttp                   3.7.4            py38h27cfd23_1  
anatome                   0.0.1                    pypi_0    pypi
argcomplete               1.12.2                   pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
async-timeout             3.0.1            py38h06a4308_0  
attrs                     20.3.0             pyhd3eb1b0_0  
beautifulsoup4            4.9.3              pyha847dfd_0  
blas                      1.0                         mkl  
blinker                   1.4              py38h06a4308_0  
boto                      2.49.0                   pypi_0    pypi
brotlipy                  0.7.0           py38h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.17.1               h27cfd23_0  
ca-certificates           2021.1.19            h06a4308_1  
cachetools                4.2.1              pyhd3eb1b0_0  
cairo                     1.14.12              h8948797_3  
certifi                   2020.12.5        py38h06a4308_0  
cffi                      1.14.0           py38h2e261b9_0  
chardet                   3.0.4           py38h06a4308_1003  
click                     7.1.2              pyhd3eb1b0_0  
cloudpickle               1.6.0                    pypi_0    pypi
conda                     4.9.2            py38h06a4308_0  
conda-build               3.21.4           py38h06a4308_0  
conda-package-handling    1.7.2            py38h03888b9_0  
coverage                  5.5              py38h27cfd23_2  
crcmod                    1.7                      pypi_0    pypi
cryptography              3.4.7            py38hd23ed53_0  
cudatoolkit               10.2.89              hfd86e86_1  
cycler                    0.10.0                   py38_0  
cython                    0.29.22          py38h2531618_0  
dbus                      1.13.18              hb2f20db_0  
decorator                 5.0.3              pyhd3eb1b0_0  
dgl-cuda10.2              0.6.0post1               py38_0    dglteam
dill                      0.3.3              pyhd3eb1b0_0  
expat                     2.3.0                h2531618_2  
fasteners                 0.16                     pypi_0    pypi
filelock                  3.0.12             pyhd3eb1b0_1  
flatbuffers               1.12                     pypi_0    pypi
fontconfig                2.13.1               h6c09931_0  
freetype                  2.10.4               h7ca028e_0    conda-forge
fribidi                   1.0.10               h7b6447c_0  
future                    0.18.2                   pypi_0    pypi
gast                      0.3.3                    pypi_0    pypi
gcs-oauth2-boto-plugin    2.7                      pypi_0    pypi
glib                      2.63.1               h5a9c865_0  
glob2                     0.7                pyhd3eb1b0_0  
google-apitools           0.5.31                   pypi_0    pypi
google-auth               1.28.0             pyhd3eb1b0_0  
google-auth-oauthlib      0.4.3              pyhd3eb1b0_0  
google-pasta              0.2.0                    pypi_0    pypi
google-reauth             0.1.1                    pypi_0    pypi
graphite2                 1.3.14               h23475e2_0  
graphviz                  2.40.1               h21bd128_2  
grpcio                    1.32.0                   pypi_0    pypi
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               hb453b48_1  
gsutil                    4.60                     pypi_0    pypi
gym                       0.18.0                   pypi_0    pypi
h5py                      2.10.0                   pypi_0    pypi
harfbuzz                  1.8.8                hffaf4a1_0  
higher                    0.2.1                    pypi_0    pypi
httplib2                  0.19.0                   pypi_0    pypi
icu                       58.2                 he6710b0_3  
idna                      2.10               pyhd3eb1b0_0  
importlib-metadata        3.7.3            py38h06a4308_1  
intel-openmp              2020.2                      254  
jinja2                    2.11.3             pyhd3eb1b0_0  
joblib                    1.0.1              pyhd3eb1b0_0  
jpeg                      9b                   h024ee3a_2  
keras-preprocessing       1.1.2                    pypi_0    pypi
kiwisolver                1.3.1            py38h2531618_0  
lark-parser               0.6.5                    pypi_0    pypi
lcms2                     2.11                 h396b838_0  
ld_impl_linux-64          2.33.1               h53a641e_7  
learn2learn               0.1.5                    pypi_0    pypi
libarchive                3.4.2                h62408e4_0  
libffi                    3.2.1             hf484d3e_1007  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
liblief                   0.10.1               he6710b0_0  
libpng                    1.6.37               h21135ba_2    conda-forge
libprotobuf               3.14.0               h8c45485_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                h2733197_1  
libuuid                   1.0.3                h1bed415_2  
libuv                     1.40.0               h7b6447c_0  
libxcb                    1.14                 h7b6447c_0  
libxml2                   2.9.10               hb55368b_3  
lmdb                      0.94                     pypi_0    pypi
lz4-c                     1.9.2                he1b5a44_3    conda-forge
markdown                  3.3.4            py38h06a4308_0  
markupsafe                1.1.1            py38h7b6447c_0  
matplotlib                3.3.4            py38h06a4308_0  
matplotlib-base           3.3.4            py38h62a2d02_0  
memory-profiler           0.58.0                   pypi_0    pypi
mkl                       2020.2                      256  
mkl-service               2.3.0            py38h1e0a361_2    conda-forge
mkl_fft                   1.3.0            py38h54f3939_0  
mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
mock                      2.0.0                    pypi_0    pypi
monotonic                 1.5                      pypi_0    pypi
multidict                 5.1.0            py38h27cfd23_2  
ncurses                   6.2                  he6710b0_1  
networkx                  2.5                        py_0  
ninja                     1.10.2           py38hff7bd54_0  
numpy                     1.19.2           py38h54aff64_0  
numpy-base                1.19.2           py38hfa32c7d_0  
oauth2client              4.1.3                    pypi_0    pypi
oauthlib                  3.1.0                      py_0  
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1k               h27cfd23_0  
opt-einsum                3.3.0                    pypi_0    pypi
ordered-set               4.0.2                    pypi_0    pypi
pandas                    1.2.3            py38ha9443f7_0  
pango                     1.42.4               h049681c_0  
patchelf                  0.12                 h2531618_1  
pbr                       5.5.1                    pypi_0    pypi
pcre                      8.44                 he6710b0_0  
pexpect                   4.6.0                    pypi_0    pypi
pillow                    7.2.0                    pypi_0    pypi
pip                       21.0.1           py38h06a4308_0  
pixman                    0.40.0               h7b6447c_0  
pkginfo                   1.7.0            py38h06a4308_0  
progressbar2              3.39.3                   pypi_0    pypi
protobuf                  3.14.0           py38h2531618_1  
psutil                    5.8.0            py38h27cfd23_1  
ptyprocess                0.7.0                    pypi_0    pypi
py-lief                   0.10.1           py38h403a769_0  
pyasn1                    0.4.8                      py_0  
pyasn1-modules            0.2.8                      py_0  
pycapnp                   1.0.0                    pypi_0    pypi
pycosat                   0.6.3            py38h7b6447c_1  
pycparser                 2.20                       py_2  
pyglet                    1.5.0                    pypi_0    pypi
pyjwt                     1.7.1                    py38_0  
pyopenssl                 20.0.1             pyhd3eb1b0_1  
pyparsing                 2.4.7              pyhd3eb1b0_0  
pyqt                      5.9.2            py38h05f1152_4  
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.2                hcf32534_0  
python-dateutil           2.8.1              pyhd3eb1b0_0  
python-libarchive-c       2.9                pyhd3eb1b0_0  
python-utils              2.5.6                    pypi_0    pypi
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.7.1           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
pytz                      2021.1             pyhd3eb1b0_0  
pyu2f                     0.1.5                    pypi_0    pypi
pyyaml                    5.4.1            py38h27cfd23_1  
qt                        5.9.7                h5867ecd_1  
readline                  8.1                  h27cfd23_0  
requests                  2.25.1             pyhd3eb1b0_0  
requests-oauthlib         1.3.0                      py_0  
retry-decorator           1.1.1                    pypi_0    pypi
ripgrep                   12.1.1                        0  
rsa                       4.7.2              pyhd3eb1b0_1  
ruamel_yaml               0.15.100         py38h27cfd23_0  
scikit-learn              0.24.1           py38ha9443f7_0  
scipy                     1.6.2            py38h91f5cce_0  
setuptools                52.0.0           py38h06a4308_0  
sexpdata                  0.0.3                    pypi_0    pypi
sip                       4.19.13          py38he6710b0_0  
six                       1.15.0             pyh9f0ad1d_0    conda-forge
soupsieve                 2.2.1              pyhd3eb1b0_0  
sqlite                    3.35.2               hdfb4753_0  
tensorboard               2.4.0              pyhc547734_0  
tensorboard-plugin-wit    1.6.0                      py_0  
tensorflow                2.4.1                    pypi_0    pypi
tensorflow-estimator      2.4.0                    pypi_0    pypi
termcolor                 1.1.0                    pypi_0    pypi
threadpoolctl             2.1.0              pyh5ca1d4c_0  
tk                        8.6.10               hbc83047_0  
torchaudio                0.7.2                      py38    pytorch
torchmeta                 1.7.0                    pypi_0    pypi
torchtext                 0.8.1                      py38    pytorch
torchvision               0.8.2                py38_cu102    pytorch
tornado                   6.1              py38h27cfd23_0  
tqdm                      4.56.0                   pypi_0    pypi
typing-extensions         3.7.4.3                       0  
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.4             pyhd3eb1b0_0  
werkzeug                  1.0.1              pyhd3eb1b0_0  
wheel                     0.36.2             pyhd3eb1b0_0  
wrapt                     1.12.1                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
yaml                      0.2.5                h7b6447c_0  
yarl                      1.6.3            py38h27cfd23_0  
zipp                      3.4.1              pyhd3eb1b0_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.5                h9ceee32_0 

对于a100s来说,这在某种程度上是可行的:

代码语言:javascript
运行
复制
pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
票数 2
EN

Stack Overflow用户

发布于 2022-03-23 03:13:19

我安装了正确的库达,意思是:

代码语言:javascript
运行
复制
python -c "import torch;print(torch.version.cuda)"

#was equal to 

nvcc -V

代码语言:javascript
运行
复制
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//' 

正在发布某种版本的nccl (例如,2.10.3 )

解决方法是删除nccl:

代码语言:javascript
运行
复制
sudo apt remove libnccl2 libnccl-dev

然后libnccl版本检查没有给出任何版本,但是ddp培训工作正常!

票数 2
EN

Stack Overflow用户

发布于 2022-09-21 11:33:15

你应该在https://pytorch.org/get-started/locally/得到答案

对我来说,它是通过设置这个来起作用的:

pip3安装火炬火炬视觉torchaudio -额外索引-url https://download.pytorch.org/whl/cu116

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66807131

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档