智能高性能网络 NCCL 通信插件安装

NCCL 通信插件简介
NCCL 插件是一款针对腾讯云智能高性能网络 IHN 产品中一个加速通信能力。主要功能是依托 IHN 网络架构，为 AI 大模型训练提供更高效的网络通信性能，同时具备网络故障快速感知与自愈的智能运维能力。
双网口负载优化，发挥 bonding 设备的通信性能极限。
全局 Hash 路由（Global Hash Routing），负载均衡，避免拥塞。
拓扑亲和性流量调度，最小化流量绕行。
全维度的网络负载监控，最短化性能感知。
前提条件
安装 NCCL 即可。
安装步骤
1. 安装 NCCL 插件。
以 Ubuntu 20.04为例，您可以使用以下命令安装插件。
# 卸载现有的 tccl 和 nccl 插件
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
﻿
# 下载安装 nccl 1.2 插件
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.2_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.2_amd64.deb
﻿
# 请确保集群内使用的 nccl 插件版本一致，以下为 nccl 1.0 版本下载安装命令，推荐使用稳定性更优的 nccl 1.2 版本
# wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.0_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.0_amd64.deb && rm -f nccl-rdma-sharp-plugins_1.0_amd64.deb
如果您使用 CentOS 或 TencentOS，参考以下步骤安装。
# 卸载现有的 nccl 插件
rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64
﻿
# 下载安装 nccl 1.2 插件
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm
﻿
# 请确保集群内使用 nccl 插件版本一致，以下为 nccl 1.0 版本下载安装命令，推荐使用稳定性更优的 nccl 1.2 版本
# wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rm -f nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm  
2. 获取拓扑排序的 IP 列表。
NCCL 插件不需要依赖文件可提供 bonding 口动态聚合和全局 hash 路由两种优化能力。如果需要支持网络拓扑的亲和性感知，用户可以通过排序的 IP 列表来实现。IP 排序可以按照如下方式完成：
登录机器，准备 host.txt 文件，把需要排序的 GPU 实例 ID 写入文件中，每行1个实例。
ins-aaa
ins-ccc
ins-bbb
ins-ddd
执行排序任务。
wget http://mirrors.tencentyun.com/install/ihn/ihn_rdma_ip_order.sh && 
chmod a+x ihn_rdma_ip_order.sh && ./ihn_rdma_ip_order.sh host.txt
查看排序后的实例 ID 以及管理 IP 列表文件 hostfile.txt。
ins-aaa:10.0.1.2
ins-bbb:10.0.1.10
ins-ccc:10.0.1.12
ins-ddd:10.0.1.20
3. 配置 NCCL 环境变量。
# 参考配置
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
4. 修改启动脚本。
如果使用 deepspeed launcher 启动训练进程，拿到排序后的 IP 列表之后，将对应的 IP 列表写入 hostfile，再启动训练进程。
root@vm-3-17-centos:/workspace/ptm/gpt# cat hostfile
10.0.1.2 slots=8
10.0.1.10 slots=8
10.0.1.12 slots=8
10.0.1.20 slots=8
﻿
deepspeed --hostfile ./hostfile --master_addr 10.0.1.2 train.py
如果使用 torchrun 启动训练进程，通过 --node_rank 指定对应的节点顺序。
// on 10.0.1.2
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=0 --master_addr=10.0.1.2 train.py ...
// on 10.0.1.10
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=1 --master_addr=10.0.1.10 train.py ...
// on 10.0.1.12
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=2 --master_addr=10.0.1.12 train.py ...
// on 10.0.1.20
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=3 --master_addr=10.0.1.20 train.py ...
如果使用 mpirun 启动训练进程，按照顺序排列 IP 即可。
mpirun \\
-np 32 \\
-H 10.0.1.2:8,10.0.1.10:8,10.0.1.12:8,10.0.1.20:8 \\
--allow-run-as-root \\
-bind-to none -map-by slot \\
-x NCCL_DEBUG=INFO 
-x NCCL_IB_GID_INDEX=3 \\
-x NCCL_IB_DISABLE=0 \\
-x NCCL_SOCKET_IFNAME=eth0 \\
-x NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7 \\
-x NCCL_NET_GDR_LEVEL=2 \\
-x NCCL_IB_QPS_PER_CONNECTION=4 \\
-x NCCL_IB_TC=160 \\
-x NCCL_IB_TIMEOUT=22 \\
-x NCCL_PXN_DISABLE=0 \\
-x LD_LIBRARY_PATH -x PATH \\
-mca coll_hcoll_enable 0 \\
-mca pml ob1 \\
-mca btl_tcp_if_include eth0 \\
-mca btl ^openib \\
all_reduce_perf -b 1G -e 1G -n 1000 -g 1
NCCL 通信插件安装

本页目录：

NCCL 通信插件简介

前提条件

安装步骤