背景
NVIDIA Container Toolkit 库 libnvidia-container 在处理 CUDA 前向兼容特性时,会将容器
/usr/local/cuda/compat
目录下的文件挂载到容器 lib(/usr/lib/x86_64-linux-gnu/
等)目录,该挂载行为会受到软链接攻击影响,可导致任意主机目录被以只读模式挂载到容器内,从而引发容器逃逸。详情可参考 社区描述。目前,NVIDIA 官方已针对该漏洞推出了 nvidia-container-toolkit 1.17.4容器工具包进行修复。
腾讯云 TKE 在2025年3月21日之前创建的 GPU 节点,都存在受到该漏洞攻击的隐患。推荐您及时升级 GPU 节点上的 nvidia-container-toolkit 工具包,避免您的业务安全受到损害。
修复指引
1. 下载 nvidia-container-toolkit 升级工具。
wget https://blake-gz-1251707795.cos.ap-guangzhou.myqcloud.com/nv-runtime-upgrade-v2.tar.gz
2. 解压工具压缩包。
tar -zxvf nv-runtime-upgrade-v2.tar.gz
3. 执行升级脚本。
Ubuntu 系列镜像:
root@VM-4-8-ubuntu:~/nv-runtime-upgrade# ./upgrade-nv-runtime.sh2025-02-21/16:05:11 INFO Need to upgrade nvidia-container-toolkit(1.14.5-1) to 1.17.4(Reading database ... 137641 files and directories currently installed.)Removing libnvidia-container-tools (1.14.5-1) ...Removing libnvidia-container1:amd64 (1.14.5-1) ...Processing triggers for libc-bin (2.31-0ubuntu9.16) ...2025-02-21/16:05:32 INFO start to install nvidia container toolkit2025-02-21/16:05:33 INFO succeed to upgrade nvidia-container-toolkit
CentOS/TencentOS 系列镜像:
[root@VM-5-10-tlinux nv-runtime-upgrade]# ./upgrade-nv-runtime.sh2025-02-21/16:02:44 INFO Need to upgrade nvidia-container-toolkit(1.14.5) to 1.17.42025-02-21/16:02:44 INFO This node has been installed qgpu2025-02-21/16:02:44 INFO current qgpu version is 2.2.02025-02-21/16:02:44 INFO backup up qgpu tools2025-02-21/16:02:44 INFO success backup qgpuNo packages marked for removal.2025-02-21/16:02:47 INFO start to install nvidia container toolkit2025-02-21/16:02:48 INFO succeed to upgrade nvidia-container-toolkit2025-02-21/16:02:48 INFO recover qgpu2025-02-21/16:02:48 INFO success to recover qgpu2025-02-21/16:02:48 INFO succeed to upgrade nvidia container toolkit
4. 检查升级是否成功。
Ubuntu 系列镜像:
root@VM-4-8-ubuntu:~/nv-runtime-upgrade# dpkg -l |grep nvidiaii libnvidia-container-tools 1.17.4-1 amd64 NVIDIA container runtime library (command-line tools)ii libnvidia-container1:amd64 1.17.4-1 amd64 NVIDIA container runtime libraryii nvidia-container-runtime 3.14.0-1 all NVIDIA Container Toolkit meta-packageii nvidia-container-toolkit 1.17.4-1 amd64 NVIDIA Container toolkitii nvidia-container-toolkit-base 1.17.4-1 amd64 NVIDIA Container Toolkit Baseii nvidia-docker2 2.14.0-1 all NVIDIA Container Toolkit meta-package
Centos/TencentOS 系列镜像:
[root@VM-5-10-tlinux ~]# rpm -qa|grep nvidialibnvidia-container1-1.17.4-1.x86_64nvidia-container-toolkit-1.17.4-1.x86_64nvidia-container-toolkit-base-1.17.4-1.x86_64libnvidia-container-tools-1.17.4-1.x86_64nvidia-docker2-2.14.0-1.noarchnvidia-container-runtime-3.14.0-1.noarch
回滚指引
若需要回滚上述操作,可参考以下命令:
if which yum &>/dev/null; thenyum remove -y nvidia-container-runtime >/dev/null || trueyum remove -y nvidia-container-toolkit >/dev/null || trueyum remove -y nvidia-container-toolkit-base >/dev/null || true# 卸载 libnvidiayum remove -y libnvidia-container1 >/dev/null|| trueelif which dpkg &>/dev/null; thenapt-get --purge remove -y nvidia-container-runtime >/dev/null || trueapt-get --purge remove -y nvidia-container-toolkit >/dev/null || true# 老版本可能没有这个,但是不影响apt-get --purge remove -y nvidia-container-toolkit-base >/dev/null || true# 删除libnvidiaapt-get --purge remove -y libnvidia-container1 >/dev/null || truefi/usr/local/qcloud/gpu/install.shif [ -f /proc/qgpu/version ]; thenecho "recover qgpu"if [ -d /usr/bin/qgpu_backup ]; thenmv /usr/bin/nvidia-container-toolkit /usr/bin/nvidia-container-toolkit.backupmv /usr/bin/nvidia-container-runtime-hook /usr/bin/nvidia-container-runtime-hook.backupcp /usr/bin/qgpu_backup/* /usr/bin/echo "success to recover qgpu"elseecho "failed to find qgpu backup dir, please restart qgpu-manager to recover qgpu"fifi