0490-如何为GPU环境编译CUDA9.2的TensorFlow1.8与1.12

作者:李继武

1

文档编写目的

从CDSW1.1.0开始支持GPU,具体可以参考Fayson之前的文章《如何在CDSW中使用GPU运行深度学习》,从最新的CDSW支持GPU的网站上我们可以查到相应的Nvidia Drive版本,CUDA版本以及TensorFlow版本,如下:

我们注意到CUDA的版本是9.2,但是目前官方发布的编译好的TensorFlow的CUDA版本还是9.0,为了在CDSW环境中让TensorFlow运行在GPU上,必须使用CUDA9.2,我们需要手动编译TensorFlow源码。这里,以编译TensorFlow1.8和TensorFlow1.12的版本为例,指定CUDA的版本为9.2,cudnn的版本为7.2.1。

2

安装编译过程中需要的包及环境

此部分两个版本的操作都相同

1.配置JDK1.8到环境变量中

2.执行如下命令,安装依赖包

yum -y install numpy                                                
yum -y install python-devel
yum -y install python-pip
yum -y install python-wheel
yum -y install epel-release
yum -y install gcc-c++
pip install --upgrade pip enum34
pip install keras --user
pip install mock

如果安装时没有可用的包,可到下面的地址下载,然后制作本地yum源:

https://pkgs.org/

3.下载CUDA9.2并安装

到下面的地址下载CUDA9.2安装包:

https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=RHEL&target_version=7&target_type=runfilelocal

选择runfile(local)版本:

上传到服务器:

修改文件权限,并运行该文件:

chmod +x cuda_9.2.148_396.37_linux.run
./cuda_9.2.148_396.37_linux.run 

将CUDA添加到环境变量:

export PATH=/usr/local/cuda-9.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64:$LD_LIBRARY_PATH

执行如下命令应能看到cuda版本:

source /etc/profile
nvcc -V

4.cuDNN v7.2.1 下载并安装

到如下地址下载cudnn v7.2.1,需要注册之后才能下载:

https://developer.nvidia.com/rdp/cudnn-archive

上传到服务器CUDA的安装目录/usr/local/cuda,解压到该目录下

tar -zxvf cudnn-9.2-linux-x64-v7.2.1.38.tgz

在该目录下执行下面命令将cudnn添加到cuda的库中:

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

进入lib64目录,建立一个软连接:

cd /usr/local/cuda/lib64
ln -s stubs/libcuda.so libcuda.so.1 

3

安装编译工具bazel

这部分编译不同的tensorflow版本需要安装不同版本的bazel,使用太新的版本有 时会报错。

A.Tensorflow1.12使用的bazel版本为0.19.2:

1.下载bazel-0.19.2:

wget https://github.com/bazelbuild/bazel/releases/download/0.19.2/bazel-0.19.2-installer-linux-x86_64.sh

2.添加可执行权限,并执行:

chmod +x bazel-0.19.2-installer-linux-x86_64.sh
./bazel-0.19.2-installer-linux-x86_64.sh --user

该--user标志将Bazel安装到$HOME/bin系统上的目录并设置.bazelrc路径$HOME/.bazelrc。使用该--help 命令可以查看其他安装选项。

显示下面的提示表示安装成功:

如果使用--user上面的标志运行Bazel安装程序,则Bazel可执行文件将安装在$HOME/bin目录中。将此目录添加到默认路径是个好主意,如下所示:

export PATH=$HOME/bin:$PATH

B.Tensorflow1.8使用的bazel版本为0.13.0:

1.下载bazel-0.13.0

wget https://github.com/bazelbuild/bazel/releases/download/0.13.0/bazel-0.13.0-installer-linux-x86_64.sh

其余的操作与上面安装bazel-0.19.2相同。

4

下载Tensorflow源码

A. 下载最新版的tensorflow:

git clone --recurse-submodules https://github.com/tensorflow/tensorflow

该命令会在当前目录下创建一个tensorflow目录,在其中下载最新版的tensorflow源码:

编写此文档时tensorflow最新的版本为1.12。

B.下载tensorflow1.8:

wget https://github.com/tensorflow/tensorflow/archive/v1.8.0.tar.gz

解压到当前文件夹:

wget https://github.com/tensorflow/tensorflow/archive/v1.8.0.tar.gz

5

配置tensorflow

不同版本的配置略有不同。

A.Tensorflow1.12

进入tensorflow1.12的源码目录,执行./configure并根据提示选择:

[root@cdh4 tensorflow]# ./configure 
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
INFO: Invocation ID: cc8b0ee2-5e84-4995-ba12-2c922ee3646b
You have bazel 0.19.2 installed.
Please specify the location of python. [Default is /usr/bin/python]: 

Found possible Python library paths:
  /usr/lib/python2.7/site-packages
  /usr/lib64/python2.7/site-packages
Please input the desired Python library path to use.  Default is [/usr/lib/python2.7/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 9.2

Please specify the location where CUDA 9.2 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.2.1

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Please specify the locally installed NCCL version you want to use. [Default is to use https://github.com/nvidia/nccl]: 

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,7.0]: 

Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 

Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: 

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
    --config=mkl            # Build with MKL support.
    --config=monolithic     # Config for mostly static monolithic build.
    --config=gdr            # Build with GDR support.
    --config=verbs          # Build with libverbs support.
    --config=ngraph         # Build with Intel nGraph support.
    --config=dynamic_kernels    # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
    --config=noaws          # Disable AWS S3 filesystem support.
    --config=nogcp          # Disable GCP support.
    --config=nohdfs         # Disable HDFS support.
    --config=noignite       # Disable Apacha Ignite support.
    --config=nokafka        # Disable Apache Kafka support.
    --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished

B.Tensorflow1.8

进入tensorflow1.8的源码目录,执行./configure并根据提示选择:

[root@cdh2 tensorflow-1.8.0]# ./configure 
WARNING: Running Bazel server needs to be killed, because the startup options are different.
You have bazel 0.13.0 installed.
Please specify the location of python. [Default is /usr/bin/python]: 

Found possible Python library paths:
  /usr/lib/python2.7/site-packages
  /usr/lib64/python2.7/site-packages
Please input the desired Python library path to use.  Default is [/usr/lib/python2.7/site-packages]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.2

Please specify the location where CUDA 9.2 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.2.1

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:

Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]: 

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]

Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 

Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
    --config=mkl            # Build with MKL support.
    --config=monolithic     # Config for mostly static monolithic build.
Configuration finished

6

编译tensorflow

两个版本都使用下方的命令进行编译

bazel build --config=opt --config=cuda --config=monolithic //tensorflow/tools/pip_package:build_pip_package

注意:执行该命令要在tensorflow的源码目录下

开始编译:

等待编译结束,该过程比较耗时,出现下面提示表示编译成功。

编译结束后,执行下面命令:

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

执行完毕后可在/tmp/tensorflow_pkg目录中看到编译成功的tensorflow安装包:

注意:在编译过程中,磁盘不足或者内存不足都将导致编译失败,内存不足可能出现下面的错误,可通过设置交换区来解决。

设置缓冲区:

sudo dd if=/dev/zero of=/var/cache/swap/swap0 bs=1M count=1024
sudo chmod 0600 /var/cache/swap/swap0
sudo mkswap /var/cache/swap/swap0
sudo swapon /var/cache/swap/swap0

当编译结束后,删除该交换区:

swapoff /var/cache/swap/swap0
rm -rf /var/cache/swap/swap0

7

验证

此处以验证tensorflow1.8为例:

1.安装编译好的tensorflow安装包:

sudo pip install /tmp/tensorflow_pkg/tensorflow-1.8.0-cp27-none-linux_x86_64.whl

2.安装成功后,打开Python的交互界面,导入tensorflow,查看版本及路径:

注意:测试的时候别在tensorflow目录下import tensorflow,可能直接引用里 面的目录下的包。

提示:代码块部分可以左右滑动查看噢

为天地立心,为生民立命,为往圣继绝学,为万世开太平。 温馨提示:如果使用电脑查看图片不清晰,可以使用手机打开文章单击文中的图片放大查看高清原图。

原文发布于微信公众号 - Hadoop实操(gh_c4c535955d0f)

原文发表时间:2018-12-23

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券