前言
目前gluten支持两种backend:clickhouse和velox,本文实践基于gluten与clickhouse组合方式。
该技术栈与服务器架构、CPU指令集、操作系统密切相关。常见的海光、兆芯服务器属于x86架构,鲲鹏、飞腾服务器属于arm架构,龙芯、申威服务器属于比较小众的自主架构。x86架构cpu指令集使用CISC即复杂指令集,而arm架构cpu指令集使用RISC即精简指令集。两种架构能够用于Native加速的指令集也不相同,比如x86的SSE、AVX等,arm的NEON、SVE等。可以通过lscpu查看自己服务器cpu指令集,而clickhouse在不同cpu架构下能够使用的cpu指令集可以查看对应的cmake文件,如下:
https://github.com/Kyligence/ClickHouse/blob/clickhouse_backend/cmake/cpu_features.cmake
最后,操作系统自带的编译工具版本太低会阻塞clickshoue的编译,比如Clang 16.0+、cmake3.20+、ninja-build1.8.2+等等。而Clang属于LLVM项目,LLVM源码编译依赖gcc7.3+、Python3+等等。
环境准备(默认)
根据个人需求选择合适的操作系统,我默认使用欧拉,版本如下:
[root@FelixZh]# cat /etc/os-release
NAME="openEuler"
VERSION="22.03LTS"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler22.03 LTS"
ANSI_COLOR="0;31"
使用欧拉系统镜像安装必须得编译工具,如下:
yum install cmake ninja-build yasm nasm gcc g++ ccache
[root@FelixZh mySourceCode]# cmake --version
cmake version 3.22.0
[root@FelixZh mySourceCode]# g++ --version
g++ (GCC) 10.3.1
[root@FelixZh mySourceCode]# ninja-build --version
1.10.2
环境准备(高级)
如果你的操作系统(如CentOS)自带编译工具版本过低,如gcc4.8。需要源码编译安装,如下:
# cmake
wget https://github.com/Kitware/CMake/archive/refs/tags/v3.22.3.tar.gz
tar -xvf CMake-3.22.3.tar.gz
cd CMake-3.22.3/
./configure –prefix=/usr/local/cmake-3.22.3
make -j8
make install
# gcc
wget https://ftp.gnu.org/gnu/gcc/gcc-11.5.0/gcc-11.5.0.tar.gz
tar -zxvf ./gcc-11.5.0.tar.gz
cd gcc-11.5.0
可以执行./contrib/download_prerequisites 来下载依赖。
如果网络不可达,可以手动下载。
https://gcc.gnu.org/pub/gcc/infrastructure/
# 具体版本号可以查看
cat contrib/download_prerequisites
gmp='gmp-6.1.0.tar.bz2'
mpfr='mpfr-3.1.6.tar.bz2'
mpc='mpc-1.0.3.tar.gz'
isl='isl-0.18.tar.bz2'
下载完成上传到gcc-11.5.0/,执行./contrib/download_prerequisites
yum install -y lbzip2 gcc gcc-c++ gmp-devel mpfr-devel libmpc-devel isl-devel
./configure --prefix=/usr/local/gcc-11.5.0 --enable-languages=c,c++ --disable-multilib
make –j8 && make install
vim /etc/profile
export GCC_HOME=/usr/local/gcc-11.5.0
export PATH=${GCC_HOME}/bin:$PATH
LLVM编译
本文使用llvm19,如下:
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-19.1.4/llvm-project-19.1.4.src.tar.xz
tar -xvf llvm-project-19.1.4.src.tar.xz
cd llvm-project-19.1.4.src
mkdir build
cmake -S llvm -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local/llvm19 -DLLVM_ENABLE_PROJECTS='bolt;clang;clang-tools-extra;compiler-rt;lld;lldb;cross-project-tests;libclc;polly' -DLLVM_ENABLE_RUNTIMES=all
cd build/ && ninja -j8
ninja install
export PATH=/usr/local/llvm19/bin:$PATH
export CC=clang-19
export CXX=clang++
验证效果如下:
Gluten编译
git clone -b v1.3.0 https://github.com/apache/incubator-gluten.git
backend使用clickhouse,可以执行build_clickhouse.sh编译,脚本会自动从Kyligence仓库下载指定commitID的ck,具体信息可见clickhouse.version文件:
[root@FelixZh incubator-gluten]# bash ./ep/build-clickhouse/src/build_clickhouse.sh
/home/mySourceCode/incubator-gluten
CH_ORG=Kyligence
CH_BRANCH=rebase_ch/20250107
CH_COMMIT=01d2a08fb01
-- The C compiler identification is Clang 19.1.4
-- The CXX compiler identification is Clang 19.1.4
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /usr/local/llvm19/bin/clang-19
libch.so路径如下:
incubator-gluten/cpp-ch/build/utils/extern-local-engine/
然后,通过mvn继续编译Java部分代码,如下:
mvn clean install -Pbackends-clickhouse -Phadoop-3.2 -Pspark-3.3 -Dhadoop.version=3.2.3 -DskipTests -Dcheckstyle.skip -Pdelta
生成Jar路径如下:
backends-clickhouse/target/gluten-1.3.0-spark-3.3-jar-with-dependencies.jar
效果验证
配置spark-env.sh
export LD_PRELOAD="/opt/libch-1.3.0.so"
配置spark-defaults.conf
spark.sql.adaptive.enabled false
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.sql.orc.impl native
spark.plugins org.apache.gluten.GlutenPlugin
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 4G
spark.executorEnv.LD_PRELOAD /opt/libch-1.3.0.so
spark.gluten.sql.columnar.libpath /opt/libch-1.3.0.so
spark.gluten.sql.enable.native.validation false
通过spark-sql执行测试sql:
select * from test_orc;