基础环境
目前 DLC 的 Spark 基础运行环境如下:
OS = Debian 11(bullseye)Python = 3.9.2
基础镜像
DLC 提供如下 pyspark 镜像的,您可以根据需求选择:
spark:3.2.1-pythonspark:3.2.1-python-mlspark:3.2.1-python-ai
spark:3.2.1-python
该版本镜像提供基础运行环境,具体依赖如下:
Package Version------------------ ---------certifi 2022.6.15charset-normalizer 2.1.0greenlet 1.1.2idna 3.3numpy 1.23.0pandas 1.4.3pip 22.1.2psycopg2-binary 2.9.3pyarrow 8.0.0PyMySQL 1.0.2python-dateutil 2.8.2pytz 2022.1requests 2.28.1setuptools 63.1.0six 1.16.0SQLAlchemy 1.4.39urllib3 1.26.9wheel 0.34.2
spark:3.2.1-python-ml
该版本镜像提供轻量机器学习场景运行环境,具体依赖如下:
Package Version------------------ ---------certifi 2022.6.15charset-normalizer 2.1.0greenlet 1.1.2idna 3.3joblib 1.1.0networkx 2.8.4numpy 1.23.0packaging 21.3pandas 1.4.3patsy 0.5.2pip 22.1.2psycopg2-binary 2.9.3pyarrow 8.0.0PyMySQL 1.0.2pyparsing 3.0.9python-dateutil 2.8.2pytz 2022.1requests 2.28.1scikit-learn 1.1.1scipy 1.8.1setuptools 63.1.0six 1.16.0SQLAlchemy 1.4.39statsmodels 0.13.2threadpoolctl 3.1.0urllib3 1.26.9wheel 0.34.2
spark:3.2.1-python-ai
该版本镜像提供人工智能场景运行环境,具体依赖如下:
Package Version---------------------------- ---------absl-py 1.1.0astunparse 1.6.3cachetools 5.2.0certifi 2022.6.15charset-normalizer 2.0.12flatbuffers 1.12gast 0.4.0google-auth 2.8.0google-auth-oauthlib 0.4.6google-pasta 0.2.0grpcio 1.47.0h5py 3.7.0idna 3.3importlib-metadata 4.11.4joblib 1.1.0keras 2.9.0Keras-Preprocessing 1.1.2libclang 14.0.1Markdown 3.3.7networkx 2.8.4numpy 1.23.0oauthlib 3.2.0opencv-python 4.6.0.66opt-einsum 3.3.0packaging 21.3pandas 1.4.3Pillow 9.1.1pip 22.1.2protobuf 3.19.4pyarrow 8.0.0pyasn1 0.4.8pyasn1-modules 0.2.8pyparsing 3.0.9python-dateutil 2.8.2pytz 2022.1requests 2.28.0requests-oauthlib 1.3.1rsa 4.8scikit-learn 1.1.1scipy 1.8.1setuptools 62.6.0six 1.16.0tensorboard 2.9.1tensorboard-data-server 0.6.1tensorboard-plugin-wit 1.8.1tensorflow 2.9.1tensorflow-estimator 2.9.0tensorflow-io-gcs-filesystem 0.26.0termcolor 1.1.0threadpoolctl 3.1.0torch 1.11.0torchvision 0.12.0typing_extensions 4.2.0urllib3 1.26.9Werkzeug 2.1.2wheel 0.34.2wrapt 1.14.1zipp 3.8.0
虚拟环境
如果默认提供的镜像不满足您的应用需求,您可以通过虚拟环境方式打包依赖,建议您使用 debian 同源操作系统,python = 3.9.X 安装、打包依赖,具体操作如下:
#> docker run -it -v {YOUR-WORKING-DIR}:/data --rm python:3.9-slim /bin/bashroot@000000> cd /dataroot@000000> python3 -m venv pyspark-venvroot@000000 (pysaprk-venv)> source pyspark-venv/bin/activateroot@000000 (pyspark-venv)> pip3 install -i https://mirrors.tencent.com/pypi/simple/ {YOUR-DEPENDENCIES}root@000000> deactivateroot@000000> tar czvf pysarpk-venv.tar.gz pyspark-venv # 打包虚拟环境root@000000> exit # 退出 docker