专栏首页容器计算Spark on Kubernetes在Mac的Demo

Spark on Kubernetes在Mac的Demo

1 Overview

讲真,Spark 2.3 开始原生支持 K8S,按照Spark 2.4 官网的方法一开始真的没跑起来,K8S Dashboard 又一堆问题,可能我太菜了,头疼。

结果我再仔细看看官方指导,发现…

2 Start

2.1 部署本地的 K8S 集群

要在 K8S 上享受跑 Spark 的快感,首先你要有 K8S 集群,如果没有也没关系,我们本地装一个。

我使用的是 Mac,具体配置如下。

ProductName:	Mac OS X
ProductVersion:	10.12.6
BuildVersion:	16G1114

其实再仔细看看官方文档 prerequisties,可以发现一些不满足的条件,比如说默认的 Minikube 的资源是不足够运行一个 Spark App 的。我本地用的是 Docker Edge 里面配的 K8S Cluster,大家尝试的话可以下载并通过设置来开启,需要注意的是,资源要调大一点,不然 Spark 启动之后机会一直在等待资源。

另外就是 example jar 的问题,留意一下,官网上有一句:

This URI is the location of the example jar that is already in the Docker image.

注意了,这个配置里的 Image 指的是已经打包到镜像的 jar 文件!!!不是你本地的文件!!!

2.2 Spark 跑起来

➜  spark-2.4.2-bin-hadoop2.7 bin/spark-submit \
  --master k8s://http://localhost:8001  \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=spark:2.3.0 \
local:///opt/spark/examples/jars/spark-examples_2.12-2.4.2.jar

2.3 应用日志

首先是展示在终端的日志,这部分的日志是从 LoggingPodStatusWatcherImpl 打印出来的,这个类的作用格式检测 K8S 上 Spark App 的 Pod 的状态 Status。

大家可以搜索一下关键词,phase,可以发现 Pod 的状态流转的过程

Pending -> Running -> Succeeded
19/04/29 14:40:14 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/04/29 14:40:21 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1556520019644-driver
	 namespace: default
	 labels: spark-app-selector -> spark-c55df736c1134dd1ac14b67ad6f300b3, spark-role -> driver
	 pod uid: a9395174-6a49-11e9-8af3-025000000001
	 creation time: 2019-04-29T06:40:21Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-97296
	 node name: N/A
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
19/04/29 14:40:21 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1556520019644-driver
	 namespace: default
	 labels: spark-app-selector -> spark-c55df736c1134dd1ac14b67ad6f300b3, spark-role -> driver
	 pod uid: a9395174-6a49-11e9-8af3-025000000001
	 creation time: 2019-04-29T06:40:21Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-97296
	 node name: docker-desktop
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
19/04/29 14:40:21 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1556520019644-driver
	 namespace: default
	 labels: spark-app-selector -> spark-c55df736c1134dd1ac14b67ad6f300b3, spark-role -> driver
	 pod uid: a9395174-6a49-11e9-8af3-025000000001
	 creation time: 2019-04-29T06:40:21Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-97296
	 node name: docker-desktop
	 start time: 2019-04-29T06:40:21Z
	 container images: spark:2.3.0
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=spark:2.3.0, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})]
19/04/29 14:40:22 INFO Client: Waiting for application spark-pi to finish...
19/04/29 14:40:24 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1556520019644-driver
	 namespace: default
	 labels: spark-app-selector -> spark-c55df736c1134dd1ac14b67ad6f300b3, spark-role -> driver
	 pod uid: a9395174-6a49-11e9-8af3-025000000001
	 creation time: 2019-04-29T06:40:21Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-97296
	 node name: docker-desktop
	 start time: 2019-04-29T06:40:21Z
	 container images: spark:2.3.0
	 phase: Running
	 status: [ContainerStatus(containerID=docker://93c8f1b06820a2f95c4aa13b498edfc35bd63bc0da83ce4ef6f63dfe6c13eef3, image=spark:2.3.0, imageID=docker://sha256:1352ff0f5275feb3b49248ed4b167659d8d752a143fe40f271c1430829336cbd, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=2019-04-29T06:40:24Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})]
19/04/29 14:40:46 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1556520019644-driver
	 namespace: default
	 labels: spark-app-selector -> spark-c55df736c1134dd1ac14b67ad6f300b3, spark-role -> driver
	 pod uid: a9395174-6a49-11e9-8af3-025000000001
	 creation time: 2019-04-29T06:40:21Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-97296
	 node name: docker-desktop
	 start time: 2019-04-29T06:40:21Z
	 container images: spark:2.3.0
	 phase: Succeeded
	 status: [ContainerStatus(containerID=docker://93c8f1b06820a2f95c4aa13b498edfc35bd63bc0da83ce4ef6f63dfe6c13eef3, image=spark:2.3.0, imageID=docker://sha256:1352ff0f5275feb3b49248ed4b167659d8d752a143fe40f271c1430829336cbd, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://93c8f1b06820a2f95c4aa13b498edfc35bd63bc0da83ce4ef6f63dfe6c13eef3, exitCode=0, finishedAt=2019-04-29T06:40:45Z, message=null, reason=Completed, signal=null, startedAt=2019-04-29T06:40:24Z, additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
19/04/29 14:40:46 INFO LoggingPodStatusWatcherImpl: Container final statuses:


	 Container name: spark-kubernetes-driver
	 Container image: spark:2.3.0
	 Container state: Terminated
	 Exit code: 0
19/04/29 14:40:46 INFO Client: Application spark-pi finished.
19/04/29 14:40:46 INFO ShutdownHookManager: Shutdown hook called
19/04/29 14:40:46 INFO ShutdownHookManager: Deleting directory /private/var/folders/n8/xsvrzm1964xgwh1mn8hqdglr0000gn/T/spark-0bacf5b1-88d9-41bf-bdcb-23d3e6d4a738

其次,可以到 K8S Dashboard 去找 Driver 和 Executor Pod 的日志,或者直接使用命令 kubectl logs <pod name>。关于这部分日志就不赘述了,是 Spark 的一些基本内容。

➜  spark-2.4.2-bin-hadoop2.7 kubectl logs spark-pi-1556521806974-driver
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.1.0.23 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal
19/04/29 07:10:17 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/04/29 07:10:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/04/29 07:10:19 INFO SparkContext: Running Spark version 2.4.2
19/04/29 07:10:19 INFO SparkContext: Submitted application: Spark Pi
19/04/29 07:10:19 INFO SecurityManager: Changing view acls to: root
19/04/29 07:10:19 INFO SecurityManager: Changing modify acls to: root
19/04/29 07:10:19 INFO SecurityManager: Changing view acls groups to:
19/04/29 07:10:19 INFO SecurityManager: Changing modify acls groups to:
19/04/29 07:10:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
19/04/29 07:10:20 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
...
...
...

3 Summary

Spark 在2.3已经支持 K8S 的集群管理的模式了,相关的实现可以参考 Spark 源码中 resource-managers/kubernetes 下的实现,其实现的方案主要是利用了 K8S 的 Java Client 来调用 K8S 的 API。具体的设计,以后有空再慢慢研究。

至于为什么 On Yarn 跑的好好的,要突然切到 K8S 呢,这里参考了一篇文章,大家可以理解一下。

https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b

  1. 数据处理的 Pipeline 已经逐渐容器化了,如果 Spark 都容器化了,那么跑在 K8S 上也就很合理,毕竟 K8S 调度 Docker 镜像的容器非常成熟。
  2. 跑在 K8S 上就没有了物理机的概念了,全部上云,这样对资源的利用以及成本的核算都会更
  3. 通过 K8S 的 NameSpace 和 Quotas,可以提供多租户的集群共享。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Spark Opeartor的指标体系

    spark-on-k8s-operator,下文简称 Spark Operator, 背景知识就不介绍太多了,本文主要分享一下 Spark Operator 的...

    runzhliu
  • Spark 2.2 on K8S Dynamic Resource Allocation

    离线计算的 Spark 任务中,会提供「最小副本数」作为离线计算任务的 Executor 数。也就是说,当 Spark Job 最大的资源数就是500个 Exe...

    runzhliu
  • spark shell 配置 Kryo 序列化

    Spark 默认使用 Java Serialization 作为序列化方式,但是这种序列化方式一般会被认为性能和效率一般。因此 Spark 官方是推荐使用 Kr...

    runzhliu
  • spark-submit报错:Application application_1529650293575_0148 finished with failed status

    董可伦
  • Apache Zeppelin 中 Spark 解释器

    概述 Apache Spark是一种快速和通用的集群计算系统。它提供Java,Scala,Python和R中的高级API,以及支持一般执行图的优化引擎。Ze...

    片刻
  • 不同的strcmp

    Android libc中的strcmp https://android.googlesource.com/platform/bootable/bootload...

    梦里茶
  • iOS多边形马赛克的实现(下)

    上一篇里我们详述了多边形马赛克的实现步骤,末尾提出了一个思考:如何在涂抹时让马赛克逐块显示呢? 再回顾一下多边形马赛克的实现。首先进行图片预处理,将原图转成bi...

    天天P图攻城狮
  • iOS多边形马赛克的实现(上)

    马赛克(英语:Mosaic)是镶嵌艺术的音译,原本是指一种装饰艺术,通常使用许多小石块或有色玻璃碎片拼成图案,在教堂中的玻璃艺品,又称为花窗玻璃(stained...

    天天P图攻城狮
  • Spark伪分布式集群搭建

    CoderJed
  • Spark源码系列之Standalone模式下Spark应用的整个启动过程

    一,Standalone模式下的spark角色 大家都知道在Standalone模式下,spark一共有以下五种角色: sparksubmit,maste...

    Spark学习技巧

扫码关注云+社区

领取腾讯云代金券