在Spark集群 + Akka + Kafka + Scala 开发(1) : 配置开发环境,我们已经部署好了一个Spark的开发环境。 本文的目标是写一个Spark应用,并可以在集群中测试。
mkdir SimpleAPP
mkdir -p SimpleAPP/src/main/scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
squareSum(10000)
}
private def squareSum(n: Long): Long = {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val squareSum = sc.parallelize(1L until n).map { i =>
i * i
}.reduce(_ + _)
println(s"============== The square sum of $n is $squareSum. ==============")
squareSum
}
}
name := "Simple Application Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
# move to the project folder
cd SimpleAPP
# build the project
sbt package
Output:
[info] Packaging .../target/scala-2.11/simple-application-project_2.11-1.0.jar ... [info] Done packaging.
粗体的部分是构建出来的jar文件相对路径。需要记住,运行的时候有用。
--master local[4]
表示在local模式下运行,使用4个线程。# run the project in local with 4 threads
$SPARK_HOME/bin/spark-submit --master local[4] --class SimpleApp target/scala-2.11/simple-application-project_2.11-1.0.jar
输出 (会有很多的log输出):
... ============== The square sum of 10000 is 333283335000. ============== ...
现在,我们完成了一个简单的spark工程的开发。下一步,看看如何在集群中运行。
部署一个standalone集群环境不是本文要讲的内容。 所以,现在我们只使用单机上的集群功能。 如果想部署一个有多个机器的standalone集群环境,可以查看在官网上的说明。部署起来也比较简单。
spark://$(hostname):7077
实际的Master URL可以在master服务器的日志中找到。
这个Master URL用于:--master
配置。http://localhost:8080
实际的Master Web UI URL可以在master服务器的日志中找到。http://localhost:8081
实际的Slave Web UI URL可以在master服务器的日志中找到。# start master
$SPARK_HOME/sbin/start-master.sh
输出:
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-steven-org.apache.spark.deploy.master.Master-1-sycentos.localdomain.out
# We need get the spark master url
cat $SPARK_HOME/logs/spark-steven-org.apache.spark.deploy.master.Master-1-sycentos.localdomain.out | grep Master:
# or
cat $SPARK_HOME/logs/spark-$(whoami)-org.apache.spark.deploy.master.Master-1-$(hostname).out | grep Master:
输出:
16/09/23 19:45:37 INFO Master: Started daemon with process name: 4604@sycentos.localdomain 16/09/23 19:45:42 INFO Master: Starting Spark master at spark://sycentos.localdomain:7077 16/09/23 19:45:42 INFO Master: Running Spark version 2.0.0 16/09/23 19:45:44 INFO Master: I have been elected leader! New state: ALIVE 16/09/23 19:59:26 INFO Master: Registering worker 10.0.2.15:36442 with 4 cores, 2.7 GB RAM 16/09/23 20:15:13 INFO Master: 10.0.2.15:42662 got disassociated, removing it. 16/09/23 20:15:13 INFO Master: 10.0.2.15:36442 got disassociated, removing it. 16/09/23 20:15:13 INFO Master: Removing worker worker-20160923195923-10.0.2.15-36442 on 10.0.2.15:36442 16/09/23 20:15:39 INFO Master: Registering worker 10.0.2.15:42462 with 4 cores, 2.7 GB RAM Note: Master: I have been elected leader! New state: ALIVE 粗体就是Master URL.
$SPARK_HOME/sbin/start-slave.sh spark://$(hostname):7077
# or
# $SPARK_HOME/sbin/start-slave.sh spark://sycentos.localdomain:7077
输出:
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-steven-org.apache.spark.deploy.worker.Worker-1-sycentos.localdomain.out
cat $SPARK_HOME/logs/spark-$(whoami)-org.apache.spark.deploy.worker.Worker-1-$(hostname).out | grep spark://
# or
# cat $SPARK_HOME/logs/spark-steven-org.apache.spark.deploy.worker.Worker-1-sycentos.localdomain.out
输出:
16/09/23 20:15:39 INFO Worker: Successfully registered with master spark://sycentos.localdomain:7077
这时,spart的master和slave服务都已经启动。
说明一下,关闭Master的命令是:
$SPARK_HOME/sbin/stop-master.sh
$SPARK_HOME/sbin/stop-slave.sh
进入到SimpleApp的目录,并运行:
# run the project
$SPARK_HOME/bin/spark-submit --master spark://$(hostname):7077 --class SimpleApp target/scala-2.11/simple-application-project_2.11-1.0.jar
输出:
... 16/09/23 20:34:40 INFO StandaloneAppClient\(ClientEndpoint: Connecting to master spark://sycentos.localdomain:7077... ... 16/09/23 20:34:40 INFO StandaloneAppClient\)ClientEndpoint: Executor added: app-20160923203440-0000/0 on worker-20160923201537-10.0.2.15-42462 (10.0.2.15:42462) with 4 cores ...
通过查找关键字master和worker,可以确认是在集群上运行。
从master服务的log里,可以找到master URL。
# Query master web UI url from master service log.
cat /opt/spark/logs/spark-steven-org.apache.spark.deploy.master.Master-1-sycentos.localdomain.out | grep MasterWebUI
输出:
16/09/23 19:45:43 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://10.0.2.15:8080
通过浏览器访问http://localhost:8080/,可以看到有一个完成的应用。
现在,我们已经可以在集群环境中运行SimpleApp
至此,我们已经写好了一个spark集群scala的应用。下一步请看: Spark集群 + Akka + Kafka + Scala 开发(3) : 开发一个Akka + Spark的应用 Spark集群 + Akka + Kafka + Scala 开发(4) : 开发一个Kafka + Spark的应用