Spark2.x学习笔记:1、Spark2.2快速入门(本地模式)

1、Spark2.2快速入门(本地模式)

1.1 Spark本地模式

学习Spark,先易后难,先从最简单的本地模式学起。

本地模式(local),常用于本地开发测试,解压缩Spark软件包就可以用,也就是所谓的“开封即用”

1.2 安装JDK8

(1)下载 登录Oracle官网http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html,同意协议条款,选择64位Linux版tar包链接,可以直接单击链接下载,推荐通过多线程下载工具(比如迅雷)加速下载。 (2)上传到服务器 通过XShell将在Windows系统下载的JDK8软件包上传到服务器192.168.1.180 (3)解压缩 此处解压缩到/opt目录。为了方便管理,我将第三方软件都安装到/opt目录。

[root@master ~]# tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt

(4)配置JDK环境变量 可以在/etc/profile文件中设置环境变量,为了方便管理此处在/etc/profile.d/目录下创建custom.sh文件,用于设置用户环境变量。

[root@master ~]# vi /etc/profile.d/custom.sh
[root@master ~]# cat /etc/profile.d/custom.sh
#java path
export JAVA_HOME=/opt/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib
[root@master ~]#

(5)环境变量生效

[root@master ~]# source /etc/profile.d/custom.sh

(6)运行java -version验证JDK

[root@master ~]# java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
[root@master ~]# 

1.3 下载Spark2.x软件包

(1)登录Spark官网 http://spark.apache.org/downloads.html (2)第1个选择spark发行版(选择2.2.0版),第2个选择软件包类型(选择Hadoop 2.7),第3个选择下载类型(直接下载较慢,选择Select Apache Mirror)

(3)单击spark-2.2.0-bin-hadoop2.7.tgz链接,选择国内镜像

(4)通过多线程下载工具加速下载 选择一个最近的镜像,比如此处选择清华大学镜像,通过wget命令wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz直接下载。

[root@master ~]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
--2017-08-29 22:43:51--  http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
Resolving mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)... 101.6.6.177, 2402:f000:1:416:101:6:6:177
Connecting to mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)|101.6.6.177|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203728858 (194M) [application/octet-stream]
Saving to: ‘spark-2.2.0-bin-hadoop2.7.tgz’

100%[============================================================================================================>] 203,728,858 9.79MB/s   in 23s    

2017-08-29 22:44:15 (8.32 MB/s) - ‘spark-2.2.0-bin-hadoop2.7.tgz’ saved [203728858/203728858]

[root@master ~]#

(5)然后解压缩/opt目录。我们约定Linux平台下第三方软件包都放到/opt目录下。

[root@master ~]# tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt

(6)由于Spark根目录太长,重命名一下。当然也可以不进行重命名。

[root@master ~]# mv /opt/spark-2.2.0-bin-hadoop2.7/ /opt/spark-2.2.0

1.4 Spark目录结构

[root@master ~]# cd /opt/spark-2.2.0/
[root@master spark-2.2.0]# ll
total 84
drwxr-xr-x. 2 500 500  4096 Jun 30 19:09 bin
drwxr-xr-x. 2 500 500   230 Jun 30 19:09 conf
drwxr-xr-x. 5 500 500    50 Jun 30 19:09 data
drwxr-xr-x. 4 500 500    29 Jun 30 19:09 examples
drwxr-xr-x. 2 500 500 12288 Jun 30 19:09 jars
-rw-r--r--. 1 500 500 17881 Jun 30 19:09 LICENSE
drwxr-xr-x. 2 500 500  4096 Jun 30 19:09 licenses
-rw-r--r--. 1 500 500 24645 Jun 30 19:09 NOTICE
drwxr-xr-x. 8 500 500   240 Jun 30 19:09 python
drwxr-xr-x. 3 500 500    17 Jun 30 19:09 R
-rw-r--r--. 1 500 500  3809 Jun 30 19:09 README.md
-rw-r--r--. 1 500 500   128 Jun 30 19:09 RELEASE
drwxr-xr-x. 2 500 500  4096 Jun 30 19:09 sbin
drwxr-xr-x. 2 500 500    42 Jun 30 19:09 yarn
[root@master spark-2.2.0]# 

目录

说明

bin

可执行脚本,Spark相关命令

conf

spark配置文件

data

spark自带例子用到的数据

examples

spark自带样例程序

lib

spark相关jar包

sbin

集群启停,因为spark有自带的集群环境

Spark软件包bin目录说明:

  • spark-shell :spark shell模式启动命令(脚本)
  • spark-submit:spark应用程序提交脚本(脚本)
  • run-example:运行spark提供的样例程序
  • spark-sql:spark SQL命令启动命令(脚本)

1.5 运行样例程序

[root@master1 spark-2.2.0]# bin/run-example SparkPi 4 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/29 01:27:26 INFO SparkContext: Running Spark version 2.2.0
17/08/29 01:27:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/29 01:27:26 INFO SparkContext: Submitted application: Spark Pi
17/08/29 01:27:27 INFO SecurityManager: Changing view acls to: root
17/08/29 01:27:27 INFO SecurityManager: Changing modify acls to: root
17/08/29 01:27:27 INFO SecurityManager: Changing view acls groups to: 
17/08/29 01:27:27 INFO SecurityManager: Changing modify acls groups to: 
17/08/29 01:27:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
17/08/29 01:27:27 INFO Utils: Successfully started service 'sparkDriver' on port 40549.
17/08/29 01:27:27 INFO SparkEnv: Registering MapOutputTracker
17/08/29 01:27:27 INFO SparkEnv: Registering BlockManagerMaster
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/08/29 01:27:27 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-719136e3-dc4e-4061-a07a-e5f04d679ad1
17/08/29 01:27:27 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/08/29 01:27:27 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/29 01:27:27 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/29 01:27:27 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.180:4040
17/08/29 01:27:27 INFO SparkContext: Added JAR file:/opt/spark-2.2.0/examples/jars/scopt_2.11-3.3.0.jar at spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar with timestamp 1503984447798
17/08/29 01:27:27 INFO SparkContext: Added JAR file:/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar at spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar with timestamp 1503984447798
17/08/29 01:27:27 INFO Executor: Starting executor ID driver on host localhost
17/08/29 01:27:27 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43952.
17/08/29 01:27:27 INFO NettyBlockTransferService: Server created on 192.168.1.180:43952
17/08/29 01:27:27 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/29 01:27:27 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.180:43952 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:28 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark-2.2.0/spark-warehouse').
17/08/29 01:27:28 INFO SharedState: Warehouse path is 'file:/opt/spark-2.2.0/spark-warehouse'.
17/08/29 01:27:29 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/08/29 01:27:29 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
17/08/29 01:27:29 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 4 output partitions
17/08/29 01:27:29 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
17/08/29 01:27:29 INFO DAGScheduler: Parents of final stage: List()
17/08/29 01:27:29 INFO DAGScheduler: Missing parents: List()
17/08/29 01:27:29 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
17/08/29 01:27:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
17/08/29 01:27:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
17/08/29 01:27:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.180:43952 (size: 1172.0 B, free: 366.3 MB)
17/08/29 01:27:29 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/08/29 01:27:29 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
17/08/29 01:27:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/08/29 01:27:29 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/29 01:27:29 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/08/29 01:27:29 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/08/29 01:27:29 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/08/29 01:27:29 INFO Executor: Fetching spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar with timestamp 1503984447798
17/08/29 01:27:29 INFO TransportClientFactory: Successfully created connection to /192.168.1.180:40549 after 34 ms (0 ms spent in bootstraps)
17/08/29 01:27:29 INFO Utils: Fetching spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar to /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/fetchFileTemp1808807623002630899.tmp
17/08/29 01:27:29 INFO Executor: Adding file:/tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/scopt_2.11-3.3.0.jar to class loader
17/08/29 01:27:29 INFO Executor: Fetching spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar with timestamp 1503984447798
17/08/29 01:27:29 INFO Utils: Fetching spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar to /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/fetchFileTemp3327801226116360399.tmp
17/08/29 01:27:29 INFO Executor: Adding file:/tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/spark-examples_2.11-2.2.0.jar to class loader
17/08/29 01:27:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 436 ms on localhost (executor driver) (1/4)
17/08/29 01:27:30 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 867 bytes result sent to driver
17/08/29 01:27:30 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 423 ms on localhost (executor driver) (2/4)
17/08/29 01:27:30 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 424 ms on localhost (executor driver) (3/4)
17/08/29 01:27:30 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 428 ms on localhost (executor driver) (4/4)
17/08/29 01:27:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/08/29 01:27:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.482 s
17/08/29 01:27:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.766385 s
Pi is roughly 3.1493878734696836
17/08/29 01:27:30 INFO SparkUI: Stopped Spark web UI at http://192.168.1.180:4040
17/08/29 01:27:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/08/29 01:27:30 INFO MemoryStore: MemoryStore cleared
17/08/29 01:27:30 INFO BlockManager: BlockManager stopped
17/08/29 01:27:30 INFO BlockManagerMaster: BlockManagerMaster stopped
17/08/29 01:27:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/08/29 01:27:30 INFO SparkContext: Successfully stopped SparkContext
17/08/29 01:27:30 INFO ShutdownHookManager: Shutdown hook called
17/08/29 01:27:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8
[root@master1 spark-2.2.0]# 

可以看到运行结果:Pi is roughly 3.1493878734696836

1.6 初识spark-shell

进入spark-shell

[root@master spark-2.2.0]# bin/spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/28 23:32:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/28 23:32:50 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.1.180:4040
Spark context available as 'sc' (master = local[*], app id = local-1503977564935).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

从上面的spark-shell日志中可以看到Spark context Web UI available at http://192.168.1.180:4040,表明spark-shell启动了一个WebUI,在浏览器地址栏输入http://192.168.1.180:4040即可打开。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏贾老师の博客

CMake 使用

883
来自专栏程序员与猫

Log system architecture

Keywords: Collector, Processor, Aggregator

1401
来自专栏Android机动车

AndroidStudio3.0多渠道打包:我用一行命令打出N个包

csdn: http://blog.csdn.net/k_bb_666

531
来自专栏微信音视频小程序

教你1天搭建自己的“微视”

A simple iOS Application project is shown below to illustrate how to configure S...

9015
来自专栏Samego开发资源

It can make your ssh login simply as well as efficiently on Mac or Linux

It can make your ssh login simply as well as efficiently on Mac or LInux. 点我翻译 ...

782
来自专栏10km的专栏

cmake:vs2015/MinGW静态编译leveldb

leveldb是google的开源项目(https://github.com/google/leveldb), 在linux下编译很方便,然而官方版本却没有提供...

4646
来自专栏杨建荣的学习笔记

一个Oracle bug的手工修复(r6笔记第59天)

在上周五的时候,本来一个例行巡检,想扩充一些表空间,结果弄巧成拙,因为一个drop datafile的操作直接导致了一主两备的两个备库MRP直接抛出了ORA-6...

2745
来自专栏java系列博客

curl的安装与简单使用

2505
来自专栏大魏分享(微信公众号:david-share)

怎样一个金箍圈(Pipeline),让至尊宝(Openshift)完成了到孙悟空(DevOps)的蜕变

但说出这句话,和实现Devops全工具链落地之间的差距,与造出原子弹和E=MC2公式的差距,实不逞多让。

3273
来自专栏Android中高级开发

Android开发之漫漫长途 IX——彻底掌握Binder

该文章是一个系列文章,是本人在Android开发的漫漫长途上的一点感想和记录,我会尽量按照先易后难的顺序进行编写该系列。该系列引用了《Android开发艺术探索...

962

扫码关注云+社区