专栏首页大数据学习笔记Spark2.x学习笔记:1、Spark2.2快速入门(本地模式)

Spark2.x学习笔记:1、Spark2.2快速入门(本地模式)

1、Spark2.2快速入门(本地模式)

1.1 Spark本地模式

学习Spark,先易后难,先从最简单的本地模式学起。

本地模式(local),常用于本地开发测试,解压缩Spark软件包就可以用,也就是所谓的“开封即用”

1.2 安装JDK8

(1)下载 登录Oracle官网http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html,同意协议条款,选择64位Linux版tar包链接,可以直接单击链接下载,推荐通过多线程下载工具(比如迅雷)加速下载。 (2)上传到服务器 通过XShell将在Windows系统下载的JDK8软件包上传到服务器192.168.1.180 (3)解压缩 此处解压缩到/opt目录。为了方便管理,我将第三方软件都安装到/opt目录。

[root@master ~]# tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt

(4)配置JDK环境变量 可以在/etc/profile文件中设置环境变量,为了方便管理此处在/etc/profile.d/目录下创建custom.sh文件,用于设置用户环境变量。

[root@master ~]# vi /etc/profile.d/custom.sh
[root@master ~]# cat /etc/profile.d/custom.sh
#java path
export JAVA_HOME=/opt/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib
[root@master ~]#

(5)环境变量生效

[root@master ~]# source /etc/profile.d/custom.sh

(6)运行java -version验证JDK

[root@master ~]# java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
[root@master ~]# 

1.3 下载Spark2.x软件包

(1)登录Spark官网 http://spark.apache.org/downloads.html (2)第1个选择spark发行版(选择2.2.0版),第2个选择软件包类型(选择Hadoop 2.7),第3个选择下载类型(直接下载较慢,选择Select Apache Mirror)

(3)单击spark-2.2.0-bin-hadoop2.7.tgz链接,选择国内镜像

(4)通过多线程下载工具加速下载 选择一个最近的镜像,比如此处选择清华大学镜像,通过wget命令wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz直接下载。

[root@master ~]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
--2017-08-29 22:43:51--  http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
Resolving mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)... 101.6.6.177, 2402:f000:1:416:101:6:6:177
Connecting to mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)|101.6.6.177|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203728858 (194M) [application/octet-stream]
Saving to: ‘spark-2.2.0-bin-hadoop2.7.tgz’

100%[============================================================================================================>] 203,728,858 9.79MB/s   in 23s    

2017-08-29 22:44:15 (8.32 MB/s) - ‘spark-2.2.0-bin-hadoop2.7.tgz’ saved [203728858/203728858]

[root@master ~]#

(5)然后解压缩/opt目录。我们约定Linux平台下第三方软件包都放到/opt目录下。

[root@master ~]# tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt

(6)由于Spark根目录太长,重命名一下。当然也可以不进行重命名。

[root@master ~]# mv /opt/spark-2.2.0-bin-hadoop2.7/ /opt/spark-2.2.0

1.4 Spark目录结构

[root@master ~]# cd /opt/spark-2.2.0/
[root@master spark-2.2.0]# ll
total 84
drwxr-xr-x. 2 500 500  4096 Jun 30 19:09 bin
drwxr-xr-x. 2 500 500   230 Jun 30 19:09 conf
drwxr-xr-x. 5 500 500    50 Jun 30 19:09 data
drwxr-xr-x. 4 500 500    29 Jun 30 19:09 examples
drwxr-xr-x. 2 500 500 12288 Jun 30 19:09 jars
-rw-r--r--. 1 500 500 17881 Jun 30 19:09 LICENSE
drwxr-xr-x. 2 500 500  4096 Jun 30 19:09 licenses
-rw-r--r--. 1 500 500 24645 Jun 30 19:09 NOTICE
drwxr-xr-x. 8 500 500   240 Jun 30 19:09 python
drwxr-xr-x. 3 500 500    17 Jun 30 19:09 R
-rw-r--r--. 1 500 500  3809 Jun 30 19:09 README.md
-rw-r--r--. 1 500 500   128 Jun 30 19:09 RELEASE
drwxr-xr-x. 2 500 500  4096 Jun 30 19:09 sbin
drwxr-xr-x. 2 500 500    42 Jun 30 19:09 yarn
[root@master spark-2.2.0]# 

目录

说明

bin

可执行脚本,Spark相关命令

conf

spark配置文件

data

spark自带例子用到的数据

examples

spark自带样例程序

lib

spark相关jar包

sbin

集群启停,因为spark有自带的集群环境

Spark软件包bin目录说明:

  • spark-shell :spark shell模式启动命令(脚本)
  • spark-submit:spark应用程序提交脚本(脚本)
  • run-example:运行spark提供的样例程序
  • spark-sql:spark SQL命令启动命令(脚本)

1.5 运行样例程序

[root@master1 spark-2.2.0]# bin/run-example SparkPi 4 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/29 01:27:26 INFO SparkContext: Running Spark version 2.2.0
17/08/29 01:27:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/29 01:27:26 INFO SparkContext: Submitted application: Spark Pi
17/08/29 01:27:27 INFO SecurityManager: Changing view acls to: root
17/08/29 01:27:27 INFO SecurityManager: Changing modify acls to: root
17/08/29 01:27:27 INFO SecurityManager: Changing view acls groups to: 
17/08/29 01:27:27 INFO SecurityManager: Changing modify acls groups to: 
17/08/29 01:27:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
17/08/29 01:27:27 INFO Utils: Successfully started service 'sparkDriver' on port 40549.
17/08/29 01:27:27 INFO SparkEnv: Registering MapOutputTracker
17/08/29 01:27:27 INFO SparkEnv: Registering BlockManagerMaster
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/08/29 01:27:27 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-719136e3-dc4e-4061-a07a-e5f04d679ad1
17/08/29 01:27:27 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/08/29 01:27:27 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/29 01:27:27 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/29 01:27:27 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.180:4040
17/08/29 01:27:27 INFO SparkContext: Added JAR file:/opt/spark-2.2.0/examples/jars/scopt_2.11-3.3.0.jar at spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar with timestamp 1503984447798
17/08/29 01:27:27 INFO SparkContext: Added JAR file:/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar at spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar with timestamp 1503984447798
17/08/29 01:27:27 INFO Executor: Starting executor ID driver on host localhost
17/08/29 01:27:27 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43952.
17/08/29 01:27:27 INFO NettyBlockTransferService: Server created on 192.168.1.180:43952
17/08/29 01:27:27 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/29 01:27:27 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.180:43952 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:28 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark-2.2.0/spark-warehouse').
17/08/29 01:27:28 INFO SharedState: Warehouse path is 'file:/opt/spark-2.2.0/spark-warehouse'.
17/08/29 01:27:29 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/08/29 01:27:29 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
17/08/29 01:27:29 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 4 output partitions
17/08/29 01:27:29 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
17/08/29 01:27:29 INFO DAGScheduler: Parents of final stage: List()
17/08/29 01:27:29 INFO DAGScheduler: Missing parents: List()
17/08/29 01:27:29 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
17/08/29 01:27:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
17/08/29 01:27:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
17/08/29 01:27:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.180:43952 (size: 1172.0 B, free: 366.3 MB)
17/08/29 01:27:29 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/08/29 01:27:29 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
17/08/29 01:27:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/08/29 01:27:29 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/29 01:27:29 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/08/29 01:27:29 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/08/29 01:27:29 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/08/29 01:27:29 INFO Executor: Fetching spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar with timestamp 1503984447798
17/08/29 01:27:29 INFO TransportClientFactory: Successfully created connection to /192.168.1.180:40549 after 34 ms (0 ms spent in bootstraps)
17/08/29 01:27:29 INFO Utils: Fetching spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar to /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/fetchFileTemp1808807623002630899.tmp
17/08/29 01:27:29 INFO Executor: Adding file:/tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/scopt_2.11-3.3.0.jar to class loader
17/08/29 01:27:29 INFO Executor: Fetching spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar with timestamp 1503984447798
17/08/29 01:27:29 INFO Utils: Fetching spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar to /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/fetchFileTemp3327801226116360399.tmp
17/08/29 01:27:29 INFO Executor: Adding file:/tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/spark-examples_2.11-2.2.0.jar to class loader
17/08/29 01:27:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 436 ms on localhost (executor driver) (1/4)
17/08/29 01:27:30 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 867 bytes result sent to driver
17/08/29 01:27:30 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 423 ms on localhost (executor driver) (2/4)
17/08/29 01:27:30 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 424 ms on localhost (executor driver) (3/4)
17/08/29 01:27:30 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 428 ms on localhost (executor driver) (4/4)
17/08/29 01:27:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/08/29 01:27:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.482 s
17/08/29 01:27:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.766385 s
Pi is roughly 3.1493878734696836
17/08/29 01:27:30 INFO SparkUI: Stopped Spark web UI at http://192.168.1.180:4040
17/08/29 01:27:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/08/29 01:27:30 INFO MemoryStore: MemoryStore cleared
17/08/29 01:27:30 INFO BlockManager: BlockManager stopped
17/08/29 01:27:30 INFO BlockManagerMaster: BlockManagerMaster stopped
17/08/29 01:27:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/08/29 01:27:30 INFO SparkContext: Successfully stopped SparkContext
17/08/29 01:27:30 INFO ShutdownHookManager: Shutdown hook called
17/08/29 01:27:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8
[root@master1 spark-2.2.0]# 

可以看到运行结果:Pi is roughly 3.1493878734696836

1.6 初识spark-shell

进入spark-shell

[root@master spark-2.2.0]# bin/spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/28 23:32:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/28 23:32:50 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.1.180:4040
Spark context available as 'sc' (master = local[*], app id = local-1503977564935).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

从上面的spark-shell日志中可以看到Spark context Web UI available at http://192.168.1.180:4040,表明spark-shell启动了一个WebUI,在浏览器地址栏输入http://192.168.1.180:4040即可打开。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • SpringBoot 2.x 整合Echarts

    在resources目录下创建js目录,然后将刚才下载的echarts.js文件放到js目录下。

    程裕强
  • 网址推荐

    版权声明:本文为博主原创文章,欢迎转载。 https://blog.csdn.net/che...

    程裕强
  • CentOS 7.x基本设置

    Hosts是一个没有扩展名的系统文件(文本文件),可以用记事本等工具打开,其作用就是将一些常用的网址IP地址与对应的域名建立一个关联“数据库”,当用户在浏览器中...

    程裕强
  • 微信开发--微信公众号(二)

    需要连接第三方的服务.这里选用的是 图灵机器人. 注册账号,并创建一个图灵机器人.然后可通过机器人设置查看自己的apikey.

    生南星
  • 全面剖析无人车三大基本技术:计算、动力和电传线控

    陈桦 编译自 Voyage官方博客 量子位 报道 | 公众号 QbitAI ? 打造一辆无人车,究竟需要哪些软件和硬件? 无人车创业公司Voyage今天在官方博...

    量子位
  • Nginx的几个常用配置和技巧

    文章列举了几个Nginx常见的,实用的,有趣的配置,希望看过之后能说一句:学到了! 一个站点配置多个域名

    Mr.Mao Notes
  • 微课|《Python编程基础与案例集锦(中学版)》第8章例题讲解(3)

    例8-22 使用递归算法计算组合数。 http://mpvideo.qpic.cn/0bf26qaaeaaaweacx533qfpfb5gdal2aaaqa....

    Python小屋屋主
  • Objective-C中把数组中字典中的数据转换成URL

            可能上面的标题有些拗口,学过PHP的小伙伴们都知道,PHP中的数组的下标是允许我们自定义的,PHP中的数组确切的说就是键值对。而在OC我们要用字...

    lizelu
  • 微服务架构之服务冶理Dubbo-服务引用

    注:公众号关于dubbo解读文章均基于apache-dubbo-incubating-2.7.1版本,发版于5月26号,此版本注册中心(多数是zookeeper...

    公众号_松花皮蛋的黑板报
  • R语言实现CHIP-seq数据分析

    ChIP-Seq是将ChIP(Chromatin Immuno precipitation)与二代测序技术相结合的技术,高效地在全基因组范围内检测与组蛋白、转录...

    一粒沙

扫码关注云+社区

领取腾讯云代金券