前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Kettle与Hadoop(九)提交Spark作业

Kettle与Hadoop(九)提交Spark作业

作者头像
用户1148526
发布2020-06-11 16:24:21
1.5K0
发布2020-06-11 16:24:21
举报
文章被收录于专栏:Hadoop数据仓库Hadoop数据仓库

实验目的: 配置Kettle向Spark集群提交作业。

实验环境: Spark History Server: 172.16.1.126

Spark Gateway: 172.16.1.124 172.16.1.125 172.16.1.126 172.16.1.127

PDI: 172.16.1.105

Hadoop版本:CDH 6.3.1 Spark版本:2.4.0-cdh6.3.1 PDI版本:8.3

Kettle连接CDH参见“https://wxy0327.blog.csdn.net/article/details/106406702”。配置步骤: 1. 将CDH中Spark的库文件复制到PDI所在主机

代码语言:javascript
复制
-- 在172.16.1.126上执行
cd /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark
scp -r * 172.16.1.105:/root/spark/

2. 为Kettle配置Spark 以下操作均在172.16.1.105以root用户执行。 (1)备份原始配置文件

代码语言:javascript
复制
cp spark-defaults.conf spark-defaults.conf.bak
cp spark-env.sh spark-env.sh.bak

(2)编辑spark-defaults.conf文件.

代码语言:javascript
复制
vim /root/spark/conf/spark-defaults.conf

内容如下:

代码语言:javascript
复制
spark.yarn.archive=hdfs://manager:8020/user/spark/lib/spark_jars.zip
spark.hadoop.yarn.timeline-service.enabled=false
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://manager:8020/user/spark/applicationHistory
spark.yarn.historyServer.address=http://node2:18088

(3)编辑spark-env.sh文件

代码语言:javascript
复制
vim /root/spark/conf/spark-env.sh

内容如下:

代码语言:javascript
复制
#!/usr/bin/env bash

HADOOP_CONF_DIR=/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61
SPARK_HOME=/root/spark

(4)编辑core-site.xml文件

代码语言:javascript
复制
vim /root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/core-site.xml

去掉下面这段的注释:

代码语言:javascript
复制
<property>
  <name>net.topology.script.file.name</name>
  <value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property>

提交Spark作业: 1. 修改PDI自带的Spark例子

代码语言:javascript
复制
cp /root/data-integration/samples/jobs/Spark\ Submit/Spark\ submit.kjb /root/big_data/

在Kettle中打开/root/big_data/Spark\ submit.kjb文件,如图1所示。

图1

编辑Spark Submit Sample作业项,如图2所示。

图2

2. 保存行执行作业

日志如下:

代码语言:javascript
复制
2020/06/10 10:12:19 - Spoon - Starting job...
2020/06/10 10:12:19 - Spark submit - Start of job execution
2020/06/10 10:12:19 - Spark submit - Starting entry [Spark PI]
2020/06/10 10:12:19 - Spark PI - Submitting Spark Script
2020/06/10 10:12:20 - Spark PI - Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO client.RMProxy: Connecting to ResourceManager at manager/172.16.1.124:8032
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO conf.Configuration: resource-types.xml not found
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container)
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Setting up container launch context for our AM
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Setting up the launch environment for our AM container
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Preparing resources for our AM container
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://manager:8020/user/spark/lib/spark_jars.zip
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Uploading resource file:/root/spark/examples/jars/spark-examples_2.11-2.4.0-cdh6.3.1.jar -> hdfs://manager:8020/user/root/.sparkStaging/application_1591323999364_0060/spark-examples_2.11-2.4.0-cdh6.3.1.jar
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Uploading resource file:/tmp/spark-281973dd-8233-4f12-b416-36d28b74159c/__spark_conf__2533521329006469303.zip -> hdfs://manager:8020/user/root/.sparkStaging/application_1591323999364_0060/__spark_conf__.zip
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing view acls to: root
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing modify acls to: root
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing view acls groups to:
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing modify acls groups to:
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO conf.HiveConf: Found configuration file file:/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/hive-site.xml
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO security.YARNHadoopDelegationTokenManager: Attempting to load user's ticket cache.
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Submitting application application_1591323999364_0060 to ResourceManager
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO impl.YarnClientImpl: Submitted application application_1591323999364_0060
2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client: Application report for application_1591323999364_0060 (state: ACCEPTED)
2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client:
2020/06/10 10:12:23 - Spark PI -      client token: N/A
2020/06/10 10:12:23 - Spark PI -      diagnostics: AM container is launched, waiting for AM container to Register with RM
2020/06/10 10:12:23 - Spark PI -      ApplicationMaster host: N/A
2020/06/10 10:12:23 - Spark PI -      ApplicationMaster RPC port: -1
2020/06/10 10:12:23 - Spark PI -      queue: root.users.root
2020/06/10 10:12:23 - Spark PI -      start time: 1591755142818
2020/06/10 10:12:23 - Spark PI -      final status: UNDEFINED
2020/06/10 10:12:23 - Spark PI -      tracking URL: http://manager:8088/proxy/application_1591323999364_0060/
2020/06/10 10:12:24 - Spark submit - Starting entry [Success]
2020/06/10 10:12:24 - Spark submit - Finished job entry [Success] (result=[true])
2020/06/10 10:12:24 - Spark submit - Finished job entry [Spark PI] (result=[true])
2020/06/10 10:12:24 - Spark submit - Job execution finished
2020/06/10 10:12:24 - Spoon - Job has ended.

Spark History Server Web UI如图3所示。

图3

点击“application_1591323999364_0061”,如图4所示。

图4

参考:

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2020-06-10 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
专用宿主机
专用宿主机(CVM Dedicated Host,CDH)提供用户独享的物理服务器资源,满足您资源独享、资源物理隔离、安全、合规需求。专用宿主机搭载了腾讯云虚拟化系统,购买之后,您可在其上灵活创建、管理多个自定义规格的云服务器实例,自主规划物理资源的使用。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档