Zeppelin资料
Zeppelin项目信息
Zeppelin文档
Flink on Zeppelin直播系列
Flink on Zeppelin系列文章
Zeppelin Quick Start
安装
修改内存大小
配置远程访问和端口
配置多租户
启动与关闭
登录
Flink on Zeppelin
配置Flink Interpreter
Hive Integration 参数
Spark on Zeppelin
配置Spark Interpreter
Zeppelin资料
Zeppelin项目信息
Zeppelin源码地址 https://github.com/apache/zeppelin
Zeppelin JIRA:https://issues.apache.org/jira/projects/ZEPPELIN/summary
Zeppelin文档
Flink on Zeppelin 文档集中地 https://www.yuque.com/jeffzhangjianfeng/gldg8w
Flink on Zeppelin 参见问题FAQ https://www.yuque.com/jeffzhangjianfeng/gldg8w/yf63y7
其他Zeppelin 使用文档 https://www.yuque.com/jeffzhangjianfeng/ggi5ys
Flink on Zeppelin直播系列
Flink on Zeppelin 短视频教程全集 https://www.bilibili.com/video/BV1Te411W73b/
Flink on Zeppelin: 极致体验(1) 入门 + Batch https://www.bilibili.com/video/av91740063
Flink on Zeppelin: 精致体验(2) Streaming + 高级应用 https://www.bilibili.com/video/av93631574
Flink on Zeppelin系列文章
Flink on Zeppelin (1) 入门篇 https://mp.weixin.qq.com/s/a6Zau9c1ZWTSotl_dMg0Xg
Flink on Zeppelin (2) Batch篇 https://mp.weixin.qq.com/s/K9rPXqqaPuhnIT_TZN8M3w
Flink on Zeppelin (3) Streaming篇https://mp.weixin.qq.com/s/k_0NgJinpK0VVTXw_Jd7ag
Flink on Zeppelin (4) 机器学习篇 https://mp.weixin.qq.com/s/ccyptHGgB_PQ0e6V8B9UKQ
Flink on Zeppelin (5) 高级特性篇 https://mp.weixin.qq.com/s/jZV6gua8ypqdiGPBulOw6Q
Zeppelin Quick Start
安装
[song@cdh68 ~]$ tar -zxvf zeppelin-0.10.0-bin-all.tgz -C app/
[song@cdh68 ~]$ cd app/zeppelin-0.10.0-bin-all/
修改内存大小
修改zeppelin的内存有2个方面:修改zeppelin server的内存,修改interpreter进程的内存
通过修改 zeppelin-env.sh里的 ZEPPELIN_MEM 来修改zeppelin server 的内存
通过修改 zeppelin-env.sh里的 ZEPPELIN_INTP_MEM 来修改 interpreter 进程的内存
export ZEPPELIN_MEM="-Xms1024m -Xmx8192m -XX:MaxMetaspaceSize=1024m"
export ZEPPELIN_INTP_MEM="-Xms1024m -Xmx2048m -XX:MaxMetaspaceSize=512m"
# JDK目录
export JAVA_HOME=/opt/jdk1.8.0_172
# 方便之后配置Interpreter on YARN模式。注意必须安装Hadoop,且hadoop必须配置在系统环境变量PATH中
export USE_HADOOP=true
# Hadoop配置文件目录
export HADOOP_CONF_DIR=/etc/hadoop/hadoop-conf
配置远程访问和端口
修改zeppelin server 的url地址
修改如下:
zeppelin.server.addr
0.0.0.0
Server binding address
zeppelin.server.port
8080
Server port.
zeppelin.notebook.storage
org.apache.zeppelin.notebook.repo.FileSystemNotebookRepo
Hadoop compatible file system notebook persistence layer implementation, such as local file system, hdfs, azure wasb, s3 and etc.
zeppelin.notebook.dir
/zeppelin/notebook
path or URI for notebook persist
zeppelin.recovery.storage.class
org.apache.zeppelin.interpreter.recovery.FileSystemRecoveryStorage
ReoveryStorage implementation based on hadoop FileSystem
zeppelin.recovery.dir
/zeppelin/recovery
Location where recovery metadata is stored
配置多租户
Multi-user Support[1]
禁止匿名访问
zeppelin-site.xml
zeppelin.anonymous.allowed
false
Anonymous user allowed by default
Shiro
Apache Shiro authentication for Apache Zeppelin[2]
配置加密密码
使用 Command Line Hasher 对 用户密码进行加密
## build Command Line Hasher tool
mvn dependency:get -DgroupId=org.apache.shiro.tools -DartifactId=shiro-tools-hasher -Dclassifier=cli -Dversion=1.7.0
## 使用打包好的 tool 对用户进行加密
java -jar ~/.m2/repository/org/apache/shiro/tools/shiro-tools-hasher/1.7.0/shiro-tools-hasher-1.7.0-cli.jar -p
Password to hash:
Password to hash (confirm):
$shiro1$SHA-256$500000$ybTZ7NhAdqsYUyD8ytJ95A==$+LP9EVgd/Dnokwp6V1n8cg1BQHx1J1LlxwCAGX+QLMY=
需要在 [main] 做如下配置,确保 隐式 iniRelam 使用一个 知道如何对安全的哈希密码进行校验的 CredentialsMatcher
如果不适用加密密码,需要将其注释掉
使用明文密码
添加song用户:
[users]
song = password1, admin, role1
user2 = password3, role3
user3 = password4, role2
[roles]
role1 = *
role2 = *
role3 = *
admin = *
启动与关闭
启动:
bin/zeppelin-daemon.sh start
关闭:
bin/zeppelin-daemon.sh stop
重启:
bin/zeppelin-daemon.sh restart
登录
根据地址和端口,使用用户名和密码登录即可。
Flink on Zeppelin
配置Flink Interpreter
Flink support in Zeppelin[3]
Flink Interpreter for Apache Zeppelin[4]
首先,将 Interpreter Binding 模式修改为 Isolated per Note,如下图所示
在这种模式下,每个 Note 在执行时会分别启动 Interpreter 进程,类似于 Flink on YARN 的 Per-job 模式,最符合生产环境的需要。
点击用户-->Interpreter 搜索 flink,点击 edit 进行配置。
配置如下:
FLINK_HOME=/home/song/app/flink-1.13.2
HADOOP_CONF_DIR=/etc/hadoop/conf
HIVE_CONF_DIR=/etc/hive/conf
flink.execution.mode=yarn
jobmanager.memory.process.size=8192m
taskmanager.memory.process.size=20480m
taskmanager.numberOfTaskSlots=5
yarn.application.name=Zeppelin Flink yarn
yarn.application.queue=default
zeppelin.flink.run.asLoginUser=true
zeppelin.flink.enableHive=true
zeppelin.flink.hive.version=2.1.1
The interpreter will be instantiated Per Note in scoped process Pre Note in scoped process
使用yarn模式,使用yarn-application报错如下:
Hive Integration 参数
如果我们想访问 Hive 数据,以及用 HiveCatalog 管理 Flink SQL 的元数据,还需要配置与 Hive 的集成。
HIVE_CONF_DIR:Hive 配置文件(hive-site.xml)所在的目录;
zeppelin.flink.enableHive:设为 true 以启用 Hive Integration;
zeppelin.flink.hive.version:Hive 版本号。
复制与 Hive Integration 相关的依赖到 $FLINK_HOME/lib 目录下,包括:
flink-connector-hive_2.11-1.11.0.jar
flink-hadoop-compatibility_2.11-1.11.0.jar
hive-exec-..jar
如果 Hive 版本是1.x,还需要额外加入 hive-metastore-1.*.jar、libfb303-0.9.2.jar 和 libthrift-0.9.2.jar
保证 Hive 元数据服务(Metastore)启动。注意不能是 Embedded 模式,即必须以外部数据库(MySQL、Postgres等)作为元数据存储。
Spark on Zeppelin
Spark Interpreter for Apache Zeppelin[5]
配置Spark Interpreter
点击用户-->Interpreter 搜索 spark,点击 edit 进行配置。
配置如下:
SPARK_HOME=/home/song/app/spark/spark-2.4.8-bin-hadoop2.7
spark.master=yarn-cluster
spark.submit.deployMode=cluster
spark.app.name=Zeppelin On Spark
spark.driver.memory=4g
spark.executor.memory=4g
spark.executor.instances=4
PYSPARK_PYTHON=/data1/app/anaconda3/bin/python
PYSPARK_DRIVER_PYTHON=/data1/app/anaconda3/bin/python
使用python3.7, 使用python3.9时报错如下:
TypeError: required field "type_ignores" missing from Module
运行报错如下:
在zeppelin-env.sh中配置如下信息:
# set hadoop conf dir
export HADOOP_CONF_DIR=/etc/hadoop/conf
再次运行,报如下错误:
原因:
Spark2中jersey版本是2.22,但是yarn中还需要依赖1.9,版本不兼容导致的。
解决方法:
/usr/hdp/current/hadoop-client/lib/jersey-core-1.19.jar
/usr/hdp/current/hadoop-yarn-client/lib/jersey-client-1.19.jar
/usr/hdp/current/hadoop-yarn-client/lib/jersey-guice-1.19.jar
将如下包放入放到 ${SPAKR_HOME}/jars 目录之下
再次运行,报bad substitution错误:
22/09/29 17:10:46 INFO cluster.YarnClientSchedulerBackend: Stopped
22/09/29 17:10:46 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/09/29 17:10:46 INFO memory.MemoryStore: MemoryStore cleared
22/09/29 17:10:46 INFO storage.BlockManager: BlockManager stopped
22/09/29 17:10:46 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
22/09/29 17:10:46 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
22/09/29 17:10:46 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/09/29 17:10:46 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Application application_1664440338037_0006 failed 2 times due to AM Container for appattempt_1664440338037_0006_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2022-09-29 17:10:45.771]Exception from container-launch.
Container id: container_e12_1664440338037_0006_02_000001
Exit code: 1
[2022-09-29 17:10:45.772]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/data/disk02/hadoop/yarn/local/usercache/deployer/appcache/application_1664440338037_0006/container_e12_1664440338037_0006_02_000001/launch_container.sh: line 37: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/3.1.5.0-152/hadoop/*:/usr/hdp/3.1.5.0-152/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
[2022-09-29 17:10:45.773]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/data/disk02/hadoop/yarn/local/usercache/deployer/appcache/application_1664440338037_0006/container_e12_1664440338037_0006_02_000001/launch_container.sh: line 37: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/3.1.5.0-152/hadoop/*:/usr/hdp/3.1.5.0-152/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
For more detailed output, check the application tracking page: http://mc-hh401-col10:8088/cluster/app/application_1664440338037_0006 Then click on links to logs of each attempt.
. Failing the application.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:94)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:188)
at org.apache.spark.SparkContext.(SparkContext.scala:501)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2526)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:855)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/09/29 17:10:46 INFO util.ShutdownHookManager: Shutdown hook called
22/09/29 17:10:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-0501ccfa-ab0d-46cf-968a-f4329c82011b
22/09/29 17:10:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-aa14b0c3-e557-4201-9e74-8e5f07bf82c0
该错误是由于apache版本的spark提交任务时找不到hdp.version变量,我们需要在HDP管理页面的MapReduce2的配置中添加该变量 :
参考资料
[1]
Multi-user Support: http://zeppelin.apache.org/docs/latest/setup/basics/multi_user_support.html
[2]
Apache Shiro authentication for Apache Zeppelin: http://zeppelin.apache.org/docs/latest/setup/security/shiro_authentication.html
[3]
Flink support in Zeppelin: http://zeppelin.apache.org/docs/latest/quickstart/flink_with_zeppelin.html
[4]
Flink Interpreter for Apache Zeppelin: http://zeppelin.apache.org/docs/latest/quickstart/flink_with_zeppelin.html
[5]
Spark Interpreter for Apache Zeppelin: https://zeppelin.apache.org/docs/0.10.0/interpreter/spark.html
领取专属 10元无门槛券
私享最新 技术干货