Apache Dolphin Scheduler是一个分布式易扩展的可视化DAG工作流任务调度系统。致力于解决数据处理流程中错综复杂的依赖关系,使调度系统在数据处理流程中开箱即用。
官网:
https://dolphinscheduler.apache.org/en-us/
github:
https://github.com/apache/incubator-dolphinscheduler
最近,Dolphin Scheduler社区发布了1.2.1的版本,新特性有:
Feature
Enhancement
值得关注的点有:
综合1.2.0版本提供的跨项目依赖,flink和http组件,工作流导入导出等特性,ds-1.2.1值得社区用户升级体验。
(ps:目前ds的dev分支中已经集成了datax和sqoop组件,敬请期待~)
bin目录下比较重要的是dolphinscheduler-daemon文件,之前版本中极容易出现的找不到jdk问题来源,当前版本的jdk已经export了本机的$JAVA_HOME,再也不用担心找不到jdk了。
非常重要的配置文件目录!!!
非常重要的配置文件目录!!!
非常重要的配置文件目录!!!
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop export HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hadoop/etc/hadoop export SPARK_HOME1=/opt/cloudera/parcels/CDH/lib/spark export SPARK_HOME2=/opt/cloudera/parcels/SPARK2/lib/spark2 export PYTHON_HOME=/usr/local/anaconda3/bin/python export JAVA_HOME=/usr/java/jdk1.8.0_131 export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive export FLINK_HOME=/opt/soft/flink export PATH=$HADOOP_HOME/bin:$SPARK_HOME1/bin:$SPARK_HOME2/bin:$PYTHON_HOME:$JAVA_HOME/bin:$HIVE_HOME/bin:$PATH:$FLINK_HOME/bin:$PATH
# base spring data source configuration spring.datasource.type=com.alibaba.druid.pool.DruidDataSource # postgre spring.datasource.driver-class-name=org.postgresql.Driver spring.datasource.url=jdbc:postgresql://localhost:5432/dolphinscheduler # mysql #spring.datasource.driver-class-name=com.mysql.jdbc.Driver #spring.datasource.url=jdbc:mysql://192.168.xx.xx:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8 spring.datasource.username=test spring.datasource.password=test # master settings # master execute thread num master.exec.threads=100 # worker settings # worker execute thread num worker.exec.threads=100 # only larger than reserved memory, worker server can work. default value : physical memory * 1/6, unit is G. worker.reserved.memory=0.1
#org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.PostgreSQLDelegate #org.quartz.dataSource.myDs.driver = com.mysql.jdbc.Driver org.quartz.dataSource.myDs.driver = org.postgresql.Driver #org.quartz.dataSource.myDs.URL = jdbc:mysql://192.168.xx.xx:3306/dolphinscheduler?characterEncoding=utf8 org.quartz.dataSource.myDs.URL = jdbc:postgresql://localhost:5432/dolphinscheduler?characterEncoding=utf8 org.quartz.dataSource.myDs.user = test org.quartz.dataSource.myDs.password = test
install.sh部署脚本是ds部署中的重头戏,下面将参数分组进行分析。
# for example postgresql or mysql ... dbtype="postgresql" # db config # db address and port dbhost="192.168.xx.xx:5432" # db name dbname="dolphinscheduler" # db username username="xx" # db passwprd # Note: if there are special characters, please use the \ transfer character to transfer passowrd="xx"
dbtype参数可以设置postgresql和mysql,这里指定了ds连接元数据库的jdbc相关信息
# conf/config/install_config.conf config # Note: the installation path is not the same as the current path (pwd) installPath="/opt/ds-agent" # deployment user # Note: the deployment user needs to have sudo privileges and permissions to operate hdfs. If hdfs is enabled, the root directory needs to be created by itself deployUser="dolphinscheduler"
配置zk集群的时候,特别注意:要用ip:2181的方式配置上去,一定要把端口带上。
ds一共包括master worker alert api四种角色,其中alert api只需指定一台机器即可,master和worker可以部署多态机器。下面的例子就是在4台机器中,部署2台master,2台worker,1台alert,1台api
zkroot参数可以通过调整,在一套zk集群中,托管多个ds集群,如配置zkRoot="/dspro",zkRoot="/dstest"
# zk cluster zkQuorum="192.168.xx.xx:2181,192.168.xx.xx:2181,192.168.xx.xx:2181" # install hosts # Note: install the scheduled hostname list. If it is pseudo-distributed, just write a pseudo-distributed hostname ips="192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4" # ssh port, default 22 # Note: if ssh port is not default, modify here sshPort=22 # run master machine # Note: list of hosts hostname for deploying master masters="192.168.0.1,192.168.0.2" # run worker machine # note: list of machine hostnames for deploying workers workers="192.168.0.3,192.168.0.4" # run alert machine # note: list of machine hostnames for deploying alert server alertServer="192.168.0.1" # run api machine # note: list of machine hostnames for deploying api server apiServers="192.168.0.1" # zk config # zk root directory zkRoot="/dolphinscheduler" # zk session timeout zkSessionTimeout="300" # zk connection timeout zkConnectionTimeout="300" # zk retry interval zkRetryMaxSleep="100" # zk retry maximum number of times zkRetryMaxtime="5"
#QQ邮箱配置 # alert config # mail protocol mailProtocol="SMTP" # mail server host mailServerHost="smtp.qq.com" # mail server port mailServerPort="465" # sender mailSender="783xx8369@qq.com" # user mailUser="783xx8369@qq.com" # sender password mailPassword="邮箱授权码" # TLS mail protocol support starttlsEnable="false" sslTrust="smtp.qq.com" # SSL mail protocol support # note: The SSL protocol is enabled by default. # only one of TLS and SSL can be in the true state. sslEnable="true" # download excel path xlsFilePath="/tmp/xls"
# resource Center upload and select storage method:HDFS,S3,NONE resUploadStartupType="NONE" # if resUploadStartupType is HDFS,defaultFS write namenode address,HA you need to put core-site.xml and hdfs-site.xml in the conf directory. # if S3,write S3 address,HA,for example :s3a://dolphinscheduler, # Note,s3 be sure to create the root directory /dolphinscheduler defaultFS="hdfs://mycluster:8020" # if S3 is configured, the following configuration is required. s3Endpoint="http://192.168.xx.xx:9010" s3AccessKey="xxxxxxxxxx" s3SecretKey="xxxxxxxxxx" # resourcemanager HA configuration, if it is a single resourcemanager, here is yarnHaIps="" yarnHaIps="192.168.xx.xx,192.168.xx.xx" # if it is a single resourcemanager, you only need to configure one host name. If it is resourcemanager HA, the default configuration is fine. singleYarnIp="ark1" # hdfs root path, the owner of the root path must be the deployment user. # versions prior to 1.1.0 do not automatically create the hdfs root directory, you need to create it yourself. hdfsPath="/dolphinscheduler" # have users who create directory permissions under hdfs root path / # Note: if kerberos is enabled, hdfsRootUser="" can be used directly. hdfsRootUser="hdfs"
devState在测试环境部署的时候可以调为true,生产环境部署建议调为false
# development status, if true, for the SHELL script, you can view the encapsulated SHELL script in the execPath directory. # If it is false, execute the direct delete devState="true"
下面的参数主要是调整的application.properties里边的配置,涉及master,worker和apiserver
# master config # master execution thread maximum number, maximum parallelism of process instance masterExecThreads="100" # the maximum number of master task execution threads, the maximum degree of parallelism for each process instance masterExecTaskNum="20" # master heartbeat interval masterHeartbeatInterval="10" # master task submission retries masterTaskCommitRetryTimes="5" # master task submission retry interval masterTaskCommitInterval="1000" # master maximum cpu average load, used to determine whether the master has execution capability masterMaxCpuLoadAvg="100" # master reserve memory to determine if the master has execution capability masterReservedMemory="0.1" # worker config # worker execution thread workerExecThreads="100" # worker heartbeat interval workerHeartbeatInterval="10" # worker number of fetch tasks workerFetchTaskNum="3" # worker reserve memory to determine if the master has execution capability workerReservedMemory="0.1" # api config # api server port apiServerPort="12345"
本文分享自微信公众号 - Eights做数据(Eights-Yelli),作者:Eights
原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。
原始发表时间:2020-02-25
本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。
我来说两句