【Dr.Elephant中文文档-4】开发者指南

一条老狗

修改于 2019-12-27 17:09:09

1.2K0

修改于 2019-12-27 17:09:09

文章被收录于专栏：极客运维

1.`Dr.Elephant`设置

请按照快速安装说明操作

2.先决条件

2.1.Play/Activator

参照快速安装说明操作中的Step 3

2.2.Hadoop/Spark on Yarn

为了在本地部署Dr.Elephant测试，你需要安装Hadoop(version 2.x)或者Spark(Yarn mode, version > 1.4.0)，以及资源管理服务和历史作业服务（可以用伪分布式）。关于伪分布式模式在 YARN 上运行 MapReduce 作业相关说明可以在这里(https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html)[1]找到。

如果还没设置环境变量，可以导入HADOOP_HOME变量

$> export HADOOP_HOME=/path/to/hadoop/home
$> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

将 hadoop 的 home 目录添加到系统变量下，因为Dr.Elephant会调用到 hadoop 的某些类库

$> export PATH=$HADOOP_HOME/bin:$PATH

确保历史作业服务器正常运行，因为Dr.Elephant需要依赖他运行

$> $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

2.3.数据库

Dr.Elephant需要一个数据库来存储相关祖业信息和分析结果数据

本地配置并启动一个mysql。可以从以下链接获取最新版的mysql：https://www.mysql.com/downloads/。Dr.Elephant支持mysql 5.5+以上的版本，有啥问题可以去Alex (wget.null@gmail.com) 的Google小组讨论。创建一个名为drelephant的库。

$> mysql -u root -p
mysql> create database drelephant

可以在Dr.Elephant的配置文件app-conf/elephant.conf中配置数据库的 url、数据库名称、用户名和密码。

使用其他数据库目前，Dr.Elephant默认是支持MySQL数据库。但我们可以在evolution files中看到 DDL 声明。如果想配置其他的数据库，可以参考这里(https://www.playframework.com/documentation/2.6.x/ScalaDatabase)[2]进行配置。

3.测试`Dr.Elephant`

你可以通过调用编译脚本来测试，脚本会进行所有单元测试。

4.项目结构

app                             → Contains all the source files
 └ com.linkedin.drelepahnt      → Application Daemons
 └ org.apache.spark             → Spark Support
 └ controllers                  → Controller logic
 └ models                       → Includes models that Map to DB
 └ views                        → Page templates

app-conf                        → Application Configurations
 └ elephant.conf                → Port, DB, Keytab and other JVM Configurations (Overrides application.conf)
 └ FetcherConf.xml              → Fetcher Configurations
 └ HeuristicConf.xml            → Heuristic Configurations
 └ JobTypeConf.xml              → JobType Configurations

conf                            → Configurations files
 └ evolutions                   → DB Schema
 └ application.conf             → Main configuration file
 └ log4j.properties             → log configuration file
 └ routes                       → Routes definition

images
 └ wiki                         → Contains the images used in the wiki documentation

public                          → Public assets
 └ assets                       → Library files
 └ css                          → CSS files
 └ images                       → Image files
 └ js                           → Javascript files

scripts
 └ start.sh                     → Starts Dr. Elephant
 └ stop.sh                      → Stops Dr. Elephant

test                            → Source folder for unit tests

compile.sh                      → Compiles the application

5.启发式算法

Dr.Elephant已经为 MapReduce 和 Spark 集成了一系列的启发式算法。有关这些算法的详细信息，请参阅启发式算法指南。这些算法都是可插拔式的模块，可以很简单的配置好。

5.1.添加新的启发式算法

你可以添加自定义的算法到Dr.Elephant中。
创建新的启发式算法，并完成测试
为自定义的启发式算法创建一个新的view页，例如helpMapperSpill.scala.html
在HeuristicConf.xml文件中添加该启发式算法的详情
HeuristicConf.xml文件应该包含下列内容：
- applicationtype：应用程序类型，是 MapReduce 还是 spark
- heuristicname：算法名称
- classname：类名全称
- viewname：view 页全称
- hadoopversions：该算法匹配的 hadoop 版本号
运行Dr.Elephant，他应该包含你新添加的算法了

HeuristicConf.xml文件示例

<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper GC</heuristicname>
<classname>com.linkedin.drelephant.mapreduce.heuristics.MapperGCHeuristic</classname>
<viewname>views.html.help.mapreduce.helpGC</viewname>
</heuristic>

5.2.配置启发式算法

如果你想要覆盖启发式算法中用到的关于严重性指标的的阈值，你可以在HeuristicConf.xml文件中指定其值，例子如下。配置严重性阈值

<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper Data Skew</heuristicname>
<classname>com.linkedin.drelephant.mapreduce.heuristics.MapperDataSkewHeuristic</classname>
<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>
<params>
  <num_tasks_severity>10, 50, 100, 200</num_tasks_severity>
  <deviation_severity>2, 4, 8, 16</deviation_severity>
  <files_severity>1/8, 1/4, 1/2, 1</files_severity>
</params>
</heuristic>

6.调度器

如今，Dr.Elephant支持 3 种工作流调度器。他们是Azkaban，Airflow和Oozie。默认情况下，这些调度器都是可用的，除了Airflow和Oozie需要一些配置外，一般都是开箱即用。

6.1.调度器配置

调度器和他们所有的参数都在app-conf目录下的SchedulerConf.xml文件中配置。通过下面的示例SchedulerConf.xml文件，了解调度器相应的配置和属性。

<!-- Scheduler configurations -->
<schedulers>
    <scheduler>
        <name>azkaban</name>
        <classname>com.linkedin.drelephant.schedulers.AzkabanScheduler</classname>
    </scheduler>

    <scheduler>
        <name>airflow</name>
        <classname>com.linkedin.drelephant.schedulers.AirflowScheduler</classname>
        <params>
            <airflowbaseurl>http://localhost:8000</airflowbaseurl>
        </params>
    </scheduler>

    <scheduler>
        <name>oozie</name>
        <classname>com.linkedin.drelephant.schedulers.OozieScheduler</classname>
        <params>
            <!-- URL of oozie host -->
            <oozie_api_url>http://localhost:11000/oozie</oozie_api_url>

            <!-- ### Non mandatory properties ###
            ### choose authentication method
            <oozie_auth_option>KERBEROS/SIMPLE</oozie_auth_option>
            ### override oozie console url with a template (only parameter will be the id)
            <oozie_job_url_template></oozie_job_url_template>
            <oozie_job_exec_url_template></oozie_job_exec_url_template>
            ### (if scheduled jobs are expected make sure to add following templates since oozie doesn't provide their URLS on server v4.1.0)
            <oozie_workflow_url_template>http://localhost:11000/oozie/?job=%s</oozie_workflow_url_template>
            <oozie_workflow_exec_url_template>http://localhost:11000/oozie/?job=%s</oozie_workflow_exec_url_template>
            ### Use true if you can assure all app names are unique.
            ### When true dr-elephant will unit all coordinator runs (in case of coordinator killed and then run again)
            <oozie_app_name_uniqueness>false</oozie_app_name_uniqueness>
            -->
        </params>
    </scheduler>
</schedulers>

6.2.贡献新的调度器

为了充分利用Dr. Elephant的全部功能，需要提供以下 4 个ID

作业定义 ID： 整个作业流程中定义的唯一 ID。通过过滤这个 ID 可以查询所有历史作业
作业执行 ID： 作业执行的唯一 ID
工作流定义 ID： 独立于任何执行的对整个流程的唯一 ID
工作流执行 ID： 特定流程执行的唯一 ID

Dr. Elephant希望通过上述 ID 能与任何调度器对接。没有这些 ID，Dr. Elephant无法为Azkaban提供集成。例如，如果没有提供作业定义 Id，那么Dr. Elephant将无法捕获作业的历史数据。同样，如果没有提供 Flow 定义 Id，则无法捕获工作流的历史记录。如果没有上述所有链接，Dr. Elephant只能在执行过程中（Mapreduce 作业级别）显示作业的性能数据。

除了上述的 4 个 ID 之外，Dr. Elephant还需要一个可选的工作名称和 4 个可选链接，这些链接将帮助用户轻松的从Dr. Elephant跳转到相应的作业应用程序。请注意，这不会影响Dr. Elephant的功能。

Flow Definition Url
Flow Execution Url
Job Definition Url
Job Execution Url

7.打分器

在Dr.Elephant中，通过启发式算法来分析运行完成的任务，会得到一个打分。这个分数的计算方法比较简单，可以通过将待优化等级的值乘以作业(task)数量。

int score = 0;
if (severity != Severity.NONE && severity != Severity.LOW) {
    score = severity.getValue() * tasks;
}
return score;

我们定义下列打分类型：

作业得分：所有作业的待优化等级数值之和
任务得分：该任务中所有的作业分数之和
任务流得分：该任务流中所有的任务分数之和

参考资料

[1]

这里: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

[2]

这里: https://www.playframework.com/documentation/2.6.x/ScalaDatabase

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-10-29，如有侵权请联系 cloudcommunity@tencent.com 删除

hadoop

xml

https

大数据

本文分享自极客运维微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度