【Dr.Elephant中文文档-6】度量指标和启发式算法

一条老狗

修改于 2019-12-27 17:09:45

1.3K0

修改于 2019-12-27 17:09:45

文章被收录于专栏：极客运维

1.度量指标

1.1.资源用量

资源使用情况是你作业在 GB 小时内使用的资源量。

1.1.1.计量统计

我们将作业的资源使用量定义为任务容器大小和任务运行时间的乘积。因此，作业的资源使用量可以定义为mapper和reducer任务的资源使用量总和。

1.1.2.范例

Consider a job with:
4 mappers with runtime {12, 15, 20, 30} mins.
4 reducers with runtime {10 , 12, 15, 18} mins.
Container size of 4 GB
Then,
Resource used by all mappers: 4 * (( 12 + 15 + 20 + 30 ) / 60 ) GB Hours = 5.133 GB Hours
Resource used by all reducers: 4 * (( 10 + 12 + 15 + 18 ) / 60 ) GB Hours = 3.666 GB Hours
Total resource used by the job = 5.133 + 3.6666 = 8.799 GB Hours

1.2.浪费的资源量

这显示了作业以 GB 小时浪费的资源量或以浪费的资源百分比。

1.2.1.计量统计

To calculate the resources wasted, we calculate the following:
The minimum memory wasted by the tasks (Map and Reduce)
The runtime of the tasks (Map and Reduce)
The minimum memory wasted by a task is equal to the difference between the container size and maximum task memory(peak memory) among all tasks. The resources wasted by the task is then the minimum memory wasted by the task multiplied by the duration of the task. The total resource wasted by the job then will be equal to the sum of wasted resources of all the tasks.

Let us define the following for each task:

peak_memory_used := The upper bound on the memory used by the task.
runtime := The run time of the task.

The peak_memory_used for any task is calculated by finding out the maximum of physical memory(max_physical_memory) used by all the tasks and the virtual memory(virtual_memory) used by the task.
Since peak_memory_used for each task is upper bounded by max_physical_memory, we can say for each task:

peak_memory_used = Max(max_physical_memory, virtual_memory/2.1)
Where 2.1 is the cluster memory factor.

The minimum memory wasted by each task can then be calculated as:

wasted_memory = Container_size - peak_memory_used

The minimum resource wasted by each task can then be calculated as:

wasted_resource = wasted_memory * runtime

1.3.运行时间

运行时间指标显示了作业运行的总时间。

1.3.1.计量统计

作业运行时间是作业提交到资源管理器和作业完成时的时间差。

1.3.2.范例

作业的提交时间为1461837302868 ms，结束时间为1461840952182 ms，作业的runtime时间是1461840952182 - 1461837302868 = 3649314 ms，即1.01小时。

1.4.等待时间

等待时间是作业处于等待状态消耗的时间

1.4.1.计量统计

For each task, let us define the following:

ideal_start_time := The ideal time when all the tasks should have started
finish_time := The time when the task finished
task_runtime := The runtime of the task

- Map tasks
For map tasks, we have

ideal_start_time := The job submission time

We will find the mapper task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last )
The total wait time of the job due to mapper tasks would be:

mapper_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)

- Reduce tasks
For reducer tasks, we have

ideal_start_time := This is computed by looking at the reducer slow start percentage (mapreduce.job.reduce.slowstart.completedmaps) and finding the finish time of the map task after which first reducer should have started
We will find the reducer task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last )

The total wait time of the job due to reducer tasks would be:
reducer_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)

2.启发式算法

2.1.Map-Reduce

2.1.1.Mapper 数据倾斜

Mapper 数据倾斜启发式算法能够显示作业是否发生数据倾斜。启发式算法会将所有 Mapper 分成两组，第一组的平均值会小于第二组。

例如，第一组有 900 个 Mapper 作业，每个 Mapper 作业平均数据量为 7MB，而另一份包含 1200 个 Mapper 作业，且每个 Mapper 作业的平均数据量是 500MB。

2.1.1.1.计算

首先通过递归算法计算两组平均内存消耗，来评估作业的等级。其误差为两组平均内存消耗的差除以这俩组最小的平均内存消耗的差的值。

Let us define the following variables,

    deviation: the deviation in input bytes between two groups
    num_of_tasks: the number of map tasks
    file_size: the average input size of the larger group

    num_tasks_severity: List of severity thresholds for the number of tasks. e.g., num_tasks_severity = {10, 20, 50, 100}
    deviation_severity: List of severity threshold values for the deviation of input bytes between two groups. e.g., deviation_severity: {2, 4, 8, 16}
    files_severity: The severity threshold values for the fraction of HDFS block size. e.g. files_severity = { ⅛, ¼, ½, 1}

Let us define the following functions,

    func avg(x): returns the average of a list x
    func len(x): returns the length of a list x
    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

We’ll compute two groups recursively based on average memory consumed by them.

Let us call the two groups: group_1 and group_2

Without loss of generality, let us assume that,
    avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then,

    deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))
    file_size = avg(group_1)
    num_of_tasks = len(group_0)

The overall severity of the heuristic can be computed as,
    severity = min(
        getSeverity(deviation, deviation_severity)
        , getSeverity(file_size,files_severity)
        , getSeverity(num_of_tasks,num_tasks_severity)
    )


---


误差（deviation）：分成两部分后输入数据量的误差
作业数量（num_of_tasks）：map作业的数量
文件大小（file_size）：较大的那部分的平均输入数据量的大小
作业数量的严重度（num_tasks_severity）：一个List包含了作业数量的严重度阈值，例如num_tasks_severity = {10, 20, 50, 100}
误差严重度（deviation severity）：一个List包含了两部分Mapper作业输入数据差值的严重度阈值，例如deviation_severity: {2, 4, 8, 16}
文件严重度（files_severity）：一个List包含了文件大小占HDFS块大小比例的严重度阈值，例如files_severity = { ⅛, ¼, ½, 1}

然后定义如下的方法，
方法 avg(x)：返回List x的平均值
方法 len(x)：返回List x的长度大小
方法 min(x,y)：返回x和y中较小的一个
方法 getSeverity(x,y)：比较x和y中的严重度阈值，返回严重度的值

接下来，根据两个部分的平均内存消耗，进行递归计算。
假设分成的两部分分别为group_1和group_2
为了不失一般性，假设
avg(group_1) > ave(group_2) and len(group_1) < len(group_2)
以及
deviation = avg(group_1) - avg(group_2) / min(avg(group_1) - avg(group_2))
file_size = avg(group_1)
num_of_tasks = len(group_0)

启发式算法的严重度可以通过下面的方法来计算：
severity = min(getSeverity(deviation, deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))

2.1.1.2.参数配置

阈值参数deviation_severity、num_tasks_severity和files_severity能够简单的进行配置。如果想进一步了解如何配置这些参数，可以点击开发者指南进行查看。

2.1.2.Mapper GC

Mapper GC 会分析任务的 GC 效率。它会计算出 GC 时间占所有 CPU 时间的百分比。

2.1.2.1.计算

启发式算法对Mapper GC严重度的计算按照如下过程进行。首先，计算出所有作业的平均的 CPU 使用时间、平均运行时间以及平均垃圾回收消耗的时间。我们要计算Mapper GC严重度的最小值，这个值可以通过平均运行时间和平均垃圾回收时间占平均 CPU 总消耗时间的比例来计算。

Let us define the following variables:

    avg_gc_time: average time spent garbage collecting
    avg_cpu_time: average cpu time of all the tasks
    avg_runtime: average runtime of all the tasks
    gc_cpu_ratio: avg_gc_time/ avg_cpu_time

    gc_ratio_severity: List of severity threshold values for the ratio of  avg_gc_time to avg_cpu_time.
    runtime_severity: List of severity threshold values for the avg_runtime.

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity of the heuristic can then be computed as,

    severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)

2.1.2.2.参数配置

阈值参数gc_ratio_severity和runtime_severity也是可以简单配置的。如果想进一步了解如何配置这些参数，可以参考开发者指南。

2.1.3.Mapper 内存消耗

此部分指标用来检查mapper的内存消耗。他会检查任务的消耗内存与容器请求到的内存比例。消耗的内存指任务最大消耗物理内存快照的平均值。容器请求的内存是作业mapreduce.map/reduce.memory.mb的配置值，是作业能请求到的最大物理内存。

2.1.3.1.计算

Let us define the following variables,

    avg_physical_memory: Average of the physical memories of all tasks.
    container_memory: Container memory

    container_memory_severity: List of threshold values for the average container memory of the tasks.
    memory_ratio_severity: List of threshold values for the ratio of avg_plysical_memory to container_memory

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity can then be computed as,

    severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)
               , getSeverity(container_memory,container_memory_severity)
              )

2.1.3.2.参数配置

阈值参数container_memory_severity和memory_ratio_severity也是可以简单配置的。如果想进一步了解如何配置这些参数，可以参考开发者指南。

2.1.4.Mapper 的运行速度

这部分分析Mapper代码的运行效率。通过这些分析可以知道mapper是否受限于 CPU，或者处理的数据量过大。这个分析能够分析mapper运行速度快慢和处理的数据量大小之间的关系。

2.1.4.1.计算

这个启发式算法的严重度值，是mapper作业的运行速度的严重度和mapper作业的运行时间严重度中较小的一个。

Let us define the following variables,

    median_speed: median of speeds of all the mappers. The speeds of mappers are found by taking the ratio of input bytes to runtime.
    median_size: median of size of all the mappers
    median_runtime: median of runtime of all the mappers.

    disk_speed_severity: List of threshold values for the median_speed.
    runtime_severity: List of severity threshold values for median_runtime.

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity of the heuristic can then be computed as,

    severity = min(getSeverity(median_speed, disk_speed_severity), getSeverity(median_runtime, median_runtime_severity)

2.1.4.2.参数配置

阈值参数disk_speed_severity和runtime_severity可以很简单的配置。如果想进一步的了解这些参数配置，可以点击开发者指南查看。

2.1.5.Mapper 溢出

这个启发式算法通过分析磁盘I/O来评判mapper的性能。mapper溢出比例（溢出的记录数/总输出的记录数）是衡量mapper性能的一个重要指标：如果这个值接近 2，表示几乎每个记录都溢出了，并临时写到磁盘两次（其中一次发生在内存排序缓存溢出时，另一次发生在合并所有溢出的块时）。当这些发生时表明mapper输入输出的数据量过大了。

2.1.5.1.计算

Let us define the following parameters,

    total_spills: The sum of spills from all the map tasks.
    total_output_records: The sum of output records from all the map tasks.
    num_tasks: Total number of tasks.
    ratio_spills: total_spills/ total_output_records

    spill_severity: List of the threshold values for ratio_spills
    num_tasks_severity: List of threshold values for total number of tasks.

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity of the heuristic can then be computed as,

	severity = min(getSeverity(ratio_spills, spill_severity), getSeverity(num_tasks, num_tasks_severity)

2.1.5.2.参数配置

阈值spill_severity和num_tasks_severity可以简单的进行配置。如果想进一步了解配置参数的详细信息，可以点击这里查看。开发者指南.

2.1.6.Mapper 运行时间

这部分分析mapper的数量是否合适。通过分析结果，我们可以更好的优化任务中mapper的数量这个参数设置。有以下两种情况发生时，这个参数就需要优化了：

Mapper的运行时间很短。通常作业在以下情况下出现：
- mapper数量过多
- mapper的平均运行时间很短
- 文件太小
大文件或不可分割文件块，通常作业在以下情况下出现：
- mapper数量太少
- mapper的平均运行时间太长
- 文件过大 (个别达到 GB 级别)

2.1.6.1.计算

Let us define the following variables,
    avg_size: average size of input data for all the mappers
    avg_time: average of runtime of all the tasks.
    num_tasks: total number of tasks.

    short_runtime_severity: The list of threshold values for tasks with short runtime
    long_runtime_severity: The list of threshold values for tasks with long runtime.
    num_tasks_severity: The list of threshold values for number of tasks.

Let us define the following functions,
    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity of the heuristic can then be computed as,
    short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))
    severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)

2.1.6.2.参数配置

阈值short_runtime_severity 、long_runtime_severity以及num_tasks_severity可以很简单的配置。如果想进一步了解参数配置的详细信息，可以点击开发者指南查看。

2.1.7.Reducer 数据倾斜

这部分分析每个Reduce中的数据是否存在倾斜情况。这部分分析能够发现Reducer中是否存在这种情况，将Reduce分为两部分，其中一部分的输入数据量是否明显大于另一部分的输入数据量。

2.1.7.1.计算

首先通过递归算法计算均值并基于每个组消耗的平均内存消耗将任务划分为两组来评估该算法的等级。误差表示为两个部分Reducer的平均内存消耗之差除以两个部分最小内存消耗之差得到的比例。

Let us define the following variables:
  deviation: deviation in input bytes between two groups
  num_of_tasks: number of reduce tasks
  file_size: average of larger group
  num_tasks_severity: List of severity threshold values for the number of tasks.
  e.g. num_tasks_severity = {10,20,50,100}
  deviation_severity: List of severity threshold values for the deviation of input bytes between two groups.
  e.g. deviation_severity = {2,4,8,16}
  files_severity: The severity threshold values for the fraction of HDFS block size
  e.g. files_severity = { ⅛, ¼, ½, 1}

Let us define the following functions:
  func avg(x): returns the average of a list x
  func len(x): returns the length of a list x
  func min(x,y): returns minimum of x and y
  func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

We’ll compute two groups recursively based on average memory consumed by them.
Let us call the two groups: group_1 and group_2
Without loss of generality, let us assume that:
  avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then,
  deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))
  file_size = avg(group_1)
  num_of_tasks = len(group_0)

The overall severity of the heuristic can be computed as,
  severity = min(getSeverity(deviation,deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))

2.1.7.2.参数配置

阈值deviation_severity、num_tasks_severity和files_severity，可以很简单的进行配置。如果想进一步了解这些参数的配置，可以点击开发者指南查看。

2.1.8.Reducer GC

这部分分析任务的 GC 效率，能够计算并告诉我们 GC 时间占所用 CPU 时间的比例。

2.1.8.1.计算

首先，会计算出所有任务的平均 CPU 消耗时间、平均运行时间以及平均垃圾回收所消耗的时间。然后，算法会根据平均运行时间以及垃圾回收时间占平均 CPU 时间的比值来计算出最低的严重等级。

Let us define the following variables:

    avg_gc_time: average time spent garbage collecting
    avg_cpu_time: average cpu time of all the tasks
    avg_runtime: average runtime of all the tasks
    gc_cpu_ratio: avg_gc_time/ avg_cpu_time

    gc_ratio_severity: List of severity threshold values for the ratio of  avg_gc_time to avg_cpu_time.
    runtime_severity: List of severity threshold values for the avg_runtime.

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity of the heuristic can then be computed as,
    severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)

2.1.8.2.参数配置

阈值gc_ratio_severity、runtime_severity可以很简单的配置，如果想进一步了解参数配置的详细过程，可以点击开发者指南查看。

2.1.9.Reducer 内存消耗

这部分分析显示了任务的内存利用率。算法会比较作业消耗的内存以及容器要求的内存分配。消耗的内存是指每个作业消耗的最大内存的平均值。容器需求的内存是指任务配置的mapreduce.map/reduce.memory.mb，也就是任务能够使用最大物理内存。

2.1.9.1.计算

Let us define the following variables,

    avg_physical_memory: Average of the physical memories of all tasks.
    container_memory: Container memory

    container_memory_severity: List of threshold values for the average container memory of the tasks.
    memory_ratio_severity: List of threshold values for the ratio of avg_physical_memory to container_memory

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity can then be computed as,

    severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)
               , getSeverity(container_memory,container_memory_severity)
              )

2.1.9.2.参数配置

阈值container_memory_severity和memory_ratio_severity可以简单的进行配置。如果想进一步了解配置参数的详细信息，可以点击开发者指南查看。

2.1.10.Reducer 运行时间

这部分分析Reducer的运行效率，可以帮助我们更好的配置任务中reducer的数量。当出现以下两种情况时，说明Reducer的数量需要进行调优：

Reducer过多，hadoop 任务可能的表现是：
- Reducer数量过多
- Reducer的运行时间很短
Reducer过少，hadoop 任务可能的表现是：
- Reducer数量过少
- Reducer运行时间很长

2.1.10.1.计算

Let us define the following variables,

    avg_size: average size of input data for all the mappers
    avg_time: average of runtime of all the tasks.
    num_tasks: total number of tasks.

    short_runtime_severity: The list of threshold values for tasks with short runtime
    long_runtime_severity: The list of threshold values for tasks with long runtime.
    num_tasks_severity: The number of tasks.

Let us define the following functions,

    func min(x,y): returns minimum of x and y
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity of the heuristic can then be computed as,

    short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))
    severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)

2.1.10.2.参数配置

阈值参数short_runtime_severity、long_runtime_severity以及num_tasks_severity可以很简单的配置，如果想进一步了解参数配置的详细过程，可以点击开发者指南查看。

2.1.11.清洗&排序

这部分分析reducer消耗的总时间以及reducer在进行清洗和排序时消耗的时间，通过这些分析，可以评估reducer的执行效率。

2.1.11.1.计算

Let’s define following variables,

    avg_exec_time: average time spent in execution by all the tasks.
    avg_shuffle_time: average time spent in shuffling.
    avg_sort_time: average time spent in sorting.

    runtime_ratio_severity: List of threshold values for the ratio of twice of average shuffle or sort time to average execution time.
    runtime_severity: List of threshold values for the runtime for shuffle or sort stages.

The overall severity can then be found as,

	severity = max(shuffle_severity, sort_severity)

	where shuffle_severity and sort_severity can be found as:

	shuffle_severity = min(getSeverity(avg_shuffle_time, runtime_severity), getSeverity(avg_shuffle_time*2/avg_exec_time, runtime_ratio_severity))

	sort_severity = min(getSeverity(avg_sort_time, runtime_severity), getSeverity(avg_sort_time*2/avg_exec_time, runtime_ratio_severity))

2.1.11.2.参数配置

阈值参数avg_exec_time、avg_shuffle_time和avg_sort_time可以很简单的进行配置。更多关于参数配置的相信信息可以点击开发者指南查看。

2.2.Spark

2.2.1.Spark 的事件日志限制

Spark事件日志处理器当前无法处理很大的日志文件。Dr-Elephant需要花很长的时间去处理一个很大的Spark时间日志文件，期间很可能会影响Dr-Elephant本身的稳定运行。因此，目前我们设置了一个日志大小限制（100MB），如果超过这个大小，会新起一个进程去处理这个日志。

2.2.1.1.计算

如果数据被限流了，那么启发式算法将评估为最严重等级CRITICAL，否则，就没有评估等级。

2.2.2.Spark 负载均衡处理器

和Map/Reduce任务的执行机制不同，Spark应用在启动后会一次性分配它所需要的所有资源，直到整个任务结束才会释放这些资源。根据这个机制，对Spark的处理器的负载均衡就显得非常重要，可以避免集群中个别节点压力过大。

2.2.2.1.计算

Let us define the following variables:

    peak_memory: List of peak memories for all executors
    durations: List of durations of all executors
    inputBytes: List of input bytes of all executors
    outputBytes: List of output bytes of all executors.

    looser_metric_deviation_severity: List of threshold values for deviation severity, loose bounds.
    metric_deviation_severity: List of threshold values for deviation severity, tight bounds.

Let us define the following functions:

    func getDeviation(x): returns max(|maximum-avg|, |minimum-avg|)/avg, where
        x = list of values
        maximum = maximum of values in x
        minimum = minimum of values in x
        avg = average of values in x

    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
    func max(x,y): returns the maximum value of x and y.
    func Min(l): returns the minimum of a list l.

The overall severity can be found as,

    severity = Min( getSeverity(getDeviation(peak_memory), looser_metric_deviation_severity),
               getSeverity(getDeviation(durations),  metric_deviation_severity),
               getSeverity(getDeviation(inputBytes), metric_deviation_severity),
               getSeverity(getDeviation(outputBytes), looser_metric_deviation_severity).
               )

2.2.2.2.参数配置

阈值参数looser_metric_deviation_severity和metric_deviation_severity可以简单的进行配置。如果想进一步了解参数配置的详细过程，可以点击开发者指南查看。

2.2.3.Spark 任务运行时间

这部分启发式算法对Spark任务的运行时间进行调优分析。每个Spark应用程序可以拆分成多个任务，每个任务又可以拆分成多个运行阶段。

2.2.3.1.计算

Let us define the following variables,

    avg_job_failure_rate: Average job failure rate
    avg_job_failure_rate_severity: List of threshold values for average job failure rate

Let us define the following variables for each job,

    single_job_failure_rate: Failure rate of a single job
    single_job_failure_rate_severity: List of threshold values for single job failure rate.

The severity of the job can be found as maximum of single_job_failure_rate_severity for all jobs and avg_job_failure_rate_severity.

i.e. severity = max(getSeverity(single_job_failure_rate, single_job_failure_rate_severity),
                    getSeverity(avg_job_failure_rate, avg_job_failure_rate_severity)
                )

where single_job_failure_rate is computed for all the jobs.

2.2.3.2.参数配置

阈值参数single_job_failure_rate_severity和avg_job_failure_rate_severity可以很简单的进行配置。更多详细信息，可以点击开发者指南查看。

2.2.4.Spark 内存限制

目前，Spark应用程序缺少动态资源分配的功能。与Map/Reduce任务不同，能够为每个map/reduce进程分配所需要的资源，并且在执行过程中逐步释放占用的资源。而Spark在应用程序执行时，会一次性的申请所需要的所有资源，直到任务结束才释放这些资源。过多的内存使用会对集群节点的稳定性产生影响。所以，我们需要限制Spark应用程序能使用的最大内存比例。

2.2.4.1.计算

Let us define the following variables,

    total_executor_memory: total memory of all the executors
    total_storage_memory: total memory allocated for storage by all the executors
    total_driver_memory: total driver memory allocated
    peak_memory: total memory used at peak

    mem_utilization_severity: The list of threshold values for the memory utilization.
    total_memory_severity_in_tb: The list of threshold values for total memory.

Let us define the following functions,

    func max(x,y): Returns maximum of x and y.
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity can then be computed as,

    severity = max(getSeverity(total_executor_memory,total_memory_severity_in_tb),
                   getSeverity(peak_memory/total_storage_memory, mem_utilization_severity)
               )

2.2.4.2.参数配置

阈值参数total_memory_severity_in_tb和mem_utilization_severity可以很简单的配置。进一步了解，可以点击开发者指南查看。

2.2.5.Spark 阶段运行时间

与Spark任务运行时间一样，Spark应用程序可以分为多个任务，每个任务又可以分为多个运行阶段。

2.2.5.1.计算

Let us define the following variable for each spark job,

    stage_failure_rate: The stage failure rate of the job
    stagge_failure_rate_severity: The list of threshold values for stage failure rate.

Let us define the following variables for each stage of a spark job,

    task_failure_rate: The task failure rate of the stage
    runtime: The runtime of a single stage

    single_stage_tasks_failure_rate_severity: The list of threshold values for task failure of a stage
    stage_runtime_severity_in_min: The list of threshold values for stage runtime.

Let us define the following functions,

    func max(x,y): returns the maximum value of x and y.
    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.

The overall severity can be found as:

    severity_stage = max(getSeverity(task_failure_rate, single_stage_tasks_faioure_rate_severity),
                   getSeverity(runtime, stage_runtime_severity_in_min)
               )
    severity_job = getSeverity(stage_failure_rate,stage_failure_rate_severity)

    severity = max(severity_stage, severity_job)

where task_failure_rate is computed for all the tasks.

2.2.5.2.参数配置

阈值参数single_stage_tasks_failure_rate_severity、stage_runtime_severity_in_min和stage_failure_rate_severity可以很简单的配置。进一步了解，请点击开发者指南。

本章篇幅较长，一些专有名词及参数功能，可以在Dr-Elephant的Dashboard中查看。

参考资料

[1]

开发者指南

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-11-01，如有侵权请联系 cloudcommunity@tencent.com 删除

mapreduce

spark

大数据

本文分享自极客运维微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

mapreduce

spark

大数据