用户指南

最佳实践

API 文档

高负载

最近更新时间:2020-04-16 10:10:49

本文档介绍如何在 TKE 集群中,通过工具定位异常是否由高负载造成,请按照以下步骤进行问题排查。

现象描述

节点高负载将会导致进程无法获得足够运行所需的 CPU 时间片,通常表现为网络 timeout、健康检查失败或服务不可用。

问题定位及解决思路

有时节点在低 cpu ‘us’ (user) 、高 cpu ‘id’ (idle) 的条件下,仍会出现负载很高的情况。通常是由于文件 IO 性能达到瓶颈,导致 IO Wait 过多,使节点整体负载升高,影响其它进程的性能。
本文以 top、atop 及 iotop 工具为例,来判断磁盘 I/O 是否正在降低应用性能。

查看平均负载及等待时间

  1. 登录节点,执行 top 命令查看当前负载。返回结果如下:

    说明:

    可通过较高的 load average 值得知该节点正在承接大量的请求,也可以通过 Cpu(s)Mem%CPU%MEM 列的数据获取哪些进程正在占用大多数资源。

     top - 19:42:06 up 23:59,  2 users,  load average: 34.64, 35.80, 35.76
     Tasks: 679 total,   1 running, 678 sleeping,   0 stopped,   0 zombie
     Cpu(s): 15.6%us,  1.7%sy,  0.0%ni, 74.7%id,  7.9%wa,  0.0%hi,  0.1%si,  0.0%st
     Mem:  32865032k total, 30989168k used,  1875864k free,   370748k buffers
     Swap:  8388604k total,     5440k used,  8383164k free,  7982424k cached
    
       PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
      9783 mysql     20   0 17.3g  16g 8104 S 186.9 52.3   3752:33 mysqld
      5700 nginx     20   0 1330m  66m 9496 S  8.9  0.2   0:20.82 php-fpm
      6424 nginx     20   0 1330m  65m 8372 S  8.3  0.2   0:04.97 php-fpm
      6573 nginx     20   0 1330m  64m 7368 S  8.3  0.2   0:01.49 php-fpm
      5927 nginx     20   0 1320m  56m 9272 S  7.6  0.2   0:12.54 php-fpm
      5956 nginx     20   0 1330m  65m 8500 S  7.6  0.2   0:12.70 php-fpm
      6126 nginx     20   0 1321m  57m 8964 S  7.3  0.2   0:09.72 php-fpm
      6127 nginx     20   0 1319m  54m 9520 S  6.6  0.2   0:08.73 php-fpm
      6131 nginx     20   0 1320m  56m 9404 S  6.6  0.2   0:09.43 php-fpm
      6174 nginx     20   0 1321m  56m 8444 S  6.3  0.2   0:08.92 php-fpm
      5790 nginx     20   0 1319m  54m 9468 S  5.6  0.2   0:17.33 php-fpm
      6575 nginx     20   0 1320m  55m 8212 S  5.6  0.2   0:02.11 php-fpm
      6160 nginx     20   0 1310m  44m 8296 S  4.0  0.1   0:10.05 php-fpm
      5597 nginx     20   0 1310m  46m 9556 S  3.6  0.1   0:21.03 php-fpm
      5786 nginx     20   0 1310m  45m 8528 S  3.6  0.1   0:15.53 php-fpm
      5797 nginx     20   0 1310m  46m 9444 S  3.6  0.1   0:14.02 php-fpm
      6158 nginx     20   0 1310m  45m 8324 S  3.6  0.1   0:10.20 php-fpm
      5698 nginx     20   0 1310m  46m 9184 S  3.3  0.1   0:20.62 php-fpm
      5779 nginx     20   0 1309m  44m 8336 S  3.3  0.1   0:15.34 php-fpm
      6540 nginx     20   0 1306m  40m 7884 S  3.3  0.1   0:02.46 php-fpm
      5553 nginx     20   0 1300m  36m 9568 S  3.0  0.1   0:21.58 php-fpm
      5722 nginx     20   0 1310m  45m 8552 S  3.0  0.1   0:17.25 php-fpm
      5920 nginx     20   0 1302m  36m 8208 S  3.0  0.1   0:14.23 php-fpm
      6432 nginx     20   0 1310m  45m 8420 S  3.0  0.1   0:05.86 php-fpm
      5285 nginx     20   0 1302m  38m 9696 S  2.7  0.1   0:23.41 php-fpm
  2. 在返回结果界面,可查看核的 wa 值,wa(wait)表示 IO WAIT 的 CPU 占用。默认显示所有核的平均值,按1查看每个核的 wa 值。如下所示:

    说明:

    wa 通常为0%,如果经常浮动在1%之上,说明存储设备的速度已经太慢,无法跟上 CPU 的处理速度。

     top - 19:42:08 up 23:59,  2 users,  load average: 34.64, 35.80, 35.76
     Tasks: 679 total,   1 running, 678 sleeping,   0 stopped,   0 zombie
     Cpu0  : 29.5%us,  3.7%sy,  0.0%ni, 48.7%id, 17.9%wa,  0.0%hi,  0.1%si,  0.0%st
     Cpu1  : 29.3%us,  3.7%sy,  0.0%ni, 48.9%id, 17.9%wa,  0.0%hi,  0.1%si,  0.0%st
     Cpu2  : 26.1%us,  3.1%sy,  0.0%ni, 64.4%id,  6.0%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu3  : 25.9%us,  3.1%sy,  0.0%ni, 65.5%id,  5.4%wa,  0.0%hi,  0.1%si,  0.0%st
     Cpu4  : 24.9%us,  3.0%sy,  0.0%ni, 66.8%id,  5.0%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu5  : 24.9%us,  2.9%sy,  0.0%ni, 67.0%id,  4.8%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu6  : 24.2%us,  2.7%sy,  0.0%ni, 68.3%id,  4.5%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu7  : 24.3%us,  2.6%sy,  0.0%ni, 68.5%id,  4.2%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu8  : 23.8%us,  2.6%sy,  0.0%ni, 69.2%id,  4.1%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu9  : 23.9%us,  2.5%sy,  0.0%ni, 69.3%id,  4.0%wa,  0.0%hi,  0.3%si,  0.0%st
     Cpu10 : 23.3%us,  2.4%sy,  0.0%ni, 68.7%id,  5.6%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu11 : 23.3%us,  2.4%sy,  0.0%ni, 69.2%id,  5.1%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu12 : 21.8%us,  2.4%sy,  0.0%ni, 60.2%id, 15.5%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu13 : 21.9%us,  2.4%sy,  0.0%ni, 60.6%id, 15.2%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu14 : 21.4%us,  2.3%sy,  0.0%ni, 72.6%id,  3.7%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu15 : 21.5%us,  2.2%sy,  0.0%ni, 73.2%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu16 : 21.2%us,  2.2%sy,  0.0%ni, 73.6%id,  3.0%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu17 : 21.2%us,  2.1%sy,  0.0%ni, 73.8%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu18 : 20.9%us,  2.1%sy,  0.0%ni, 74.1%id,  2.9%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu19 : 21.0%us,  2.1%sy,  0.0%ni, 74.4%id,  2.5%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu20 : 20.7%us,  2.0%sy,  0.0%ni, 73.8%id,  3.4%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu21 : 20.8%us,  2.0%sy,  0.0%ni, 73.9%id,  3.2%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu22 : 20.8%us,  2.0%sy,  0.0%ni, 74.4%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
     Cpu23 : 20.8%us,  1.9%sy,  0.0%ni, 74.4%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
     Mem:  32865032k total, 30209248k used,  2655784k free,   370748k buffers
     Swap:  8388604k total,     5440k used,  8383164k free,  7986552k cached

监视磁盘 IO 统计信息

  1. 执行命令 atop ,查看当前磁盘 IO 状态。本例中,磁盘 sda 显示 busy 100% ,表示已达到严重性能瓶颈。

     ATOP - lemp              2017/01/23  19:42:32              ---------                10s elapsed
     PRC | sys    3.18s | user  33.24s | #proc    679 | #tslpu    28 | #zombie    0 | #exit      0 |
     CPU | sys      29% | user    330% | irq       1% | idle   1857% | wait    182% | curscal  69% |
     CPL | avg1   33.00 | avg5   35.29 | avg15  35.59 | csw    62610 | intr   76926 | numcpu    24 |
     MEM | tot    31.3G | free    2.1G | cache   7.6G | dirty  41.0M | buff  362.1M | slab    1.2G |
     SWP | tot     8.0G | free    8.0G |              |              | vmcom  23.9G | vmlim  23.7G |
     DSK |          sda | busy    100% | read       4 | write   1789 | MBw/s   2.84 | avio 5.58 ms |
     NET | transport    | tcpi   10357 | tcpo    9065 | udpi       0 | udpo       0 | tcpao    174 |
     NET | network      | ipi    10360 | ipo     9065 | ipfrw      0 | deliv  10359 | icmpo      0 |
     NET | eth0      4% | pcki    6649 | pcko    6136 | si 1478 Kbps | so 4115 Kbps | erro       0 |
     NET | lo      ---- | pcki    4082 | pcko    4082 | si 8967 Kbps | so 8967 Kbps | erro       0 |
    
     PID   TID  THR  SYSCPU  USRCPU  VGROW  RGROW  RDDSK  WRDSK ST EXC S CPUNR  CPU CMD       1/12
      9783     -  156   0.21s  19.44s     0K  -788K     4K  1344K --   - S     4 197% mysqld
      5596     -    1   0.10s   0.62s 47204K 47004K     0K   220K --   - S    18   7% php-fpm
      6429     -    1   0.06s   0.34s 19840K 19968K     0K     0K --   - S    21   4% php-fpm
      6210     -    1   0.03s   0.30s -5216K -5204K     0K     0K --   - S    19   3% php-fpm
      5757     -    1   0.05s   0.27s 26072K 26012K     0K     4K --   - S    13   3% php-fpm
      6433     -    1   0.04s   0.28s -2816K -2816K     0K     0K --   - S    11   3% php-fpm
      5846     -    1   0.06s   0.22s -2560K -2660K     0K     0K --   - S     7   3% php-fpm
      5791     -    1   0.05s   0.21s  5764K  5692K     0K     0K --   - S    22   3% php-fpm
      5860     -    1   0.04s   0.21s 48088K 47724K     0K     0K --   - S     1   3% php-fpm
      6231     -    1   0.04s   0.20s  -256K    -4K     0K     0K --   - S     1   2% php-fpm
      6154     -    1   0.03s   0.21s -3004K -3184K     0K     0K --   - S    21   2% php-fpm
      6573     -    1   0.04s   0.20s  -512K  -168K     0K     0K --   - S     4   2% php-fpm
      6435     -    1   0.04s   0.19s -3216K -2980K     0K     0K --   - S    15   2% php-fpm
      5954     -    1   0.03s   0.20s     0K   164K     0K     4K --   - S     0   2% php-fpm
      6133     -    1   0.03s   0.19s 41056K 40432K     0K     0K --   - S    18   2% php-fpm
      6132     -    1   0.02s   0.20s 37836K 37440K     0K     0K --   - S    11   2% php-fpm
      6242     -    1   0.03s   0.19s -12.2M -12.3M     0K     4K --   - S    12   2% php-fpm
      6285     -    1   0.02s   0.19s 39516K 39420K     0K     0K --   - S     3   2% php-fpm
      6455     -    1   0.05s   0.16s 29008K 28560K     0K     0K --   - S    14   2% php-fpm
  2. 查看占用磁盘 IO 的进程,有如下两种方法:

    • 继续在该界面按 d,可查看哪些进程正在使用磁盘 IO。返回结果如下:

         ATOP - lemp               2017/01/23  19:42:46               ---------               2s elapsed
         PRC | sys    0.24s | user   1.99s | #proc    679 | #tslpu    54 | #zombie    0 | #exit      0 |
         CPU | sys      11% | user    101% | irq       1% | idle   2089% | wait    208% | curscal  63% |
         CPL | avg1   38.49 | avg5   36.48 | avg15  35.98 | csw     4654 | intr    6876 | numcpu    24 |
         MEM | tot    31.3G | free    2.2G | cache   7.6G | dirty  48.7M | buff  362.1M | slab    1.2G |
         SWP | tot     8.0G | free    8.0G |              |              | vmcom  23.9G | vmlim  23.7G |
         DSK |          sda | busy    100% | read       2 | write    362 | MBw/s   2.28 | avio 5.49 ms |
         NET | transport    | tcpi    1031 | tcpo     968 | udpi       0 | udpo       0 | tcpao     45 |
         NET | network      | ipi     1031 | ipo      968 | ipfrw      0 | deliv   1031 | icmpo      0 |
         NET | eth0      1% | pcki     558 | pcko     508 | si  762 Kbps | so 1077 Kbps | erro       0 |
         NET | lo      ---- | pcki     406 | pcko     406 | si 2273 Kbps | so 2273 Kbps | erro       0 |
      
           PID          TID         RDDSK         WRDSK        WCANCL         DSK        CMD         1/5
          9783            -            0K          468K           16K         40%        mysqld
          1930            -            0K          212K            0K         18%        flush-8:0
          5896            -            0K          152K            0K         13%        nginx
           880            -            0K          148K            0K         13%        jbd2/sda5-8
          5909            -            0K           60K            0K          5%        nginx
          5906            -            0K           36K            0K          3%        nginx
          5907            -           16K            8K            0K          2%        nginx
          5903            -           20K            0K            0K          2%        nginx
          5901            -            0K           12K            0K          1%        nginx
          5908            -            0K            8K            0K          1%        nginx
          5894            -            0K            8K            0K          1%        nginx
          5911            -            0K            8K            0K          1%        nginx
          5900            -            0K            4K            4K          0%        nginx
          5551            -            0K            4K            0K          0%        php-fpm
          5913            -            0K            4K            0K          0%        nginx
          5895            -            0K            4K            0K          0%        nginx
          6133            -            0K            0K            0K          0%        php-fpm
          5780            -            0K            0K            0K          0%        php-fpm
          6675            -            0K            0K            0K          0%        atop
    • 也可以执行命令 iotop -oPa 查看哪些进程占用磁盘 IO。返回结果如下:

         Total DISK READ: 15.02 K/s | Total DISK WRITE: 3.82 M/s
           PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
          1930 be/4 root          0.00 B   1956.00 K  0.00 % 83.34 % [flush-8:0]
          5914 be/4 nginx         0.00 B      0.00 B  0.00 % 36.56 % nginx: cache manager process
           880 be/3 root          0.00 B     21.27 M  0.00 % 35.03 % [jbd2/sda5-8]
          5913 be/2 nginx        36.00 K   1000.00 K  0.00 %  8.94 % nginx: worker process
          5910 be/2 nginx         0.00 B   1048.00 K  0.00 %  8.43 % nginx: worker process
          5896 be/2 nginx        56.00 K    452.00 K  0.00 %  6.91 % nginx: worker process
          5909 be/2 nginx        20.00 K   1144.00 K  0.00 %  6.24 % nginx: worker process
          5890 be/2 nginx        48.00 K    692.00 K  0.00 %  6.07 % nginx: worker process
          5892 be/2 nginx        84.00 K    736.00 K  0.00 %  5.71 % nginx: worker process
          5901 be/2 nginx        20.00 K    504.00 K  0.00 %  5.46 % nginx: worker process
          5899 be/2 nginx         0.00 B    596.00 K  0.00 %  5.14 % nginx: worker process
          5897 be/2 nginx        28.00 K   1388.00 K  0.00 %  4.90 % nginx: worker process
          5908 be/2 nginx        48.00 K    700.00 K  0.00 %  4.43 % nginx: worker process
          5905 be/2 nginx        32.00 K   1140.00 K  0.00 %  4.36 % nginx: worker process
          5900 be/2 nginx         0.00 B   1208.00 K  0.00 %  4.31 % nginx: worker process
          5904 be/2 nginx        36.00 K   1244.00 K  0.00 %  2.80 % nginx: worker process
          5895 be/2 nginx        16.00 K    780.00 K  0.00 %  2.50 % nginx: worker process
          5907 be/2 nginx         0.00 B   1548.00 K  0.00 %  2.43 % nginx: worker process
          5903 be/2 nginx        36.00 K   1032.00 K  0.00 %  2.34 % nginx: worker process
          6130 be/4 nginx         0.00 B     72.00 K  0.00 %  2.18 % php-fpm: pool www
          5906 be/2 nginx        12.00 K    844.00 K  0.00 %  2.10 % nginx: worker process
          5889 be/2 nginx        40.00 K   1164.00 K  0.00 %  2.00 % nginx: worker process
          5894 be/2 nginx        44.00 K    760.00 K  0.00 %  1.61 % nginx: worker process
          5902 be/2 nginx        52.00 K    992.00 K  0.00 %  1.55 % nginx: worker process
          5893 be/2 nginx        64.00 K    972.00 K  0.00 %  1.22 % nginx: worker process
          5814 be/4 nginx        36.00 K     44.00 K  0.00 %  1.06 % php-fpm: pool www
          6159 be/4 nginx         4.00 K      4.00 K  0.00 %  1.00 % php-fpm: pool www
          5693 be/4 nginx         0.00 B      4.00 K  0.00 %  0.86 % php-fpm: pool www
          5912 be/2 nginx        68.00 K    300.00 K  0.00 %  0.72 % nginx: worker process
          5911 be/2 nginx        20.00 K    788.00 K  0.00 %  0.72 % nginx: worker process

      通过 man iotop 命令可以查看以下几个参数的含义。返回结果如下:

         -o, --only
                Only show processes or threads actually doing I/O, instead of showing all processes or threads. This can be dynamically toggled by pressing o.
         -P, --processes
                Only show processes. Normally iotop shows all threads.
      
         -a, --accumulated
                Show accumulated I/O instead of bandwidth. In this mode, iotop shows the amount of I/O processes have done since iotop started.

其他原因

节点上部署了其他非 K8S 管理的服务可能造成节点高负载问题。例如,在节点上安装不被 K8S 所管理的数据库。

目录