最近几天我们几位同事一起在做业务压测,TPS始终上不去,始终在100上下,平均响应时间也在5、6秒左右,性能差得可想而知,但是数据库层压测期间所有的SQL平均响应时间均在30ms以下,CPU使用率也才达到40%而已,RDS整体性能并没有达到瓶颈。
由于是PHP程序,没法用现有的压测平台进行全链路压测跟踪到具体每一步的耗时。在经历了各种linux调优、PHP参数调优以后,效果仍然没有起色,气氛一度陷入尴尬。
其实利用之前文章《MySQL DBA如何利用strace/pstack/gdb来定位问题》中的提到的strace也可以跟踪到某个线程的全部调用情况,但是同事推荐了另外一个项目:phptrace,可以专门针对PHP程序做链路追踪,效果跟strace类似。
项目地址如下:https://github.com/Qihoo360/phptrace
它可以跟踪PHP在运行时的函数调用、请求信息、执行流程,并且提供有过滤器、统计信息、当前状态等实用功能。在任何环境下,它都能很好的定位阻塞问题以及在高负载下Debug,尤其是线上生产环境。
具有以下特性:
你可以利用如下方式来跟踪PHP程序
$ phptrace -p 3600
[pid 3600] > Me->run() called at [example.php:57]
[pid 3600] > Me->say("good night") called at [example.php:33]
[pid 3600] < Me->say("good night") = NULL called at [example.php:33] ~ 0.000s 0.000s
[pid 3600] > Me->sleep() called at [example.php:34]
[pid 3600] > Me->say("sleeping...") called at [example.php:27]
[pid 3600] < Me->say("sleeping...") = NULL called at [example.php:27] ~ 0.000s 0.000s
[pid 3600] > sleep(2) called at [example.php:28]
[pid 3600] < sleep(2) = 0 called at [example.php:28] ~ 2.000s 2.000s
[pid 3600] < Me->sleep() = NULL called at [example.php:34] ~ 2.000s 0.000s
[pid 3600] > Me->say("wake up") called at [example.php:35]
[pid 3600] < Me->say("wake up") = NULL called at [example.php:35] ~ 0.000s 0.000s
[pid 3600] < Me->run() = NULL called at [example.php:57] ~ 2.000s 0.000s
你也可以根据函数名或者url进行过滤
$ phptrace -p 3600 -f type=class,content=Me
[pid 3600]> Me->run() called at [example.php:57]
[pid 3600]> Me->say("good night") called at [example.php:33]
[pid 3600]< Me->say("good night") = NULL called at [example.php:33] ~ 0.000s 0.000s
[pid 3600]> Me->sleep() called at [example.php:34]
[pid 3600]> Me->say("sleeping...") called at [example.php:27]
[pid 3600]< Me->say("sleeping...") = NULL called at [example.php:27] ~ 0.000s 0.000s
[pid 3600]< Me->sleep() = NULL called at [example.php:34] ~ 2.000s 2.000s
[pid 3600]> Me->say("wake up") called at [example.php:35]
[pid 3600]< Me->say("wake up") = NULL called at [example.php:35] ~ 0.000s 0.000s
[pid 3600]< Me->run() = NULL called at [example.php:57] ~ 2.001s 0.000s
在压测期间,我们利用phptrace做了跟踪,通过阅读trace日志,发现所有的调用时间基本上集中在了一个函数上,而其他部分消耗时间都很短。
[pid 22745] < file_put_contents("/var/www/cp-activity/activity/da"..., "2020-07-17 04:35:28 front_money_"..., 8) = 254 called at [/var/www/cp-activity/activity/framework/class/service/cache/fcache.class.php:140] ~ 1.338s 1.338s
经过仔细勘察,这个file_put_contents
函数的作用其实就是往nas上写日志,看到这里真的是有点内伤的感觉,写本地磁盘、写ES难道不香吗,为啥想不开写nas呢???
基于上述判断,我们在其中一台机器上注释掉了该函数的调用,然后再次做了压测(10000并发情况),压测期间监控到alive线程基本上保持在10左右,总的TPS峰值达到了接近400左右,这还是在只修改了一台的情况下的结果,相信如果6台应用全部注释掉该方法,TPS应该能够上升一个数量级(待继续验证)。
strace
、phptrace
、perf
等工具