专栏首页LINUX阅码场用off-cpu火焰图调查Linux性能问题

用off-cpu火焰图调查Linux性能问题

来源

https://www.memsql.com/blog/linux-off-cpu-investigation/

《investigating Linux Performance with Off-CPU Flame Graphs》

本文用off-cpu火焰图分析一个程序的延迟(主要在拿锁上),找出来瓶颈,并消除的故事。本文非常值得一读,但是阅码场没有足够的时间将其翻译为中文,希望童鞋们直接读英文。

The Setup

As a performance engineer at MemSQL, one of my primary responsibilities is to ensure that customer Proof of Concepts (POCs) run smoothly. I was recently asked to assist with a big POC, where I was surprised to encounter an uncommon Linux performance issue. I was running a synthetic workload of 16 threads (one for each CPU core). Each one simultaneously executed a very simple query (select count(*) from t where i > 5) against a columnstore table.

In theory, this ought to be a CPU bound operation since it would be reading from a file that was already in disk buffer cache. In practice, our cores were spending about 50% of their time idle

In this post, I’ll walk through some of the debugging techniques and reveal exactly how we reached resolution.

What were our threads doing?

After confirming that our workload was indeed using 16 threads, I looked at the state of our various threads. In every refresh of my htop window, I saw that a handful of threads were in the D state corresponding to “Uninterruptible sleep”:

Why were we going off CPU?

At this point, I generated an off-cpu flamegraph using Linux perf_events to see why we entered this state. Off-CPU means that instead of looking at what is keeping the CPU busy, you look at what is preventing it from being busy by things happening elsewhere (e.g. waiting for IO or a lock). The normal way to generate these visualizations is to use perf inject -s, but the machine I tested on did not have a new enough version of perf. Instead I had to use an awk script I had previously written:

$ sudo perf record --call-graph=fp -e 'sched:sched_switch' -e 'sched:sched_stat_sleep' -e 'sched:sched_stat_blocked' --pid $(pgrep memsqld | head -n 1) -- sleep 1

[ perf record: Woken up 1 times to write data ]

[ perf record: Captured and wrote 1.343 MB perf.data (~58684 samples) ]

$ sudo perf script -f time,comm,pid,tid,event,ip,sym,dso,trace -i sched.data | ~/FlameGraph/stackcollapse-perf-sched.awk | ~/FlameGraph/flamegraph.pl --color=io --countname=us >off-cpu.svg

Note: recording scheduler events via perf record can have a very large overhead and should be used cautiously in production environments. This is why I wrap the perf record around a sleep 1 to limit the duration.

In an off-cpu flamegraph, the width of a bar is proportional to the total time spent off cpu. Here we see a lot of time is spent in rwsem_down_write_failed.

From the repeated calls to rwsem_down_read_failed and rwsem_down_write_failed, we see that culprit was mmapcontending in the kernel on the mm->mmap_sem lock:

down_write(&mm->mmap_sem);

ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,&populate);

up_write(&mm->mmap_sem);

This was causing every mmap syscall to take 10-20ms (almost half the latency of the query itself). MemSQL was so fast that that we had inadvertently written a benchmark for Linux mmap!

The fix was simple — we switched from using mmap to using the traditional file read interface. After this change, we nearly doubled our throughput and became CPU bound as we expected:

For more information and discussion around Linux performance, check out the original post on my personal blog.

Download MemSQL Community Edition to run your own performance tests for free today: memsql.com/download

Alex Reece is a systems and performance engineer. He believes in active benchmarking, root cause analysis, and fast code.

"Linux阅码场"是专业的Linux及系统软件技术交流社区,企业和Linux人才的连接枢纽。

本文分享自微信公众号 - Linux阅码场(LinuxDev),作者:Alex Reece

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2018-12-18

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • ACPI几个关键概念汇总整理(英文)

    ACPI System Locality Distance Information Table (SLIT): The relative distance be...

    Linux阅码场
  • 吴章金:通过操作 Section 为 Linux ELF 程序新增数据

    Section 是 Linux ELF 程序格式的一种核心数据表达方式,用来存放一个一个的代码块、数据块(包括控制信息块),这样一种模块化的设计为程序开发提供了...

    Linux阅码场
  • 王录华:我是如何为公有云和XX联邦政府提供安全的Linux操作系统的解决方案的?

    我大约是在2012年左右时,在网络和私有云两大领域之外,又在操作系统领域得到一个非常有挑战的新机遇:为公司超过30万台的服务器提供安全的Linux操作系统解决方...

    Linux阅码场
  • 动态合并点预测(CS AR)

    经过几十年的研究,条件分支的错误预测仍然给性能带来了很大的问题。此外,对无限大小预测器的极限研究表明,许多剩余的分支是当前策略无法预测的。我们的工作重点是在面对...

    刘持诚
  • 动态多智能体系统中强化学习模型的解释(CS)

    译文:近年来,人们对深度强化学习(DRL)系统的透明度和可解释性越来越感兴趣。口头解释作为我们日常生活中最自然的交流方式,更值得关注,因为它可以让用户更好地了解...

    N乳酸菌
  • Linux下kafka集群搭建过程记录

    http://kafka.apache.org/downloads 中下载,目前最新版本的kafka已经到2.2.0,我这里之前下载的是kafka_2.11-2...

    小勇DW3
  • C++核心准则C.127:包含虚函数的类应该有虚析构函数或保护析构函数‍

    C.127: A class with a virtual function should have a virtual or protected destru...

    面向对象思考
  • Deploy Using Travis-CI And Github Webhook — webpack doc as an example

    Overview Some friends and I have been running webpack-china for a few months. A...

    李成熙heyli
  • 光剑评注:其实,说了这么多废话,无非就是: 一切皆是映射。不管是嵌套 XML,还是 Lisp 嵌套括号,还是 XXX 的 Map 数据结构,一切都是树形结构——映射。Lisp的本质(The Natur

    http://www.defmacro.org/ramblings/lisp.html

    一个会写诗的程序员
  • 开源ALNS 自适应大邻域搜索(Adaptive Large Neighborhood Search)

    This package offers a general, well-documented and tested implementation of the ...

    用户1908973

扫码关注云+社区

领取腾讯云代金券