文章/答案/技术大牛

发布

linux内核

共 50 篇文章

混部之殇-论云原生资源隔离技术之CPU隔离(一)

腾讯TencentOS 十年云原生的迭代演进之路

一次内核hung task分析

NFSv4客户端hung住的BUG，您解决了吗？

nfs不同版本的挂载与解析

(好文重发)朴英敏:用crash工具分析Linux内核死锁的一次实战

内核问题解决方法记录

blocked for more than 120 seconds

记一次Linux主机内存脏数据引发的NameNode故障

[linux][memory]cgroup回收内存对虚拟机的影响分析

docker cgroup 技术之memory（首篇）

[linux][memory] 内存回收

Linux内核理解 Memory barrier（内存屏障）

Linux内核27-优化和内存屏障

谢宝友：深入理解 Linux RCU 从硬件说起之内存屏障

谢宝友：深入理解 RCU 之概念

聊聊 Linux 上软件实现的“交换机” - Bridge！

谈谈 Linux 假死现象

宋宝华：数据库为什么有可能喜欢Linux AIO（异步I/O）?

write文件一个字节后何时发起写磁盘IO？

深入理解 Linux的 I/O 系统

深入理解Linux 的Page Cache

深入理解Linux文件系统之文件系统挂载(上)

深入理解Linux文件系统之文件系统挂载(下)

【线上故障】通过系统日志分析和定位

实战案例分享：根据 JVM crash 日志定位和分析问题

Linux系统安全 | Linux日志分析和管理

如何快速处理线上故障

面试-线上故障如何排查

Linux内核Crash分析

内核timer crash debug思路

一次解决Linux内核内存泄漏实战全过程

Linux Kernel模块内存泄露的一种查找思路

linux系统奔溃之vmcore：kdump 的亲密战友 crash

crash浅析tasklist_lock与进程释放

Linux OOM机制分析

cgroup oom引发Pod重建问题分析

workqueue相关数据结构在内核crash分析中的实战应用

Linux设备驱动workqueue(工作队列)案例实现

Linux内核中的软中断、tasklet和工作队列具体解释

扒开 Linux 中断的底裤之 workqueue

Linux系统驱动之GIC驱动程序对中断的处理流程

Linux系统驱动之链式中断控制器驱动程序编写

清单首页linux内核文章详情

清单「linux内核」 12/50

记一次Linux主机内存脏数据引发的NameNode故障

Lu说·大数据运维工程师

全文共10789字，阅读约需3min。

如果你运维过一个大数据集群，你就能够明白内存对于集群主节点来说是一种稀缺资源，特别是集群越大，机器内存问题给集群带来的问题就越明显且越频繁。在我管理的这么多大数据集群中，该问题现象就层出不穷。下面对其中一次现象进行了总结。

问题现象

1) 一台主机无缘无故死机; 2) 其上namenode服务无响应其发生准备切换。

问题排查思路是：

接收到主机网络不通告警；
接收到其上namenode服务STOP告警；
查看备节点是否成功切换为主节点；
查看zkfc日志确定failover原因；
重启（如果需要）故障主机，备份日志，启动nn服务，查看主机日志确认主机故障原因。

其实正确的处理问题路径是：重启主机->启动nn->查看日志定位原因。这里不对问题处理思路作深究，仅重点阐述本案中主机死机的原因。

主机死机，一般是要查看/var/log/message日志，发现提示

echo 0> /proc/sys/kernel/hung_task_timeout_secs

完整的日志如下：

Mar 28 03:12:47 namenode01 collectd[1731]: write_graphite plugin: send to epc-graphite-server:2003 (tcp) failed with status 110 (Connection timed out)
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status -1.
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Some write plugin is back to normal operation. `write' succeeded.
Mar 28 03:14:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:14:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:14:31 namenode01 kernel: kworker/15:2    D ffff88208618bdd8     0 22246      2 0x00000080
Mar 28 03:14:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:14:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:14:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8
Mar 28 03:14:31 namenode01 kernel: Call Trace:
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff8105aeae>] ? physflat_send_IPI_mask+0xe/0x10
Mar 28 03:14:31 namenode01 kernel: [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c0548>] ? resched_curr+0xa8/0xc0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a8f8b>] create_worker+0xeb/0x200
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a9216>] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a9743>] worker_thread+0x383/0x3c0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/2:2:11678 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/2:2     D ffff88003922bdd8     0 11678      2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88003922bbf0 0000000000000046 ffff881fffd86eb0 ffff88003922bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88003922bfd8 ffff88003922bfd8 ffff881fffd86eb0 ffff88003922bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88003922bd58 ffff881fffd86eb0 ffff88003922bdd8
Mar 28 03:16:31 namenode01 kernel: Call Trace:
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810c8f18>] ? __enqueue_entity+0x78/0x80
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810cf90c>] ? enqueue_entity+0x26c/0xb60
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a8f8b>] create_worker+0xeb/0x200
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a9216>] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a9743>] worker_thread+0x383/0x3c0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/15:2    D ffff88208618bdd8     0 22246      2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8

原因在于，默认情况下， Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后，文件系统会把将缓存中的内存全部写入磁盘，导致后续的IO请求都是同步的。将缓存写入磁盘时，有一个默认120秒的超时时间。

出现上面的问题的原因是IO子系统的处理速度不够快，不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢，导致越来越多的请求堆积，最终系统内存全部被占用，导致系统失去响应。

这个Linux延迟写机制带来的问题，并且在主机内存越大时，出现该问题的可能性更大。研究发现This is a know bug。

This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory. The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.

链接：nfo-task-blocked-for-more-than-120-seconds

关于脏数据，有几个配置：

vm.dirty_background_ratio是内存可以填充“脏数据”的百分比。这些“脏数据”在稍后是会写入磁盘的，pdflush/flush/kdmflush这些后台进程会稍后清理脏数据。举一个例子，我有32G内存，那么有3.2G的内存可以待着内存里，超过3.2G的话就会有后来进程来清理它。

vm.dirty_ratio 是绝对的脏数据限制，内存里的脏数据百分比不能超过这个值，如果超过，将强制刷写到磁盘。如果脏数据超过这个数量，新的IO请求将会被阻挡，直到脏数据被写进磁盘。这是造成IO卡顿的重要原因，但这也是保证内存中不会存在过量脏数据的保护机制。

vm.dirty_expire_centisecs 指定脏数据能存活的时间。在这里它的值是30秒。当 pdflush/flush/kdmflush 进行起来时，它会检查是否有数据超过这个时限，如果有则会把它异步地写到磁盘中。毕竟数据在内存里待太久也会有丢失风险。

vm.dirty_writeback_centisecs 指定多长时间 pdflush/flush/kdmflush 这些进程会起来一次。

调优

我们可以从以下思路进行调优：

减少脏数据的比例，避免刷写超时
减小脏数据在内存中的存放时间，避免积少成多

修改相应参数

临时修改

# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -p

永久修改

#vi /etc/sysctl.conf
写入
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

sysctl -p

关于脏数据，这里简单说一下：

脏数据 由于Linux内核实现的一种主要磁盘缓存的存在，也就是页高速缓存（cache）。页高速缓存的缓存作用，写操作实际上会被延迟。当页高速缓存中的数据比后台存储的数据更新时，那么该数据就被称做脏数据。

参考

关于Linux内核延迟写机制，可以学习以下文章：http://ilinuxkernel.com/?p=1578
关于页高速缓存和脏数据等其他IO术语不清楚的同学，参考几个IO术语：https://blog.51cto.com/qixue/1906775

----<END>----

linux内核

记一次Linux主机内存脏数据引发的NameNode故障

问题现象

调优

参考

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐