前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >一次kvm上kernel crash分析记录

一次kvm上kernel crash分析记录

作者头像
惠伟
发布2021-02-24 11:28:03
1.6K0
发布2021-02-24 11:28:03
举报
文章被收录于专栏:虚拟化笔记虚拟化笔记

起因

有同事联系我说公司内部云上一台虚拟机crash,有环境可以看vmcore。

代码语言:javascript
复制
[root@p32486v ~]# crash /usr/lib/debug/lib/modules/3.10.0-957.21.3.el7.x86_64/vmlinux vmcore
crash> bt
PID: 0      TASK: ffff8a4d3a338000  CPU: 4   COMMAND: "swapper/4"
 #0 [ffff8a4fffd03670] machine_kexec at ffffffffa6263934
 #1 [ffff8a4fffd036d0] __crash_kexec at ffffffffa631d162
 #2 [ffff8a4fffd037a0] crash_kexec at ffffffffa631d250
 #3 [ffff8a4fffd037b8] oops_end at ffffffffa696d778
 #4 [ffff8a4fffd037e0] no_context at ffffffffa695bdbe
 #5 [ffff8a4fffd03830] __bad_area_nosemaphore at ffffffffa695be55
 #6 [ffff8a4fffd03880] bad_area_nosemaphore at ffffffffa695bfc6
 #7 [ffff8a4fffd03890] __do_page_fault at ffffffffa69706d0
 #8 [ffff8a4fffd03900] trace_do_page_fault at ffffffffa69709d6
 #9 [ffff8a4fffd03940] do_async_page_fault at ffffffffa696ff62
#10 [ffff8a4fffd03960] async_page_fault at ffffffffa696c798
    [exception RIP: __kmalloc+148]
    RIP: ffffffffa641d8e4  RSP: ffff8a4fffd03a18  RFLAGS: 00010282
    RAX: 0000000000000000  RBX: ffff8a4fcb2487c0  RCX: 000000000000a64f
    RDX: 000000000000a64e  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff8a4fffd03a48   R8: 000000000001f040   R9: ffffffffc03e049b
    R10: ffff8a4d3fc03e00  R11: 0000000000000000  R12: 0000000000000020
    R13: ffff8a007365735f  R14: 0000000000000008  R15: ffff8a4d3fc03e00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#11 [ffff8a4fffd03a50] tcp_v4_syn_recv_sock_toa at ffffffffc03e049b [toa]
#12 [ffff8a4fffd03ac0] tcp_check_req at ffffffffa68abfcf
#13 [ffff8a4fffd03b30] tcp_v4_do_rcv at ffffffffa68a999f
#14 [ffff8a4fffd03b78] tcp_v4_rcv at ffffffffa68aaf6c
#15 [ffff8a4fffd03c10] ip_local_deliver_finish at ffffffffa68838dd
#16 [ffff8a4fffd03c38] ip_local_deliver at ffffffffa6883bc9
#17 [ffff8a4fffd03c90] ip_rcv_finish at ffffffffa6883540
#18 [ffff8a4fffd03cb8] ip_rcv at ffffffffa6883ef9
#19 [ffff8a4fffd03d20] __netif_receive_skb_core at ffffffffa683a579
#20 [ffff8a4fffd03d90] __netif_receive_skb at ffffffffa683a878
#21 [ffff8a4fffd03db0] netif_receive_skb_internal at ffffffffa683a900
#22 [ffff8a4fffd03de0] napi_gro_receive at ffffffffa683b588
#23 [ffff8a4fffd03e08] virtnet_poll at ffffffffc03ed1f5 [virtio_net]
#24 [ffff8a4fffd03e78] net_rx_action at ffffffffa683af1f
#25 [ffff8a4fffd03ef8] __do_softirq at ffffffffa62a1075
#26 [ffff8a4fffd03f68] call_softirq at ffffffffa697932c
#27 [ffff8a4fffd03f80] do_softirq at ffffffffa622e675
#28 [ffff8a4fffd03fa0] irq_exit at ffffffffa62a13f5
#29 [ffff8a4fffd03fb8] do_IRQ at ffffffffa697a606
--- <IRQ stack> ---
#30 [ffff8a4d3a383df8] ret_from_intr at ffffffffa696c362
    [exception RIP: native_safe_halt+11]
    RIP: ffffffffa696af5b  RSP: ffff8a4d3a383ea8  RFLAGS: 00000246
    RAX: ffffffffa696ad30  RBX: ffff8a4fffd15ec0  RCX: 0100000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000046
    RBP: ffff8a4d3a383ea8   R8: 0000000000000000   R9: ffff8a4fe728bda8
    R10: 0000000000000000  R11: 7fffffffffffffff  R12: 0000000000000004
    R13: 000005fb53bfaa00  R14: cc9a9800d079cbbe  R15: 0000000000000086
    ORIG_RAX: ffffffffffffff3c  CS: 0010  SS: 0018
#31 [ffff8a4d3a383eb0] default_idle at ffffffffa696ad4e
#32 [ffff8a4d3a383ed0] arch_cpu_idle at ffffffffa62366f0
#33 [ffff8a4d3a383ee0] cpu_startup_entry at ffffffffa62fc6da
#34 [ffff8a4d3a383f28] start_secondary at ffffffffa6258047
#35 [ffff8a4d3a383f50] start_cpu at ffffffffa62000d5
crash>

草率+迷糊

为什么cpu在内核收包时突然发生了异常,异常号是14,执行async_page_fault处理异常,处理异常时发现发现cr2中地址处于内核空间中,内核是不会发生page fault的,async_page_fault MMU会触发,kvm环境下kvm也会触发,那一定是kvm的问题了。

guest初始化时设置异常处理函数,并且write msr把地址的一块物理内相给了kvm,说你要给我触发异常就写这个物理地址。

代码语言:javascript
复制
kvm_guest_init
           if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
		x86_init.irqs.trap_init = kvm_apf_trap_init;


static void __init kvm_apf_trap_init(void)
{
	set_intr_gate(14, async_page_fault);
}


static void kvm_guest_cpu_init(void)
{
	if (!kvm_para_available())
		return;

	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));

#ifdef CONFIG_PREEMPT
		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif
		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
		__this_cpu_write(apf_reason.enabled, 1);
		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
		       smp_processor_id());
	}
}

异常来了,kvm开始处理,发现reason不是kvm,然后进入普通处理流程,发现pagefault地址在内核空间,内核不能发生pagefault,crash了。

代码语言:javascript
复制
async_page_fault->do_async_page_fault


u32 kvm_read_and_reset_pf_reason(void)
{
	u32 reason = 0;

	if (__this_cpu_read(apf_reason.enabled)) {
		reason = __this_cpu_read(apf_reason.reason);
		__this_cpu_write(apf_reason.reason, 0);
	}

	return reason;
}

do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
{
	enum ctx_state prev_state;

	switch (kvm_read_and_reset_pf_reason()) {
	default:
		trace_do_page_fault(regs, error_code);
		break;
	case KVM_PV_REASON_PAGE_NOT_PRESENT:
		/* page is swapped out by the host. */
		prev_state = exception_enter();
		exit_idle();
		kvm_async_pf_task_wait((u32)read_cr2());
		exception_exit(prev_state);
		break;
	case KVM_PV_REASON_PAGE_READY:
		rcu_irq_enter();
		exit_idle();
		kvm_async_pf_task_wake((u32)read_cr2());
		rcu_irq_exit();
		break;
	}
}

dotraplinkage void __kprobes
trace_do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
	enum ctx_state prev_state;
	/*
	 * The exception_enter and tracepoint processing could
	 * trigger another page faults (user space callchain
	 * reading) and destroy the original cr2 value, so read
	 * the faulting address now.
	 */
	unsigned long address = read_cr2();

	prev_state = exception_enter();
	trace_page_fault_entries(regs, error_code);
	__do_page_fault(regs, error_code, address);
	exception_exit(prev_state);
}

最终决定把kvm apf关了尝试一下,guest kernel加参数no-kvmapf,guest不再写MSR_KVM_ASYNC_PF_EN,那么kvm也就不再给guest inject这样的irq,guest就不会执行到async_page_fault。

第二天早上一看,又crash重启了,决定回头再分析一下。

如果host把guest的page swap out了, guest vcpu上运行的进程触发了EPT violation要等kvm把page swap in才能恢复运行,要等很长时间,所以asyn page fault就想着把发生EPT violation的vcpu换个进程运行,就注入一个异常通知vcpu切换进程,等page swap in了再注入异常通知vcpu可以运行刚才被切换走的进程了,host上内存足够用,没有把任何内存swap out出去。

想不明白为什么要把这种机制和内核pagefault handler混在一起,看了代码感觉cpu 发生pagefault和kvm有关系,其实cpu发生pagefault和kvm EPT violation没什么关系,各干各的活,不细想就迷糊了,第一天就是没想明白,有点迷糊。

再次尝试

toa是什么东西,网上搜索发现是配合lvs/dpvs用的,正好组里有做lvs和dpvs的同事,要了代码看了看,其实代码是开源的,最早开源的toa只有ipv4的,ipv4地址只有32bit,系统调用一个寄存器就传到用户态了,可ipv6地址有128 bit,一个寄存器放不下,只能内核kmalloc暂存,copy到用户态,然后内核再找个时机kfree,公司同事就是这样修改了。

怀疑toa模块中kmalloc后memcpy时长度不对,把重要数据覆盖了。地址不对,长度不对,非法报文都排查了一遍,看代码没问题。

代码语言:javascript
复制
while (length > 0) {
			int opcode = *ptr++;
			int opsize;
			switch (opcode) {
			case TCPOPT_EOL:
				return NULL;
			case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
				length--;
				continue;
			default:
				opsize = *ptr++;
				if (opsize < 2)	/* "silly options" */
					return NULL;
				if (opsize > length)
					return NULL;	/* don't parse partial options */
				if ((TCPOPT_TOA == opcode && TCPOLEN_TOA == opsize) ||
                     (TCPOPT_TOA_V6 == opcode && TCPOLEN_TOA_V6 == opsize) ) {
					memcpy(&tdata, ptr - 2, opsize);
					//TOA_DBG("find toa data: ip = %u.%u.%u.%u, port = %u\n", NIPQUAD(tdata.ip),
						//ntohs(tdata.port));
					ret_ptr = kmalloc(opsize, GFP_ATOMIC);
                    if (! ret_ptr)
                        return NULL;
                    memcpy(ret_ptr, &tdata, opsize);
					return ret_ptr;
				}
				ptr += opsize - 2;
				length -= opsize;
			}
		}

怀疑cpu barrie或者aligned不对,然后看汇编似乎也没问题。

代码语言:javascript
复制
crash>  dis tcp_v4_syn_recv_sock_toa
0xffffffffc03e043b <tcp_v4_syn_recv_sock_toa+299>:      cmp    $0x1,%esi
0xffffffffc03e043e <tcp_v4_syn_recv_sock_toa+302>:      jle    0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa+232>
0xffffffffc03e0440 <tcp_v4_syn_recv_sock_toa+304>:      cmp    %r13d,%esi
0xffffffffc03e0443 <tcp_v4_syn_recv_sock_toa+307>:      jg     0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa+232>
0xffffffffc03e0445 <tcp_v4_syn_recv_sock_toa+309>:      cmp    $0x8,%esi
0xffffffffc03e0448 <tcp_v4_syn_recv_sock_toa+312>:      je     0xffffffffc03e04dc <tcp_v4_syn_recv_sock_toa+460>
0xffffffffc03e044e <tcp_v4_syn_recv_sock_toa+318>:      cmp    $0x14,%esi
0xffffffffc03e0451 <tcp_v4_syn_recv_sock_toa+321>:      je     0xffffffffc03e0470 <tcp_v4_syn_recv_sock_toa+352>
0xffffffffc03e0453 <tcp_v4_syn_recv_sock_toa+323>:      lea    -0x2(%rdi,%rdx,1),%rax
0xffffffffc03e0458 <tcp_v4_syn_recv_sock_toa+328>:      sub    %esi,%r13d
0xffffffffc03e045b <tcp_v4_syn_recv_sock_toa+331>:      jmpq   0xffffffffc03e03d7 <tcp_v4_syn_recv_sock_toa+199>
0xffffffffc03e0460 <tcp_v4_syn_recv_sock_toa+336>:      add    $0x14,%rax
0xffffffffc03e0464 <tcp_v4_syn_recv_sock_toa+340>:      jne    0xffffffffc03e03d7 <tcp_v4_syn_recv_sock_toa+199>
0xffffffffc03e046a <tcp_v4_syn_recv_sock_toa+346>:      jmp    0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa+232>
0xffffffffc03e046c <tcp_v4_syn_recv_sock_toa+348>:      nopl   0x0(%rax)
0xffffffffc03e0470 <tcp_v4_syn_recv_sock_toa+352>:      cmp    $0xfd,%ecx
0xffffffffc03e0476 <tcp_v4_syn_recv_sock_toa+358>:      jne    0xffffffffc03e0453 <tcp_v4_syn_recv_sock_toa+323>
0xffffffffc03e0478 <tcp_v4_syn_recv_sock_toa+360>:      lea    -0x5c(%rbp),%r13
0xffffffffc03e047c <tcp_v4_syn_recv_sock_toa+364>:      movzbl %dl,%r12d
0xffffffffc03e0480 <tcp_v4_syn_recv_sock_toa+368>:      mov    %rax,%rsi
0xffffffffc03e0483 <tcp_v4_syn_recv_sock_toa+371>:      mov    %r12,%rdx
0xffffffffc03e0486 <tcp_v4_syn_recv_sock_toa+374>:      mov    %r13,%rdi
0xffffffffc03e0489 <tcp_v4_syn_recv_sock_toa+377>:      callq  0xffffffffa6586b60 <__memcpy>
0xffffffffc03e048e <tcp_v4_syn_recv_sock_toa+382>:      mov    $0x20,%esi
0xffffffffc03e0493 <tcp_v4_syn_recv_sock_toa+387>:      mov    %r12,%rdi
0xffffffffc03e0496 <tcp_v4_syn_recv_sock_toa+390>:      callq  0xffffffffa641d850 <__kmalloc>
0xffffffffc03e049b <tcp_v4_syn_recv_sock_toa+395>:      test   %rax,%rax
0xffffffffc03e049e <tcp_v4_syn_recv_sock_toa+398>:      je     0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa+232>
0xffffffffc03e04a4 <tcp_v4_syn_recv_sock_toa+404>:      mov    %r12,%rdx
0xffffffffc03e04a7 <tcp_v4_syn_recv_sock_toa+407>:      mov    %r13,%rsi
0xffffffffc03e04aa <tcp_v4_syn_recv_sock_toa+410>:      mov    %rax,%rdi
0xffffffffc03e04ad <tcp_v4_syn_recv_sock_toa+413>:      callq  0xffffffffa6586b60 <__memcpy>

看汇编没看出问题,突然想到既然kmalloc用的地方没问题,哪什么地方kfree的,会不会kfree有问题。

代码语言:javascript
复制
crash> dis __kmalloc
0xffffffffa641d8b1 <__kmalloc+97>:      mov    (%r10),%r8 //r10是s,取s->cpu_slab到r8
0xffffffffa641d8b4 <__kmalloc+100>:     add    %gs:0x59bf38e4(%rip),%r8        # 0x111a0
0xffffffffa641d8bc <__kmalloc+108>:     mov    0x8(%r8),%rdx  //r8是c, rdx是tid
0xffffffffa641d8c0 <__kmalloc+112>:     mov    (%r8),%r13  //r13是object
0xffffffffa641d8c3 <__kmalloc+115>:     mov    0x10(%r8),%rax //取page
0xffffffffa641d8c7 <__kmalloc+119>:     test   %r13,%r13  //if !object
0xffffffffa641d8ca <__kmalloc+122>:     je     0xffffffffa641d9f9 <__kmalloc+425>
0xffffffffa641d8d0 <__kmalloc+128>:     test   %rax,%rax  //if !page
0xffffffffa641d8d3 <__kmalloc+131>:     je     0xffffffffa641d9f9 <__kmalloc+425>
0xffffffffa641d8d9 <__kmalloc+137>:     movslq 0x20(%r10),%rax  //r10是s,取offset
0xffffffffa641d8dd <__kmalloc+141>:     lea    0x1(%rdx),%rcx  //next_tid
0xffffffffa641d8e1 <__kmalloc+145>:     mov    (%r10),%r8 //取s->cpu_slab到r8
0xffffffffa641d8e4 <__kmalloc+148>:     mov    0x0(%r13,%rax,1),%rbx
0xffffffffa641d8e9 <__kmalloc+153>:     mov    %r13,%rax
0xffffffffa641d8ec <__kmalloc+156>:     cmpxchg16b %gs:(%r8)
0xffffffffa641d8f1 <__kmalloc+161>:     sete   %al
0xffffffffa641d8f4 <__kmalloc+164>:     test   %al,%al
0xffffffffa641d8f6 <__kmalloc+166>:     je     0xffffffffa641d8b1 <__kmalloc+97>  //goto redo;
0xffffffffa641d8f8 <__kmalloc+168>:     movslq 0x20(%r10),%rax //s->offset
0xffffffffa641d8fc <__kmalloc+172>:     test   $0x8000,%r12d  //gfpflags & __GFP_ZERO
0xffffffffa641d903 <__kmalloc+179>:     prefetcht0 (%rbx,%rax,1)

出问题时RIP在这一行

代码语言:javascript
复制
mov    0x0(%r13,%rax,1),%rbx 

对应着如下代码,难就难在全都是inline函数,汇编和代码不好对应。

代码语言:javascript
复制
__kmalloc->slab_alloc->slab_alloc_node->get_freepointer_safe->get_freepointer
static inline void *get_freepointer(struct kmem_cache *s, void *object)
{
	return *(void **)(object + s->offset);
}

简化一下就是*(object + s->offset),把object指针偏移一下,然后取指针指向的内容,指针有问题,指向的地址没有物理page对应,发生了pagefault。object来自于freelist,大概率就是freelist有问题,为什么freelist有问题,基本可以确定是重复kfree了。

代码语言:javascript
复制
crash> kmem -s ffff8a4d3fc03e00
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
kmem: kmalloc-8: slab: ffffd3ca90ffaf00 duplicate freelist object: ffff8a4fffebca10
kmem: kmalloc-8: slab: ffffd3ca90b335c0 duplicate freelist object: ffff8a4feccd79a8
kmem: kmalloc-8: slab: ffffd3ca90b0c600 duplicate freelist object: ffff8a4fec318fe8
kmem: kmalloc-8: slab: ffffd3ca8cf65e00 duplicate freelist object: ffff8a4efd978e38
kmem: kmalloc-8: slab: ffffd3ca8cf4e100 duplicate freelist object: ffff8a4efd384d80
kmem: kmalloc-8: slab: ffffd3ca8cf679c0 duplicate freelist object: ffff8a4efd9e7b70
ffff8a4d3fc03e00 kmalloc-8 

看代码kfree(sk->sk_user_data)后没有置空sk->sk_user_data = NULL。

把sk->sk_user_data = NULL加到代码,全新编译内核模块,结果后来又crash了,看了toa的统计我发现free次数还是多于malloc次数。

代码语言:javascript
复制
cat /proc/1/task/1/net/toa_stats  > toa_stats.txt
j="0"; for i in $(awk 'NR==2{print $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9}' toa_stats.txt); do  j=$[$i+$j];  done;
k="0"; for i in $(awk 'NR==4{print $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9}' toa_stats.txt); do  k=$[$i+$k];  done; 
echo "kmalloc $j kfree $k"

进一步怀疑是没有加锁导致的,一个线程accept后,另两个线程同时调用getpeername就会出现重复kfree,得加锁保护。

还有一个问题就是没有调用getpeername就会存在内存泄露。

看dpvs中toa已经修改sk_free时kfree(sk->sk_user_data),并且做了保护,kfree和getpeername互斥,但加的锁是per cpu spin lock,用的是spin_lock_bh和spin_unlock_bh,访问是在process context中,kmalloc是在soft irq context中,kfree有可能在soft irq也有可能在process context中,但不能保证一定是同一个cpu,比如一个cpu线程getpeername,另一个cpu线程close那么也存在问题,假如不允许这样,会不会一个cpu线程getpeername,此时收到了reset包,另一个cpu在soft irq中kfree,感觉这样的情况是会出现的。

看汇编是很有意思的一件事,三分基础七分观察。

遗留问题

alloc_percpu的数据在crash中怎么查看?想看toa的kmalloc和kfree的统计数据,统计数据是动态alloc的percpu数据,怎么找到它的内存,看汇编和gs寄存器有关系,推理了好久没搞明白,有懂的请留言指教。

参考文献

iqiyi/dpvs

汇编语言基础 - 文集 - 简书

铁甲万能狗:第9篇-C/C++ x86_64的函数栈

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 起因
  • 草率+迷糊
  • 再次尝试
  • 遗留问题
  • 参考文献
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档