RAS（四）Intel MCA-Uncorrected Recoverable

Linux阅码场

发布于 2023-08-21 15:48:12

9550

RAS（四）Intel MCA-Uncorrected Recoverable

Recovery of uncorrected recoverable（UCR） errors是MCA的一个增强特性，它针对部分硬件无法恢复的故障，提供软件隔离、恢复的机会。UCR errors表示硬件已检测到错误并发出信号通知到软件，软件执行了某些恢复操作（不会出现错误污染、扩散）后，系统可以继续运行。

UCR功能支持

IA32_MCG_CAP (MCG_SER_P) bit位表示是否支持software error recovery。当Set时，表明系统支持software error recovery。

UCR Error

Error上报

IA32_MCi_STATUS MSR用来上报UCR errors，包括corrected or uncorrected errors。

UCR errors可以通过corrected machine check interrupt (CMCI) 或 machine check exception (MCE)上报。

硬件错误是否是UCR Error

当IA32_MCG_CAP[24] Set后，IA32_MCi_STATUS寄存器的下述bit置对应位后，表示发生了UCR error：

• Valid (bit 63) = 1

• UC (bit 61) = 1

• PCC (bit 57) = 0

故障发生地址

当IA32_MCi_STATUS寄存器的ADDRV和MISCV置位1后，可以通过IA32_MCi_MISC和IA32_MCi_ADDR寄存器获取地址信息

UCR故障分类

IA32_MCi_STATUS的bits 56:55可以描述UCR error类型和恢复action

•S (Signaling) flag, bit 56：当Set时，表示这个MC bank发生硬件故障，错误已经触发上报。软件需要读取AR flag来执行对应recovery action；

•AR (Action Required) flag, bit 55：表示针对这个error，软件必须采取对应的recovery action。这个恢复措施必须在Processor运行的程序调度之前成功的执行完毕。这个恢复措施很重要，否则程序可能会获取到这个错误数据，从而造成错误数据的扩散，破坏数据完整性，这对于云厂商来说是致命的。

针对UCR中corrected和uncorrected errors故障分类

•Uncorrected no action required (UCNA)：UCNA通过CMCI上报。UCNA Error表示系统中的某些数据已损坏，但数据尚未被消费（即没被read），并且处理器的状态可用，程序可以继续在处理器上执行。

•Software recoverable action optional (SRAO)：SRAO通过MCE或CMCI上报。SRAO Error表示系统中某些数据损坏，但是未被消费。软件恢复措施是可选的，可以根据MCACOD采取恢复策略。

•Software recoverable action required (SRAR)：SRAR通过MCE上报。SRAR Error表示系统中某些数据损坏且正在被消费，软件必须在此CPU任务调度前采取recovery action（通常是kill当前cpu上进程，但不限于此）。如果无法恢复，比如无法获取Addr或Task信息，则应该Panic。

UCR Overwrite Rules

前面介绍过，同一个bank在前一个UCR error处理时又发生UCR error，那么就会导致error overflow以及overwrite。可以想象到，因为每个bank只有一组寄存器记录故障信息，那么发生overwrite后，硬件不得已必须丢弃一组数据，那么如何抉择呢，MCA给了如下规则：

•UCR errors will overwrite corrected errors.

•Uncorrected (PCC=1) errors overwrite UCR (PCC=0) errors.

•UCR errors are not written over previous UCR errors.

•Corrected errors do not write over previous UCR errors

总结来说：

•高级别（严重）的UCR error会覆盖低级别（一般）的UCR error，低级别（一般）的UCR error不会覆盖高级别（严重）的UCR error；

•同级别的UCR error，后来的不会覆盖前面的；

具体规则表如下

实际在云业务，Intel x86服务器下，部分内存宕机原因就是Memory Double UCE导致，笔者也成功复现了此类宕机故障（https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/+/60c3182214feb4b192234eb980f30e109bbde5cd）

UC/SRAR/SRAO故障内核处理流程

上篇文章介绍CMCI内核处理CE故障流程时，UCNA错误也涉及。因为CE/UCNA都使用CMCI上报且处理流程基本一致，所以这里不再赘述。下面主要讲下MCE内核处理函数do_machine_check()。（Linux v6.3分支，arch/x86/kernel/cpu/mce/core.c）

1.注释解读

C++/* * The actual machine check handler. This only handles real exceptions when * something got corrupted coming in through int 18. * * This is executed in #MC context not subject to normal locking rules. * This implies that most kernel services cannot be safely used. Don't even * think about putting a printk in there! * * On Intel systems this is entered on all CPUs in parallel through * MCE broadcast. However some CPUs might be broken beyond repair, * so be always careful when synchronizing with others. * * Tracing and kprobes are disabled: if we interrupted a kernel context * with IF=1, we need to minimize stack usage. There are also recursion * issues: if the machine check was due to a failure of the memory * backing the user stack, tracing that reads the user stack will cause * potentially infinite recursion. * * Currently, the #MC handler calls out to a number of external facilities * and, therefore, allows instrumentation around them. The optimal thing to * have would be to do the absolutely minimal work required in #MC context * and have instrumentation disabled only around that. Further processing can * then happen in process context where instrumentation is allowed. Achieving * that requires careful auditing and modifications. Until then, the code * allows instrumentation temporarily, where required. * */noinstr void do_machine_check(struct pt_regs *regs){

这个函数的注释包含信息量超大，所以非常有必要认真读一下

•该函数是machine check exception（int 18）处理函数。这个函数执行在异常（#MC）上下文，意味着不能直接使用，更不要想着在函数中调用printk。

•在Intel架构，#MC会同步广播到所有CPU。然而有些CPU可能因为硬件损坏无法执行该函数，所以其他CPU同步执行要小心。这里提到两点，一点是所有CPU都会收到#MC，即所有正常CPU都会执行do_machine_check()；第二点是有可能CPU硬件损坏，所以有些CPU无法正常执行do_machine_check()，代码需要关注。

•Tracing和 kprobes功能禁止使用：函数当前处于中断上下文，处理流程（代码）要最小化。

•递归问题：如果machine check是由于用户堆栈的内存故障引起的，那么再次读取用户堆栈将导致潜在的无限递归。即machine check函数中不能再次读取程序堆栈内存，因为可能正好读到故障内存再次触发#MC，导致无限循环。

•下面一段要结合do_machine_check()代码。do_machine_check函数分为上下两部分，上半部分在异常上下文，执行核心代码，比如读寄存器、故障分类、恢复寄存器。然后下半部通过task_work_add()函数创建task work，执行后续offline page和kill task等恢复动作，task work运行在进程上下文中，允许一些在异常上下文不允许的操作。有点类似中断上、下部之分。

Currently, the #MC handler calls out to a number of external facilities and, therefore, allows instrumentation around them. The optimal thing to have would be to do the absolutely minimal work required in #MC context and have instrumentation disabled only around that. Further processing can then happen in process context where instrumentation is allowed. Achieving that requires careful auditing and modifications. Until then, the code allows instrumentation temporarily, where required.

2.获取寄存器信息

C++noinstr void do_machine_check(struct pt_regs regs){... this_cpu_inc(mce_exception_count); mce_gather_info(&m, regs); m.tsc = rdtsc(); final = this_cpu_ptr(&mces_seen); final = m; no_way_out = mce_no_way_out(&m, &msg, valid_banks, regs); barrier();

Machine check函数先通过mce_gather_info()和mce_no_way_out()函数获取MCG_STATUS，instruction，RIP，MCA_STATUS等寄存器信息，保存在struct mce{}，结构体定义在arch/x86/include/uapi/asm/mce.h

C++struct mce { __u64 status; /* Bank's MCi_STATUS MSR / __u64 misc; / Bank's MCi_MISC MSR / __u64 addr; / Bank's MCi_ADDR MSR / __u64 mcgstatus; / Machine Check Global Status MSR / __u64 ip; / Instruction Pointer when the error happened / __u64 tsc; / CPU time stamp counter / __u64 time; / Wall time_t when error was detected */

注意在mce_no_way_out()函数，通过mce_severity()函数判断故障类型是MCE_PANIC_SEVERITY，即无法recover的故障，会直接return 1，保存在no_way_out，后续根据no_way_out=1会直接进入mce_panic()函数，打印故障信息并panic。

C++static __always_inline int mce_no_way_out(struct mce m, char msg, unsigned long validp, struct pt_regs regs){... for (i = 0; i < this_cpu_read(mce_num_banks); i++) { 。 ... m->bank = i; if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) { mce_read_aux(m, i); msg = tmp; return 1; } } return 0;}

MCE_PANIC_SEVERITY定义在enum severity_level，arch/x86/kernel/cpu/mce/internal.h，对故障进行分类，可以看到除了上文提到的DE、UCNA、AO、UC、AR、PANIC，还有KEEP，SOME等级别。

C++enum severity_level { MCE_NO_SEVERITY, MCE_DEFERRED_SEVERITY, MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY, MCE_KEEP_SEVERITY, MCE_SOME_SEVERITY, MCE_AO_SEVERITY, MCE_UC_SEVERITY, MCE_AR_SEVERITY, MCE_PANIC_SEVERITY,};

3.Local machine check和非Local machine check故障处理流程

C++/* * Local machine check may already know that we have to panic. * Broadcast machine check begins rendezvous in mce_start() * Go through all banks in exclusion of the other CPUs. This way we * don't report duplicated events on shared banks because the first one * to see it will clear it. */ if (lmce) { if (no_way_out) mce_panic("Fatal local machine check", &m, msg); } else { order = mce_start(&no_way_out); }

函数首先根据lmce，判断是否是local processor故障，如果是且no_way_out=1，调用mce_panic()函数打印故障信息并panic。如果否，则需要遍历所有bank找到真正故障的processor和故障信息。

上文提到#MC异常会发给所有CPU，那么这些CPU都会执行do_machine_check()，那么如何处理并行执行？Mce驱动的解决办法思路是指定一个Monarch CPU，其他CPU都是Subject CPU，然后由Monarch CPU完成遍历bank、找到故障bank并读取寄存器信息动作，Subject CPU等待。等找到后再释放Subject CPU继续执行。核心函数是mce_start()和mce_end()。

C++static noinstr int mce_start(int *no_way_out){ /* * 等待所有CPU就绪，执行到这里 */ while (arch_atomic_read(&mce_callin) != num_online_cpus()) { if (mce_timed_out(&timeout, "Timeout: Not all CPUs entered broadcast exception handler")) { arch_atomic_set(&global_nwo, 0); goto out; } ndelay(SPINUNIT); }... if (order == 1) { /* * Monarch CPU开始向下执行，并设置mce_executing=1. */ arch_atomic_set(&mce_executing, 1); } else { /* * Subject CPU等待 */ while (arch_atomic_read(&mce_executing) < order) { if (mce_timed_out(&timeout, "Timeout: Subject CPUs unable to finish machine check processing")) { arch_atomic_set(&global_nwo, 0); goto out; } ndelay(SPINUNIT); } }

那么接下来代码都是Monarch CPU执行了

C++ taint = __mc_scan_banks(&m, regs, final, toclear, valid_banks, no_way_out, &worst); if (!no_way_out) mce_clear_state(toclear); /* * Do most of the synchronization with other CPUs. * When there's any problem use only local no_way_out state. */ if (!lmce) { if (mce_end(order) < 0) { if (!no_way_out) no_way_out = worst >= MCE_PANIC_SEVERITY; if (no_way_out) mce_panic("Fatal machine check on current CPU", &m, msg); }

Monarch CPU根据在mce_end()->mce_reign()过程中，遍历所有cpu的寄存器并获取故障级别，并取最大的故障。因为设定bank一般同时只会发生一次UCE故障，所以取级别最高的即是故障Processor，其他未发生故障的bank的severity都是MCE_NO_SEVERITY。

mces_seen遍历是一个per cpu变量，每个CPU执行machine check函数入口时都会把m信息保存在mces_seen，所以根据per_cpu(mces_seen, cpu)即可遍历到所有CPU的寄存器信息，不需要再去读取了。

如果发生Overflow情况，那么取最大也没问题，比如AR+AO情况，只要处理AR即可，AO可忽略；如果是AR+AR或者AR+UC，那么MCA_STATUS寄存器的overflow bit会置1，mce_severity()函数会综合判定返回MCE_PANIC_SEVERITY。

C++static void mce_reign(void){... /* * This CPU is the Monarch and the other CPUs have run * through their handlers. * Grade the severity of the errors of all the CPUs. */ for_each_possible_cpu(cpu) { struct mce *mtmp = &per_cpu(mces_seen, cpu); if (mtmp->severity > global_worst) { global_worst = mtmp->severity; m = &per_cpu(mces_seen, cpu); } } /* * 如果无法恢复则直接Panic. */ if (m && global_worst >= MCE_PANIC_SEVERITY) { /* call mce_severity() to get "msg" for panic */ mce_severity(m, NULL, &msg, true); mce_panic("Fatal machine check", m, msg); } ... if (global_worst <= MCE_KEEP_SEVERITY) mce_panic("Fatal machine check from unknown source", NULL, NULL);

Monarch CPU此时已获取想要的寄存器信息，在mce_end()中会释放Subject CPU。

C++/* * Allow others to run. */ atomic_inc(&mce_executing);

那么Subject CPU中真正发生故障的CPU的MCA STATUS寄存器也有故障信息，会不会Monarch CPU和真正故障CPU会执行两次Recover代码呢？也不会，因为Monarch CPU会在mce_reign()函数最后清零mces_seen，那么真正故障CPU也会"认为"自己没有故障了。

C++ /* * Now clear all the mces_seen so that they don't reappear on * the next mce. */ for_each_possible_cpu(cpu) memset(&per_cpu(mces_seen, cpu), 0, sizeof(struct mce));

4.User mode和Kernel mode Recover

C++/* Fault was in user mode and we need to take some action */ if ((m.cs & 3) == 3) { /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); if (kill_current_task) queue_task_work(&m, msg, kill_me_now); else queue_task_work(&m, msg, kill_me_maybe); } else { if (m.kflags & MCE_IN_KERNEL_RECOV) { if (!fixup_exception(regs, X86_TRAP_MC, 0, 0)) mce_panic("Failed kernel mode recovery", &m, msg); } if (m.kflags & MCE_IN_KERNEL_COPYIN) queue_task_work(&m, msg, kill_me_never); }

最后就是根据cs寄存器判断此时故障上下文是在用户态还是内核态，并执行相应Recover Action：

•如果是用户态，则会调用queue_task_work()->kill_me_maybe()->memory_failure()或queue_task_work()->kill_me_now()，其中memory_failure()主要完成page hardoffline、remap phy addr和kill tasks（发送SIGBUS）等恢复动作。kill_me_now()直接给current task发送SIGBUS。

•如果是内核态，那么会调用fixup_exception()或kill_me_never，针对故障内存在uaccess、copyin、copyout、bpf等调用上下文进行恢复，如果不是或恢复失败，则调用mce_panic()宕机。

本文主要介绍了MCA UCR增强特性相关的定义、寄存器和软件处理流程，由于篇幅有限，有些地方并未完全展开，比如mce_severity()、mce_panic()、memory_failure()、fixup_exception()函数，又比如kill_current_task、lmce、worst变量等等，有些比较简单可根据上下文自行阅读理解，有些太复杂看后面是否单独介绍。

参考文档：《Intel® 64 and IA-32 Architectures Software Developer’s Manual 》

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2023-07-12，如有侵权请联系 cloudcommunity@tencent.com 删除

intel