首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >超微BMC看门狗引起的重靴

超微BMC看门狗引起的重靴
EN

Server Fault用户
提问于 2015-05-31 15:46:46
回答 1查看 16.7K关注 0票数 7

我最近买了一个SuperMicro X10SLL-F主板,它有一个内置的BMC (高速AST2400芯片)。在服务器上运行linux时,我想使用内置的看门狗控制器(gentoo硬化)。

我启用了bios中的看门狗功能,然后将主板跳线从硬重置切换到NMI (看门狗超时操作,以避免重新启动)。关于软-我安装并添加到默认的运行级看门狗程序(sys-app/ configured狗),该程序配置为每10秒就会有一个看门狗设备(/dev/ watchdog,它已经存在)。看门狗超时设置为250秒。

程序显然会看到看门狗硬件(启用openipmi的ipmitool):

代码语言:javascript
运行
复制
# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      254 sec
Present Countdown:      253 sec

Freeipmi:

代码语言:javascript
运行
复制
# bmc-watchdog --get
Timer Use:                   SMS/OS
Timer:                       Running
Logging:                     Enabled
Timeout Action:              Hard Reset
Pre-Timeout Interrupt:       None
Pre-Timeout Interval:        0 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Set
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           254 seconds
Current Countdown:           253 seconds

但是,经过一定的时间后(上面的程序报告了良好的“当前倒计时”值):

代码语言:javascript
运行
复制
[  294.107534] Uhhuh. NMI received for unknown reason 21 on CPU 0.
[  294.107998] Do you have a strange power saving mode enabled?
[  294.108437] Dazed and confused, but trying to continue

这是NMI,显然是由看门狗超时引起的。在那台机器硬复位后不到一分钟。

问题在哪里,我该往哪个方向挖?

编辑:与ipmi相关的内核消息:

代码语言:javascript
运行
复制
[    0.353090] ipmi message handler version 39.2
[    0.353353] ipmi device interface
[    0.353623] IPMI System Interface driver.
[    0.353898] ipmi_si: probing via ACPI
[    0.354172] ipmi_si 00:08: [io  0x0ca2] regsize 1 spacing 1 irq 0
[    0.354444] ipmi_si: Adding ACPI-specified kcs state machine
[    0.354790] ipmi_si: probing via SMBIOS
[    0.355051] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[    0.355317] ipmi_si: Adding SMBIOS-specified kcs state machine duplicate interface
[    0.355836] ipmi_si: probing via SPMI
[    0.356095] ipmi_si: SPMI: io 0xca2 regsize 1 spacing 1 irq 0
[    0.356362] ipmi_si: Adding SPMI-specified kcs state machine duplicate interface
[    0.356906] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
[    0.390536] ipmi_si: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
[    0.418476] ipmi_si 00:08: Found new BMC (man_id: 0x002a7c, prod_id: 0x0801, dev_id: 0x20)
[    0.419004] ipmi_si 00:08: IPMI kcs interface initialized
[    0.419272] IPMI SSIF Interface driver
[    0.420350] IPMI Watchdog: driver initialized
[    0.420635] Copyright (C) 2004 MontaVista Software - IPMI Powerdown via sys_reboot.
[    0.421444] IPMI poweroff: ATCA Detect mfg 0x2A7C prod 0x801
[    0.421710] IPMI poweroff: Found a chassis style poweroff function

编辑:我尝试使用bmc-看门狗配置"-u 4 -p 2 -a 0 -F -P -L -O -i 300 -e 10“。因此,只使用SMS/OS时间,超时中断设置为NMI,超时操作设置为NMI:

代码语言:javascript
运行
复制
# bmc-watchdog --get
Timer Use:                   SMS/OS
Timer:                       Running
Logging:                     Enabled
Timeout Action:              None
Pre-Timeout Interrupt:       NMI / Diagnostic Interrupt
Pre-Timeout Interval:        0 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Set
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           300 seconds
Current Countdown:           290 seconds

但这并没有带来任何改变。

编辑。同样,当我用回显\0x00回显/dev/看门狗的方式触发看门狗计时器时,然后保持它不动--在默认的10秒超时之后,系统被正确地重新引导。因此,看门狗可以正常工作,但从启动系统重新启动到350秒。

编辑。我检查了BMC系统事件日志(SEL),并在重新启动后发现:

代码语言:javascript
运行
复制
Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
Sensor #202 | Watchdog 2 | Assertion Event | Timer expired, status only ; Timer use at expiration = SMS/OS ; Interrupt type = none

有趣的是--这个事件被标记为“仅为状态”。即使如此,系统也会重新启动。当我故意触发看门狗超时时,日志是不同的:

代码语言:javascript
运行
复制
Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
Sensor #202 | Watchdog 2 | Assertion Event | Hard Reset ; Timer use at expiration = SMS/OS ; Interrupt type = none
EN

回答 1

Server Fault用户

回答已采纳

发布于 2015-05-31 19:13:30

最后,我找到了一个有点奇怪的解决方案:只需打开看门狗跳线(JWD1) (既不选择NMI,也不选择硬重置)。在BIOS设置中启用了监视狗。

在这种情况下,看门狗按预期工作-系统稳定25分钟与bmc看门狗运行和重新启动后,监督狗程序终止。

票数 5
EN
页面原文内容由Server Fault提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://serverfault.com/questions/695650

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档