T630-GPU服务器宕机、自动重启日志记录

T630-GPU服务器宕机,自动重启,日志记录:A fatal error was detected on a component at bus 128 device 3 function 0

故障原因:

造成机器宕机的原因是当多GPU高负载工作时, GPU 温度达到阈值(95度)触发了bus fatal error,导致重启宕机。

根本原因是IDRAC 温控进程异常,无法准确实时的反馈GPU实际工作温度,从而使GPU过热宕机;

Racadm直接调整风扇转速方式:

查看当前值:

[root@xxxxx ~]#racadm -r BMCIP -u xxx -p xxx get System.ThermalSettings.FanSpeedoffset

Security Alert: Certificate is invalid - self signed certificate

Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.

[Key=System.Embedded.1#ThermalSettings.1]

FanSpeedOffset=Off

设置风扇转速值为3:【0 low fan speed、1 medium fan speed、2 high fan speed、3 max fan speed】

[root@xxxxx ~]# racadm -r BMCIP -u xxx -p xxx set System.ThermalSettings.FanSpeedoffset 3

Security Alert: Certificate is invalid - self signed certificate

Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.

[Key=System.Embedded.1#ThermalSettings.1]

Object value modified successfully

设置完成后再次查看:

[root@xxxxx ~]#racadm -r BMCIP -u xxx -p xxx get System.ThermalSettings.FanSpeedoffset

Security Alert: Certificate is invalid - self signed certificate

Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.

[Key=System.Embedded.1#ThermalSettings.1]

FanSpeedOffset=Max Fan Speed

通过调整风扇转速,服务器运行正常。

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20180913G0DHW500?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 yunjia_community@tencent.com 删除。

扫码关注云+社区

领取腾讯云代金券