我在服务器中使用带有PERC810控制器的硬件PERC810,最近遇到了一个我不确定的度量。到目前为止,我一直在使用smartctl度量“增长缺陷列表中的元素”作为驱动器正在失败并应该被删除的提示,但是如果我使用perccli (或storcli/megacli),驱动器也会显示一个名为“媒体错误计数”的度量。这方面的问题是,从我所读到的这些度量标准来看,它们基本上是相同的--它们都显示了磁盘上重新分配的扇区或物理缺陷。但是,我的一些hdds显示,在增长缺陷列表中的元素上,数字大于零,但在媒体错误计数时为零,反之亦然。例如,这个磁盘:
perccli /c0/e37/s7 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive /c0/e37/s7 :
================
----------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
----------------------------------------------------------------------------
37:7 72 Onln 1 3.637 TB SAS HDD N N 512B WD4001FYYG-01SL3 U -
----------------------------------------------------------------------------
EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild
Drive /c0/e37/s7 - Detailed Information :
=======================================
Drive /c0/e37/s7 State :
======================
Shield Counter = 0
Media Error Count = 38
Other Error Count = 118063
Drive Temperature = 41C (105.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No
Drive /c0/e37/s7 Device attributes :
==================================
SN = WMC1F0D41KD5
Manufacturer Id = WD
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01F55DD1
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01
它显示了Media Error Count = 3
,但是当我对同一个磁盘使用smartctl时:
smartctl -a -d megaraid,72 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VR08
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01f55dd0
Serial number: WMC1F0D41KD5
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Fri Jan 28 14:14:51 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 41 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 60298:10
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 18
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 118
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 2538437 9298 76289 2547735 9392 215124.761 94
write: 5550372 5405661 5407707 10956033 5405661 571404.363 0
verify: 184 0 0 184 0 352.277 0
Non-medium error count: 202249
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 11 - [- - -]
Long (extended) Self-test duration: 31120 seconds [518.7 minutes]
它显示了Elements in grown defect list: 0
下面是同一个服务器上的另一个示例,只是不同的hdd:
perccli /c0/e37/s4 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive /c0/e37/s4 :
================
----------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
----------------------------------------------------------------------------
37:4 63 Onln 1 3.637 TB SAS HDD N N 512B WD4001FYYG-01SL3 U -
----------------------------------------------------------------------------
EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild
Drive /c0/e37/s4 - Detailed Information :
=======================================
Drive /c0/e37/s4 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 118060
Drive Temperature = 35C (95.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No
Drive /c0/e37/s4 Device attributes :
==================================
SN = WMC1F0D222KF
Manufacturer Id = WD
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01352C35
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01
Drive /c0/e37/s4 Policies/Settings :
==================================
Drive position = DriveGroup:1, Span:1, Row:0
Enclosure position = 0
Connected Port Number = 0(path0)
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No
Port Information :
================
-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
0 Active 6.0Gb/s 0x50000c0f01352c36
1 Active Unknown 0x0
-----------------------------------------
Inquiry Data =
00 00 06 12 5b 01 10 02 57 44 20 20 20 20 20 20
57 44 34 30 30 31 46 59 59 47 2d 30 31 53 4c 33
56 52 30 38 57 44 2d 57 4d 43 31 46 30 44 32 32
32 4b 46 20 20 20 20 20 00 00 00 a0 0c 40 20 c0
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
但是smartctl:
smartctl -a -d megaraid,63 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VR08
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01352c34
Serial number: WMC1F0D222KF
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Fri Jan 28 14:39:52 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 35 C
Drive Trip Temperature: 40 C
Accumulated power on time, hours:minutes 60299:24
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 18
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 118
Elements in grown defect list: 44
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 4899063 1 1 4899064 1 215489.217 0
write: 6593514 494 496 6594008 499 571584.348 0
verify: 345 0 0 345 0 349.197 0
Non-medium error count: 202287
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 11 - [- - -]
Long (extended) Self-test duration: 31120 seconds [518.7 minutes]
显示Elements in grown defect list: 44
你能解释一下这两种指标之间的区别吗?在确定一个错误的驱动器时,应该采用哪一种标准?谢谢。
发布于 2023-03-01 14:31:32
造成差异的原因是,在测量相似事物时,这两个度量标准在不同的层次上运行。
Media Error Count
测量RAID卡所看到的媒体错误
Elements in grown defect list
显示增长列表的大小,或驱动器本身所看到的重映射扇区的数量。
这两个值不匹配的原因有多种:
total uncorrected errors
),并且相同扇区的重写是成功的-因此RAID数组记录一个媒体错误,但磁盘不会重新映射扇区(我认为磁盘没有重新映射这些扇区是有缺陷的,但我在野外看到了它们)。https://serverfault.com/questions/1091480
复制相似问题