从Oracle 10G中,Oracle推出ASM(自动存储管理)功能,用于替换基于主机的存储管理软件,使得Oracle Rac运行不在依赖于第三方的存储管理软件(如hacmp,sfrac)。在10G中,ASM的功能和稳定性就还不完善,并没有大规模的被使用。但是在11G版本中,ASM已经大规模被使用,瞬间成为集群的核心存储管理解决方案。同时ASM这个黑匣子也逐渐的被大家认识,今天我们就给大家分享一个ASM磁盘组不能挂载的案例。
ASM磁盘组有三种冗余方式:External Redundancy、Normal Redundancy、High Redundancy。其中的冗余机制这里就不过多介绍了,拥有冗余的磁盘组就可以高枕无忧了吗?肯定不是,冗余的机制只能保证部分故障的解决,还不足以让我们高枕无忧,就如我们今天的分享案例,哪怕你有normal的冗余方式,也只能事出无奈、束手无策,特别是在一些OLAP系统中,几十T的数据很正常,当故障来临时,哪怕通过备份来还原,那么这个时间也是无法容忍的。唯有对ASM的原理足够的了解,才能让我们在故障时,通过一些非常规的手段修复。
ASM跟普通文件系统一样,有自己的元数据,并且ASM的元数据库是可以直接通过KFED来查看和编辑的,今天我们用到的就是PST元数据,下面我们简单描述一下:
PST对于ASM非常重要,在读取其他ASM metadata之前会先检查PST,当mount diskgroup时,GMON进程会读取diskgroup中所有磁盘去找到和确认PST,通过PST可以确认哪些磁盘是可以ONLINE使用的,哪些是OFFLINE的。
PST位于磁盘的第一个au上,但并不是每块磁盘上都有PST。磁盘组镜像的不同,PST的个数也不同,如下:
下面有请我们今天的主角出场:
NORMAL磁盘组中有1个failgroup意外offline(如现在市面上的一体机1个存储节点意外重启),在这个failgroup恢复回来重新成功online之前,另外一个failgroup中有一块磁盘损坏了,此时悲剧就发生了,即使被offline的failgroup还原回来,也不能mount磁盘组。因为我们之前介绍的ASM重要元数据PST认为这些盘的状态不是可正常访问的。
1、构建一个NORMAL冗余的磁盘组,有3个failgroup,每个fg有2块盘:
SQL> select GRPNUM_KFDSK,NUMBER_KFDSK,MODE_KFDSK,FAILNAME_KFDSK,PATH_KFDSK from x$kfdsk where GRPNUM_KFDSK=2;
GRPNUM_KFDSK NUMBER_KFDSK MODE_KFDSK FAILNAME_KFDSK PATH_KFDSK
------------ ------------ ---------- -------------------- --------------------
2 0 127 FG2 /dev/asm-test-diske
2 1 127 FG2 /dev/asm-test-diskf
2 2 127 FG1 /dev/asm-test-diskc
2 3 127 FG1 /dev/asm-test-diskd
2 4 127 FG3 /dev/asm-test-diskg
2 5 127 FG3 /dev/asm-test-diskh
为了便于观察恢复效果,跟踪某条记录的变化,在offline primary extent所在磁盘后,更新这条数据,然后破坏其secondary extent所在磁盘,最后验证该事务是否丢失。这里手动创建一张rescureora的测试表,并查看其中一行记录物理存放位置。
SQL> select object_id,object_name,
2 dbms_rowid.rowid_block_number(rowid) block#,
3 dbms_rowid.rowid_relative_fno(rowid) file#
4 from sys.rescureora where rownum=1;
OBJECT_ID OBJECT_NAME BLOCK# FILE#
---------- ---------------------------------------- ---------- ----------
20 ICOL$ 131 5
通过脚本找到数据块与ASM磁盘的映射关系,由于是normal冗余,此处会看到两副本,LXN_KFFXP为0的是primary extent在1号disk上,为1的是secondary extent在4号disk上,稍后我们就模拟offline 1号disk所在fg,并且破坏4号盘。
SQL> @asm_block
Enter value for block: 131
Enter value for file_number: 256
Enter value for file_type: DATAFILE
Enter value for filename: TEST.256.1034246527
GROUP_KFFXP LXN_KFFXP AU_KFFXP DISK_KFFXP PXN_KFFXP
----------- ---------- ---------- ---------- ----------
2 1 24 4 3
2 0 30 1 2
2、通过GMON的日志文件来分析PST位置
从gmon trace可以发现,该磁盘组PST在0、2、4号磁盘上。通过kfed也可以验证:
=============== PST ====================
grpNum: 2
state: 1
callCnt: 25
(lockvalue) valid=1 ver=0.0 ndisks=3 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next: 29
last: 29
pst count: 3 --pst个数
pst locations: 4 2 0 --pst分布
incarn: 25
dta size: 6
version: 1
ASM version: 186646528 = 11.2.0.0.0
contenttype: 1
partnering pattern: [ ]
--------------- LOC MAP ----------------
0: dirty 0 cur_loc: 0 stable_loc: 0
1: dirty 0 cur_loc: 0 stable_loc: 0
--------------- DTA --------------------
0: sts v v(rw) p(rw) a(x) d(x) fg# = 1 addTs = 2429200834 parts: 5 (amp) 4 (amp) 3 (amp) 2 (amp)
1: sts v v(rw) p(rw) a(x) d(x) fg# = 1 addTs = 2429200834 parts: 4 (amp) 5 (amp) 2 (amp) 3 (amp)
2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 1 (amp) 4 (amp) 5 (amp) 0 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 5 (amp) 0 (amp) 1 (amp) 4 (amp)
4: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2429203972 parts: 1 (amp) 0 (amp) 2 (amp) 3 (amp)
5: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2429203972 parts: 0 (amp) 1 (amp) 3 (amp) 2 (amp)
[grid@rescureora1 trace]$ kfed read /dev/asm-test-diske aun=1 blkn=1|more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 17 ; 0x002: KFBTYP_PST_META
kfbh.datfmt: 2 ; 0x003: 0x02
kfbh.block.blk: 257 ; 0x004: blk=257
kfbh.block.obj: 2147483648 ; 0x008: disk=0
kfbh.check: 837788407 ; 0x00c: 0x31efa2f7
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfdpHdrPairBv1.first.super.time.hi:33098987 ; 0x000: HOUR=0xb DAYS=0x7 MNTH=0x3 YEAR=0x7e4
kfdpHdrPairBv1.first.super.time.lo:3089312768 ; 0x004: USEC=0x0 MSEC=0xcb SECS=0x2 MINS=0x2e
kfdpHdrPairBv1.first.super.last: 28 ; 0x008: 0x0000001c
kfdpHdrPairBv1.first.super.next: 29 ; 0x00c: 0x0000001d
kfdpHdrPairBv1.first.super.copyCnt: 3 ; 0x010: 0x03 --PST有3个副本,分别在0、2、4号disk上
kfdpHdrPairBv1.first.super.version: 1 ; 0x011: 0x01
kfdpHdrPairBv1.first.super.ub2spare: 0 ; 0x012: 0x0000
kfdpHdrPairBv1.first.super.incarn: 25 ; 0x014: 0x00000019
kfdpHdrPairBv1.first.super.copy[0]: 4 ; 0x018: 0x0004 -4号disk
kfdpHdrPairBv1.first.super.copy[1]: 2 ; 0x01a: 0x0002 -2号disk
kfdpHdrPairBv1.first.super.copy[2]: 0 ; 0x01c: 0x0000 -0号disk
kfdpHdrPairBv1.first.super.copy[3]: 0 ; 0x01e: 0x0000
kfdpHdrPairBv1.first.super.copy[4]: 0 ; 0x020: 0x0000
3、模拟故障现场
3.1 offline fg2,fg2为primary extent最在的failgroup,此时手动offline,模拟生产环境的存储节点服务器关机。
SQL> ALTER DISKGROUP TEST offline disks in failgroup fg2;
Diskgroup altered.
此时gmon日志中,会生成最新的PST信息,如下:
GMON updating disk modes for group 2 at 27 for pid 26, osid 3324
dsk = 0/0xe96887d4, mask = 0x7e, op = clear
dsk = 1/0xe96887d7, mask = 0x7e, op = clear
=============== PST ====================
grpNum: 2
state: 1
callCnt: 27
(lockvalue) valid=1 ver=0.0 ndisks=2 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next: 31
last: 31
pst count: 2 --此时pst只有2个了
pst locations: 4 2
incarn: 30
dta size: 6
version: 1
ASM version: 186646528 = 11.2.0.0.0
contenttype: 1
partnering pattern: [ ]
--------------- LOC MAP ----------------
0: dirty 0 cur_loc: 0 stable_loc: 0
1: dirty 0 cur_loc: 0 stable_loc: 0
--------------- DTA --------------------
0: sts v v(--) p(--) a(-) d(-) fg# = 1 addTs = 2429200834 parts: 5 (amp) 4 (amp) 3 (amp) 2 (amp) --此处可以看到0号盘已经offline
1: sts v v(--) p(--) a(-) d(-) fg# = 1 addTs = 2429200834 parts: 4 (amp) 5 (amp) 2 (amp) 3 (amp) --此处可以看到1号盘已经offline
2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 1 (amp) 4 (amp) 5 (amp) 0 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 5 (amp) 0 (amp) 1 (amp) 4 (amp)
4: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2429203972 parts: 1 (amp) 0 (amp) 2 (amp) 3 (amp)
5: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2429203972 parts: 0 (amp) 1 (amp) 3 (amp) 2 (amp)
3.2 更新数据,此处更新数据只是为了最后验证数据的有效性。
SQL> update sys.rescureora set object_name='rescureora' where rownum=1;
1 row updated.
SQL> commit;
Commit complete.
SQL> select object_id,object_name,
2 dbms_rowid.rowid_block_number(rowid) block#,
3 dbms_rowid.rowid_relative_fno(rowid) file#
4 from sys.rescureora where rownum=1;
OBJECT_ID OBJECT_NAME BLOCK# FILE#
---------- ---------------------------------------- ---------- ----------
20 rescureora 131 5
SQL> alter system checkpoint;
System altered.
3.3 手动破坏4号磁盘,这里采用的dd命令,如果在12C中开启afd后,dd命令会自动过滤。
[grid@rescureora1 trace]$ dd if=/dev/zero of=/dev/asm-test-diskg bs=4096 count=1 conv=notrunc
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000306284 s, 13.4 MB/s
3.4 故障出现磁盘组crash,即使另外一个fg恢复回来(刚刚异常关闭的存储节点启动)
SQL> alter diskgroup test mount;
alter diskgroup test mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "4" is missing from group number "2"
SQL> alter diskgroup test mount force;
alter diskgroup test mount force
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15066: offlining disk "4" in group "TEST" may result in a data loss
ORA-15042: ASM disk "4" is missing from group number "2
通过gmon trace观察此时的PST分布情况:
=============== PST ====================
grpNum: 2
state: 2
callCnt: 39
(lockvalue) valid=1 ver=0.0 ndisks=2 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next: 35
last: 35
pst count: 2
pst locations: 2 5 --2和5号disk
incarn: 34
dta size: 6
version: 1
ASM version: 186646528 = 11.2.0.0.0
contenttype: 1
partnering pattern: [ ]
--------------- LOC MAP ----------------
0: dirty 0 cur_loc: 0 stable_loc: 0
1: dirty 1 cur_loc: 0 stable_loc: 0
--------------- DTA --------------------
0: sts v v(--) p(--) a(-) d(-) fg# = 1 addTs = 2429200834 parts: 5 (amp) 4 (amp) 3 (amp) 2 (amp)
1: sts v v(--) p(--) a(-) d(-) fg# = 1 addTs = 2429200834 parts: 4 (amp) 5 (amp) 2 (amp) 3 (amp)
2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 1 (amp) 4 (amp) 5 (amp) 0 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 5 (amp) 0 (amp) 1 (amp) 4 (amp)
4: sts v v(-w) p(-w) a(-) d(-) fg# = 3 addTs = 2429203972 parts: 1 (amp) 0 (amp) 2 (amp) 3 (amp)
5: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2429203972 parts: 0 (amp) 1 (amp) 3 (amp) 2 (amp)
4、解决方案
4.1 查看PST中的磁盘状态
[grid@rescureora1 ~]$ kfed read /dev/asm-test-diskc aun=1 blkn=2|grep "status"|grep -v "I=0"
kfdpDtaEv1[0].status: 21 ; 0x000: I=1 V=0 V=1 P=0 P=1 A=0 D=0
kfdpDtaEv1[1].status: 21 ; 0x030: I=1 V=0 V=1 P=0 P=1 A=0 D=0
kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status: 127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status: 127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
[grid@rescureora1 ~]$ kfed read /dev/asm-test-diskh aun=1 blkn=2|grep "status"|grep -v "I=0"
kfdpDtaEv1[0].status: 21 ; 0x000: I=1 V=0 V=1 P=0 P=1 A=0 D=0
kfdpDtaEv1[1].status: 21 ; 0x030: I=1 V=0 V=1 P=0 P=1 A=0 D=0
kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status: 127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status: 127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
[grid@rescureora1 ~]$ kfed read /dev/asm-test-diskc aun=1 blkn=2|grep "status"|grep -v "I=0" > repair.txt
4.2 修改磁盘的状态,这里将磁盘1和2的状态值修改为127就可以
kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status: 127 ; 0x0c0: I=1 V=0 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status: 127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
[grid@rescureora1 ~]$ kfed merge /dev/asm-test-diskh aun=1 blkn=2 text=repair.txt
[grid@rescureora1 ~]$ kfed merge /dev/asm-test-diskc aun=1 blkn=2 text=repair.txt
[grid@rescureora1 ~]$ kfed merge /dev/asm-test-diskh aun=1 blkn=3 text=repair_3.txt
[grid@rescureora1 ~]$ kfed merge /dev/asm-test-diskc aun=1 blkn=3 text=repair_3.txt
4.3 PST修复完成,尝试mount磁盘组
SQL> alter diskgroup test mount force;
alter diskgroup test mount force
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15096: lost disk write detected
ORA-15042: ASM disk "4" is missing from group number "2"
这里报写丢失,跟之前的报错已经不一样,此时是由于磁盘组在挂载时做recover报错,那么很简单,跳过recover就可以。
此时报错稍稍有些不同,磁盘组在进行recover的时候报错,checkpoint为seq=7,block=1474。
NOTE: starting recovery of thread=1 ckpt=7.1474 group=2 (TEST)
NOTE: BWR validation signaled ORA-15096
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_4035.trc:
ORA-15096: lost disk write detected
ORA-15042: ASM disk "4" is missing from group number "2"
查看ACD checkpoint block:
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 3 ; 0x008: file=3
kfbh.check: 1111750266 ; 0x00c: 0x4243f67a
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfracdc.eyec[0]: 65 ; 0x000: 0x41
kfracdc.eyec[1]: 67 ; 0x001: 0x43
kfracdc.eyec[2]: 68 ; 0x002: 0x44
kfracdc.eyec[3]: 67 ; 0x003: 0x43
kfracdc.thread: 1 ; 0x004: 0x00000001
kfracdc.lastAba.seq: 4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk: 4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0: 1 ; 0x010: 0x00000001
kfracdc.blks: 10751 ; 0x014: 0x000029ff
kfracdc.ckpt.seq: 7 ; 0x018: 0x00000007 --此处标红
kfracdc.ckpt.blk: 1474 ; 0x01c: 0x000005c2 --此处标红
kfracdc.fcn.base: 6657 ; 0x020: 0x00001a01
kfracdc.fcn.wrap: 0 ; 0x024: 0x00000000
kfracdc.bufBlks: 256 ; 0x028: 0x00000100
kfracdc.strt112.seq: 2 ; 0x02c: 0x00000002
kfracdc.strt112.blk: 0 ; 0x030: 0x00000000
修改ACD,并修复回ASM环境:
[grid@rescureora1 trace]$ cat acd.txt
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 3 ; 0x008: file=3
kfbh.check: 1111750266 ; 0x00c: 0x4243f67a
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfracdc.eyec[0]: 65 ; 0x000: 0x41
kfracdc.eyec[1]: 67 ; 0x001: 0x43
kfracdc.eyec[2]: 68 ; 0x002: 0x44
kfracdc.eyec[3]: 67 ; 0x003: 0x43
kfracdc.thread: 1 ; 0x004: 0x00000001
kfracdc.lastAba.seq: 4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk: 4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0: 1 ; 0x010: 0x00000001
kfracdc.blks: 10751 ; 0x014: 0x000029ff
kfracdc.ckpt.seq: 9 ; 0x018: 0x00000007
kfracdc.ckpt.blk: 1474 ; 0x01c: 0x000005c2
kfracdc.fcn.base: 6657 ; 0x020: 0x00001a01
kfracdc.fcn.wrap: 0 ; 0x024: 0x00000000
kfracdc.bufBlks: 256 ; 0x028: 0x00000100
kfracdc.strt112.seq: 2 ; 0x02c: 0x00000002
kfracdc.strt112.blk: 0 ; 0x030: 0x00000000
kfed merge /dev/asm-test-diske aun=4 blkn=0 text=acd.txt
4.5 磁盘组正常挂载
SQL> alter diskgroup test mount force;
Diskgroup altered.
4.6 启动数据库
SQL> startup
ORACLE instance started.
Total System Global Area 839282688 bytes
Fixed Size 2257880 bytes
Variable Size 541068328 bytes
Database Buffers 289406976 bytes
Redo Buffers 6549504 bytes
Database mounted.
ORA-01113: file 5 needs media recovery
ORA-01110: data file 5: '+TEST/rescureora/datafile/test.256.1034246527'
如果在正常环境中,此时会出现数据不一致的情况,当然,如果有归档日志在,那么就可以向本案例一样,完美的解决。
SQL> recover database;
Media recovery complete.
SQL> alter database open;
5、数据验证,无丢失
SQL> select object_id,object_name,
2 dbms_rowid.rowid_block_number(rowid) block#,
3 dbms_rowid.rowid_relative_fno(rowid) file#
4 from sys.rescureora where rownum=1;
OBJECT_ID OBJECT_NAME BLOCK# FILE#
---------- ------------------------------ ---------- ----------
20 rescureora 131 5
今年的数据技术嘉年华大会上,李翔宇老师将带来题为《在通过案例深入解析Oracle内部原理》的演讲,与大家一起探索CBO和ASM rebalance的一些内部机制,精彩不容错过!