前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >DBA生存警示:保护现场不要让事情更糟

DBA生存警示:保护现场不要让事情更糟

作者头像
数据和云
发布2018-03-06 10:26:08
1.1K0
发布2018-03-06 10:26:08
举报
文章被收录于专栏:数据和云数据和云

张乐奕

云和恩墨副总经理 Oracle ACE 总监

ITPUB Oracle数据库管理版版主、Oracle高可用版版主、ACOUG联合创始人

今天收到一个发过来请求帮助的 case,Oracle 数据库无法启动,请求帮助恢复。仔细阅读了发过来的告警日志,这是一个典型的“事情越弄越糟”的案例。

作为一个专业的DBA,在遇到问题时,一定要思考:如何保护现场,不让事情变得更糟。这是基本要求,保护现场以使得其他人接手工作时,可以从原有状态开始。

以下就来根据告警日志,一条一条地回顾这位 DBA 是如何将数据库弄到完全启动不了的。

故障最开始是从1月11日的凌晨 3:30 开始出现,数据库在归档的时候,意外发现某个控制文件的头块全部被清零了,这可能是存储本身的问题,并非人为。

代码语言:javascript
复制
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Master background archival failure: 227
Fri Jan 11 03:31:24 2013
Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Completely zero block found during control file header read
Fri Jan 11 03:31:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:31:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'

接下来,数据库痛苦地挣扎了半小时,期间不停地报相同的 ORA-00227 错误。一直到凌晨 4:01,终于 CKPT 进程也发现无法更新控制文件头部,于是强势地将数据库直接关闭了。

代码语言:javascript
复制
Fri Jan 11 04:01:25 2013
Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Completely zero block found during control file header read
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
CKPT: terminating instance due to error 227
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_pmon_2997.trc:
ORA-00227: corrupt block detected in control file: (block , # blocks )
Fri Jan 11 04:01:26 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_psp0_2999.trc:
ORA-00227: corrupt block detected in control file: (block , # blocks )
Instance terminated by CKPT, pid = 3007

接下来的5个小时,数据库静静地躺在机房里,没有人知道这个数据库已经挂掉了,一直到上午 DBA 来上班。他发现数据库无法访问,于是尝试重新启动数据库。

代码语言:javascript
复制
Fri Jan 11 09:15:51 2013
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on. 

自然数据库无法正常启动,连 mount 状态都无法进入,因为某个控制文件头部已经损坏了。告警日志的信息明确地说明了无法读取 control03.ctl 文件的头块,因此在尝试 mount 数据库的时候报了 ORA-00205 错误。

代码语言:javascript
复制
Fri Jan 11 09:15:56 2013
ALTER DATABASE   MOUNT
Fri Jan 11 09:15:56 2013
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047: unable to read the header block of file
Additional information: 2
Fri Jan 11 09:15:59 2013
ORA-205 signalled during: ALTER DATABASE   MOUNT...
Fri Jan 11 09:19:31 2013
Starting ORACLE instance (normal)
Fri Jan 11 09:19:43 2013
alter database mount
Fri Jan 11 09:19:43 2013
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047: unable to read the header block of file
Additional information: 2
Fri Jan 11 09:19:43 2013
ORA-205 signalled during: alter database mount

接下来,这位 DBA 开始反复地关闭实例,又启动实例。这样的操作一直持续了1个小时,一直到上午的 10:28,可以想象这是多么纠结的一个小时。

忠告: 除非十分特殊的恢复案例,否则反复起停数据库实例是于事无补的。

代码语言:javascript
复制
Shutting down instance: further logons disabled
Fri Jan 11 09:23:47 2013
Stopping background process CJQ0
……
Fri Jan 11 09:38:02 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 09:43:00 2013
Shutting down instance: further logons disabled
……
Fri Jan 11 09:43:58 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 09:55:34 2013
ALTER DATABASE CLOSE NORMAL
……
Fri Jan 11 09:56:55 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 10:28:10 2013
ALTER DATABASE CLOSE NORMAL

接下来 10:29 的再次重新启动数据库实例之前,DBA 终于意识到可能是控制文件出现了问题,因此修改了初始化参数,将报错的 control03.ctl 文件从初始化参数 control_files 中去掉了。这次数据库得以正常启动。

代码语言:javascript
复制
Fri Jan 11 10:29:20 2013
Starting ORACLE instance (normal)
……
control_files            = /data1/oradata/dpdata/control01.ctl, /data1/oradata/dpdata/control02.ctl
……
Fri Jan 11 10:29:37 2013
Completed: ALTER DATABASE OPEN

而DBA也迅速地作了一次备份控制文件的操作,但是正是这个操作引导到了后面噩梦一般的结果。

代码语言:javascript
复制
Fri Jan 11 10:36:14 2013
alter database backup controlfile to trace
Fri Jan 11 10:36:14 2013
Completed: alter database backup controlfile to trace

数据库又平稳地运行了一个上午,这种宁静到下午14:16的时候被打破,数据库开始报 ORA-600 错误,并且在 CKPT 进程作检查点的时候,报数据文件10和31的文件头部无法被正确读取。如果不是更深层次的原因,那么这可能仍然是跟凌晨时候控制文件意外损坏时候的故障一样,也许是存储子系统本身的问题。

代码语言:javascript
复制
Fri Jan 11 14:16:07 2013
Errors in file /oracle/admin/dpdata/udump/dpdata_ora_22240.trc:
ORA-00600: internal error code, arguments: [6002], [0], [0], [3], [0], [], [], []
Fri Jan 11 14:19:44 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171: datafile 10 going offline due to error advancing checkpoint
ORA-01122: database file 10 failed verification check
ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf'
ORA-01251: Unknown File Header Version read for file number 10
Fri Jan 11 14:19:59 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171: datafile 31 going offline due to error advancing checkpoint
ORA-01122: database file 31 failed verification check
ORA-01110: data file 31: '/data3/ts2_dpcis.dbf'
ORA-01251: Unknown File Header Version read for file number 31

紧接着,应用系统的某个 JOB 也由于数据文件无法访问,而开始报错。

代码语言:javascript
复制
Fri Jan 11 14:30:19 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_j001_12993.trc:
ORA-12012: error on auto execute of job 88
ORA-00376: file 10 cannot be read at this time
ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf'
ORA-06512: at "DECTR.P_MOVE_CONTS_SHIP", line 77
ORA-06512: at line 1

相同的报错一直持续了四十多分钟,之后 DBA 又再次介入了。但是 DBA 很奇怪地断然执行了 offline 这两个数据文件的操作,并在2分多钟之后,又尝试将两个数据文件再次 online 。而由于文件损坏,自然在 online 的时候遇到了 ORA-1122 错误,而无法成功 online 。

代码语言:javascript
复制
Fri Jan 11 15:16:23 2013
alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:16:23 2013
Completed: alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:17:05 2013
alter database datafile  '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:17:05 2013
Completed: alter database datafile  '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:19:41 2013
alter database datafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:19:41 2013
ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...
Fri Jan 11 15:21:10 2013
alter database datafile  '/data2/DECTR_HIS2.dbf' online
Fri Jan 11 15:21:10 2013
ORA-1122 signalled during: alter database datafile  '/data2/DECTR_HIS2.dbf' online...

这才仅仅是噩梦的开始,接下来的一切属于危险动作,请勿轻易模仿。

遇到 ORA-1122 错误以后,DBA 考虑了 9 秒钟,再次断然地关闭了数据库,并随之又重新启动。由于仅仅是用户表空间数据文件损坏,并且之前也已经被 offline 了,因此数据库实例毫无障碍地 OPEN 成功。

代码语言:javascript
复制
Fri Jan 11 15:21:19 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:21:19 2013
Stopping background process QMNC
Fri Jan 11 15:21:19 2013
Stopping background process CJQ0
Fri Jan 11 15:21:21 2013
Stopping background process MMNL
Fri Jan 11 15:21:22 2013
Stopping background process MMON
Fri Jan 11 15:21:23 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:22:59 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:23:13 2013
Completed: ALTER DATABASE OPEN

DBA再次尝试online数据文件的操作,同样的ORA-1122错误。

代码语言:javascript
复制
Fri Jan 11 15:23:31 2013
alter database datafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:23:31 2013
ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...

考虑了2分半钟之后,DBA 也许是想起上午的时候做过控制文件的备份,因此决定进行数据库恢复。重启数据库到 nomount 状态,并开始进行RECOVER DATABASE USING BACKUP CONTROLFILE,ORA-1507 错误的意思是告知如果要使用备份的控制文件进行数据库恢复,那么应该要先使用备份的控制文件将数据库启动到 mount 状态。

代码语言:javascript
复制
Fri Jan 11 15:25:05 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:25:05 2013
Stopping background process QMNC
Fri Jan 11 15:25:05 2013
Stopping background process CJQ0
Fri Jan 11 15:25:07 2013
Stopping background process MMNL
Fri Jan 11 15:25:08 2013
Stopping background process MMON
Fri Jan 11 15:25:09 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:26:32 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:26:46 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 15:26:46 2013
ORA-1507 signalled during: ALTER DATABASE RECOVER  database using backup controlfile    until cancel  ...

DBA 于是将数据库启动到 mount 状态,继续进行数据库恢复。这其中的几个ORA 错误都是正常的,ORA-279 提示需要一个归档文件来完成恢复,ORA-308提示打不开 1_87749_604491553.dbf 归档文件,根据前面的告警日志,可以知道实际上 87749 这个重做日志是当前日志,还没有归档,自然找不到。ORA-1547错误表示恢复已经完成,但是 OPEN RESETLOGS 的时候仍然要遇到错误。

代码语言:javascript
复制
Fri Jan 11 15:26:56 2013
alter database mount
Fri Jan 11 15:27:00 2013
Setting recovery target incarnation to 2
Fri Jan 11 15:27:00 2013
Successful mount of redo thread 1, with mount id 560899584
Fri Jan 11 15:27:00 2013
Database mounted in Exclusive Mode
Completed: alter database mount
Fri Jan 11 15:27:10 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Media Recovery Start
 parallel recovery started with 3 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Fri Jan 11 15:27:28 2013
Media Recovery Log /soft/db_arch/1_87749_604491553.dbf
Errors with log /soft/db_arch/1_87749_604491553.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Fri Jan 11 15:27:28 2013
Media Recovery Log /soft/db_arch/1_87749_604491553.dbf
Errors with log /soft/db_arch/1_87749_604491553.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER CANCEL 
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...

DBA忽略了这个错误,尝试将数据库打开,很显然会遇到O RA-1589 错误,之后又尝试用 NORESTLOGS 方式 OPEN 数据库,这也很显然会遇到 ORA-1588 错误。不完全恢复的数据库必须要以 RESETLOGS 方式打开。

代码语言:javascript
复制
Fri Jan 11 15:29:52 2013
alter database open
Fri Jan 11 15:29:52 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 15:30:11 2013
alter database open NORESETLOGS
Fri Jan 11 15:30:11 2013
ORA-1588 signalled during: alter database open NORESETLOGS

之后, DBA 作了一个艰难的决定,再次关闭并重启了数据库。又再次尝试相同的 OPEN 步骤。当然,Oracle 也给与了相同的报错。数据库仍然无法打开。至此,数据库无法提供服务已经1个多小时。

代码语言:javascript
复制
Fri Jan 11 15:30:42 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:30:42 2013
Stopping background process CJQ0
Fri Jan 11 15:30:42 2013
Stopping background process MMNL
Fri Jan 11 15:30:43 2013
Stopping background process MMON
Fri Jan 11 15:30:44 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:30:59 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:31:08 2013
ALTER DATABASE OPEN
ORA-1589 signalled during: ALTER DATABASE OPEN...
Fri Jan 11 15:31:28 2013
alter database open NORESETLOGS
Fri Jan 11 15:31:28 2013
ORA-1588 signalled during: alter database open NORESETLOGS...
Fri Jan 11 15:31:41 2013
alter database open RESETLOGS
Fri Jan 11 15:31:41 2013
ORA-1122 signalled during: alter database open RESETLOGS...

再接下来,是一团混乱, DBA 多次重启数据库,尝试了多种恢复手段。offline数据文件,recover 数据文件,recover 数据库,online 数据文件,再recover,再 offline,再 open,但是一切尝试都是徒劳的。一直到晚上18:35,在数据库宕机4个多小时以后,开始求助我们帮助其恢复数据库。

代码语言:javascript
复制
Fri Jan 11 15:41:28 2013
alter database datafile '/data2/DECTR_HIS2.dbf' offline
Completed: alter database datafile '/data2/DECTR_HIS2.dbf' offline
Fri Jan 11 15:41:35 2013
alter database open
Fri Jan 11 15:41:35 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 15:42:20 2013
alter database  open resetlogs
Fri Jan 11 15:42:20 2013
ORA-1245 signalled during: alter database  open resetlogs...
Fri Jan 11 15:43:40 2013
ALTER DATABASE RECOVER  datafile '/data3/ts2_dpcis.dbf'  
Fri Jan 11 15:43:40 2013
Media Recovery Start
Fri Jan 11 15:43:40 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile '/data3/ts2_dpcis.dbf'  ...
Fri Jan 11 15:46:09 2013
ALTER DATABASE RECOVER  datafile 10  
Fri Jan 11 15:46:09 2013
Media Recovery Start
Fri Jan 11 15:46:09 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 10  ...
……
Fri Jan 11 16:37:51 2013
ALTER DATABASE RECOVER  database  
Fri Jan 11 16:37:51 2013
Media Recovery Start
Fri Jan 11 16:37:51 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  database  ...
Fri Jan 11 16:39:29 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 16:39:29 2013
Media Recovery Start
 parallel recovery started with 3 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 16:39:43 2013
ALTER DATABASE RECOVER    CANCEL  
Fri Jan 11 16:39:44 2013
ORA-1547 signalled during: ALTER DATABASE RECOVER    CANCEL  ...
Fri Jan 11 16:39:44 2013
ALTER DATABASE RECOVER CANCEL 
ORA-1112 signalled during: ALTER DATABASE RECOVER CANCEL ...
Fri Jan 11 16:40:15 2013
alter database datafile 10 online
Fri Jan 11 16:40:15 2013
Completed: alter database datafile 10 online
Fri Jan 11 16:40:25 2013
alter database datafile 31 online
Completed: alter database datafile 31 online
Fri Jan 11 16:40:47 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 16:40:47 2013
Media Recovery Start
Fri Jan 11 16:40:47 2013
Media Recovery failed with error 1110
ORA-283 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 16:47:12 2013
WARNING: inbound connection timed out (ORA-3136)
Fri Jan 11 17:44:47 2013
ALTER DATABASE RECOVER  datafile 10  
Fri Jan 11 17:44:47 2013
Media Recovery Start
Fri Jan 11 17:44:47 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 10  ...
Fri Jan 11 17:45:19 2013
ALTER DATABASE RECOVER  database until cancel using backup controlfile  
Fri Jan 11 17:45:19 2013
Media Recovery Start
Fri Jan 11 17:45:19 2013
Media Recovery failed with error 1110
ORA-283 signalled during: ALTER DATABASE RECOVER  database until cancel using backup controlfile  ...
Fri Jan 11 17:46:39 2013
alter database datafile 10 offline
Fri Jan 11 17:46:40 2013
Completed: alter database datafile 10 offline
Fri Jan 11 17:47:18 2013
ALTER DATABASE RECOVER  database until cancel  
Fri Jan 11 17:47:18 2013
Media Recovery Start
Fri Jan 11 17:47:18 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  database until cancel  ...
Fri Jan 11 18:11:31 2013
alter database open
Fri Jan 11 18:11:31 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 18:35:29 2013
Starting ORACLE instance (normal)
Fri Jan 11 18:35:43 2013
alter database open
Fri Jan 11 18:35:43 2013
ORA-1589 signalled during: alter database open...

这是一个没有备份的数据库,实际上如果是存储字系统的问题导致了数据文件损坏,那么可能与DBA的关系并不大,但是在经过一下午的折腾,将一个其实仅仅是坏了2个数据文件而可以轻松 OPEN 的数据库恢复到无论如何也无法轻易打开的状态,这就与 DBA 有很大的关系了。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2016-03-05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据和云 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
对象存储
对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限制,可容纳海量数据且支持 HTTP/HTTPS 协议访问的分布式存储服务。腾讯云 COS 的存储桶空间无容量上限,无需分区管理,适用于 CDN 数据分发、数据万象处理或大数据计算与分析的数据湖等多种场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档