第一篇的内容可以参考。ASM无法启动的问题分析(一),有不少的朋友给了一些建议,我也糅合了进来。一并感谢。
当然重启服务发现CSSD服务是Online,但是ASM是无法启动。
[grid@testbiadmin]$ crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA01.dg ora....up.type ONLINE ONLINE rolequery ora.FLASH01.dgora....up.type ONLINE ONLINE rolequery ora....ER.lsnrora....er.type ONLINE ONLINE rolequery ora.asm ora.asm.type ONLINE OFFLINE ora.cssd ora.cssd.type ONLINE ONLINE rolequery ora.testbi.db ora....se.type ONLINE OFFLINE ora.testbi1.dbora....se.type OFFLINE OFFLINE ora.diskmon ora....on.type OFFLINE OFFLINE ora.evmd ora.evm.type ONLINE OFFLINE ora.ons ora.ons.type OFFLINE OFFLINE ora....rydb.dbora....se.type ONLINE OFFLINE
手工启动ASM的服务也是报错
[grid@testbiadmin]$ srvctl start asm PRCR-1079: Failed to start resource ora.asm CRS-5017:The resource action "ora.asm start" encountered the following error: ORA-01078:failure in processing system parameters ORA-29701:unable to connect to Cluster Synchronization Service .For details refer to "(:CLSN00107:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log". CRS-2674:Start of 'ora.asm' on 'rolequery' failed
对应的trace日志也没有发现特别的异常之处。
2016-01-2221:36:46.513: [ora.DATA01.dg][1444443904] {0:10:2} [check] abort command: check 2016-01-2221:36:46.513: [ora.FLASH01.dg][1442342656] {0:10:2} [check] abort command:check 2016-01-2221:36:46.513: [ora.DATA01.dg][1444443904] {0:10:2} [check] tryActionLock { 2016-01-2221:36:46.513: [ora.FLASH01.dg][1442342656] {0:10:2} [check] tryActionLock { 2016-01-2221:36:48.506: [ora.DATA01.dg][1457051392] {0:10:2} [check] DgpAgent::getConnxnabort 2016-01-2221:36:48.506: [ AGENT][1457051392]{0:10:2} UserErrorException: Locale is 2016-01-2221:36:48.506: [ora.DATA01.dg][1457051392] {0:10:2} [check] clsnUtils::errorException type=2 string= CRS-5017:The resource action "ora.DATA01.dg check" encountered the followingerror: DgpAgent::getConnxnaborted. For details refer to "(:CLSN00109:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log". 2016-01-2221:36:48.506: [ora.FLASH01.dg][1452848896] {0:10:2} [check] DgpAgent::getConnxnabort 2016-01-2221:36:48.506: [ AGENT][1452848896]{0:10:2} UserErrorException: Locale is 2016-01-2221:36:48.507: [ora.FLASH01.dg][1452848896] {0:10:2} [check] clsnUtils::errorException type=2 string= CRS-5017:The resource action "ora.FLASH01.dg check" encountered the followingerror: DgpAgent::getConnxnaborted. For details refer to "(:CLSN00109:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log". 2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check] LsnrAgent::check{ 2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check] lsnrctl statusLISTENER 2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check]getOracleHomeAttrib: oracle_home = /home/U01/app/grid/11.2.3 2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check]getOracleHomeAttrib: oracle_home = /home/U01/app/grid/11.2.3 2016-01-2221:36:48.508: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check]Utils::getCrsHome crsHome /home/U01/app/grid/11.2.3 2016-01-2221:36:48.508: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check] Utils:execCmdaction = 3 flags = 38 ohome = (null) cmdname = lsnrctl.
不过看起来似乎是在LISTENER的部分出现了问题。
查看grid中的listener配置,发现ASM实例的状态是UNKNOWN,这个着实有些奇怪。
LSNRCTL>status Connectingto (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521))) ListenerParameter File /home/U01/app/grid/11.2.3/network/admin/listener.ora ListenerLog File /home/U01/app/grid/diag/tnslsnr/rolequery/listener/alert/log.xml ListeningEndpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rolequery.test.com)(PORT=1521))) ServicesSummary... Service"+ASM" has 1 instance(s). Instance "+ASM", status UNKNOWN,has 1 handler(s) for this service... Service"testbi" has 1 instance(s). Instance "testbi", status UNKNOWN,has 1 handler(s) for this service... Thecommand completed successfully LSNRCTL>service Connectingto (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521))) ServicesSummary... Service"+ASM" has 1 instance(s). Instance "+ASM", status UNKNOWN,has 1 handler(s) for this service... Handler(s): "DEDICATED" established:494refused:0 LOCAL SERVER Service"testbi" has 1 instance(s). Instance "testbi", status UNKNOWN,has 1 handler(s) for this service... Handler(s): "DEDICATED" established:6069refused:748 LOCAL SERVER Thecommand completed successfully
这个时候开始查看从什么时候开始出现这个ASM的报错信息,看看系统层面有什么特别的改动。
ThuJan 14 17:03:00 2016 LNS:Standby redo logfile selected for thread 1 sequence 544 for destinationLOG_ARCHIVE_DEST_3 ArchivedLog entry 1261 added for thread 1 sequence 543 ID 0xb380aad0 dest 1: ThuJan 14 17:46:18 2016 ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 16626 ThuJan 14 17:48:18 2016 ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 16847 ThuJan 14 17:50:08 2016 ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 16888
可以看到是在1月14日 开始出现这个问题。
那么1月14日还有什么变更呢,系统级响应的配置也非常有限了。/etc/hosts是其中一个,目前发现最有可疑,因为时间点比较接近,当然也有一些疑点需要说服自己。
[oracle@testbi~]$ ll /etc/hosts -rw-r--r--1 root root 921 Jan 14 14:28 /etc/hosts
那我们来看看/etc/hosts,到底有什么改动。当然重点是下面的部分。
[grid@rolequery~]$ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6localhost6.localdomain6 10.11.64.174 dbbak_yq_10.11.64.174_tel # db_ccs 10.11.1.26 testmon.test.com 10.11.54.190 testmon.test.tc 192.168.97.143 xxx.no.sohu.com 10.11.2.86 testmon.test.com 10.11.64.28 testbi.test.com 10.11.64.28 rolequery.test.com 10.11.0.68 s2testbi.test.com
从配置里面看给这个服务器配置了多个主机名。看起来这个问题最可疑,可以简单验证一下是否是这样。把多余的那个主机名给注释掉。#10.11.64.28 testbi.test.com
然后重启cssd的服务,虽然报错,但是似乎有了reload的功效。
[grid@rolequery~]$ crs_start ora.cssd CRS-5702:Resource 'ora.cssd' is already running on 'rolequery' CRS-0223:Resource 'ora.cssd' has placement error.
再次启动asm就成功了,看来还真是这个主机名的配置导致的问题。
[grid@rolequery~]$ crs_start ora.asm Attemptingto start `ora.asm` on member `rolequery` Startof `ora.asm` on member `rolequery` succeeded.
所以通过上面的情况就可以基本判定就是这个主机名导致的问题了,在这个基础上数据库实例启动就很顺利。
使用kfod查看asm中的文件情况,这个时候都是正常的了。
[grid@rolequery~]$ kfod asm_diskstring='/dev/asm-sda9' disk=all -------------------------------------------------------------------------------- Disk Size Path User Group ================================================================================ 1: 153605 Mb /dev/asm-sda9 grid asmadmin -------------------------------------------------------------------------------- ORACLE_SIDORACLE_HOME ================================================================================ +ASM /home/U01/app/grid/11.2.3
这个时候有一个疑问是为什么1月14号的问题,到了21号才开始出现发现真正有问题。
所以还是需要说服自己。
我们可以从alert日志看出来,其实后台是在强制终止了一些进程,那么这些进程的情况是怎么样的呢。
得到了问题时间段的一些日志情况,可以看到大多数都是j000和m000的进程。会有个别的ora进程会被终止,但是数量很少。对于统计的系统来说还是很难发现的了。
-rw-r----- 1 oracle asmadmin 3252 Jan 14 23:58 testbi_j000_12684.trc -rw-r-----1 oracle asmadmin 73 Jan 15 00:00 testbi_m000_12855.trm -rw-r-----1 oracle asmadmin 3163 Jan 15 00:00 testbi_m000_12855.trc -rw-r-----1 oracle asmadmin 73 Jan 15 00:00 testbi_m001_12853.trm -rw-r-----1 oracle asmadmin 3529 Jan 15 00:00 testbi_m001_12853.trc -rw-r-----1 oracle asmadmin 84 Jan 15 00:00 testbi_j000_12874.trm -rw-r-----1 oracle asmadmin 3254 Jan 15 00:00 testbi_j000_12874.trc -rw-r-----1 oracle asmadmin 73 Jan 15 00:01 testbi_ora_12934.trm -rw-r-----1 oracle asmadmin 1339 Jan 15 00:01 testbi_ora_12934.trc -rw-r-----1 oracle asmadmin 73 Jan 15 00:01 testbi_ora_12937.trm -rw-r-----1 oracle asmadmin 1325 Jan 15 00:01 testbi_ora_12937.trc -rw-r-----1 oracle asmadmin 73 Jan 15 00:02 testbi_j000_13142.trm -rw-r-----1 oracle asmadmin 3255 Jan 15 00:02 testbi_j000_13142.trc -rw-r-----1 oracle asmadmin 73 Jan 15 00:04 testbi_j000_13201.trm
具体的日志情况如下。对于j000的进程,对于的日志如下:
testbi_j000_12874.trc kgxgncin:CLSS init failed with status 3 kgxgncin:return status 3 (1311719766 SKGXN not av) from CLSS NOTE:kfmsInit: ASM failed to initialize group services ErrorORA-29701 signaled atksedsts()+461<-ksf_short_stack()+77<-kge_snap_callstack()+63<-kge_sigtrace_dump()+69<-kgepop()+750<-kgeselv()+290<-ksesecl0()+162<-kfmsInit()+175<-kfmsSlvReg()+1142<-kfmdSlvOpPriv()+2207<-kfmdWriteSubmitted()+414<-kfk_process_an_ioq()+210<-kfk_submit_io()+74<-kfk_io1()+1089<-kfkRequest()+14<-kfk_transitIO()+1696<-kfioSubmitIO()+4740<-kfioRequestPriv()+182<-kfioRequest()+706<-ksfd_kfioRequest()+649<-ksfd_osmgo()+256<-ksfdgo()+861<-ksfdaio()+2504<-kcflbi()+908<-kcbldio()+3036<-kcblio()+318<-kdblsync()+436<-kdblcmtt()+3216<-klclisync()+256<-qerltFetch()+799<-insdlexe()+409<-insExecStmtExecIniEngine()+82<-insexe()+1845<-qes3tExecSQL()+710<-qerleStart()+176<-rwsstd()+249<-qerjoStart()+577<-selexe0()+818<-opiexe()+5726<-opipls()+2135<-opiodr()+916<-rpidrus()+211<-skgmstack()+148<-rpiswu2()+638<-rpidrv()+1384<-psddr0()+473<-psdnal()+457<-pevm_EXIM()+308<-pfrinstr_EXIM()+53<-pfrrun_no_tool()+63<-pfrrun()+627<-plsql_run()+649<-peicnt()+301<-kkxexe()+525<-opiexe()+17677<-opiodr()+916<-rpidrus()+211<-skgmstack()+148<- ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 12874 -----Abridged Call Stack Trace ----- ksedsts()+461<-kfk_io1()+2863<-kfkRequest()+14<-kfk_transitIO()+1696<-kfioSubmitIO()+4740<-kfioRequestPriv()+182<-kfioRequest()+706<-ksfd_kfioRequest()+649<-ksfd_osmgo()+256<-ksfdgo()+861<-ksfdaio()+2504<-kcflbi()+908<-kcbldio()+3036<-kcblio()+318<-kdblsync()+436 <-kdblcmtt()+3216<-klclisync()+256<-qerltFetch()+799<-insdlexe()+409<-insExecStmtExecIniEngine()+82<-insexe()+1845<-qes3tExecSQL()+710<-qerleStart()+176<-rwsstd()+249<-qerjoStart()+577<-selexe0()+818<-opiexe()+5726<-opipls()+2135<-opiodr()+916<-rpidrus()+211 <-skgmstack()+148<-rpiswu2()+638<-rpidrv()+1384<-psddr0()+473<-psdnal()+457<-pevm_EXIM()+308<-pfrinstr_EXIM()+53<-pfrrun_no_tool()+63<-pfrrun()+627<-plsql_run()+649<-peicnt()+301<-kkxexe()+525<-opiexe()+17677<-opiodr()+916<-rpidrus()+211<-skgmstack()+148<-rpiswu2()+638 <-rpidrv()+1384<-rpiexe()+74<-kkjex1e()+8650<-kkjsexe()+711<-kkjrdp()+694<-opirip()+958<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+253<-_start()+36 -----End of Abridged Call Stack Trace -----
对于应用连接进程,仔细查看发现竟然是rman的进程,所以也可以排除。
$less testbi_ora_12934.trc Tracefile /home/U01/app/oracle/diag/rdbms/testbi/testbi/trace/testbi_ora_12934.trc OracleDatabase 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production Withthe Partitioning, Automatic Storage Management, OLAP, Data Mining andReal Application Testing options ORACLE_HOME= /home/U01/app/oracle/product/11.2.3/db_1 Systemname: Linux Nodename: testbi.test.com Release: 2.6.32-279.el6.x86_64 Version: #1 SMP Wed Jun 13 18:24:36 EDT 2012 Machine: x86_64 Instancename: testbi Redothread mounted by this instance: 1 Oracleprocess number: 55 Unixprocess pid: 12934, image: oracle@testbi.test.com (TNS V1-V3) ***2016-01-15 00:01:02.534 ***SESSION ID:(1988.8917) 2016-01-15 00:01:02.534 ***CLIENT ID:() 2016-01-15 00:01:02.534 ***SERVICE NAME:(SYS$USERS) 2016-01-15 00:01:02.534 ***MODULE NAME:(rman@testbi.test.com (TNS V1-V3)) 2016-01-15 00:01:02.534 ***ACTION NAME:(0000001 FINISHED70) 2016-01-15 00:01:02.534
最后好不容易找到一个JDBC连接的进程trace日志,已经是几天以后的一个日志了,所以这个问题还是比较难发现的,当然到了21号开始,因为后期有一些大的推广活动,大批量的应用要连进来,没想到在这个时候就出现了问题。
[oracle@testbitrace]$ cat testbi_ora_28550.trc Tracefile /home/U01/app/oracle/diag/rdbms/testbi/testbi/trace/testbi_ora_28550.trc OracleDatabase 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production Withthe Partitioning, Automatic Storage Management, OLAP, Data Mining andReal Application Testing options ORACLE_HOME= /home/U01/app/oracle/product/11.2.3/db_1 Systemname: Linux Nodename: testbi.test.com Release: 2.6.32-279.el6.x86_64 Version: #1 SMP Wed Jun 13 18:24:36 EDT 2012 Machine: x86_64 Instancename: testbi Redothread mounted by this instance: 1 Oracleprocess number: 45 Unixprocess pid: 28550, image: oracle@testbi.test.com ***2016-01-16 00:52:34.100 ***SESSION ID:(3687.4155) 2016-01-16 00:52:34.100 ***CLIENT ID:() 2016-01-16 00:52:34.100 ***SERVICE NAME:(SYS$USERS) 2016-01-16 00:52:34.100 ***MODULE NAME:(JDBC Thin Client) 2016-01-16 00:52:34.100 ***ACTION NAME:() 2016-01-16 00:52:34.100 CursorDiagnosticsNodes: ChildNode: ChildNumber=3 ID=34 reason=Rolling Invalidate Window Exceeded(2)size=0x0 details=already_processed ChildNode: ChildNumber=4 ID=34 reason=Rolling Invalidate Window Exceeded(3)size=2x4 invalidation_window=1452700223 ksugctm=1452700354 ChildNode: ChildNumber=1 ID=34 reason=Rolling Invalidate Window Exceeded(2)size=0x0 details=already_processed ChildNode: ChildNumber=3 ID=34 reason=Rolling Invalidate Window Exceeded(3)size=2x4 invalidation_window=1452620854 ksugctm=1452621139 ChildNode: ChildNumber=5 ID=34 reason=Rolling Invalidate Window Exceeded(2)size=0x0 details=already_processed
当然对于这个问题,尽管是找到了原因是主机名,但是还是带有一定的运气。怎么来验证这个问题呢。可以简单做个测试。我们创建一个数据文件,当然这个时候肯定是没有问题的。
在修复之后的$ sqlplus / assysdba
SQL>create tablespace testdata datafile size 10M; Tablespacecreated.
我们修改主机名,这个时候/etc/hosts里面就有了两个主机名的配置信息。然后修改主机名
[root@testbidev]# hostname testbi.test.com [root@testbidev]# su - oracle [ [oracle@testbi~]$ sqlplus / as sysdba SQL>create tablespace testdata2 datafile size 10M; createtablespace testdata2 datafile size 10M * ERRORat line 1: ORA-03113:end-of-file on communication channel ProcessID: 21248 SessionID: 2266 Serial number: 13
然后查看alert日志发现了期望之中的ASM报错日志。
VKRMstarted with pid=37, OS id=20767 SatJan 23 19:08:09 2016 createtablespace testdata datafile size 10M Completed:create tablespace testdata datafile size 10M SatJan 23 19:09:07 2016 createtablespace testdata2 datafile size 10M ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 21248 SatJan 23 19:09:33 2016 Startingbackground process SMCO SatJan 23 19:09:33 2016 SMCOstarted with pid=24, OS id=21280
当然这些思路也参考了一些材料,短短几个小时,其实花费的时间其实也不多。但是解决了这个问题,就很释然了。至少可以很理性的规避这个问题了。
当然对于这个问题的追溯其实也是一个画蛇添足的操作,比如原来有一台服务器A,主机名为rolequery,但是后面移作他用,但是主机名一直没有改动,最后为了规范,想改动一下,就配置了两个主机名,对于备库来说还是能够识别的,结果在主库这么做就会有一些奇怪的问题,当然在线修改主机名还是一个敏感的操作,尽量避免,而且需要综合评估。
参考资料如下:
HighOCSSD CPU Usage/ ERROR: unrecoverable error ORA-29701 raised in ASM I/O path;terminating process {nnnnn} (Doc ID 1491852.1)
ASMInstance Cannot Start Due To ORA-29701 Standalone (non-Rac) (Doc ID1489552.1)
Oracle Restart: ORA-29701 raised in ASM I/Opath; terminating process, Permission denied (13) (Doc ID 1328629.1)
http://www.freelists.org/post/oracle-l/unrecoverable-error-ORA29701-raised-in-ASM-IO-path-terminating-process
http://database.com.mk/wordpress/2011/02/19/error-unrecoverable-error-ora-29701-raised-in-asm-io-path-when-using-goldengate-replicat-process/
http://blog.csdn.net/rgb_rgb/article/details/7937928
本文分享自微信公众号 - 杨建荣的学习笔记(jianrong-notes),作者:杨建荣
原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。
原始发表时间:2016-01-24
本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。
我来说两句