rac节点无法启动ORA-29702的问题及分析(70天)

今天在虚拟机上启动rac,发现有一个节点怎么都起不了。另外一个节点没问题。

SQL> startup nomount 
ORA-29702: error occurred in Cluster Group Service operation

尝试使用crs_stat查看crs的组件状态,也报错了。

-bash-4.1$ crs_stat -t 
CRS-0184: Cannot communicate with the CRS daemon.

查看alert日志,发现在最后是因为29702的错误导致的。

SMON started with pid=20, OS id=12344 
Sun May 11 04:10:28 2014 
RECO started with pid=21, OS id=12346 
Sun May 11 04:10:28 2014 
MMON started with pid=22, OS id=12348 
Sun May 11 04:10:28 2014 
MMNL started with pid=23, OS id=12350 
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... 
starting up 1 shared server(s) ... 
USER (ospid: 12242): terminating the instance due to error 29702 
Instance terminated by USER, pid = 12242

对于这个错误,oracle给出的解释如下。

-bash-4.1$ oerr ora 29702 
29702, 00000, "error occurred in Cluster Group Service operation" 
// *Cause: An unexpected error occurred while performing a CGS operation. 
// *Action: Verify that the LMON process is still active. 
//          Check the Oracle LMON trace files for errors. 
//          Also, check the related CSS trace file for errors.

查看lmon的日志如下:

Trace file /u04/app/11.2.0/db/diag/rdbms/racdb/RACDB1/trace/RACDB1_lmon_12324.trc 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production 
With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP, 
Data Mining and Real Application Testing options 
ORACLE_HOME = /u04/app/11.2.0/db/product/11.2.0/dbhome_1 
System name:    Linux 
Node name:      rac1 
Release:        2.6.32-71.el6.x86_64 
Version:        #1 SMP Wed Sep 1 01:33:01 EDT 2010 
Machine:        x86_64 
VM name:        VMWare Version: 6 
Instance name: RACDB1 
Redo thread mounted by this instance: 0 <none> 
Oracle process number: 11 
Unix process pid: 12324, image: oracle@rac1 (LMON)
*** 2014-05-11 04:10:27.777 
*** SESSION ID:(130.1) 2014-05-11 04:10:27.777 
*** CLIENT ID:() 2014-05-11 04:10:27.777 
*** SERVICE NAME:() 2014-05-11 04:10:27.777 
*** MODULE NAME:() 2014-05-11 04:10:27.777 
*** ACTION NAME:() 2014-05-11 04:10:27.777 
GES resources 5720 pool 3 
GES enqueues 8361 
GES IPC: Receivers 2  Senders 2 
GES IPC: Buffers  Receive 1000  Send (i:1030 b:471) Reserve 301 
GES IPC: Msg Size  Regular 1176  Batch 8376 
Batching factor: enqueue replay 206, ack 229 
Batching factor: cache replay 128 size per lock 64
*** 2014-05-11 04:10:28.644 
kjxggin: CGS tickets = 1000 
kgxgncin: CLSS init failed with status 3 
kgxgncin: return status 3 (1311719766 SKGXN not av) from CLSS 
kjxgmin: kgxgncin fails - (2) 
kjxggin: generic group layer init fails
*** 2014-05-11 04:10:28.655 
Global Enqueue Service Shutdown

对于该节点,使用crs_stat,crsctl的操作都无济于事。

-bash-4.1$ crsctl check crs 
CRS-4638: Oracle High Availability Services is online 
CRS-4535: Cannot communicate with Cluster Ready Services 
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon 
CRS-4534: Cannot communicate with Event Manager
-bash-4.1$ crs_start -all 
CRS-0184: Cannot communicate with the CRS daemon.

查看进程,确实都起来了。

-bash-4.1$ ps -ef|grep d.bin 
root      2103     1  0 May10 ?        00:00:51 /u04/app/11.2.0/grid/bin/ohasd.bin reboot 
grid      2297     1  0 May10 ?        00:00:32 /u04/app/11.2.0/grid/bin/oraagent.bin 
grid      2309     1  0 May10 ?        00:00:01 /u04/app/11.2.0/grid/bin/mdnsd.bin 
grid      2320     1  0 May10 ?        00:00:36 /u04/app/11.2.0/grid/bin/gpnpd.bin 
root      2330     1  0 May10 ?        00:00:14 /u04/app/11.2.0/grid/bin/orarootagent.bin 
grid      2333     1  0 May10 ?        00:02:39 /u04/app/11.2.0/grid/bin/gipcd.bin 
root      2348     1  1 May10 ?        00:12:00 /u04/app/11.2.0/grid/bin/osysmond.bin 
root      2569     1  0 May10 ?        00:03:55 /u04/app/11.2.0/grid/bin/ologgerd -M -d /u04/app/11.2.0/grid/crf/db/rac1 
grid     12569  9580  0 04:25 pts/1    00:00:00 grep d.bin
使用root用户来停掉crs。但是报了错。 
root 
[root@rac1 bin]# ./crsctl disable crs 
CRS-4621: Oracle High Availability Services autostart is disabled.
[root@rac1 bin]# ./crsctl stop crs 
CRS-2796: The command may not proceed when Cluster Ready Services is not running 
CRS-4687: Shutdown command has completed with errors. 
CRS-4000: Command Stop failed, or completed with errors.

再次尝试启动,也是报错。

[root@rac1 bin]# ./crsctl enable crs 
CRS-4622: Oracle High Availability Services autostart is enabled. 
[root@rac1 bin]# ./crsctl start crs 
CRS-4640: Oracle High Availability Services is already active 
CRS-4000: Command Start failed, or completed with errors.

最后看到mos上有一个workaround,可以手动Kill掉那些crs的进程。当然了,在正式环境中还是得把psu打上。

[root@rac1 bin]# ps -fea | grep ohasd.bin | grep -v grep 
root      2103     1  0 May10 ?        00:00:52 /u04/app/11.2.0/grid/bin/ohasd.bin reboot 
[root@rac1 bin]# ps -fea | grep gipcd.bin | grep -v grep 
grid      2333     1  0 May10 ?        00:02:41 /u04/app/11.2.0/grid/bin/gipcd.bin 
[root@rac1 bin]# ps -fea | grep mdnsd.bin | grep -v grep 
grid      2309     1  0 May10 ?        00:00:01 /u04/app/11.2.0/grid/bin/mdnsd.bin 
[root@rac1 bin]# ps -fea | grep gpnpd.bin | grep -v grep 
grid      2320     1  0 May10 ?        00:00:37 /u04/app/11.2.0/grid/bin/gpnpd.bin 
[root@rac1 bin]# ps -fea | grep evmd.bin | grep -v grep 
[root@rac1 bin]# ps -fea | grep crsd.bin | grep -v grep 
[root@rac1 bin]# kill -9 2103 2333  2309 2320 

再次尝试启动crs

[root@rac1 bin]# ./crsctl start crs 
CRS-4123: Oracle High Availability Services has been started.
[root@rac1 bin]# ./crs_stat -t 
CRS-0184: Cannot communicate with the CRS daemon.

启动的时候有些慢,稍等一下,直接自己来启库了。这次起库就没有问题了。

-bash-4.1$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Sun May 11 04:41:03 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> startup nomount 
ORACLE instance started.
Total System Global Area  638853120 bytes 
Fixed Size                  2231072 bytes 
Variable Size             482346208 bytes 
Database Buffers          146800640 bytes 
Redo Buffers                7475200 bytes 
SQL> alter database mount;
Database altered.
SQL> alter database open;
Database altered.
SQL>

查看crs的状态,该起的都起了。两个节点创建了一个小表做测试,没有问题了。那个workaround的细节可以从MOS文档 ID 1233580.1里面查看。

-bash-4.1$ crs_stat -t 
Name           Type           Target    State     Host        
------------------------------------------------------------ 
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1        
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac2        
ora.asm        ora.asm.type   OFFLINE   OFFLINE               
ora.cvu        ora.cvu.type   OFFLINE   OFFLINE               
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE               
ora....network ora....rk.type ONLINE    ONLINE    rac1        
ora.oc4j       ora.oc4j.type  OFFLINE   OFFLINE               
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1        
ora....SM1.asm application    OFFLINE   OFFLINE               
ora....C1.lsnr application    ONLINE    ONLINE    rac1        
ora.rac1.gsd   application    OFFLINE   OFFLINE               
ora.rac1.ons   application    ONLINE    ONLINE    rac1        
ora.rac1.vip   ora....t1.type ONLINE    ONLINE    rac1        
ora....SM2.asm application    OFFLINE   OFFLINE               
ora....C2.lsnr application    ONLINE    ONLINE    rac2        
ora.rac2.gsd   application    OFFLINE   OFFLINE               
ora.rac2.ons   application    ONLINE    ONLINE    rac2        
ora.rac2.vip   ora....t1.type ONLINE    ONLINE    rac2        
ora.racdb.db   ora....se.type ONLINE    ONLINE    rac2        
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac2

原文发布于微信公众号 - 杨建荣的学习笔记(jianrong-notes)

原文发表时间:2014-05-12

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏杨建荣的学习笔记

11g Dataguard中的snapshot standby特性(r8笔记第49天)

11g中的ADG特性本身已经非常有特色,促使很多对于10g中不太灵便的备库升级到11g,对于DBA是一大福利,那么还有一个福利就是snapshot standb...

2695
来自专栏杨建荣的学习笔记

11g主库归档自动删除的小问题分析 (r8笔记第1天)

最近在无疑中查看一个数据库的日志的时候,发现里面有这么一段内容。 Sat Feb 06 10:07:25 2016 Deleted Oracle manag...

32911
来自专栏杨建荣的学习笔记

备库查询导致的ORA-01110错误及修复(r8笔记第67天)

最近帮助业务部门解决了一个技术问题,因为发现有数据问题需要对存在问题的数据做分析。当然一个难点就是把数据给筛选出来,当我看到他们提供的语句,在备 库做了简单的数...

3287
来自专栏杨建荣的学习笔记

Oracle 12c PDB浅析(二)(r8笔记第29天)

之前写了第一篇Oracle 12c PDB浅析 在上次的基础上继续来学习学习。 首先关于多租户的架构设计来说,就好比在一座已经几十年的老房子上动地基...

3086
来自专栏向治洪

修改android最小堆内存

在oncreate的时候加入如下代码段即可保证该运行程序有足够的内存了: int CWJ_HEAP_SIZE = 10 * 1024 * 1024;  //10...

1726
来自专栏用户2442861的专栏

java数据库操作 (附带数据库连接池的代码)

本文来自:曹胜欢博客专栏。转载请注明出处:http://blog.csdn.net/csh624366188

572
来自专栏白驹过隙

服务器线程并发和进程并发

2697
来自专栏idba

MetaData Lock 之三

一 简介 通过前面两篇文章的介绍,相信读到这里的各位对MDL 锁已经有了比较深入的了解了,本文将结合理论知识介绍几组MDL 锁的案例。 二 常见MDL 锁的...

803
来自专栏大数据架构

Spark CommitCoordinator 保证数据一致性

本文通过 Local mode 执行如下 Spark 程序详解 commit 原理

763
来自专栏算法channel

基础|进程和线程模型

计算机中最重要的模型之一,莫过于进程模型和线程模型了,对于它们的深刻理解,直接关系到软件开发,算法设计等计算机细分方向。 01 — 进程模型 进程是指一个具有一...

3345

扫码关注云+社区