由报警邮件分析发现的备库oracle bug(r7笔记第12天)

jeanron100

发布于 2018-03-16 16:50:18

6940

发布于 2018-03-16 16:50:18

文章被收录于专栏：杨建荣的学习笔记

昨天到公司之后，收到两份封报警邮件，可以看到在早晨6:30左右主库的v$dataguard_status检查时发现了一个错误。然后再2分钟后就自动恢复了。

一般这种问题很自然想到可能是网络出现了一些问题。因为自动恢复了，所以也不同太着急处理，于是细细看了下。

报警邮件如下：

ZABBIX-监控系统:

------------------------------------

报警内容: DG_issue

------------------------------------

报警级别: PROBLEM

------------------------------------

监控项目: dg_issue：2015-11-06 06:30:06.0Log Transport ServicesErrorPING[ARC0]: Error 1089 when pinging standby (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=test.cyou.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=stest2_XPT)(INSTANCE_NAME=test)(SERVER=dedicated))).

------------------------------------

报警时间:2015.11.06-06:31:05

报警恢复邮件如下:

ZABBIX-监控系统:

-------------------------------

监控项目: dg_issue_none

------------------------------------

主机名称：ora_test_p_yangjr@10.127.2.132

------------------------------------

恢复时间:2015.11.06-06:33:07

这是套一主两备的10gR2的环境，初步感觉是不是备库的连接数超出了。查看了资源使用情况，发现session数也都少得很。

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION   LIMIT_VALUE
------------------------------ ------------------- --------------- -------------------- --------------------
processes                                       37              40       1000                 1000
sessions                                        43              46       3000                 3000
enqueue_locks                                   61              98      32860                32860
enqueue_resources                               46              46      13420            UNLIMITED

查看process确实是1000个，还远没有达到。

idle> show parameter process
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes                      integer     0
db_writer_processes                  integer     2
gcs_server_processes                 integer     0
job_queue_processes                  integer     10
log_archive_max_processes            integer     2
processes                            integer     1000

这个时候直接到日志目录下，看到产生了大量的日志文件。日志文件似乎有并行的迹象。

-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p014_10824.trc
-rw-r----- 1 oracle oinstall       912 Nov  6 02:50 test_p013_10822.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p012_10820.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p011_10818.trc
-rw-r----- 1 oracle oinstall       912 Nov  6 02:50 test_p010_10816.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p009_10814.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p008_10812.trc
-rw-r----- 1 oracle oinstall       912 Nov  6 02:50 test_p007_10810.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p006_10808.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p005_10806.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p004_10804.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p003_10802.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p002_10800.trc
-rw-r----- 1 oracle oinstall       912 Nov  6 02:50 test_p001_10798.trc
-rw-r----- 1 oracle oinstall       913 Nov  6 02:50 test_p000_10796.trc
-rw-r----- 1 oracle oinstall      3005 Nov  6 02:50 test_mrp0_10794.trc
-rw-r----- 1 oracle oinstall   1013705 Nov  6 06:29 test_mmon_26762.trc
-rw-r----- 1 oracle oinstall     30361 Nov  6 06:30 test_rsm0_28626.trc
-rw-r----- 1 oracle oinstall     25111 Nov  6 06:30 test_dmon_26766.trc
-rw-r----- 1 oracle oinstall 565972219 Nov  6 06:30 drctest.log
-rw-r----- 1 oracle oinstall      1876 Nov  6 06:31 test_mrp0_11131.trc
-rw-r----- 1 oracle oinstall    106067 Nov  6 06:31 alert_test.log

这个时候有几个问题，一个是为什么会产生如此多的并行，但是通过资源的监控视图中没有看到。

另外一个就是为什么偏偏是6:30开始报错。

最后一个是一主两备，一个备库有报错，另外一个却没有。

首先排除资源的使用情况，这个库是出现了大量的看似并行的日志，我们来看看并行日志怎么说。随便抓取一个。发现应该没有关系，里面还有slave的字样。

Redo thread mounted by this instance: 1
Oracle process number: 29
Unix process pid: 19714, image: oracle@test.test.com (P007)
*** 2015-11-06 15:29:13.896
*** SERVICE NAME:() 2015-11-06 15:29:13.896
*** SESSION ID:(2972.1) 2015-11-06 15:29:13.896
Slave exiting with ORA-10388 exception
KCBR: Number of read descriptors = 1024
KCBR: Media recovery blocks read (ASYNC) = 513
KCBR: Redo cache copies/changes = 4284/4284
KCBR: Influx buffers flushed = 8 times
KCBR: Reads = 31 reap (29 no-op, 0 one), 30 all

那么为什么是6:30起开始报错呢，一种可能就是刚好在哪个时间点触发了。另外一种可能就是在特定的时间里有了触发条件。

这个一种很自然的想法就是crontab了。

查看crontab的内容，会发现两行内容

50 2 * * * . $HOME/.profile;$HOME/dbadmin/scripts/open_standby.sh

30 6 * * * . $HOME/.profile;$HOME/dbadmin/scripts/close_standby.sh

脚本内容如下：

$ less $HOME/dbadmin/scripts/open_standby.sh
#!/bin/bash
cd ~
. ~oracle/.bash_profile
home=~oracle
dgmgrl / "edit database stest2 set state='READ-ONLY'"
  if [ $? -ne 0 ]; then
  echo "set read-only failed!"
  exit 1
  fi
$ less $HOME/dbadmin/scripts/close_standby.sh
#!/bin/bash
cd ~
. ~oracle/.bash_profile
home=~oracle
dgmgrl / "edit database stest2 set state='ONLINE'"
  if [ $? -ne 0 ]; then
  echo "set read-only failed!"
  exit 1
  fi

这个时候查看数据库日志发现里面其实已经有不少的错误信息。截取一段。

RFS[112]: Archived Log: '/U01/app/oracle/flash_recovery_area/Stest2/archivelog/2015_11_06/o1_mf_1_7608_c3q98op1_.arc'
Fri Nov  6 02:51:16 2015
ksvcreate: Process(m000) creation failed
Fri Nov  6 02:52:16 2015
ksvcreate: Process(m000) creation failed
Fri Nov  6 02:53:16 2015
ksvcreate: Process(m000) creation failed
...
ksvcreate: Process(m000) creation failed
Fri Nov  6 06:26:20 2015
ksvcreate: Process(m000) creation failed
Fri Nov  6 06:27:20 2015
ksvcreate: Process(m000) creation failed
Fri Nov  6 06:28:20 2015
ksvcreate: Process(m000) creation failed
Fri Nov  6 06:29:20 2015
ksvcreate: Process(m000) creation failed

对于这个问题感觉还是有些奇怪，看起来是m000进程启动不了，那么是内核配置过低了。查看了一下，没有发现问题。

这个时候在metalink上搜寻，发现一个相符的文章。

'Ksvcreate: Process(m000) creation failed' after Standby Database Open Read Only Multiple Times (Doc ID 418553.1)

对于这个问题，oracle认为是基于一个bug。

Cause

The Cause of this Problem has been identified in Bug 5583049.

至于解决方案，也提了两个建议。

Solution
There are two Workarounds available:
    Restart the Instance..
or
     Disable ADDM - Should be re-enabled if Standby takes up the Primary Role
          * Set SGA_TARGET=0 and set shared_pool_size, db_cache_size, etc if using
            Automatic SGA Memory Management (ASMM)
          * Set STATISTICS_LEVEL=BASIC to disable statistics gathering

对于这个问题，先不说是个bug,疑问就是为什么在一个备库有，另外一个备库没有报错呢?

查看了一下，发现另外一个备库中是没有crontab的设置，一直是处于apply状态的。

这样问题似乎就清楚了，看来确实是一个bug导致的。那么这个问题既然是一个bug，为什么之前就从来没有发现过呢。

说来惭愧，自己在bdump目录下搜寻了一下，这个问题可以追溯到2013年了。看来是一个遗留问题了。

bdump]$ grep "ksvcreate: Process(m000) creation failed" alert*|uniq|less
alert_test.log:ksvcreate: Process(m000) creation failed
alert_test.log.20130701:ksvcreate: Process(m000) creation failed
alert_test.log.20131001:ksvcreate: Process(m000) creation failed
alert_test.log.20131101:ksvcreate: Process(m000) creation failed

按照这个报错的频率，基本是每天都有，那么报错的次数有多少，差不多30多万次。

$ grep -B1 "ksvcreate: Process(m000) creation failed" alert*|uniq|wc -l

316539

那么根据第一个建议，重启数据库即可解决，到底好不好呢，再来一个grep来验证一下。发现这些年来，这个备库因为各种操作，重启了也不少次

$ grep "Starting ORACLE instance (normal)" alert*
alert_test.log:Starting ORACLE instance (normal)
alert_test.log.20111201:Starting ORACLE instance (normal)
alert_test.log.20111201:Starting ORACLE instance (normal)
alert_test.log.20111201:Starting ORACLE instance (normal)
alert_test.log.20120601:Starting ORACLE instance (normal)
alert_test.log.20130901:Starting ORACLE instance (normal)
alert_test.log.20130901:Starting ORACLE instance (normal)

。。。。

可以看出，其实已经重启了很多次，但是报警还是依旧，所以说重启还是不能解决问题。

那么来看看第二个建议，就是设置sga_target=0来取消自动管理，还有addm的设置，这些还需要再评估，但我认为sga_target的影响可能更大。

所以简单来说，修复就是设置sga_target=0

如果要设置sga_target为0，则要手工指定其它内存组件的大小，那么先来看看目前的设置，作为参考。

$ sh showbuffer.sh
COMPONENT                       CURRENT_M      MIN_M      MAX_M SPECCIFIED_M LAST_OPER LAST_OPER_TYP  GRANULE_M
------------------------------ ---------- ---------- ---------- ------------ --------- ------------- ----------
shared pool                          2080       2064          0            0 IMMEDIATE GROW                  16
large pool                             16         16          0            0           STATIC                16
java pool                              16         16          0            0           STATIC                16
streams pool                           32         32          0            0           STATIC                16
DEFAULT buffer cache                11152      11152          0            0 IMMEDIATE SHRINK                16

可以看见buffer_cache大概在11g，加上shared_pool 的2G，大概修是13G左右

但是show sga的时候发现，database buffers有差不多34G左右这个就有点差别太悬殊了吧。

idle> show sga
Total System Global Area 4.8318E+10 bytes
Fixed Size                  2173768 bytes
Variable Size            1.3304E+10 bytes
Database Buffers         3.4997E+10 bytes
Redo Buffers               14602240 bytes

查看参数sga的设置大小

idle> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 45G
sga_target                           big integer 36G

发现sga_target还确实设置是36G左右，但是似乎备库上只启用了13G左右。

idle> alter system set shared_pool_size=4G scope=spfile;

System altered.

这个时候设置buffer cache为30g，改动写进spfile就报错了。

idle> alter system set db_cache_size=30G scope=spfile;

alter system set db_cache_size=30G scope=spfile

ERROR at line 1:

ORA-32017: failure in updating SPFILE

ORA-00384: Insufficient memory to grow cache

那么设置略低一些，改成25G。竟然就可以了。

idle> alter system set db_cache_size=25G scope=spfile;

System altered.

重启之后，发现buffer_cache基本在设定的范围之内。

idle> shutdown immediate
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
idle> startup nomount
ORACLE instance started.
Total System Global Area 4.8318E+10 bytes
Fixed Size                  2173768 bytes
Variable Size            2.0368E+10 bytes
Database Buffers         2.7934E+10 bytes
Redo Buffers               14602240 bytes

这个时候想改大一些还是不行。

idle> alter system set db_cache_size=30G scope=spfile;
alter system set db_cache_size=30G scope=spfile
*
ERROR at line 1:
ORA-32017: failure in updating SPFILE
ORA-00384: Insufficient memory to grow cache

关键是查看内存组件的大小使用v$dynamic_components发现buffer_cache还是11G

这种问题就有些奇怪了。查看metalink发现还有一个bug

至于这个问题为什么只启用了13G的sga，我查看主库的配置发现似乎是有关联的。

sys@test> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 13G
sga_target                           big integer 13G

这一点在metalink还没有找到相关的说明，不过要想避免这个bug，还是需要设置sga_target=0

看来在10g中还是有不少的小问题，还是需要考虑考虑升级到11g了。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2015-11-07，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自杨建荣的学习笔记微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度

由报警邮件分析发现的备库oracle bug(r7笔记第12天)

由报警邮件分析发现的备库oracle bug(r7笔记第12天)

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐