ORA-16191?又是密码惹的祸?

DataGuard是甲骨文推出的一种高可用性数据库方案,它是在主节点与备用节点间通过日志同步来保证数据的同步,可以实现数据库快速切换与灾难性恢复,因此被不少企业用作数据容灾解决方案。日常维护Dataguard过程中,我们也会遇到各种各样的问题。例如这个ORA-16191,你可能以为又是密码问题。但实际上,到底是不是呢?

背景

就在上周,我们的一个生产Dataguard报错,主库两个RAC节点的alert log中每隔一分钟就会报一次ORA-16191。

大家可关注下两个节点的报错开始时间,后面会有分析。

主库节点一:

Thu Jul 26 01:21:29 2018

Error 1017 received logging on to the standby

------------------------------------------------------------

Check that the primary and standby are using a password file

and remote_login_passwordfile is set to SHARED or EXCLUSIVE,

and that the SYS password is same in the password files.

returning error ORA-16191

------------------------------------------------------------

FAL[server, ARC6]: Error 16191 creating remote archivelog file 'PRODX'

FAL[server, ARC6]: FAL archive failed, see trace file.

ARCH: FAL archive failed. Archiver continuing

ORACLE Instance PROD1 - Archival Error. Archiver continuing.

主库节点二:

Thu Jul 26 10:00:32 2018

Error 1017 received logging on to the standby

------------------------------------------------------------

Check that the primary and standby are using a password file

and remote_login_passwordfile is set to SHARED or EXCLUSIVE,

and that the SYS password is same in the password files.

returning error ORA-16191

------------------------------------------------------------

FAL[server, ARC4]: Error 16191 creating remote archivelog file 'PRODX'

FAL[server, ARC4]: FAL archive failed, see trace file.

ARCH: FAL archive failed. Archiver continuing

ORACLE Instance PROD2 - Archival Error. Archiver continuing.

同时,备库不再有日志的接收和APPLY。

原因分析及尝试

查看ORA-16191报错原因

错误描述很清楚,由于密码或者密码文件问题导致日志不能传输到备库。可是,我们并没有修改密码呀?什么原因呢?

先改个密码试试。

主库节点一使用orapwd修改sys密码

$cd $ORACLE_HOME/dbs

$ orapwd file=orapwPROD1

将新生成的密码文件拷贝到主库节点二和备库,并修改文件名

看看节点一告警日志,是否还有报错

还是报错

那就不是密码问题

尝试重启下备库

shutdown immediate

startup mount

alter database open

alter database recover managed standby database using current logfile disconnect from session;

继续报错

再试试将主节点传输改成defer再enable?

alter system set LOG_ARCHIVE_DEST_STATE_2=defer;

alter system set LOG_ARCHIVE_DEST_STATE_2=enable;

报错依旧

图片来源:123rf.com.cn

MOS来帮忙

还是上MOS搜一把。搜搜搜,突然眼前一亮

ORA-16191 ORA-01017 and ORA-16000 shipping the logs to Standby database (文档ID 2129339.1)

正好这个环境这几天加了个登录审计的trigger,难道真是这个trigger影响?

怪不得刚刚重启备库的时候,备库alertlog中有ORA-16000: database open for read-only access报错。我还纳闷本来就是read only,什么东东在写库。

Thu Jul 26 17:52:53 2018

Errors in file /oracle/diag/rdbms/PRODx/PROD1/trace/PROD1_ora_8978826.trc:

ORA-00604: error occurred at recursive SQL level 1

ORA-16000: database open for read-only access

ORA-06512: at line 29

试试先。

停备库,在参数文件中增加_system_trig_enabled=false。起备库。

startup mount

alter database open

alter database recover managed standby database using current logfile disconnect from session;

查看alertlog,日志已经开始传输了。

Fri Jul 27 09:43:05 2018

alter database recover managed standby database using current logfile disconnect from session

Attempt to start background Managed Standby Recovery process (PROD1)

Fri Jul 27 09:43:05 2018

MRP0 started with pid=62, OS id=11796536

MRP0: Background Managed Standby Recovery process started (PROD1)

Fri Jul 27 09:43:10 2018

Archived Log entry 6489 added for thread 1 sequence 76934 rlc 801185643 ID 0xf2d8f210 dest 2:

RFS[6]: Opened log for thread 1 sequence 76949 dbid -395156247 branch 801185643

started logmerger process

Fri Jul 27 09:43:10 2018

Managed Standby Recovery starting Real Time Apply

问题解决。

图片来源:123rf.com.cn

深度分析

还记得背景分析那部分让大家关注报错时间么?有没有发现主库节点一和节点二的报错时间并不一致?

主库节点一:

Thu Jul 26 01:21:29 2018

主库节点二:

Thu Jul 26 10:00:32 2018

查看主库trigger创建时间

竟然跟上面两个时间都不一致?到底是为什么呢?

SR工程师给了如下解释:

From primary alert logs, we can see before it reported Error 1017. They were using standby redo log to do real time apply, but when applying service was not able to keep up with redo rate, they automatically went to the archive log files to do recovery on standby. Then this time they found issue to transport the archive logs to standby and reported error. Please refer to doc "Data Guard Real-Time Apply FAQ ( Doc ID 828274.1 )" for more details.

===============================================

What will happen if Log apply Service cannot cope with the Real Time Apply ?

If for some Reason the Apply Service is unable to keep up with the Redo Rate then the Apply Service it will

automatically go to the Archive log Files as needed.

However, the Apply Service will automatically resume Real-Time Apply (reading the Standby Redo log Files) as soon as the Problem is resolved

===============================================

因为我们主备库之间同步用的是“standby redo log to do real time apply”,只有在备库apply赶不上redo的速度的时候,才会去找归档日志,此时才会触发ORA-16191的报错,所以才造成了上面三个节点报错时间不一致。

图片来源:123rf.com.cn

---END---

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20180801G18D6500?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 yunjia_community@tencent.com 删除。

扫码关注云+社区

领取腾讯云代金券