记一次数据库重启后归档急剧增加的问题(98天)

在本地的环境中测试外部表的性能,由于空间有限,不一会儿归档的空间就爆了。然后文件貌似出现了系统级的问题,刚刚rm掉的归档日志文件。隔了几秒钟再ls,就出现了。怎么删都删不掉。尝试了多次之后,无奈尝试shutdown immediate结果等了好半天还是没反应,然后采用shutdown abort后重启,库直接起不来了。报了ora错误,然后库就起不来了。

查看日志显示,和之前碰到的归档空间不足导致的问题一致。清除有问题的归档之后,重启库就起来了。可以参见日志:http://blog.itpub.net/23718752/viewspace-1167163/

SMON: enabling cache recovery 
ARC3: Archival started 
ARC0: STARTING ARCH PROCESSES COMPLETE 
ARC1: Encountered disk I/O error 19502 
ARC1: Closing local archive destination LOG_ARCHIVE_DEST_1: '/u03/ora11g/product/11.2.0/dbhome_1/dbs/arch1_339_837590339.dbf' (error 19502) (TEST01) 
ARC1: I/O error 19502 archiving log 3 to '/u03/ora11g/product/11.2.0/dbhome_1/dbs/arch1_339_837590339.dbf' 
ARC2: Encountered disk I/O error 19502 
ARC2: Closing local archive destination LOG_ARCHIVE_DEST_1: '/u03/ora11g/product/11.2.0/dbhome_1/dbs/arch1_337_837590339.dbf' (error 19502) (TEST01) 
ARC2: I/O error 19502 archiving log 2 to '/u03/ora11g/product/11.2.0/dbhome_1/dbs/arch1_337_837590339.dbf' 
ARCH: Archival stopped, error occurred. Will continue retrying 
ORACLE Instance TEST01 - Archival Error 
ORA-16038: log 3 sequence# 339 cannot be archived 
ORA-19502: write error on file "", block number  (block size=) 
ORA-00312: online log 3 thread 1: '/u03/ora11g/oradata/TEST01/redo03.log' 
Sun Jun 08 00:37:11 2014 
[2867] Successfully onlined Undo Tablespace 2. 
Undo initialization finished serial:0 start:1051584 end:1055564 diff:3980 (39 seconds) 
Verifying file header compatibility for 11g tablespace encryption.. 
Verifying 11g file header compatibility for tablespace encryption completed 
SMON: enabling tx recovery 
Database Characterset is AL32UTF8 
ARC3: Encountered disk I/O error 19502 
ARC3: Closing local archive destination LOG_ARCHIVE_DEST_1: '/u03/ora11g/product/11.2.0/dbhome_1/dbs/arch1_337_837590339.dbf' (error 19502) (TEST01) 
No Resource Manager plan active 
Sun Jun 08 00:37:13 2014 
Non critical error ORA-48181 caught while writing to trace file "/u03/ora11g/diag/rdbms/test01/TEST01/trace/TEST01_dbrm_2840.trc" 
Error message: Linux-x86_64 Error: 28: No space left on device 
Additional information: 1

然而这仅仅是个开始。我检查文件的使用情况,马上发现有一个目录下空间只剩下几百k了,排查空间的使用情况,最后定为是Undo的自增长造成的,本来500M左右的Undo现在涨到了900多M。

因为库是刚起来的,也没什么其他的操作,于是就做了Undo文件的resize,结果让我大跌眼镜。

SQL> alter database datafile '/u03/ora11g/oradata/TEST01/undotbs01.dbf' resize 500M; 
alter database datafile '/u03/ora11g/oradata/TEST01/undotbs01.dbf' resize 500M 
* 
ERROR at line 1: 
ORA-03297: file contains used data beyond requested RESIZE value

resize不行,再也没有其他的多余空间,而且目前遇到的情况更奇怪,归档生成的极为频繁。本来空间就紧张的虚拟机几乎不能做任何操作。

我最后尝试更改归档路径,把归档指到空间稍大的一个分区。然后再查查看到底是什么在后台消耗了这么多的资源,这个库自启动以来没做任何其他的操作。

先更改了归档路径,然后shutdown immediate还是没反应,尝试shutdown abort重启。这次重启没有其他的问题。库起来了,

但是在短时间内生成了相当多的归档文件。有很多已经被自己手工删除了。

rw-r----- 1 ora11g dba 99933184 Jun  8 01:04 1_358_837590339.dbf 
-rw-r----- 1 ora11g dba 88097280 Jun  8 01:05 1_359_837590339.dbf 
-rw-r----- 1 ora11g dba 99902976 Jun  8 01:06 1_360_837590339.dbf 
-rw-r----- 1 ora11g dba 93864960 Jun  8 01:07 1_361_837590339.dbf 
-rw-r----- 1 ora11g dba 97522176 Jun  8 01:08 1_362_837590339.dbf 
-rw-r----- 1 ora11g dba 95759360 Jun  8 01:09 1_363_837590339.dbf 
-rw-r----- 1 ora11g dba 92051968 Jun  8 01:10 1_364_837590339.dbf 
[ora11g@rac1 arch]$

查看系统级的进程。里面有好几个并行相关的进程,目前没有其他的操作,是在做后台的回滚吗?

我记得重启之前做数据加载测试的时候是用了并行。

top - 01:10:00 up  1:06,  3 users,  load average: 3.52, 3.27, 2.99 
Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie 
Cpu(s): 10.8%us,  4.6%sy,  0.0%ni, 15.1%id, 68.7%wa,  0.2%hi,  0.7%si,  0.0%st 
Mem:   2030124k total,  1965484k used,    64640k free,     2408k buffers 
Swap:  4063224k total,        0k used,  4063224k free,  1578300k cached
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                   
3176 ora11g    20   0  530m  76m  74m D 10.9  3.9   2:16.24 ora_p003_TEST01                                                                                           
3158 ora11g    20   0  531m  79m  77m D  9.6  4.0   2:11.00 ora_p001_TEST01                                                                                           
3174 ora11g    20   0  530m  73m  71m S  5.3  3.7   2:13.23 ora_p002_TEST01                                                                                           
3128 ora11g    20   0  549m  91m  71m D  3.6  4.6   0:50.43 ora_dbw0_TEST01 
3936 ora11g    20   0 14940 1248  904 R  1.0  0.1   0:00.15 top -c                                                                                                    
3130 ora11g    20   0  545m  22m  20m D  0.7  1.2   0:09.51 ora_lgwr_TEST01                                                                                           
2655 grid      20   0 97440 1556  744 S  0.3  0.1   0:00.45 sshd: grid@pts/0                                                                                          
3114 ora11g    -2   0  529m  15m  13m S  0.3  0.8   0:01.41 ora_vktm_TEST01                                                                                           
3140 ora11g    20   0  530m  31m  28m S  0.3  1.6   0:00.89 ora_mmnl_TEST01  

查看当前是否有session在操作

SELECT s.USERNAME,s.SID,s.SERIAL#,t.UBAFIL "UBA filenum",  t.UBABLK 
  "UBA Block number",t.USED_UBLK "Number os undo Blocks Used", 
  t.START_TIME,t.STATUS,t.START_SCNB,t.XIDUSN RollID,r.NAME RollName 
   FROM v$session s,v$transaction t,v$rollname r 
   WHERE s.SADDR=t.SES_ADDR  AND t.XIDUSN=r.usn;
no rows selected

查看undo的使用情况

SELECT 
seg.tablespace_name "Tablespace Name", 
ts.bytes/1024/1024 "TS Size(MB)",   
ue.status "UNDO Status", 
count(*)  "Used Extents",   
round(sum(ue.bytes)/1024/1024, 2) "Used Size(MB)",   
round(sum(ue.bytes)/ts.bytes*100, 2) "Used Rate(%)" 
FROM dba_segments seg, DBA_UNDO_EXTENTS ue,   
(SELECT tablespace_name, sum(bytes) bytes     
   FROM dba_data_files GROUP BY tablespace_name) ts 
    WHERE ue.segment_NAME=seg.segment_NAME and seg.tablespace_name=ts.tablespace_name 
    GROUP BY seg.tablespace_name, ts.bytes, ue.status ORDER BY seg.tablespace_name;
Tablespace Name                TS Size(MB) UNDO Stat Used Extents Used Size(MB) Used Rate(%) 
------------------------------ ----------- --------- ------------ ------------- ------------ 
UNDOTBS                            933.125 ACTIVE             283        929.88        99.65 
UNDOTBS                            933.125 EXPIRED             12           .75          .08 
UNDOTBS                            933.125 UNEXPIRED            9           1.5          .16

使用ash来查看一些到底在那几分钟里发生了什么。

Sid,Srl# (Inst) % Activity SQL ID        Event                          % Event 
--------------- ---------- ------------- ----------------------------- -------- 
User                 Program 
-------------------- ------------------------------ 
 243,    1(1)      75.35               wait for a undo record           21.62 
SYS                  oracle@rac1 (SMON)
                                        db file sequential read          18.32 
SYS                  oracle@rac1 (SMON)
                                        wait for stopper event to be     16.08 
SYS                  oracle@rac1 (SMON)
         -------------------------------------------------------------
Top DB Objects                 DB/Inst: TEST01/TEST01  (Jun 08 01:04 to 01:09) 
-> With respect to Application, Cluster, User I/O and buffer busy waits only.
     Object ID % Activity Event                             % Event 
--------------- ---------- ------------------------------ ---------- 
Object Name (Type)                                    Tablespace 
----------------------------------------------------- ------------------------- 
          15390       7.00 db file sequential read              6.83 
N1.T (TABLE)                                          POOL_DATA
          15391       3.92 db file sequential read              2.86 
N1.SYS_C005621 (INDEX)                                POOL_DATA
                          buffer busy waits                    1.06
         -------------------------------------------------------------
Top DB Files                   DB/Inst: TEST01/TEST01  (Jun 08 01:04 to 01:09) 
-> With respect to Cluster and User I/O events only.
       File ID % Activity Event                             % Event 
--------------- ---------- ------------------------------ ---------- 
File Name                                             Tablespace 
----------------------------------------------------- ------------------------- 
              3       9.30 db file sequential read              8.63 
/u03/ora11g/oradata/TEST01/undotbs01.dbf              UNDOTBS
            10       2.13 db file sequential read              2.02 
/u01/ora11g/pool_data06.dbf                           POOL_DATA
             8       1.46 db file sequential read              1.46 
/u03/ora11g/oradata/TEST01/pool_data04.dbf            POOL_DATA
            11       1.40 db file sequential read              1.40 
/u01/ora11g/pool_data07.dbf                           POOL_DATA
             7       1.34 db file sequential read              1.34 
/u03/ora11g/oradata/TEST01/pool_data02.dbf            POOL_DATA
         -------------------------------------------------------------
Top Latches                    DB/Inst: TEST01/TEST01  (Jun 08 01:04 to 01:09)
                 No data exists for this section of the report. 
          -------------------------------------------------------------
Activity Over Time             DB/Inst: TEST01/TEST01  (Jun 08 01:04 to 01:09) 
-> Analysis period is divided into smaller time slots 
-> Top 3 events are reported in each of those slots 
-> 'Slot Count' shows the number of ASH samples in that slot 
-> 'Event Count' shows the number of ASH samples waiting for 
   that event in that slot 
-> '% Event' is 'Event Count' over all ASH samples in the analysis period
                        Slot                                   Event 
Slot Time (Duration)    Count Event                             Count % Event 
-------------------- -------- ------------------------------ -------- ------- 
01:04:00    (0 secs)        7 write complete waits                  3    0.17 
                              db file async I/O submit              1    0.06 
                              free buffer waits                     1    0.06 
01:04:00   (1.0 min)      343 wait for a undo record               81    4.54 
                              db file sequential read              67    3.75 
                              wait for stopper event to be i       59    3.31 
01:05:00   (1.0 min)      330 wait for a undo record               69    3.87 
                              db file async I/O submit             58    3.25 
                              wait for stopper event to be i       55    3.08 
01:06:00   (1.0 min)      363 wait for a undo record               97    5.43 
                              db file async I/O submit             59    3.31 
                              wait for stopper event to be i       52    2.91 
01:07:00   (1.0 min)      377 db file sequential read              87    4.87 
                              wait for a undo record               77    4.31 
                              wait for stopper event to be i       61    3.42 
01:08:00   (1.0 min)      365 db file sequential read              80    4.48 
                              wait for a undo record               61    3.42 
                              db file async I/O submit             59    3.31 
          -------------------------------------------------------------

这样看果然一目了然,全部的问题都指向了一个表t和对应的索引。

这个表是做数据加载测试使用的表,加载的数据量有千万。会产生很多的redo。

查看n1.t这个表的情况,看表里面,目前是没有数据,但是查询会持续相当长的时间。简单验证一下。

SQL> select count(*)from t where rownum<2;
 COUNT(*) 
---------- 
         0

然后查看表t的统计信息。还是现实千万的数据条数。

******************************************* 
OWNER                          TABLE_NAME 
------------------------------ ------------------------------ 
N1                             T    
******************************************* 
********** TABLE GENERAL INFO *****************
TABLE_NAME                     PAR TABLESPACE STATUS   NUM_ROWS     BLOCKS EMPTY_BLOCKS LOG MON ROW_MOVE LAST_ANAL 
------------------------------ --- ---------- ------ ---------- ---------- ------------ --- --- -------- --------- 
T                              NO  POOL_DATA  VALID    13240320      74364            0 NO  YES DISABLED 07-JUN-14
********** TABLE STORAGE INFO *****************
 INITEXT    NXTEXT     MINEXT      MAXEXT  FREELISTS  AVG_SPACE  CHAIN_CNT AVG_ROW_LEN CACHE                T DEPENDEN COMPRES 
--------- --------- ---------- ----------- ---------- ---------- ---------- ----------- -------------------- - -------- ---------- 
    65536   1048576          1  2147483645                     0          0          36     N                N DISABLED DISABLED

在我查完问题之后,问题也好像自动解决了,归档也不在生成了。查看alert文件,确实是在做事务的并行回滚,不过刚刚做完。

Thread 1 advanced to log sequence 360 (LGWR switch) 
  Current log# 3 seq# 360 mem# 0: /u03/ora11g/oradata/TEST01/redo03.log 
Sun Jun 08 01:05:28 2014 
Archived Log entry 304 added for thread 1 sequence 359 ID 0xd9428f03 dest 1: 
Sun Jun 08 01:06:09 2014 
Thread 1 cannot allocate new log, sequence 361 
Checkpoint not complete 
  Current log# 3 seq# 360 mem# 0: /u03/ora11g/oradata/TEST01/redo03.log 
Sun Jun 08 01:06:22 2014 
Thread 1 advanced to log sequence 361 (LGWR switch) 
  Current log# 1 seq# 361 mem# 0: /u03/ora11g/oradata/TEST01/redo01.log 
Sun Jun 08 01:06:26 2014 
Archived Log entry 305 added for thread 1 sequence 360 ID 0xd9428f03 dest 1: 
Sun Jun 08 01:07:13 2014 
Thread 1 advanced to log sequence 362 (LGWR switch) 
  Current log# 2 seq# 362 mem# 0: /u03/ora11g/oradata/TEST01/redo02.log 
Sun Jun 08 01:07:16 2014 
Archived Log entry 306 added for thread 1 sequence 361 ID 0xd9428f03 dest 1: 
Sun Jun 08 01:08:17 2014 
Thread 1 cannot allocate new log, sequence 363 
Checkpoint not complete 
  Current log# 2 seq# 362 mem# 0: /u03/ora11g/oradata/TEST01/redo02.log 
Sun Jun 08 01:08:30 2014 
Thread 1 advanced to log sequence 363 (LGWR switch) 
  Current log# 3 seq# 363 mem# 0: /u03/ora11g/oradata/TEST01/redo03.log 
Sun Jun 08 01:08:33 2014 
Archived Log entry 307 added for thread 1 sequence 362 ID 0xd9428f03 dest 1: 
Sun Jun 08 01:09:28 2014 
Thread 1 advanced to log sequence 364 (LGWR switch) 
  Current log# 1 seq# 364 mem# 0: /u03/ora11g/oradata/TEST01/redo01.log 
Sun Jun 08 01:09:30 2014 
Archived Log entry 308 added for thread 1 sequence 363 ID 0xd9428f03 dest 1: 
Sun Jun 08 01:10:20 2014 
Thread 1 advanced to log sequence 365 (LGWR switch) 
  Current log# 2 seq# 365 mem# 0: /u03/ora11g/oradata/TEST01/redo02.log 
Sun Jun 08 01:10:24 2014 
Archived Log entry 309 added for thread 1 sequence 364 ID 0xd9428f03 dest 1: 
Sun Jun 08 01:10:27 2014 
SMON: Parallel transaction recovery tried

我现在要做的就是把表t的高水位线放下来。

SQL> truncate table t; 
Table truncated.
SQL> set timing on 
select count(*)from t 
SQL> /
 COUNT(*) 
---------- 
         0
Elapsed: 00:00:00.00

原文发布于微信公众号 - 杨建荣的学习笔记(jianrong-notes)

原文发表时间:2014-06-09

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据和云

追本溯源:Oracle 只读表空间的探索实践

作者简介 ? 胡中豪 云和恩墨西区交付工程师,多年一线 DBA 经验,曾服务于运营商、电网、政府行业、银行等行业客户;擅长数据库故障处理、性能优化、实施升级 本...

2583
来自专栏乐沙弥的世界

Oracle 基于 RMAN 的不完全恢复(incomplete recovery by RMAN)

      Oracle 数据库可以实现数据库不完全恢复与完全恢复。完全恢复是将数据库恢复到最新时刻,也就是无损恢复,保证数据库无丢失的恢复。而不完全恢复则是根...

431
来自专栏杨建荣的学习笔记

关于dblink锁定带来的问题(r3笔记第20天)

可能在一些分布式环境中,有一些数据访问都需要用到db link。从某种程度上来说dblink是很方便,但是从性能上来说还是有一些的隐患。如果两个环境之间的网络情...

2635
来自专栏蓝天

零停重启程序工具Huptime研究

零停重启目标程序,比如一个网络服务程序,不用丢失和中断任何消息实现重新启动,正在处理的消息也不会中断和丢失,重启的方法是给目标程序的进程发SIGHUP信号。...

561
来自专栏乐沙弥的世界

基于 RMAN 的同机数据库克隆

Oracle数据库克隆,也叫着Oracle数据库复制,可以通过基于用户管理的方式来完成,也可以基于RMAN方式来实现。而且Oracle建议使用RMAN方式来实现...

491
来自专栏杨建荣的学习笔记

临时表相关 (r4笔记第52天)

临时表在日常工作中可能使用比较多,但是大家都对临时表相关的一些知识了解比较少。我们来简单说数理一下。 首先是临时表空间,临时表都存储在临时表空间中,对于临时表...

2535
来自专栏乐沙弥的世界

Oracle 审计失败的用户登陆(Oracle audit)

       对于在线交易系统,且Oracle用户在使用缺省的profile的情形下,多用户共享相同的数据库用户及密码,任意用户输入错误密码累计达到10次以上,...

623
来自专栏杨建荣的学习笔记

备库归档删除策略失效的问题分析 (r7笔记第6天)

最近碰到了一个有些奇怪的问题,自己当时排查问题时间紧,没有细细琢磨,今天抽空看了下,终于发现了端倪。 首先是在早晨收到了报警邮件,报警邮件内容如下: ZABBI...

3468
来自专栏杨建荣的学习笔记

关于db_files和maxdatafiles的问题(r4笔记第31天)

昨天在做生产监控的时候发现有个库的表空间不够了,就发邮件给客户的dba去处理,但是得到的反馈是尝试添加的时候发现已经超过了数据文件的最大数限制。这个错误毫无疑问...

3016
来自专栏杨建荣的学习笔记

一次DB time抖动发现的expdp的bug(r6笔记第86天)

最近收到一封报警邮件,提示还是DB time突然提高,时间发生在早晨的时候,想必大过节的也不会有人这么卖力工作把数据库负载弄上去。 ############ D...

2754

扫描关注云+社区