前几天在做巡检的时候发现有个库的负载在某一个时间段内极高,高达100倍。一个10分钟的awr报告,得到的db time 却有1000分钟。
Snap Id | Snap Time | Sessions | Cursors/Session | |
---|---|---|---|---|
Begin Snap: | 27745 | 13-Jan-15 06:00:02 | 3500 | 6.4 |
End Snap: | 27746 | 13-Jan-15 06:10:03 | 3512 | 6.4 |
Elapsed: | 10.01 (mins) | |||
DB Time: | 1,010.21 (mins) |
发生的时间段是早上6点左右。想必不大可能是人为操作的原因。 但是查看time model的时候却让人感觉很奇怪。 Time Model Statistics
Statistic Name | Time (s) | % of DB Time |
---|---|---|
sql execute elapsed time | 7,303.18 | 12.05 |
DB CPU | 1,233.88 | 2.04 |
PL/SQL execution elapsed time | 111.37 | 0.18 |
parse time elapsed | 14.47 | 0.02 |
inbound PL/SQL rpc elapsed time | 11.08 | 0.02 |
RMAN cpu time (backup/restore) | 5.57 | 0.01 |
hard parse elapsed time | 4.08 | 0.01 |
connection management call elapsed time | 2.41 | 0.00 |
sequence load elapsed time | 2.03 | 0.00 |
repeated bind elapsed time | 0.03 | 0.00 |
failed parse elapsed time | 0.02 | 0.00 |
hard parse (sharing criteria) elapsed time | 0.01 | 0.00 |
hard parse (bind mismatch) elapsed time | 0.01 | 0.00 |
PL/SQL compilation elapsed time | 0.00 | 0.00 |
DB time | 60,612.63 | |
background elapsed time | 1,770.76 | |
background cpu time | 40.97 |
OS级别的统计信息如下。
Statistic | Value | End Value |
---|---|---|
BUSY_TIME | 197,613 | |
IDLE_TIME | 4,611,782 | |
IOWAIT_TIME | 10,540 | |
NICE_TIME | 0 | |
SYS_TIME | 34,788 | |
USER_TIME | 150,433 | |
LOAD | 10 | 1 |
Host CPU (CPUs: 80 Cores: 40 Sockets: 4)
Load Average Begin | Load Average End | %User | %System | %WIO | %Idle |
---|---|---|---|---|---|
9.92 | 10.06 | 3.1 | 0.7 | 0.2 | 95.9 |
Instance CPU
%Total CPU | %Busy CPU | %DB time waiting for CPU (Resource Manager) |
---|---|---|
2.7 | 64.5 | 0.0 |
OS Busy% = 197613/(197613+4611782) = 25% Inst CPU% (using DB CPU) = 1234*100/ (197613+4611782)=2.57% 查看等待事件的时候发现 Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
db file sequential read | 1,286,918 | 4,395 | 3 | 7.25 | User I/O |
DB CPU | 1,234 | 2.04 | |||
direct path read | 68,781 | 870 | 13 | 1.43 | User I/O |
SQL*Net more data from dblink | 2,860 | 438 | 153 | 0.72 | Network |
db file parallel read | 15,649 | 315 | 20 | 0.52 | User I/O |
Inst CPU% (using db file sequential read) = 4395*1000/ (197613+4611782) = 9% 可以很明显的看到top 5的等待事件只占用了将近10%左右的DB time,剩下的90%左右的到哪去了呢? DB CPU+Foreground events=1234+4395+870+438+315=7252 而DB time=1010.21*60=60612, 差别实在是太大了。 对于这个问题,需要首先关注的就是sql部分,DB time是会把程序等待的时间计算进去的,但是DB cpu不会。在短短的这十分钟里看来还是有不少的文章的。 查看sql的统计情况时,可以明显看到executions为0的sql语句。但是占用的总体比例看起来很低,是因为这部分的时间没有算到DB cpu里面去。
Elapsed Time (s) | Executions | Elapsed Time per Exec (s) | %Total | SQL Id | SQL Module | SQL Text | ||
---|---|---|---|---|---|---|---|---|
602.35 | 0 | 0.99 | 9yn2xdw7uy0bq | JDBC Thin Client | select /*+ index(s SERVICE_AGR... | |||
528.06 | 0 | 0.87 | 6kaqc0d7sgvx8 | SQL Developer | select bc.charge_seq_no, bc.se... | |||
460.43 | 1 | 460.43 | 0.76 | 280b97jfdzvcm | DECLARE job BINARY_INTEGER := ... | |||
460.36 | 1 | 460.36 | 0.76 | f4s5k0j1whjwx | create table RATED_EVENT NOLOG... | |||
458.74 | 1 | 458.74 | 0.76 | d6078fp7h1jb6 | JDBC Thin Client | SELECT CD.TARGET_PCN, SE.AGREE... | ||
446.97 | 1 | 446.97 | 0.74 | cm7r7w9vvbcbd | PL/SQL Developer | declare cnt number; begin cnt ... | ||
446.97 | 1 | 446.97 | 0.74 | fayqg3gtg6ygs | PL/SQL Developer | declare cnt number; begin cnt ... | ||
446.97 | 1 | 446.97 | 0.74 | 3gbq16harc68h | PL/SQL Developer | declare cnt number; begin cnt ... | ||
446.94 | 1 | 446.94 | 0.74 | asc03yk438k67 | PL/SQL Developer | declare cnt number; begin cnt ... | ||
378.29 | 8 | 47.29 | 0.62 | 00jrngwmz2vxx | JDBC Thin Client | /* AR9_AR9GetPaymentSMS_select... |
继续查看其它的部分,就会发现executions为0的sql远不止那两个。
Buffer Gets | Executions | %Total | Elapsed Time (s) | SQL Id | SQL Module | SQL Text | |
---|---|---|---|---|---|---|---|
20,877,607 | 304 | 15.86 | 146.80 | aty7a3bvqfxxx | JDBC Thin Client | SELECT COUNT(1) FROM (/* null_... | |
15,363,930 | 5 | 11.67 | 10.84 | 1hg2wcuapy3y3 | JDBC Thin Client | select bl1_run_request.run_mod... | |
7,445,354 | 8 | 5.66 | 378.29 | 00jrngwmz2vxx | JDBC Thin Client | /* AR9_AR9GetPaymentSMS_select... | |
6,700,190 | 53 | 5.09 | 48.73 | 4pbpdtw14s7bg | JDBC Thin Client | /* null_CL9CommActivity_update... | |
6,491,924 | 1 | 4.93 | 446.97 | fayqg3gtg6ygs | PL/SQL Developer | declare cnt number; begin cnt ... | |
5,813,177 | 1 | 4.42 | 446.94 | asc03yk438k67 | PL/SQL Developer | declare cnt number; begin cnt ... | |
5,350,250 | 1 | 4.06 | 446.97 | cm7r7w9vvbcbd | PL/SQL Developer | declare cnt number; begin cnt ... | |
5,305,308 | 1 | 4.03 | 446.97 | 3gbq16harc68h | PL/SQL Developer | declare cnt number; begin cnt ... | |
5,029,500 | 1 | 3.82 | 458.74 | d6078fp7h1jb6 | JDBC Thin Client | SELECT CD.TARGET_PCN, SE.AGREE... | |
4,440,670 | 1 | 3.37 | 302.84 | 7s8bcw9axdjhy | JDBC Thin Client | SELECT CD.TARGET_PCN, SE.AGREE... | |
4,424,502 | 35 | 3.36 | 31.27 | 2uqmgpqwnbu8y | JDBC Thin Client | /* null_CL9CommActivity_select... | |
4,179,264 | 5,709 | 3.17 | 17.78 | g1p5bjdgv0wkq | JDBC Thin Client | /* AR9_SearchInvoices_searchIn... | |
4,162,849 | 5,631 | 3.16 | 46.95 | axjaa2n7rx9b0 | JDBC Thin Client | SELECT COUNT(1) FROM (/* AR9_S... | |
4,101,017 | 280,418 | 3.12 | 358.90 | 182gmhv5f5zqv | PL/SQL Developer | UPDATE TABLE_BPM_STEP_INST SET... | |
3,916,711 | 272,996 | 2.98 | 347.54 | bjqz15a4rqcf4 | PL/SQL Developer | UPDATE TABLE_BPM_STEP_INST SET... | |
3,806,950 | 268,600 | 2.89 | 334.82 | ad50sx1wjnsxv | PL/SQL Developer | UPDATE TABLE_BPM_STEP_INST SET... | |
3,806,852 | 268,649 | 2.89 | 334.17 | 7499mkmfmu1fr | PL/SQL Developer | UPDATE TABLE_BPM_STEP_INST SET... | |
3,792,292 | 31 | 2.88 | 177.30 | 98dyqu3xsgyf5 | JDBC Thin Client | /* null_CL9CrdMntrExtract_sele... | |
2,563,257 | 33 | 1.95 | 65.68 | 67mpfp9dfdcvn | JDBC Thin Client | SELECT /*+ ordered use_nl (bl1... | |
2,552,705 | 33 | 1.94 | 4.19 | a331y50gxz8gb | JDBC Thin Client | SELECT COUNT(*) FROM (SELECT /... | |
2,388,043 | 0 | 1.81 | 81.54 | dyn656kdc8w98 | PL/SQL Developer | SELECT /*+ PARALELL(cl1_coll_e... | |
2,159,680 | 389 | 1.64 | 7.30 | cysbbg2h86xc6 | JDBC Thin Client | /* MSD_SubLogSelect_selectNonQ... | |
1,967,921 | 32,264 | 1.49 | 7.24 | bwjxuug65z19n | JDBC Thin Client | /* BL_LogicalDate_selectOnLine... | |
1,893,601 | 0 | 1.44 | 93.11 | 0afh4ddt9bxu7 | PL/SQL Developer | SELECT /*+ PARALELL(cl1_coll_e... | |
1,777,723 | 0 | 1.35 | 528.06 | 6kaqc0d7sgvx8 | SQL Developer | select bc.charge_seq_no, bc.se... | |
1,540,694 | 0 | 1.17 | 106.64 | 5f43xqgtmz3f2 | PL/SQL Developer | SELECT /*+ PARALELL(cl1_coll_e... | |
1,495,550 | 0 | 1.14 | 105.95 | 5z436m8azttg1 | PL/SQL Developer | SELECT /*+ PARALELL(cl1_coll_e... | |
1,319,463 | 30 | 1.00 | 5.39 | gy7n3dzhh90y4 | Ac1FtcManager@ccbstbpr1 (TNS V1-V3) | /* */ UPDATE /*+ INDEX (AC1_CO... |
同时这部分的信息都指向了一个program,都是通过客户端pl/sql Deverloper或者SQL Developer连接的。 查看相应的sql都使用了并行,并行度都在4以上,这样问题似乎就清晰很多了。这部分executions为0的语句通过客户端程序连接,手工启用了并行,这些时间都没有计算进DB CPU中。 稍后和现场,开发部门的同事进行了沟通和确认,他们说在昨天有一个严重的问题,最后由开发提供了脚本进行了修复。 因为持续的时间较短,只有10分钟,所以对系统的负载没有造成严重的影响,最后就在内部做一个总结和分析。