前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >记一次生产数据库系统内存使用过高的案例

记一次生产数据库系统内存使用过高的案例

作者头像
数据和云
发布2020-07-29 14:40:44
9920
发布2020-07-29 14:40:44
举报
文章被收录于专栏:数据和云数据和云

墨墨导读:根据监控平台信息,发现数据库平台节点2内存使用率过高,达到98%。通过查询占用内存较高的进程、检查TFA状态、同步TFA配置等方式,使得系统恢复正常运作。

概述

根据监控平台信息,发现某数据库平台节点2内存使用率过高,内存使用率达到98%。

1. 查询占用内存较高的进程

grid     280483 183124  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280493 155171  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280497 104733  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280499 187375  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280533 239249  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280534 157752  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280536 281960  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280545  69656  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280552 128541  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280553  63409  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280558 108705  0 18:21 ?        00:00:00 [asmcmd daemon]
grid     280575 194378  0 18:21 ?        00:00:00 [asmcmd daemon]

内存使用率暂用最高的为asmcmd daemon,这个进程究竟在做什么导致消耗这么高的内存呢?

记下来跟踪一下该进程过程。

wait4(163639, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 163639
open("/tmp/clsecho_stderr_file.txt", O_RDONLY) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc8333d180) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR)                   = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=285, ...}) = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
read(4, "Can't open '/oracle/app/12.2.0/g"..., 8192) = 285
stat("/oracle/app/12.2.0/grid/bin/clsecho", {st_mode=S_IFREG|0755, st_size=11405, ...}) = 0
geteuid()                               = 1001
geteuid()                               = 1001
getegid()                               = 501
lseek(4, 99, SEEK_SET)                  = 99
lseek(4, 0, SEEK_CUR)                   = 99
pipe([6, 7])                            = 0
pipe([8, 9])                            = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f059005ea10) = 164739
close(9)                                = 0
close(7)                                = 0
read(8, "", 4)                          = 0
close(8)                                = 0
ioctl(6, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc8333d0f0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(6, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
brk(0)                                  = 0x81493d000
brk(0x81495e000)                        = 0x81495e000
read(6, "20-Jul-20 18:14 ASMCMD Backgroun"..., 8192) = 102
read(6, "", 8192)                       = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=164739, si_status=0, si_utime=0, si_stime=195} ---
fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
close(6)                                = 0
brk(0)                                  = 0x81495e000
brk(0)                                  = 0x81495e000
brk(0x81495c000)                        = 0x81495c000
brk(0)                                  = 0x81495c000
wait4(164739, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 164739
close(4)                                = 0
open("/tmp/clsecho_stderr_file.txt", O_RDONLY) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc8333d180) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR)                   = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=285, ...}) = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
read(4, "Can't open '/oracle/app/12.2.0/g"..., 8192) = 285
stat("/oracle/app/12.2.0/grid/bin/clsecho", {st_mode=S_IFREG|0755, st_size=11405, ...}) = 0
geteuid()                               = 1001
geteuid()                               = 1001
getegid()                               = 501
lseek(4, 99, SEEK_SET)                  = 99
lseek(4, 0, SEEK_CUR)                   = 99
pipe([6, 7])                            = 0
pipe([8, 9])                            = 0
clone(^CProcess 219522 detached

在这些进程上进行strace跟踪发现,无法连接到ASM实例以及对套接字文件不存在等大量无效调用。

"ASMCMD Background (PID = 118768):  Invalid file handle for pipe /tmp/asmcmd_fg_118436" 2> /tmp/clsecho_stderr_file.txt

进一步的分析/tmp/clsecho_stderr_file.txt发现,但随着CPU的增加,这些进程正在系统地从系统中获取更多交换空间。

[root@ tmp]# more clsecho_stderr_file.txt
Can't open '/oracle/app/12.2.0/grid/log/diag/asmcmd/user_grid/xxssd2/alert/alert.log' for append
CLSU-00100: operating system function: open failed failed with error data: 2
CLSU-00101: operating system error message: No such file or directory
CLSU-00103: error location: SlfFopen1

这就很巧了该目录为TFA的诊断目录。说明当前TFA存在问题

2. 检查TFA状态

[grid@~]$ tfactlTFA-00104 Cannot establish connection with TFA Server. Please check TFA Certificates

果然,节点2存在问题,连不上TFA 服务,那么节点1呢?因为节点1此时没有发生内存使用过高情况。

节点1TFA情况:

.------------------------------------------------------------------------------------.
|                                      xxssd1                                      |
+-----------------------------------------------------------------------+------------+
| Configuration Parameter                                               | Value      |
+-----------------------------------------------------------------------+------------+
| TFA Version                                                           | 19.2.1.0.0 |
| Java Version                                                          | 1.8        |
| Public IP Network                                                     | true       |
| Automatic Diagnostic Collection                                       | true       |
| Alert Log Scan                                                        | true       |
| Disk Usage Monitor                                                    | true       |
| Managelogs Auto Purge                                                 | false      |
| Trimming of files during diagcollection                               | true       |
| Inventory Trace level                                                 | 1          |
| Collection Trace level                                                | 1          |
| Scan Trace level                                                      | 1          |
| Other Trace level                                                     | 1          |
| Granular Tracing                                                      | false      |
| Debug Mask (Hex)                                                      | 0          |
| Repository current size (MB)                                          | 6908       |
| Repository maximum size (MB)                                          | 10240      |
| Max Size of TFA Log (MB)                                              | 50         |
| Max Number of TFA Logs                                                | 10         |
| Max Size of Core File (MB)                                            | 50         |
| Max Collection Size of Core Files (MB)                                | 500        |
| Max File Collection Size (MB)                                         | 5120       |
| Minimum Free Space to enable Alert Log Scan (MB)                      | 500        |
| Time interval between consecutive Disk Usage Snapshot(minutes)        | 60         |
| Time interval between consecutive Managelogs Auto Purge(minutes)      | 60         |
| Logs older than the time period will be auto purged(days[d]|hours[h]) | 30d        |
| Automatic Purging                                                     | true       |
| Age of Purging Collections (Hours)                                    | 12         |
| TFA IPS Pool Size                                                     | 5          |
| TFA ISA Purge Age (seconds)                                           | 604800     |
| TFA ISA Purge Mode                                                    | profile    |
| TFA ISA Purge Thread Delay (minutes)                                  | 60         |
| Setting for ACR redaction (none|SANITIZE|MASK)                        | none       |
| Email Notification will be sent for CHA EVENTS if address is set      | false      |
| AUTO Collection will be generated for CHA EVENTS                      | false      |

tfactl> status

.-----------------------------------------------------------------------------------------------.
| Host     | Status of TFA | PID  | Port | Version    | Build ID             | Inventory Status |
+----------+---------------+------+------+------------+----------------------+------------------+
| xxssd1   | RUNNING       | 8075 | 5000 | 19.2.1.0.0 | 19210020190425110550 | COMPLETE         |
| xxssd2   | NOT RUNNING   | -    |      |            |                      |                  |
'----------+---------------+------+------+------------+----------------------+------------------'

节点1运行正常,节点2没有运行,多次手动启动没有反应,报错如下:

Unable to determine the status of TFA in other nodes.

说明TFA 节点互联状态已经失效了。

3. 同步TFA配置

如果另一个节点TFA存在问题,那么可以在正常节点进行同步配置。

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.


Current Node List in TFA :
1. xxssd1
2. xxssd2

Node List in Cluster :
1. xxssd1
2. xxssd2

Node List to sync TFA Certificates :
     1  xxssd2

Do you want to update this node list? [Y|N] [N]:

Syncing TFA Certificates on xxssd2 :

TFA_HOME on xxssd2 : /oracle/app/12.2.0/grid/tfa/xxssd2/tfa_home

Please Enter the password for xxssd2 :

Is password same for all the nodes? [Y|N] [Y]: Y

Shutting down TFA on xxssd2...
Copying TFA Certificates to xxssd2...
Copying SSL Properties to xxssd2...
Shutting down TFA on xxssd2...
Sleeping for 5 seconds...
Starting TFA on xxssd2...

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.

.-------------------------------------------------------------------------------------------------.
| Host     | Status of TFA | PID    | Port | Version    | Build ID             | Inventory Status |
+----------+---------------+--------+------+------------+----------------------+------------------+
| xxssd1   | RUNNING       |   8075 | 5000 | 19.2.1.0.0 | 19210020190425110550 | COMPLETE         |
| xxssd2   | RUNNING       | 230525 | 5000 | 19.2.1.0.0 | 19210020190425110550 | COMPLETE         |
'----------+---------------+--------+------+------------+----------------------+------------------'

4. 后续处理

TFA配置完成后,内存的使用率就开下降,内存释放。

             total        used        free      shared  buff/cache   available
Mem:           1007         942          16           5          49          54
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         907          50           5          49          89
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         827         131           5          48         169
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         820         137           5          48         176
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         745         213           5          48         251
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         745         213           5          48         251
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         745         213           5          48         251
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         745         213           5          48         251
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         745         213           5          48         251
Swap:            31           0          31
[root@xxssd2 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:           1007         745         213           5          48         251
Swap:            31           0          31

5. 总结

TFA(Trace File Analyzer Collector)是个11.2版本上推出的用来收集Grid Infrastructure/RAC环境下的诊断日志的工具,它可以用非常简单的命令协助用户收集RAC里的日志,以便进一步进行诊断;TFA是类似diagcollection的一个oracle 集群日志收集器,而且TFA比diagcollection集中和自动化的诊断信息收集能力更强大。

建议生产环境数据库均关闭TFA自动收集、分析功能(Autodiagcollect)从而避免类似情况发生影响生产环境数据库的正常运行。

.------------------------------------------------------------------------------------.
|                                      gatzyca1                                      |
+-----------------------------------------------------------------------+------------+
| Configuration Parameter                                               | Value      |
+-----------------------------------------------------------------------+------------+
| TFA Version                                                           | 19.2.1.0.0 |
| Java Version                                                          | 1.8        |
| Public IP Network                                                     | true       |
| Automatic Diagnostic Collection                                       | true       |

注:关闭自动收集、分析功能不影响数据库正常运行,不影响TFA的日志收集、整合以及打包功能。

root用户执行:

tfactl set autodiagcollect = OFF
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-07-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据和云 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档