现象:本地连接ncpa.cpl打不开、任务管理器taskmgr打不开、cmd或powershell卡住、eventvwr打不开、浏览器卡住、RDP远程卡住、ping ip是通的、ping域名没有反应、云监控agent报错并丢失基础监控图像、openvpn等业务服务报错、系统服务CryptSvc报错、系统服务NlaSvc报错(以上现象,重启机器后全都恢复正常)
一板斧:重启机器后恢复
特定条件触发:特定业务环境(比如openvpn等)+主机安全2.2低子版本升级到2.2高子版本+特定配置(比如内存较小的低配机器)+特定系统(低版本windows系统概率较高,比如2008R2、2012R2、2016)
排查过程:有问题现场的机器先在母机上做内存快照,然后用elf2dmp转成windbg能分析的.dmp文件(转的时候需要能访问公网下载微软的symbol,转换成功也得在windbg里分析)
内存快照参考:
elf2dmp下载链接:
链接:https://pan.baidu.com/s/1R5fEkjXoFRMd__JARvHfxA?pwd=7ots
提取码:7ots
虽然系统卡顿、网络通讯异常,但机器负载不高,看了所有cpu都很空闲,也没有调度问题,所以应当是有线程被堵住,比如死锁。
锁有多种,比如pnplocks,qlocks,eresource,spinlock,CriticalSectionLock,前面四个都没问题
!pnplocks
看pnp锁; 比如在设备管理器里禁用/启用/卸载/扫描设备的时候如果无限卡住(OS整体没问题, 就设备管理器里操作的时候会无限卡), 那么很可能某个pnp锁被某个线程持有没释放;
!qlocks
看队列锁; 这种死锁的表现大多也是系统卡住; 但并不多见;
!locks
看eresources锁, 这种死锁也会表现为系统卡住
!running -it (~0;k; 、~1;k;、~2;k;……)
内核态的spinklock锁没有专用命令,但是内核态spinlock进入锁之后是关抢占状态, 持有锁/等待锁的线程大多都正在某个CPU上跑, 所以用~0;k; 、~1;k; 、~2;k;、……(依此类推)去看每个CPU在干什么, ~0;k; 、~1;k; 、~2;k;、……(依此类推)在CPU特别多的时候, 可以用!running -it, 但是这个命令在某些版本的os dump里, 因为符号问题可能失败,~0;k; 、~1;k; 、~2;k;、……(依此类推)这种不挑os版本;
!for_each_process ".process /r /p @#Process;!ntsdexts.locks"
看用户态进程的CriticalSecton锁; 如果普通三方进程内发生这种死锁, 一般只影响他自身; 如果是系统服务进程内发生这种死锁, 取决于服务进程用途, 可能导致系统功能组件异常, 也可能引发系统全局卡住.
!ready
看处于ready状态, 等待调度,但尚未得到调度的线程; 如果很多线程处于ready状态, 那么调度很可能有问题
用!for_each_process ".process /r /p @#Process;!ntsdexts.locks"分析CriticalSectionLock时
其他CritSec的LockCount是0,CritSec ntdll!LdrpLoaderLock+0 at 00000000772380d8的LockCount是6
!for_each_process ".process /r /p @#Process;!ntsdexts.locks"
CritSec ntdll!LdrpLoaderLock+0 at 00000000772380d8
WaiterWoken No
LockCount 6
RecursionCount 1
OwningThread 1563c
EntryCount 0
ContentionCount 755
*** Locked
CritSec +1b56330 at 0000000001b56330
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount a1
*** Locked
CritSec +1b57d20 at 0000000001b57d20
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 1
*** Locked
CritSec +2140150 at 0000000002140150
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
CritSec +1b5b140 at 0000000001b5b140
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
CritSec +1b5caa0 at 0000000001b5caa0
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
CritSec +1b5e400 at 0000000001b5e400
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
CritSec +2146140 at 0000000002146140
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 7
*** Locked
CritSec +1546618 at 0000000001546618
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
CritSec +2147aa0 at 0000000002147aa0
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
CritSec +214c690 at 000000000214c690
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 1
*** Locked
CritSec +1549308 at 0000000001549308
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 4b1c
EntryCount 0
ContentionCount 0
*** Locked
CritSec bcrypt!g_csLoaderLock+0 at 000007fefc5bd6c0
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 72d4
EntryCount 0
ContentionCount 0
*** Locked
如上图,OwningThread是1563c
bit 1 2 3 4 全置1,就是0x1e,不加那个0x1e的话, 栈显示可能会有问题,所以一定得加
2^1+2^2+2^3+2^4=2+4+8+16=30
进一步用!thread -t 0x1563c 0x1e打印
能看到死锁可能跟载入了C:\Windows\system32\teneyes\ydm.dll有关,而ydm.dll是跟主机安全(YDService)相关的东西,ydm.dll有2处,一处是system32,一处是YDService目录
C:\Program Files\QCloud\YunJing\YDEyes\ydm.dll
C:\Windows\System32\teneyes\ydm.dll
在YDService目录(C:\Program Files\QCloud\YunJing\YDEyes\)能看到3个.exe文件
调用ydm.dll的大概率就是那3个.exe
C:\Program Files\QCloud\YunJing\YDEyes\ydm.dll
C:\Windows\System32\teneyes\ydm.dll
用!mex.tl -r (!mex tasklist -r)看了进程列表,只有loader.exe能匹配,继续打印,能看到TimeStamp,跟近期发生问题的日期相近
0: kd> !process 0 1 loader.exe
PROCESS fffffa8007b41060
SessionId: 0 Cid: fc4c Peb: 7fffffd5000 ParentCid: 24cc
DirBase: 695da000 ObjectTable: fffff8a004a4df90 HandleCount: 36.
Image: loader.exe
VadRoot fffffa8005d97010 Vads 29 Clone 0 Private 167. Modified 0. Locked 0.
DeviceMap fffff8a000008b30
Token fffff8a003b22060
ElapsedTime 2 Days 00:29:17.239
UserTime 00:00:00.000
KernelTime 00:00:00.000
QuotaPoolUsage[PagedPool] 20488
QuotaPoolUsage[NonPagedPool] 3488
Working Set Sizes (now,min,max) (678, 50, 345) (2712KB, 200KB, 1380KB)
PeakWorkingSetSize 692
VirtualSize 11 Mb
PeakVirtualSize 14 Mb
PageFaultCount 689
MemoryPriority BACKGROUND
BasePriority 8
CommitCharge 189
0: kd> .process /p fffffa8007b41060;!peb 7fffffd5000
Implicit process is now fffffa80`07b41060
PEB at 000007fffffd5000
InheritedAddressSpace: No
ReadImageFileExecOptions: No
BeingDebugged: No
ImageBaseAddress: 000000013f050000
NtGlobalFlag: 0
NtGlobalFlag2: 0
Ldr 0000000077242e40
Ldr.Initialized: Yes
Ldr.InInitializationOrderModuleList: 00000000000e29d0 . 00000000001099c0
Ldr.InLoadOrderModuleList: 00000000000e28c0 . 00000000001099a0
Ldr.InMemoryOrderModuleList: 00000000000e28d0 . 00000000001099b0
Base TimeStamp Module
13f050000 6698c005 Jul 18 15:11:01 2024 C:\Windows\system32\teneyes\loader.exe
77110000 5dc1e7f7 Nov 06 05:21:59 2019 C:\Windows\SYSTEM32\ntdll.dll
76ff0000 5dc1e834 Nov 06 05:23:00 2019 C:\Windows\system32\kernel32.dll
7fefcd40000 5dc1e835 Nov 06 05:23:01 2019 C:\Windows\system32\KERNELBASE.dll
7fefd810000 5dc1e784 Nov 06 05:20:04 2019 C:\Windows\system32\ADVAPI32.dll
7fefd330000 4eeb033f Dec 16 16:37:19 2011 C:\Windows\system32\msvcrt.dll
7fefd720000 55636728 May 26 02:17:12 2015 C:\Windows\SYSTEM32\sechost.dll
7fefd020000 5dc1e7a2 Nov 06 05:20:34 2019 C:\Windows\system32\RPCRT4.dll
7fef8ca0000 5caeb94d Apr 11 11:49:33 2019 C:\Windows\system32\api-ms-win-core-synch-l1-2-0.DLL
SubSystemData: 0000000000000000
ProcessHeap: 00000000000e0000
ProcessParameters: 00000000000e1e80
CurrentDirectory: 'C:\Program Files\QCloud\YunJing\YDEyes\'
WindowTitle: 'C:\Windows\system32\teneyes\loader.exe'
ImageFile: 'C:\Windows\system32\teneyes\loader.exe'
CommandLine: 'C:\Windows\system32\teneyes\loader.exe -i 924'
分析 2台发生问题的机器
例如这台机器就是2024-7-7主机安全更新后出的问题,日志摘选如下:
[2024-7-7 02:55:52.294][98388][INF][DNS] Old version: 18082001 new: 18082002
[2024-7-7 02:55:52.294][98388][INF][DNS] Old pid: 936 new: 936
[2024-7-7 02:55:52.294][98388][INF][DNS] Old create time: 132352064492422529 new: 132352064492422529
[2024-7-7 02:55:52.294][98388][INF][DNS] Install provider: C:\Windows\system32\teneyes
[2024-7-7 02:55:52.295][98388][INF][DNS] YDM name: ydm64.dll loader name: loader64.exe
[2024-7-7 02:55:52.303][100040][INF][NET] Send packet: 1
[2024-7-7 02:55:52.340][59312][INF][USER] Report full
[2024-7-7 02:55:52.340][21744][INF][LOGON] CRoutineASync class CEventTraceClient thread starts, thread ID: 21744
[2024-7-7 02:55:52.340][21744][INF][LOGON] Client start
[2024-7-7 02:55:52.340][21744][INF][LOGON] Client register trace: NT Kernel Logger
[2024-7-7 02:55:52.340][21744][INF][LOGON] Trace: NT Kernel Logger already exists
[2024-7-7 02:55:52.340][21744][INF][LOGON] Try stop trace: NT Kernel Logger
[2024-7-7 02:55:52.341][27616][INF][PORT] Report full
[2024-7-7 02:55:52.341][27616][INF][PORT] Get net table
[2024-7-7 02:55:52.341][27616][INF][PORT] Get net table(tcp)
[2024-7-7 02:55:52.341][54376][INF] Cache get file md5, file: open failed with error:3
[2024-7-7 02:55:52.346][98388][INF][DNS] Install command: C:\Windows\system32\wevtutil.exe im C:\Windows\system32\teneyes\LogMAN.man
[2024-7-7 02:55:52.370][21744][INF][LOGON] Stop trace successfully
[2024-7-7 02:55:52.370][21744][INF][LOGON] Try start trace: NT Kernel Logger
[2024-7-7 02:55:52.392][57048][INF] CRoutineASync class CRemoteTaskRoutine thread starts, thread ID: 57048
[2024-7-7 02:55:52.392][82316][INF] CRoutineASync class CCommonRoutine thread starts, thread ID: 82316
[2024-7-7 02:55:52.392][57048][INF] Routine start
[2024-7-7 02:55:52.392][82316][INF] Common routine start
[2024-7-7 02:55:52.392][43608][INF] Start push
[2024-7-7 02:55:52.393][43608][INF][Push] Period scan day internal: 1
[2024-7-7 02:55:52.393][43608][INF][Push] Period scan start hour: 2 (2,1)
[2024-7-7 02:55:52.393][43608][INF][Push] Period scan delay second: 2748
[2024-7-7 02:55:52.394][21744][INF][LOGON] Start trace successfully
[2024-7-7 02:55:52.394][21744][INF][LOGON] Enable provider
[2024-7-7 02:55:52.394][21744][INF][LOGON] Enable provider successfully
[2024-7-7 02:55:52.394][21744][INF][LOGON] Client connect to trace
[2024-7-7 02:55:52.394][21744][INF][LOGON] Process trace: NT Kernel Logger
[2024-7-7 02:55:52.401][54376][INF] Cache get file md5, file:System open failed with error:2
[2024-7-7 02:55:52.410][59312][INF][USER] Check count: 1
[2024-7-7 02:55:52.447][43608][INF][Push] Init successfully
[2024-7-7 02:55:52.451][54376][INF] Cache get file md5, file:smss.exe open failed with error:2
[2024-7-7 02:55:52.461][59312][INF][USER] Check count: 2
[2024-7-7 02:55:52.468][98388][INF][DNS] Install result: **** 警告: 系统上安装了发布者 {bcb6085b-1c39-477f-8f2c-aa7cc85f4007}。
例如这台机器就是2024-7-20主机安全更新后出的问题,日志摘选如下:
[2024-7-20 21:11:08.865][11408][INF][DNS] Old version: 18082001 new: 18082002
[2024-7-20 21:11:08.865][11408][INF][DNS] Old pid: 948 new: 948
[2024-7-20 21:11:08.865][11408][INF][DNS] Old create time: 133340590491093750 new: 133340590491093750
[2024-7-20 21:11:08.865][11408][INF][DNS] Install provider: C:\Windows\system32\teneyes
[2024-7-20 21:11:08.865][11408][INF][DNS] YDM name: ydm64.dll loader name: loader64.exe
[2024-7-20 21:11:08.881][11408][INF][DNS] Install command: C:\Windows\system32\wevtutil.exe im C:\Windows\system32\teneyes\LogMAN.man
[2024-7-20 21:11:08.912][17024][INF][PORT] Report full
[2024-7-20 21:11:08.912][17024][INF][PORT] Get net table
[2024-7-20 21:11:08.912][17024][INF][PORT] Get net table(tcp)
[2024-7-20 21:11:08.912][ 6168][INF] Cache get file md5, file: open failed with error:3
[2024-7-20 21:11:08.912][ 8828][INF][LOGON] CRoutineASync class CEventTraceClient thread starts, thread ID: 8828
[2024-7-20 21:11:08.912][ 8828][INF][LOGON] Client start
[2024-7-20 21:11:08.912][ 8828][INF][LOGON] Client register trace: NT Kernel Logger
[2024-7-20 21:11:08.912][ 8828][INF][LOGON] Trace: NT Kernel Logger already exists
[2024-7-20 21:11:08.912][ 8828][INF][LOGON] Try stop trace: NT Kernel Logger
[2024-7-20 21:11:08.990][ 7228][INF] CRoutineASync class CRemoteTaskRoutine thread starts, thread ID: 7228
[2024-7-20 21:11:08.990][14332][INF] CRoutineASync class CCommonRoutine thread starts, thread ID: 14332
[2024-7-20 21:11:08.990][ 7228][INF] Routine start
[2024-7-20 21:11:08.990][14332][INF] Common routine start
[2024-7-20 21:11:08.990][ 8636][INF] Start push
[2024-7-20 21:11:08.990][ 8636][INF][Push] Period scan day internal: 1
[2024-7-20 21:11:08.990][ 8636][INF][Push] Period scan start hour: 2 (2,1)
[2024-7-20 21:11:08.990][ 8636][INF][Push] Period scan delay second: 1132
[2024-7-20 21:11:09.053][ 8636][INF][Push] Init successfully
[2024-7-20 21:11:09.053][ 6168][INF] Cache get file md5, file:System open failed with error:2
[2024-7-20 21:11:09.115][ 7588][INF][Push] CRoutineASync class CPushEngine thread starts, thread ID: 7588
[2024-7-20 21:11:09.115][ 7588][INF] Handler routine start
[2024-7-20 21:11:09.115][ 4760][INF][Push] CRoutineASync class CFileUploadRoutine thread starts, thread ID: 4760
[2024-7-20 21:11:09.115][ 4760][INF] File upload routine start
[2024-7-20 21:11:09.115][ 7588][INF][Push] Send bruteforce rule pull successfully
[2024-7-20 21:11:09.115][ 6168][INF] Cache get file md5, file:C:\Windows\System32\smss.exe open failed with error:2
[2024-7-20 21:11:09.180][ 6168][INF] Cache get file md5, file:C:\Windows\System32\csrss.exe open failed with error:2
[2024-7-20 21:11:09.242][ 6168][INF] Cache get file md5, file:C:\Windows\System32\csrss.exe open failed with error:2
[2024-7-20 21:11:09.273][11408][INF][DNS] Install result: **** Warning: Publisher {bcb6085b-1c39-477f-8f2c-aa7cc85f4007} is installed on the system.
结合问题现象和主机安全日志、操作系统日志,怀疑跟dnscache关系较大(ping ip通,ping域名没反应),尝试kill dnscache没想到奏效了。
临时解决办法:
powershell能打开的话,执行这几句powershell:
sc.exe queryex DNSCache
sc.exe queryex cryptsvc
$DNSCachePID=(((sc.exe queryex DNSCache) |findstr PID) -split ":")[1].Trim()
taskkill.exe /f /pid $DNSCachePID
sc.exe start DNSCache 2>&1 >$null
sc.exe start cryptsvc 2>&1 >$null
tasklist /svc|findstr /i "dns cryptsvc"
sc.exe queryex DNSCache
sc.exe queryex cryptsvc
powershell不能的话,打开cmd:
tasklist /svc|findstr /i "dns cryptsvc" 看下dnscache对应的pid,然后执行
taskkill.exe /f /pid 后面跟dnscache对应的pid,kill后会自动拉起一个新的pid
如果powershell、cmd、TAT都无法下发指令,且图形界面也无法重启dnscache服务,则只能重启机器恢复
如果powershell、cmd、TAT都无法下发指令,且图形界面也无法重启dnscache服务,则只能重启机器恢复
如果powershell、cmd、TAT都无法下发指令,且图形界面也无法重启dnscache服务,则只能重启机器恢复
上图中的svchost.exe是解耦的一些关键服务(比如Dnscache、CryptSvc、NlaSvc等),下面的则是耦合的,例如:
tasklist /svc|findstr /i "NlaSvc CryptSvc"
svchost.exe 1064 CryptSvc, Dnscache, LanmanWorkstation, NlaSvc, WinRM
这里解耦、耦合到底怎么回事?详细介绍如下
从Windows原理和系统日志进一步分析
Cryptographic Services(CryptSvc)提供四种管理服务: 目录数据库服务,用于确认 Windows 文件的签名和允许安装新程序;受保护的根服务,用于从该计算机中添加与删除受信任根证书颁发机构的证书;自动根证书更新服务,用于从 Windows Update 中检索根证书和启用 SSL 等方案;密钥服务,用于协助注册此计算机以获取证书。如果此服务已停止,这些管理服务将无法正常运行。如果此服务已禁用,任何明确依赖它的服务将无法启动。
Network Location Awareness(NlaSvc)收集和存储网络的配置信息,并在此信息被修改时向程序发出通知。如果停止此服务,则配置信息可能不可用;如果禁用此服务,则显式依赖此服务的所有服务都将无法启动。
1703及其之后的系统(即server2019、2022),在≥3.5G内存的机器上,svchost.exe默认是解耦关键系统服务的,<3.5G的机器上,svchost.exe则默认是耦合模式。
1703之前的低版本系统(2008R2、2012R2、2016),不论内存多大,svchost.exe默认都是耦合模式,即Dnscache、CryptSvc、NlaSvc等关键系统服务是耦合在同一个pid的svchost.exe的。
耦合模式的情况下,Dnscache有问题的时候会影响cryptsvc和网络通讯(比如影响内网域名解析继而影响云监控等内网服务上报数据),并会导致整个系统卡顿、远程卡死,尤其是涉及用户登录、证书验证的场景,比如远程、浏览器等。出问题的基本就是低版本系统,有现场的机器里系统日志也伴随Cryptographic Services(CryptSvc)、Network Location Awareness(NlaSvc)的报错日志,完全契合。
并且临时解决方案(kill dnscache对应的pid)能奏效,主机安全的日志也有支撑(主机安全的日志里有dnscache相关记录)。
推断可能是从24年2月开始出现的概率性问题,同样版本/同样大小/同样位置的ydm.dll,不是每台机器都出问题,较低概率遇到,临时办法就是kill dnscache对应的pid。
拓展1:
对于耦合的关键服务,如果想解耦变成独立的(own模式),以dhcp服务为例,可以这样调整
sc config dhcp type= own obj= LocalSystem
拓展2:警惕一些vpn软件对dnscache服务的注册表的改动
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters\DnsPolicyConfig
也许触发条件就是特定业务,比如openvpn,至少从注册表来说,openvpn会改动dnscache服务的注册表,而dnscache服务对应的svchost.exe是关键系统进程,并且在一些情况下会耦合好几个关键系统服务
如果有特定业务环境(比如openvpn等),触发dnscache服务对应的svchost.exe死锁的概率较高,从而网络通讯卡死(例如:ping内网ip是通的,ping内网域名没反应)、部分关键系统功能卡死(例如:打开eventvwr卡死、打开chrome卡死、打开taskmgr卡死、打开ncpa.cpl卡死,等等)
这个问题,特定条件触发:特定业务环境(比如openvpn等)+主机安全2.2低子版本升级到2.2高子版本+特定配置(比如内存较小的低配机器)+特定系统(低版本windows系统概率较高,比如2008R2、2012R2、2016)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。