一、背景
1.讲故事前段时间有位朋友微信找到我,说他生产机器上的 Console 服务看起来像是卡死了,也不生成日志,对方也收不到我的httpclient请求,不知道程序出现什么情况了,特来寻求帮助。
哈哈,一般来说卡死的情况在窗体程序(WinForm,WPF) 上特别多,在 Console,WebApi 中相对较少,既然找到我,那就上 WinDbg 分析吧。
二、WinDbg 分析
1. 程序真的卡死了吗程序之所以能跑的梭梭响,全靠线程帮忙,如果玩不转可能就是线程上出了点什么问题,接下来使用 !t 展示下线程列表。
0 : 000 > ! t ThreadCount : 124 UnstartedThread : 0 BackgroundThread : 105 PendingThread : 0 DeadThread : 18 Hosted Runtime : no Lock DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception 0 1 65 ec 000002097 A4BF390 2 a020 Preemptive 0000000000000000 : 0000000000000000 000002097 a4bea20 - 00001 MTA 3 2 15 afc 000002097 A50BA10 2 b220 Preemptive 00000209060 BE8F8 : 00000209060 BEBE8 000002097 a4bea20 - 00001 MTA ( Finalizer ) 4 3 11 d20 000002097 A524EC0 102 a220 Preemptive 0000000000000000 : 0000000000000000 000002097 a4bea20 - 00001 MTA ( Threadpool Worker ) 5 9 b8a4 000002097 DE10290 3029220 Preemptive 0000000000000000 : 0000000000000000 000002097 a4bea20 - 00001 MTA ( Threadpool Worker ) 6 13 1 b22c 000002097 DE0ADB0 1029220 Preemptive 0000000000000000 : 0000000000000000 000002097 a4bea20 - 00001 MTA ( Threadpool Worker ) 7 16 1 b0e8 000002097 DE1A030 202 b220 Preemptive 00000209063 CB630 : 00000209063 CC1F8 000002097 a4bea20 - 00001 MTA 8 17 138 c8 000002097 DE175C0 202 b220 Preemptive 00000209063 F71C0 : 00000209063 F7BC8 000002097 a4bea20 - 00001 MTA 9 18 1 afd0 000002097 DE181E0 202 b220 Preemptive 00000209064627E0 : 0000020906463598 000002097 a4bea20 - 00001 MTA 10 19 1 ac48 000002097 DE13310 202 b220 Preemptive 000002090632 D6C8 : 000002090632E0 D8 000002097 a4bea20 - 00001 MTA 11 20 18704 000002097 DE16390 202 b220 Preemptive 00000209063 FB5A8 : 00000209063 FBBC8 000002097 a4bea20 - 00001 MTA 12 21 1 ade4 000002097 DE187F0 202 b220 Preemptive 00000209062 EA138 : 00000209062 EA708 000002097 a4bea20 - 00001 MTA 13 22 13164 000002097 DE13920 202 b220 Preemptive 0000020906392108 : 0000020906392 A38 000002097 a4bea20 - 00001 MTA 14 23 1 b334 000002097 DE169A0 202 b220 Preemptive 00000209063 CD9E8 : 00000209063 CE1F8 000002097 a4bea20 - 00001 MTA ... 106 168 19e18 0000020927066770 a029220 Preemptive 0000000000000000 : 0000000000000000 000002097 a4bea20 - 00001 MTA ( Threadpool Completion Port ) 108 136 af74 0000020928590290 8029220 Preemptive 00000209063 B60E0 : 00000209063 B6408 000002097 a4bea20 - 00001 MTA ( Threadpool Completion Port ) 107 84 e7f0 000002097 AD6B600 8029220 Preemptive 00000209063E3300 : 00000209063E3 DD8 000002097 a4bea20 - 00001 MTA ( Threadpool Completion Port )
一般来说卦中的 Lock Count 列表示当前线程所持有的托管锁个数,现在显示的 -00001 应该是命令不起效果了。。。不过没关系,我们还可以通过 !syncblk 来看下 lock 的情况,毕竟 95% 的锁场景都会用到它。
0 : 000 > ! syncblk Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner 176 000002097E0 DA1D8 17 1 00000209216 A3500 132 c8 71 00000209002 c6a68 System .Object 191 000002097E0 DAC88 27 1 000002097 ADC54F0 1 ae90 53 00000209000185e8 System .IO .TextWriter + SyncTextWriter ----------------------------- Total 270 CCW 0 RCW 0 ComClassFactory 0 Free 32
从卦中的 MonitorHeld=27 来看,表示这个 SyncTextWriter 对象当前有 13 个线程在等待,有 1 个线程在持有,那这个线程为什么没有退出呢?接下来可以切到 53 号线程上,查看下它的线程栈。
0 : 053 > ~~ [ 1 ae90 ] s ntdll ! NtWriteFile + 0x14 : 00007 ffd`a70df774 c3 ret 0 : 053 > ! clrstack OS Thread Id : 0x1ae90 ( 53 ) Child SP IP Call Site 00000070505 BDE88 00007 ffda70df774 [ InlinedCallFrame : 00000070505 bde88 ] Interop + Kernel32 .WriteFile ( IntPtr , Byte * , Int32 , Int32 ByRef , IntPtr ) 00000070505 BDE88 00007 ffd32a7cbe2 [ InlinedCallFrame : 00000070505 bde88 ] Interop + Kernel32 .WriteFile ( IntPtr , Byte * , Int32 , Int32 ByRef , IntPtr ) 00000070505 BDE50 00007 ffd32a7cbe2 ILStubClass .IL_STUB_PInvoke ( IntPtr , Byte * , Int32 , Int32 ByRef , IntPtr ) 00000070505 BDF20 00007 ffd3a2d6971 System .ConsolePal + WindowsConsoleStream .WriteFileNative ( IntPtr , Byte [ ] , Int32 , Int32 , Boolean ) 00000070505 BDF80 00007 ffd3a2d672f System .ConsolePal + WindowsConsoleStream .Write ( Byte [ ] , Int32 , Int32 ) [ / _ / src / libraries / System .Console / src / System / ConsolePal .Windows .cs @ 1131 ] 00000070505 BDFD0 00007 ffd377c05d9 System .IO .StreamWriter .Flush ( Boolean , Boolean ) [ / _ / src / libraries / System .Private .CoreLib / src / System / IO / StreamWriter .cs @ 260 ] 00000070505 BE050 00007 ffd3a2d6687 System .IO .StreamWriter .WriteLine ( System .String ) 00000070505 BE0D0 00007 ffd3a2d6472 System .IO .TextWriter + SyncTextWriter .WriteLine ( System .String ) [ / _ / src / libraries / System .Private .CoreLib / src / System / IO / TextWriter .cs @ 880 ] 00000070505 BE130 00007 ffd3a2d640d System .Console .WriteLine ( System .String ) [ / _ / src / libraries / System .Console / src / System / Console .cs @ 716 ] 00000070505 BE160 00007 ffd3a2375ab HCloud .xxx + d__2 .MoveNext ( ) 00000070505 BEAE0 00007 ffd37e19365 System .Runtime .CompilerServices .AsyncMethodBuilderCore .Start [ [ System .__Canon , System .Private .CoreLib ] ] ( System .__Canon ByRef ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Runtime / CompilerServices / AsyncMethodBuilderCore .cs @ 63 ] 00000070505 BEB40 00007 ffd3a235cc3 HCloud .xxxx .Execute ( Quartz .IJobExecutionContext ) 00000070505 BEBB0 00007 ffd3a231264 Quartz .xxx + d__9 .MoveNext ( ) 00000070505 BEE90 00007 ffd3a230183 System .Runtime .CompilerServices .AsyncMethodBuilderCore .Start [ [ Quartz .Core .JobRunShell + d__9 , Quartz ] ] ( d__9 ByRef ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Runtime / CompilerServices / AsyncMethodBuilderCore .cs @ 63 ] 00000070505 BEF10 00007 ffd3a2300db System .Runtime .CompilerServices .AsyncTaskMethodBuilder .Start [ [ Quartz .Core .JobRunShell + d__9 , Quartz ] ] ( d__9 ByRef ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Runtime / CompilerServices / AsyncTaskMethodBuilder .cs @ 33 ] 00000070505 BEF40 00007 ffd3a230082 Quartz .xxxxl .Run ( System .Threading .CancellationToken ) 00000070505 BF020 00007 ffd3a22ffc3 Quartz .Core .QuartzSchedulerThread + c__DisplayClass28_0 .b__0 ( ) 00000070505 BF060 00007 ffd3939b71a System .Threading .Tasks .Task `1 [ [ System .__Canon , System .Private .CoreLib ] ] .InnerInvoke ( ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Threading / Tasks / Future .cs @ 507 ] 00000070505 BF0C0 00007 ffd37d54431 System .Threading .ExecutionContext .RunFromThreadPoolDispatchLoop ( System .Threading .Thread , System .Threading .ExecutionContext , System .Threading .ContextCallback , System .Object ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Threading / ExecutionContext .cs @ 300 ] 00000070505 BF110 00007 ffd37d53657 System .Threading .Tasks .Task .ExecuteWithThreadLocal ( System .Threading .Tasks .Task ByRef , System .Threading .Thread ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Threading / Tasks / Task .cs @ 2352 ] 00000070505 BF1C0 00007 ffd37d50e04 System .Threading .ThreadPoolWorkQueue .Dispatch ( ) [ / _ / src / libraries / System .Private .CoreLib / src / System / Threading / ThreadPool .cs @ 677 ] 00000070505 BF690 00007 ffd925ea7a3 [ DebuggerU2MCatchHandlerFrame : 00000070505 bf690 ]
仔细观察线程栈会很惊讶的发现,居然还能卡在 System.Console.WriteLine 方法上,挺奇怪的,为了探究原因,我们使用 k 命令看下非托管栈。
0 : 053 > k 5 # Child - SP RetAddr Call Site 00 00000070 `505bddd8 00007 ffd`a32febda ntdll ! NtWriteFile + 0x14 01 00000070 `505bdde0 00007 ffd`32a7cbe2 KERNELBASE ! WriteFile + 0x7a 02 00000070 `505bde50 00007 ffd`3a2d6971 0x00007ffd `32a7cbe2 03 00000070 `505bdf20 00007 ffd`3a2d672f System_Console ! System .ConsolePal .WindowsConsoleStream .WriteFileNative + 0x61 04 00000070 `505bdf80 00007 ffd`377c05d9 System_Console ! System .ConsolePal .WindowsConsoleStream .Write + 0x3f 0 : 053 > ub ntdll ! NtWriteFile + 0x14 ntdll ! NtDeviceIoControlFile + 0x15 : 00007 ffd`a70df755 cd2e int 2 Eh 00007 ffd`a70df757 c3 ret 00007 ffd`a70df758 0 f1f840000000000 nop dword ptr [ rax + rax ] ntdll ! NtWriteFile : 00007 ffd`a70df760 4 c8bd1 mov r10 , rcx 00007 ffd`a70df763 b808000000 mov eax , 8 00007 ffd`a70df768 f604250803fe7f01 test byte ptr [ SharedUserData + 0x308 ( 00000000 `7ffe0308 ) ] , 1 00007 ffd`a70df770 7503 jne ntdll ! NtWriteFile + 0x15 ( 00007 ffd`a70df775 ) 00007 ffd`a70df772 0 f05 syscall
从上面的 syscall 系统调用关键词看,代码是将用户态的 ntdll!NtWriteFile 切到入了内核态的 nt!NtWriteFile 方法,那进入了内核态为什么没有返回呢?这又是一个值得思索的问题。
2. 为什么进入了内核态无返回其实 ntdll!NtWriteFile 这个 win32 api 方法的第一个参数是一个 handle 的文件句柄,签名如下。
__kernel_entry NTSYSCALLAPI NTSTATUS NtWriteFile ( [ in ] HANDLE FileHandle , [ in , optional ] HANDLE Event , [ in , optional ] PIO_APC_ROUTINE ApcRoutine , [ in , optional ] PVOID ApcContext , [ out ] PIO_STATUS_BLOCK IoStatusBlock , [ in ] PVOID Buffer , [ in ] ULONG Length , [ in , optional ] PLARGE_INTEGER ByteOffset , [ in , optional ] PULONG Key ) ;
可能 handle 在内核中被别人占用了,可以用 !handle 查看下 rcx 寄存器。
0 : 053 > r rax = 0000000000000008 rbx = 00000070505 bdf50 rcx = 0000000000000418 rdx = 0000000000000000 rsi = 0000000000000000 rdi = 0000000000000418 rip = 00007 ffda70df774 rsp = 00000070505 bddd8 rbp = 00000070505 bdf10 r8 = 0000000002000805 r9 = 0000000000000000 r10 = 0000020906191620 r11 = 00000070505 bc8f8 r12 = 0000000000000100 r13 = 0000000000000053 r14 = 0000000000000077 r15 = 000002097 adc54f0 iopl = 0 nv up ei pl zr na po nc cs = 0033 ss = 002 b ds = 002 b es = 002 b fs = 0053 gs = 002 b efl = 00000246 ntdll ! NtWriteFile + 0x14 : 00007 ffd`a70df774 c3 ret 0 : 053 > ! handle 0000000000000418 f Handle 0000000000000418 Type File Attributes 0 GrantedAccess 0x120196 : ReadControl , Synch Write / Add , Append / SubDir / CreatePipe , WriteEA , ReadAttr , WriteAttr HandleCount 3 PointerCount 65483 No object specific information available
哈哈,其实也看不出什么,也没法进入内核态,所以下一步只能到网上搜搜看,其实有经验的朋友肯定猜出来了,应该是控制台启用了 快捷编辑窗口 ,截图如下:
快捷编辑窗口 简而言之就是用户可以在控制台上独占这个窗口,编辑一些内容, 可一旦被用户独占,那程序侧就没法输出内容到 控制台窗口 上了,只能在 内核态 傻傻等等,这应该就是形成原因,画个图大概就像下面这样。
将信息告诉朋友后,朋友说他用的是 Windows 服务部署,但不管是什么模式部署,注释掉 Console.WriteLine 肯定没错。
三:总结
这次卡死的事故,主要还是开发人员大量的使用 Console.WriteLine 来输出日志,在某个时刻输出端窗口因为各种原因被他人独占,导致程序侧无法输出内容到窗口而一直被迫等待,之后朋友将日志输出切换到文件模式,问题得以解决。
其实这个问题很多新手朋友都会犯,特此记录下来。
原文地址:https://mp.weixin.qq.com/s/_0eTb_MiQYC0OW9RYVWiGw
查看更多关于记一次 .NET 某企业OA后端服务 卡死分析的详细内容...