好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

记一次 .NET 某企业OA后端服务 卡死分析

一、背景

1.讲故事

前段时间有位朋友微信找到我,说他生产机器上的 Console 服务看起来像是卡死了,也不生成日志,对方也收不到我的httpclient请求,不知道程序出现什么情况了,特来寻求帮助。

哈哈,一般来说卡死的情况在窗体程序(WinForm,WPF) 上特别多,在 Console,WebApi 中相对较少,既然找到我,那就上 WinDbg 分析吧。

二、WinDbg 分析

1. 程序真的卡死了吗

程序之所以能跑的梭梭响,全靠线程帮忙,如果玩不转可能就是线程上出了点什么问题,接下来使用 !t 展示下线程列表。

 0  :  000  >   ! t
ThreadCount :   124  UnstartedThread :   0  BackgroundThread :   105  PendingThread :   0  DeadThread :   18  Hosted Runtime :  no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain  Count  Apt Exception  0   1   65 ec  000002097 A4BF390  2 a020 Preemptive  0000000000000000  :  0000000000000000   000002097 a4bea20  -  00001  MTA  3   2   15 afc  000002097 A50BA10  2 b220 Preemptive  00000209060 BE8F8 :  00000209060 BEBE8  000002097 a4bea20  -  00001  MTA  ( Finalizer )   4   3   11 d20  000002097 A524EC0  102 a220 Preemptive  0000000000000000  :  0000000000000000   000002097 a4bea20  -  00001  MTA  ( Threadpool Worker )   5   9  b8a4  000002097 DE10290  3029220  Preemptive  0000000000000000  :  0000000000000000   000002097 a4bea20  -  00001  MTA  ( Threadpool Worker )   6   13   1 b22c  000002097 DE0ADB0  1029220  Preemptive  0000000000000000  :  0000000000000000   000002097 a4bea20  -  00001  MTA  ( Threadpool Worker )   7   16   1 b0e8  000002097 DE1A030  202 b220 Preemptive  00000209063 CB630 :  00000209063 CC1F8  000002097 a4bea20  -  00001  MTA  8   17   138 c8  000002097 DE175C0  202 b220 Preemptive  00000209063 F71C0 :  00000209063 F7BC8  000002097 a4bea20  -  00001  MTA  9   18   1 afd0  000002097 DE181E0  202 b220 Preemptive  00000209064627E0  :  0000020906463598   000002097 a4bea20  -  00001  MTA  10   19   1 ac48  000002097 DE13310  202 b220 Preemptive  000002090632 D6C8 :  000002090632E0 D8  000002097 a4bea20  -  00001  MTA  11   20   18704   000002097 DE16390  202 b220 Preemptive  00000209063 FB5A8 :  00000209063 FBBC8  000002097 a4bea20  -  00001  MTA  12   21   1 ade4  000002097 DE187F0  202 b220 Preemptive  00000209062 EA138 :  00000209062 EA708  000002097 a4bea20  -  00001  MTA  13   22   13164   000002097 DE13920  202 b220 Preemptive  0000020906392108  :  0000020906392 A38  000002097 a4bea20  -  00001  MTA  14   23   1 b334  000002097 DE169A0  202 b220 Preemptive  00000209063 CD9E8 :  00000209063 CE1F8  000002097 a4bea20  -  00001  MTA 
  ...  106   168   19e18   0000020927066770  a029220 Preemptive  0000000000000000  :  0000000000000000   000002097 a4bea20  -  00001  MTA  ( Threadpool Completion Port )   108   136  af74  0000020928590290   8029220  Preemptive  00000209063 B60E0 :  00000209063 B6408  000002097 a4bea20  -  00001  MTA  ( Threadpool Completion Port )   107   84  e7f0  000002097 AD6B600  8029220  Preemptive  00000209063E3300  :  00000209063E3 DD8  000002097 a4bea20  -  00001  MTA  ( Threadpool Completion Port )  

一般来说卦中的 Lock Count 列表示当前线程所持有的托管锁个数,现在显示的 -00001 应该是命令不起效果了。。。不过没关系,我们还可以通过 !syncblk 来看下 lock 的情况,毕竟 95% 的锁场景都会用到它。

 0  :  000  >   ! syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner  176   000002097E0 DA1D8  17   1   00000209216 A3500  132 c8  71   00000209002 c6a68 System .Object   191   000002097E0 DAC88  27   1   000002097 ADC54F0  1 ae90  53   00000209000185e8  System .IO  .TextWriter  + SyncTextWriter  -----------------------------  Total  270  CCW  0  RCW  0  ComClassFactory  0  Free  32  

从卦中的 MonitorHeld=27 来看,表示这个 SyncTextWriter 对象当前有 13 个线程在等待,有 1 个线程在持有,那这个线程为什么没有退出呢?接下来可以切到 53 号线程上,查看下它的线程栈。

 0  :  053  >   ~~  [  1 ae90 ] s
ntdll ! NtWriteFile +  0x14  :   00007 ffd`a70df774 c3              ret  0  :  053  >   ! clrstack 
OS Thread Id :   0x1ae90   (  53  )  Child SP               IP Call Site  00000070505 BDE88  00007 ffda70df774  [ InlinedCallFrame :   00000070505 bde88 ]  Interop + Kernel32 .WriteFile  ( IntPtr ,  Byte *  ,  Int32 ,  Int32 ByRef ,  IntPtr )   00000070505 BDE88  00007 ffd32a7cbe2  [ InlinedCallFrame :   00000070505 bde88 ]  Interop + Kernel32 .WriteFile  ( IntPtr ,  Byte *  ,  Int32 ,  Int32 ByRef ,  IntPtr )   00000070505 BDE50  00007 ffd32a7cbe2 ILStubClass .IL_STUB_PInvoke  ( IntPtr ,  Byte *  ,  Int32 ,  Int32 ByRef ,  IntPtr )   00000070505 BDF20  00007 ffd3a2d6971 System .ConsolePal  + WindowsConsoleStream .WriteFileNative  ( IntPtr ,  Byte [  ]  ,  Int32 ,  Int32 ,   Boolean  )   00000070505 BDF80  00007 ffd3a2d672f System .ConsolePal  + WindowsConsoleStream .Write  ( Byte [  ]  ,  Int32 ,  Int32 )   [  / _ / src / libraries / System .Console  / src / System / ConsolePal .Windows  .cs  @  1131  ]   00000070505 BDFD0  00007 ffd377c05d9 System .IO  .StreamWriter  .Flush  (  Boolean  ,   Boolean  )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / IO / StreamWriter .cs  @  260  ]   00000070505 BE050  00007 ffd3a2d6687 System .IO  .StreamWriter  .WriteLine  ( System .String  )   00000070505 BE0D0  00007 ffd3a2d6472 System .IO  .TextWriter  + SyncTextWriter .WriteLine  ( System .String  )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / IO / TextWriter .cs  @  880  ]   00000070505 BE130  00007 ffd3a2d640d System .Console  .WriteLine  ( System .String  )   [  / _ / src / libraries / System .Console  / src / System / Console .cs  @  716  ]   00000070505 BE160  00007 ffd3a2375ab HCloud .xxx  + d__2 .MoveNext  (  )   00000070505 BEAE0  00007 ffd37e19365 System .Runtime  .CompilerServices  .AsyncMethodBuilderCore  .Start  [  [ System .__Canon  ,  System .Private  .CoreLib  ]  ]  ( System .__Canon  ByRef )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Runtime / CompilerServices / AsyncMethodBuilderCore .cs  @  63  ]   00000070505 BEB40  00007 ffd3a235cc3 HCloud .xxxx  .Execute  ( Quartz .IJobExecutionContext  )   00000070505 BEBB0  00007 ffd3a231264 Quartz .xxx  + d__9 .MoveNext  (  )   00000070505 BEE90  00007 ffd3a230183 System .Runtime  .CompilerServices  .AsyncMethodBuilderCore  .Start  [  [ Quartz .Core  .JobRunShell  + d__9 ,  Quartz ]  ]  ( d__9 ByRef )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Runtime / CompilerServices / AsyncMethodBuilderCore .cs  @  63  ]   00000070505 BEF10  00007 ffd3a2300db System .Runtime  .CompilerServices  .AsyncTaskMethodBuilder  .Start  [  [ Quartz .Core  .JobRunShell  + d__9 ,  Quartz ]  ]  ( d__9 ByRef )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Runtime / CompilerServices / AsyncTaskMethodBuilder .cs  @  33  ]   00000070505 BEF40  00007 ffd3a230082 Quartz .xxxxl  .Run  ( System .Threading  .CancellationToken  )   00000070505 BF020  00007 ffd3a22ffc3 Quartz .Core  .QuartzSchedulerThread  + c__DisplayClass28_0 .b__0  (  )   00000070505 BF060  00007 ffd3939b71a System .Threading  .Tasks  .Task `1 [  [ System .__Canon  ,  System .Private  .CoreLib  ]  ]  .InnerInvoke  (  )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Threading / Tasks / Future .cs  @  507  ]   00000070505 BF0C0  00007 ffd37d54431 System .Threading  .ExecutionContext  .RunFromThreadPoolDispatchLoop  ( System .Threading  .Thread  ,  System .Threading  .ExecutionContext  ,  System .Threading  .ContextCallback  ,  System .Object  )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Threading / ExecutionContext .cs  @  300  ]   00000070505 BF110  00007 ffd37d53657 System .Threading  .Tasks  .Task  .ExecuteWithThreadLocal  ( System .Threading  .Tasks  .Task  ByRef ,  System .Threading  .Thread  )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Threading / Tasks / Task .cs  @  2352  ]   00000070505 BF1C0  00007 ffd37d50e04 System .Threading  .ThreadPoolWorkQueue  .Dispatch  (  )   [  / _ / src / libraries / System .Private  .CoreLib  / src / System / Threading / ThreadPool .cs  @  677  ]   00000070505 BF690  00007 ffd925ea7a3  [ DebuggerU2MCatchHandlerFrame :   00000070505 bf690 ]  

仔细观察线程栈会很惊讶的发现,居然还能卡在 System.Console.WriteLine 方法上,挺奇怪的,为了探究原因,我们使用 k 命令看下非托管栈。

 0  :  053  >  k  5  # Child - SP          RetAddr               Call Site  00   00000070 `505bddd8  00007 ffd`a32febda     ntdll ! NtWriteFile +  0x14   01   00000070 `505bdde0  00007 ffd`32a7cbe2     KERNELBASE ! WriteFile +  0x7a   02   00000070 `505bde50  00007 ffd`3a2d6971  0x00007ffd `32a7cbe2  03   00000070 `505bdf20  00007 ffd`3a2d672f     System_Console ! System .ConsolePal  .WindowsConsoleStream  .WriteFileNative  +  0x61   04   00000070 `505bdf80  00007 ffd`377c05d9     System_Console ! System .ConsolePal  .WindowsConsoleStream  .Write  +  0x3f   0  :  053  >  ub ntdll ! NtWriteFile +  0x14  ntdll ! NtDeviceIoControlFile +  0x15  :   00007 ffd`a70df755 cd2e  int   2 Eh  00007 ffd`a70df757 c3              ret  00007 ffd`a70df758  0 f1f840000000000 nop     dword ptr  [ rax + rax ]  ntdll ! NtWriteFile :   00007 ffd`a70df760  4 c8bd1          mov     r10 , rcx  00007 ffd`a70df763 b808000000      mov     eax ,  8   00007 ffd`a70df768 f604250803fe7f01 test    byte ptr  [ SharedUserData +  0x308   (  00000000 `7ffe0308 )  ]  ,  1   00007 ffd`a70df770  7503  jne     ntdll ! NtWriteFile +  0x15   (  00007 ffd`a70df775 )   00007 ffd`a70df772  0 f05            syscall

从上面的 syscall 系统调用关键词看,代码是将用户态的 ntdll!NtWriteFile 切到入了内核态的 nt!NtWriteFile 方法,那进入了内核态为什么没有返回呢?这又是一个值得思索的问题。

2. 为什么进入了内核态无返回

其实 ntdll!NtWriteFile 这个 win32 api 方法的第一个参数是一个 handle 的文件句柄,签名如下。

__kernel_entry NTSYSCALLAPI NTSTATUS NtWriteFile (   [  in  ]  HANDLE           FileHandle ,   [  in  ,  optional ]  HANDLE           Event ,   [  in  ,  optional ]  PIO_APC_ROUTINE  ApcRoutine ,   [  in  ,  optional ]  PVOID            ApcContext ,   [ out ]  PIO_STATUS_BLOCK IoStatusBlock ,   [  in  ]  PVOID            Buffer ,   [  in  ]  ULONG            Length ,   [  in  ,  optional ]  PLARGE_INTEGER   ByteOffset ,   [  in  ,  optional ]  PULONG           Key  )  ;  

可能 handle 在内核中被别人占用了,可以用 !handle 查看下 rcx 寄存器。

 0  :  053  >  r
rax =  0000000000000008  rbx =  00000070505 bdf50 rcx =  0000000000000418  rdx =  0000000000000000  rsi =  0000000000000000  rdi =  0000000000000418  rip =  00007 ffda70df774 rsp =  00000070505 bddd8 rbp =  00000070505 bdf10
 r8 =  0000000002000805  r9 =  0000000000000000  r10 =  0000020906191620  r11 =  00000070505 bc8f8 r12 =  0000000000000100  r13 =  0000000000000053  r14 =  0000000000000077  r15 =  000002097 adc54f0
iopl =  0  nv up ei pl zr na po nc
cs =  0033  ss =  002 b  ds =  002 b  es =  002 b  fs =  0053  gs =  002 b             efl =  00000246  ntdll ! NtWriteFile +  0x14  :   00007 ffd`a70df774 c3              ret  0  :  053  >   ! handle  0000000000000418  f
Handle  0000000000000418  Type          File
  Attributes  0  GrantedAccess  0x120196  :  ReadControl , Synch
         Write / Add , Append / SubDir / CreatePipe , WriteEA , ReadAttr , WriteAttr
  HandleCount  3  PointerCount  65483  No object specific information available

哈哈,其实也看不出什么,也没法进入内核态,所以下一步只能到网上搜搜看,其实有经验的朋友肯定猜出来了,应该是控制台启用了 快捷编辑窗口 ,截图如下:

快捷编辑窗口 简而言之就是用户可以在控制台上独占这个窗口,编辑一些内容, 可一旦被用户独占,那程序侧就没法输出内容到 控制台窗口 上了,只能在 内核态 傻傻等等,这应该就是形成原因,画个图大概就像下面这样。

将信息告诉朋友后,朋友说他用的是 Windows 服务部署,但不管是什么模式部署,注释掉 Console.WriteLine 肯定没错。

三:总结

这次卡死的事故,主要还是开发人员大量的使用 Console.WriteLine 来输出日志,在某个时刻输出端窗口因为各种原因被他人独占,导致程序侧无法输出内容到窗口而一直被迫等待,之后朋友将日志输出切换到文件模式,问题得以解决。

其实这个问题很多新手朋友都会犯,特此记录下来。

原文地址:https://mp.weixin.qq.com/s/_0eTb_MiQYC0OW9RYVWiGw

查看更多关于记一次 .NET 某企业OA后端服务 卡死分析的详细内容...

  阅读:56次