Sunday, June 03, 2007

Production Debugging for Hung ASP.Net 2 applications – a crash course

If your app is down right now skip the intro – I’ll understand.

Introduction


First up: debugging – even in the development environment – should never be a substitute to detailed diagnostic logging and appropriate – meaningful – error handling. Any time I fix an exception, if the problem is initially non-obvious, I first fix the logging code. Then and only then do I fix the exception.

Baking in a support mentality early in the development process is the only way to avoid an application that’s hard (and by that I mean expensive) to support. The customer might be paying for the support time, but if you can’t turnaround support issues quickly – and practically any issue in a production business application needs turning around quickly – they start questioning your competence.

But sometimes logs aren’t enough, and even with the best will in the world, sometimes they’re not sufficiently detailed (they won’t - for example – be much help tracking down that memory leak). Sometimes you need to know what’s thrashing right now; to see the smoking gun.

Enter the world of crash dump debugging.

I’d been following Tess’s blog for a while now, but still had a big hole in my understanding. Sure I now knew I could use the !clrstack command to start looking at managed thread stacks, but where the hell was I typing that? This then is my introduction to Production Debugging for .Net applications (or ‘how to understand what Tess is talking about’).

I’m going to be talking about debugging hung .Net 2 ASP.Net applications, but much of this is relevant to debugging any .Net application in a production or testing environment.

Getting the Dump


Assuming you don’t have time to read a more detailed article on this (and the Microsoft guide is 170+ pages!), just do this:

WARNING: Installing the debugging tools on your production server isn’t recommended. That’s not to say you have any choice in the matter, but you should be aware there are security and availability implications that – if your app is hung – you probably don’t care about right now.

Install the Microsoft Debugging Tools for Windows and IISState on the machine where the hang occurs. You’ll need the Debugging Tools on your workstation too, but you can do this later if it’s happening right now.

If your app is thrashing CPU, use TaskManager to get the process ID (PID) of the W3WP process that’s using the most CPU (there may be more than one). NB: PIDs aren’t displayed by default: use View\Select Columns to turn them on.

Otherwise, at the command prompt, enter cscript iisapp.vbs to determine the process ID that matches your app pool.

At the command prompt CD to wherever you installed IISState (C:\IISState is the default). If there is existing content in the output folder, delete or rename the folder. Then execute the following:
IISState –p  <PID> -d

(…replacing <PID> with the PID you determined above)

WARNING: At this point your app will become unresponsive until the dump completes

IISState will put a dump file (.dmp) and a text summary (.log) of the process’s threads etc… in the output directory.

Rename the output folder to something that reflects the date/time, so future dumps don’t overwrite it.

Since we’re doing a ‘hang’, ideally at this point you’d wait a few mins, and then take another dump. If you don’t, you may regret it later and / or have to wait for the issue to re-arise before you can definitively pin it down. However, if the world and his dog want the site back up right now then skip it – it’s your call.

Kill that process. IIS will recycle it, and things should go back to normal for now. Panic over.

Analyzing the Dump


Copy the renamed output directory down onto your machine. For
the process ID you dumped there will also be an IISState summary file, eg IISState-5090.log. This log is a dump of the stack trace of all the threads in the process, and attempts to perform some additional diagnostics for you. Unfortunately, it doesn’t (at present) support .Net 2, so most likely it’s not going to tell you what the problem is. You should at least be able to see a few threads that look something like this:
Thread ID: 13
System Thread ID: d10
Kernel Time: 0:0:4.186
User Time: 0:20:25.702
Thread Type: Managed Thread. Possible ASP.Net page or other .Net worker
No valid SOS data table found.


Begin System Thread Information

# ChildEBP RetAddr 
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 01a9e76c 06ee91cd 0x6eea301
01 01a9eb6c 06cc6e91 0x6ee91cd
02 01a9ee44 06cc6278 0x6cc6e91
03 01a9ee80 06cc3ddd 0x6cc6278
04 01a9eeb8 06cc381c 0x6cc3ddd
05 01a9ef3c 0736fa0f 0x6cc381c
06 01a9ef84 0736f65f 0x736fa0f
07 01a9efec 0725853d 0x736f65f
08 01a9f050 79e9272b 0x725853d
09 01a9f060 79e926ad mscorwks!DllRegisterServerInternal+0x1d6b
0a 01a9f0e0 79e9450b mscorwks!DllRegisterServerInternal+0x1ced
0b 01a9f220 79e943f3 mscorwks!DllRegisterServerInternal+0x3b4b
0c 01a9f37c 07258408 mscorwks!DllRegisterServerInternal+0x3a33
0d 05b30b78 00000008 0x7258408

What does this mean? Well IIState doesn’t understand what’s going on with the stack below that mscorwks dll, which screams .Net 2 stack to you (as does the comment against Thread Type). But the likelihood is there’s not much else we’re going to get out of that file[1]. So it’s time to look at the mini dump.

Open a Visual Studio 2005 Command Prompt window, and from there launch C:\Program Files\Debugging Tools for Windows\windbg.exe (this ensures the .Net 2 framework is in the PATH for the WinDBG process, and makes life simpler –if you don’t have VS2005 installed do it manually). Do File \ Open Crash Dump and brose to the dmp file you created with IISState. If it asks you about saving workspace information just hit no.

It’ll open a text window with some summary information about the dump. This window has a little command bar at the bottom. This is where you do stuff.

The first thing it probably says is “Symbol search path is: *** Invalid ***”. So we type .symfix into the command bar to set it up to dynamically load symbols from Microsoft as required. You can do something to your %path% so you don’t have to do this each time, but I’ve not got round to that yet.

It probably also says
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(13cc.84): Wake debugger - code 80000007 (first/second chance not available)

…but this is exactly what it says it is – it’s the console debugger attaching (how IISState got the dump), and not a ‘real’ exception per-se.

You can type ~ to get a list of the threads, or ~ *kb to get them with a bit of stack trace too, but we’re still only looking at unmanaged threads here, so pretty meaningless to me.

Type .load SOS to load the managed code debugging extension SOS.dll, then type !threads to see all the managed threads. Now that’s more like it.

NB: If you’re debugging on a different platform to the original crash (eg server was Win2003, you’re on XP), then you’re going to get an error at this point “Failed to load data access DLL, 0x80004005”. The error message contains the fix: execute .cordll -ve -u -l and the correct version of mscordacwks.dll for the dump will automatically be loaded, eg:
0:000> .cordll -ve -u -l
CLRDLL: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscordacwks.dll:2.0.50727.42
f:0 doesn't match desired version 2.0.50727.91 f:0
CLRDLL: Loaded DLL C:\Program Files\Debugging Tools for Windows\sym\mscordacwks_x86_x86_2.0.50727.91.dll\442337B7562000
\mscordacwks_x86_x86_2.0.50727.91.dll
CLR DLL status: Loaded DLL C:\Program Files\Debugging Tools for
Windows\sym\mscordacwks_x86_x86_2.0.50727.91.dll\442337B7562000\mscordacwks_x86_x86_2.0.50727.91.dll


Your !threads output might look like this:

ThreadCount: 9
UnstartedThread: 0
BackgroundThread: 9
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
ID OSID Thread OBJ    State   GC       Context       Domain   Count APT Exception
      12    1  d10 000dce98   1808220 Disabled 00000000:00000000 000d4008     2 Ukn (Threadpool Worker)
      14    2  9f4 000ede10      b220 Enabled 00000000:00000000 000daa20 0 MTA (Finalizer)
      15    3  6d0 00103ad8    80a220 Enabled 00000000:00000000 000daa20     0 MTA (Threadpool Completion Port)
      16    4 14b0 001074d0      1220 Enabled 00000000:00000000 000daa20     0 Ukn
      10    5  4f8 0012e408   880a220 Enabled 00000000:00000000 000daa20     0 MTA (Threadpool Completion Port)
      18    9 1184 06a32e20    800220 Enabled 00000000:00000000 000daa20     0 Ukn (Threadpool Completion Port)
      26    6 15a0 06ada200   180b220 Enabled  00000000:00000000 000daa20     0 MTA (Threadpool Worker)
      28    b   e0 06a5bd40   880b220 Enabled 00000000:00000000 000daa20     0 MTA (Threadpool Completion Port)
      29    c  fd0 06a3c368   180b220 Enabled  00000000:00000000 000daa20     0 MTA (Threadpool Worker)


Notice how the last column tells you what type of thread it is. You can get the managed stack trace for each one of these threads by selecting which thread you’re interested in from the Processes and Threads window (up on the toolbar) and typing !clrstack, but I find it more useful to just look at all of them straight off: ~*e !clrstack

You’ll see a lot of this:
OS Thread Id: 0x84 (0)
Unable to walk the managed stack. The current thread is likely not a managed thread.
You can run !threads to get a list of managed threads in the process


...because what you're doing is looking at the managed stack trace for all threads (and many of them won't be .Net threads remember), but eventually you’ll see some managed stack traces pop up.

In my case it was a bit of an open-and-shut case, because there was only one .Net thread actually doing anything in the process – kinda a giveaway:

OS Thread Id: 0xd10 (12)
ESP       EIP    
[… trimmed 3rd party vendor stack trace…]
01a9eec4 06cc381c Aspose.Pdf.Pdf.Save(System.IO.Stream)
01a9eed4 069d4a05 MyApp.Server.Utils.WordToPdf.AsposeImpl.Generate(System.IO.MemoryStream)
01a9ef44 0736fa0f MyApp.Server.Document.Implementations.AbstractWordToPdfDocumentGenerationImpl.Generate(...)
01a9ef8c 0736f65f MyApp.Server.Document.GenerationManager.Generate(...)
01a9eff4 0725853d MyApp.Server.ServiceImplementation.DocumentService.GenerateDocumentById(GenerateDocumentByIdRequest)
[… removed ASP.Net framework stack trace …]
01a9f710 65fbe244 System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
01a9f744 65fbde92 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
01a9f750 65fbc567 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
01a9f900 79f2d05b [ContextTransitionFrame: 01a9f900]
01a9f950 79f2d05b [GCFrame: 01a9f950]
01a9faa8 79f2d05b [ComMethodFrame: 01a9faa8]


[NB: I've had to edit that for brevity and wrapping]

Ok, so we’re dying in a 3rd party component. But we’re still going to need to replicate this somehow because (presumably) it doesn’t do this all the time. Fortunately in this case all we need to know is the arguments supplied to our GenerateDocumentById() method to attempt to reproduce in dev. They’re in the dump too, so let’s see what they were.

Select that thread (12) using ~12s (or the Processes and Threads window, and no, I don’t know what the ‘s’ suffix means) and use !clrstack –p to view the stack with parameters. It’s a big list, but in there somewhere is the bit I’m interested in:

0:012> !clrstack –p
[… blah blah …]
01a9eff4 0725853d MyApp.Server.ServiceImplementation.DocumentService.GenerateDocumentById(GenerateDocumentByIdRequest)
PARAMETERS:
this = 0x02e596b0
generateDocumentByIdRequest = 0x02e595b4


!dumpobject (or !do) then tells me the contents of that generateDocumentByIdRequest instance given it’s address as shown above (0x02e595b4):

0:012> !do 0x02e595b4
Name: MyApp.Server.ServiceContracts.GenerateDocumentByIdRequest
MethodTable: 05dcb5e0
EEClass: 06f935ec
Size: 12(0xc) bytes
   (C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files\blah blah blah\MyApp.Server.ServiceContracts.DLL)
Fields:
MT Field   Offset    Type VT Attr    Value Name
790fedc4  40001c4  4         System.Int32 0 instance  8229991 documentId


The bit I care about on this is the documentId field itself, and being a value-type field (Int32) I can see the value right there: 8229991 (for a reference type like System.String I'd get another address to the instance itself, ie have to perform another !do).

And in this case, that was that. Well, it turned out to affect more than just this document ID, but at least I had enough to reproduce in the dev environment. And just in case anyone read the stack trace carefully, by the time we got to this stage Aspose already had a fix out for this issue, so I can’t really knock them. Finally whilst I was at it, I had a look at the objects in memory and began to wonder whether I’ve got a memory leak, so there’s probably a follow up to come.

Other SOS commands to look at:


!threadpool (also this is where the CPU utilization hides)
!eeheap –gc (see memory allocations that are subject to the GC)
!dumpheap –stat (see object allocations by type)
!help (gives help on the last extension loaded, so SOS commands in our case)

Other Resources:


Installing and Configuring IISState

SOS: It’s not just an ABBA Song Anymore: A good introductory article on MSDN with a dreadful title

If Broken It Is… Fix it You Should [Tess’s Blog]: This will all make a lot more sense now

Production Debugging for .NET Framework Applications: This 170 page document from Microsoft’s Patterns and Practices team is allegedly out of date, but is an invaluable reference, and what I based most of this article on

Debugging Microsoft® .NET 2.0 Applications: I haven’t read this MSPress book, but the v1 version got quite a lot of citations when I was researching this stuff




[1] So why did we take it? Because if it turns out it’s an IIS issue, and not your code, then someone will be able to get something out of it. Just not you.

Popular Posts