Investigating an InvalidProgramException from a memory dump (part 1 of 3)
Datadog automated instrumentation for .NET works by rewriting the IL of interesting methods to emit traces that are then sent to the back-end. This is a complex piece of logic, written using the profiler API, and ridden with corner-cases. And as always with complex code, bugs are bound to happen, and those can be very difficult to diagnose.
As it turns out, we had customer reports of applications throwing
InvalidProgramException when using our instrumentation. This exception is thrown when the JIT encounters invalid IL code, most likely emitted by our profiler. The symptoms were always the same: upon starting, the application had a random probability of ending up in a state where the exception would be thrown every time a method in particular was called. When that happened, restarting the application fixed the issue. The affected method would change from one time to the other. The issue was bad enough that the customers felt the need to report it, and rare enough that it couldn’t be reproduced at will. Yikes.
Since we couldn’t reproduce the issue ourselves, I decided to ask for a memory dump, and received one a few weeks later (random issues are random). This was my first time debugging this kind of problem, and it proved to be quite dodgy, so I figured out it would make a nice subject for an article.
The article ended up being a lot longer than I thought, so I divided it into 3 different parts.
- Part 1: Preliminary exploration
- Part 2: Finding the generated IL
- Part 3: Identifying the error and fixing the bug
The memory dump had been captured on a Linux instance. I opened it with dotnet-dump, running on WSL2. The first step was to find what method was throwing the exception. Usually, this kind of memory dump is captured when the first-chance exception is thrown, and that exception is visible in the last column when using the
clrthreads command. But I couldn’t find any:
I then decided to have a look at the notes sent along with the dump (yeah, I know, I should have started there), and understood why I couldn’t see the exception: the customer confirmed that the issue was occurring and just captured the memory dump at a random point in time. Can’t blame them: I don’t even know how to capture a memory dump on first-chance exceptions on Linux, and it doesn’t seem to be supported by procdump. If somebody from the .NET diagnostics team reads me…
That’s OK though. If no garbage collection happened since the exception was thrown, it should still be hanging somewhere around on the heap. To find out, I used the
dumpheap -stat command:
Three of them, great. I used the
dumpheap -mt command to get their address, with the value from the “MT” column. Then I used the
pe) command to get the stacktrace associated to the exception:
Note: Since this is the output from an actual memory dump sent by a customer, I redacted the namespaces containing business code and replaced by “Customer”
We see that the exception was thrown from
<>c__ indicates a closure, so this was probably a lambda function declared in the
Since I didn’t have the source code, I used the
ip2md command to convert the instruction pointer (second column of the stacktrace) to a MethodDescriptor, then fed it to the
dumpil command to print the raw IL:
I’m not going to show the full IL for obvious privacy reasons, but one thing struck me: the code was not calling any method that we instrument so there was no reason the IL would have been rewritten (and it turns out there was no traces of rewriting).
Digging deeper, I got a hint at what was happening:
Translated into C#, this is pretty much equivalent to:
Basically, the method was there to log a few things (that I didn’t include in the snippet) then rethrow the exception. But since they didn’t wrap the exception before rethrowing it (and didn’t use
ExceptionDispatchInfo), it overwrote the original callstack!
The caller code would have been something like:
The usage of
ContinueWith is confirmed by the presence of
ContinuationResultTaskFromTask`1 in the callstack.
That was really bad for me, because it meant that the original exception could have been thrown anywhere in whatever methods were called by
By looking at the raw IL of the caller method, I figured out that
System.Data.Common.DbConnection::OpenAsync. Since the application used PostgreSQL with the Npgsql library, and since our tracer automatically instruments that library, it made sense that the rewritten method would be somewhere in there. Normally, I could have checked our logs to quickly find what Npgsql method was rewritten, but the customer didn’t retrieve them before killing the instance, and waiting for the problem to happen again could have taken weeks (random issue is random). So I decided to bite the bullet and start the painstaking process of cross-checking the source code of Npgsql with the state of the objects in the memory dump to find the exact place where the execution stopped and the exception was originally thrown.
For instance, at some point the method
PostgresDatabaseInfoFactory.Load is called:
There were instances of
PostgresDatabaseInfo on the heap, so I knew this method ran properly. From there, the
LoadBackendTypes method is called, and the result is assigned to the
But when inspecting the instances of
PostgresDatabaseInfo (using the
do command with the address returned by the
dumpheap -mt command), we can see that the
_types field has no value:
Therefore, the execution stopped somewhere in the
LoadBackendTypes method. That method has calls to
NpgsqlCommand.ExecuteReaderAsync, which are two method that our tracer instruments, and therefore is a candidate for rewriting.
Great! At this point I just had to dump the generated IL, find the error, and call it a day. Right? Well it got more complicated than I anticipated, as we’ll see in the next article.