Investigating an InvalidProgramException from a memory dump (part 1 of 3)

Datadog automated instrumentation for .NET works by rewriting the IL of interesting methods to emit traces that are then sent to the back-end. This is a complex piece of logic, written using the profiler API, and ridden with corner-cases. And as always with complex code, bugs are bound to happen, and those can be very difficult to diagnose.

As it turns out, we had customer reports of applications throwing InvalidProgramException when using our instrumentation. This exception is thrown when the JIT encounters invalid IL code, most likely emitted by our profiler. The symptoms were always the same: upon starting, the application had a random probability of ending up in a state where the exception would be thrown every time a method in particular was called. When that happened, restarting the application fixed the issue. The affected method would change from one time to the other. The issue was bad enough that the customers felt the need to report it, and rare enough that it couldn’t be reproduced at will. Yikes.

Since we couldn’t reproduce the issue ourselves, I decided to ask for a memory dump, and received one a few weeks later (random issues are random). This was my first time debugging this kind of problem, and it proved to be quite dodgy, so I figured out it would make a nice subject for an article.

The article ended up being a lot longer than I thought, so I divided it into 3 different parts.

Preliminary exploration

I then decided to have a look at the notes sent along with the dump (yeah, I know, I should have started there), and understood why I couldn’t see the exception: the customer confirmed that the issue was occurring and just captured the memory dump at a random point in time. Can’t blame them: I don’t even know how to capture a memory dump on first-chance exceptions on Linux, and it doesn’t seem to be supported by procdump. If somebody from the .NET diagnostics team reads me…

That’s OK though. If no garbage collection happened since the exception was thrown, it should still be hanging somewhere around on the heap. To find out, I used the dumpheap -stat command:

Three of them, great. I used the dumpheap -mt command to get their address, with the value from the “MT” column. Then I used the printexception (pe) command to get the stacktrace associated to the exception:

Note: Since this is the output from an actual memory dump sent by a customer, I redacted the namespaces containing business code and replaced by “Customer”

We see that the exception was thrown from Customer.DataAccess.Generic.AsyncDataAccessBase`2+<>c__DisplayClass1_0.<CreateConnectionAsync>b__0. The <>c__ indicates a closure, so this was probably a lambda function declared in the CreateConnectionAsync method.

Since I didn’t have the source code, I used theip2md command to convert the instruction pointer (second column of the stacktrace) to a MethodDescriptor, then fed it to the dumpil command to print the raw IL:

I’m not going to show the full IL for obvious privacy reasons, but one thing struck me: the code was not calling any method that we instrument so there was no reason the IL would have been rewritten (and it turns out there was no traces of rewriting).

Digging deeper, I got a hint at what was happening:

Translated into C#, this is pretty much equivalent to:

Basically, the method was there to log a few things (that I didn’t include in the snippet) then rethrow the exception. But since they didn’t wrap the exception before rethrowing it (and didn’t use ExceptionDispatchInfo), it overwrote the original callstack!

The caller code would have been something like:

The usage of ContinueWith is confirmed by the presence of ContinuationResultTaskFromTask`1 in the callstack.

That was really bad for me, because it meant that the original exception could have been thrown anywhere in whatever methods were called by SomethingAsync.

By looking at the raw IL of the caller method, I figured out that SomethingAsync was System.Data.Common.DbConnection::OpenAsync. Since the application used PostgreSQL with the Npgsql library, and since our tracer automatically instruments that library, it made sense that the rewritten method would be somewhere in there. Normally, I could have checked our logs to quickly find what Npgsql method was rewritten, but the customer didn’t retrieve them before killing the instance, and waiting for the problem to happen again could have taken weeks (random issue is random). So I decided to bite the bullet and start the painstaking process of cross-checking the source code of Npgsql with the state of the objects in the memory dump to find the exact place where the execution stopped and the exception was originally thrown.

For instance, at some point the method PostgresDatabaseInfoFactory.Load is called:

There were instances of PostgresDatabaseInfo on the heap, so I knew this method ran properly. From there, the LoadBackendTypes method is called, and the result is assigned to the _types field:

But when inspecting the instances of PostgresDatabaseInfo (using the dumpobj or do command with the address returned by the dumpheap -mt command), we can see that the _types field has no value:

Therefore, the execution stopped somewhere in the LoadBackendTypes method. That method has calls to NpgsqlCommand.ExecuteReader and NpgsqlCommand.ExecuteReaderAsync, which are two method that our tracer instruments, and therefore is a candidate for rewriting.

Great! At this point I just had to dump the generated IL, find the error, and call it a day. Right? Well it got more complicated than I anticipated, as we’ll see in the next article.

Written by

Software developer passionate about .NET, performance, and debugging

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store