Investigating an InvalidProgramException from a memory dump (part 2 of 3)
In this series of article, we’re retracing how I debugged an
InvalidProgramException, caused by a bug in the Datadog profiler, from a memory dump sent by a customer.
- Part 1: Preliminary exploration
- Part 2: Finding the generated IL
- Part 3: Identifying the error and fixing the bug
Let’s start with a quick reminder. The profiler works by rewriting the IL of interesting methods to inject instrumentation code. The
InvalidProgramException is thrown by the JIT when trying to compile the IL emitted by the profiler, which must be somehow invalid. The first part was about identifying in what method the exception was thrown, and I ended up concluding that
Npgsql.PostgresDatabaseInfo.LoadBackendTypes was the culprit. The second part is going to be about how to find the generated IL for that method.
Finding the generated IL
Npgsql.PostgresDatabaseInfo.LoadBackendTypes is an asynchronous method. The logic of an async method is stored in the
MoveNext method of its state-machine, so that’s the one I was interested in.
There’s a function in
dotnet-dump to display the IL of a method:
dumpil. This method requires the MethodDescriptor (MD) of the target method, so I needed to find it for
I started by dumping all the types in the module, using the
dumpmodule -mt command, to find the state-machine:
This gave me the MT (MethodTable) of the state-machine type:
00007fcf16509c10. Then I fed it to the
dumpmt -md command to get the MD:
The command outputs all the methods from the given type, and from there we can see that our MD is
dumpil command returned the original IL, not the rewritten one. Looking for ideas, I used the
dumpmd command to get more information about the method:
Interestingly, the method was marked as not jitted. In hindsight, that makes sense. We rewrite the method using the
JitCompilationStarted event of the profiler API. The JIT then tries to compile it, fails, and throws away the rewritten IL.
Fun fact for those who know about tiered compilation: you may have noticed in the
dumpmdoutput that there are two versions of the method, QuickJitted and OptimizedTier1, despite the
IsJittedflag being false. I managed to reproduce this on a test app with a profiler emitting bad IL: after calling the method 30 times, the tiered JIT promotes it to tier 1, even though the method was never jitted successfully
Dead-end? I really didn’t want to give up after going through the tedious process of finding the method, so I decided to get creative. The same way that I managed to find the
InvalidProgramException on the heap even though it wasn’t referenced anymore, I figured out that there could still be traces of the generated IL somewhere.
To give the rewritten IL to the JIT, the profiler uses the
SetILFunctionBody API. What’s interesting about it is that the buffer, used to write the IL, is provided by the JIT own’s allocator. Quoting the documentation:
Use the ICorProfilerInfo::GetILFunctionBodyAllocator method to allocate space for the new method to ensure that the buffer is compatible.
I have no clue how the
new operator works in C++, so I decided to follow another lead. What happens to that buffer after it’s given to the
SetILFunctionBody method? I’m not going to show the method implementation, but the interesting bit is how it ends up calling
Module::SetDynamicIL. In turn,
SetDynamicIL stores the IL body in an internal table (this time I’m showing the implementation because it’ll be important for later):
fTemporaryOverride is false in this codepath, so
m_debuggerSpecificData.m_pDynamicILBlobTable is used to store the IL. If I could find the address of that table in the memory dump, then maybe I could retrieve the generated IL!
As I shown in a previous article, it’s possible to export all the symbols of a module on Linux by using the
nm command. So I tried looking for
m_debuggerSpecificData, but no luck:
> nm -C libcoreclr.so | grep m_debuggerSpecificData
How could I possibly find this structure without symbols?
I firmly believe that debugging is a creative process. So I took a step back and started thinking. When
Module::SetDynamicIL is called, the runtime is capable, somehow, of locating that structure. So the answer, whatever it is, must be somewhere in the assembly code of that method.
Reading this article makes it sound like it’s an instantaneous process, but locating
m_debuggerSpecificDatawithout symbols is the result of 2 hours of trial and error and bouncing ideas back and forth with my former coworkers Christophe Nasarre and Gregory Leocadie
In the process, I also discovered that a ISOSDacInterface7 is being implemented for .NET 5, and it has all the facilities I needed to find the dynamic IL. *sigh*
Fortunately, that method is exported in the symbols:
> nm -C libcoreclr.so | grep Module::SetDynamicIL
0000000000543da0 t Module::SetDynamicIL(unsigned int, unsigned long, int)
I used gdb to decompile it:
OK, that’s a lot to process. Especially if, like me, you’re not that familiar with native disassembly. The trick is to compare it with the original source code (that’s why I posted
SetDynamicIL earlier), and focus exclusively on what you’re looking for.
First, we need to locate the
this parameter. Object-oriented programming does not exist at the assembly level, so the
this pointer that we magically use must be given to the target function somehow. By convention, when calling an instance method,
this is the first argument of the function.
Next, we need to know how arguments are given to the function. Calling Wikipedia to the rescue, we learn that Linux uses the “System V AMD64 ABI” calling convention. In that convention, the first argument of a function is stored in the
Now we need some kind of “anchor”. A well-identified point within the function that we can focus on. Right at the beginning of
SetDynamicIL, we find this condition:
This is great because it uses
m_debuggerSpecificData (the field we're looking for), it has a condition, and it calls a method (
InitializeDynamicILCrst). This makes it very easy to spot in the disassembly. Now we know we have to focus on this bit:
this is stored in the
rdi register. This register is copied to
Then we reuse this register here:
This code reads the memory at the address
rbx+0x568, pushes the contents to the
r14 register, then tests something:
test r14,r14. Testing a register against itself is the assembly way of checking if a value is empty. That’s our
if (m_debuggerSpecificData.m_pDynamicILCrst == NULL) check! This means that
m_debuggerSpecificData.m_pDynamicILCrst is located at the offset
0x568 from the address of the module instance.
That’s great, but I needed
m_debuggerSpecificData.m_pDynamicILCrst. So I had a look at the structure stored in the
Fields are stored in memory in the same order as they are declared in the code. So
m_pDynamicILBlobTable is the pointer stored right after
m_pDynamicILCrst in the memory.
To test this, I first needed the address of the module containing
LoadBackendTypes. If you scroll all the way back to where I called
dumpmd, you can find it in the output:
I was looking for the content of the memory at the offset
0x568 of the Module, so I added that to the Module address to get
0x7FCF13EC67D8 + 0x568 = 0x7FCF13EC6D40
I then used LLDB to dump the memory at that address:
(lldb) memory read --count 4 --size 8 --format x 7FCF13EC6D400x7fcf13ec6d40: 0x00007fce8c90c820 0x00007fce8c938380
0x7fcf13ec6d50: 0x0000000000000000 0x0000000000000000
Assuming my reasoning was correct,
0x00007fce8c90c820 would be the address of
0x00007fce8c938380 the address of
m_pDynamicILBlobTable. There was no way to be completely sure, but I could check if the values in memory matched the layout of the table in the source code:
One pointer to a table, followed by 4 integers indicating the size and the occupancy of the table.
I first dumped the pointer:
(lldb) memory read --count 2 --size 8 --format x 0x00007fce8c9383800x7fce8c938380: 0x00007fce8c938a90 0x000000010000000b
0x00007fce8c938a90 sure looks like a pointer to the heap. Then I checked the integers (using
size 4 instead of
size 8 ):
(lldb) memory read --count 8 --size 4 --format x 0x00007fce8c9383800x7fce8c938380: 0x8c938a90 0x00007fce 0x0000000b 0x00000001
0x7fce8c938390: 0x00000001 0x00000008 0x00000045 0x00000000
The pointer was still there (looking backward because of endianness), then a few small values. I mapped everything to the fields of the table and got:
m_table = 0x00007fce8c938a90
m_tableSize = 0x0b
m_tableCount = 0x1
m_tableOccupied = 0x1
m_tableMax = 0x8
Once again, there was no way to be sure, but the values were consistent with the expected layout, and seemed to indicate that there was one element stored in the table!
m_table is a hashtable associating method tokens and pointers to IL code:
I must confess I had a bit of trouble figuring out the layout of the hashtable from the source code (full of templates and other C++ magic), so I cheated a bit. I knew from the
dumpmd output that my method token was
0x6000D24. So I just dumped a bunch of memory at the memory location of the hashtable and looked for that value:
(lldb) memory read --count 32 --size 4 --format x 0x00007fce8c938a90
0x7fce8c938a90: 0x00000000 0x00007fce 0x00000000 0x00000000
0x7fce8c938aa0: 0x00000000 0x00007fcf 0x00000000 0x00000000
0x7fce8c938ab0: 0x00000000 0x6265645b 0x00000000 0x00000000
0x7fce8c938ac0: 0x00000000 0x6974616c 0x00000000 0x00000000
0x7fce8c938ad0: 0x00000000 0x74636e75 0x00000000 0x00000000
0x7fce8c938ae0: 0x00000000 0x36303437 0x00000000 0x00000000
0x7fce8c938af0: 0x06000d24 0x00000000 0x8c983790 0x00007fce
0x7fce8c938b00: 0x00000000 0x7367704e 0x00000000 0x00000000
It turned out that the value was next to a pointer (
0x00007fce8c983790, backwards), so there was a good probability that it was pointing to the IL I was looking for!
How to confirm it? Every IL method has a header, so I decompiled the original
PostgresDatabaseInfo.LoadBackendTypes method with dnSpy to look for a remarkable value. The
LocalVarSig token had a value of
I then dumped a few bytes at the address I found and looked for the value:
(lldb) memory read --count 8 --size 4 --format x 0x00007fce8c9837900x7fce8c983790: 0x0002301b 0x000005fe 0x11000275 0x08007b02
0x7fce8c9837a0: 0x020a0400 0x0008037b 0x19060b04 0x0d167936
And sure enough, it matched!
The next and final step was to dump the IL and try to understand why it was causing the
InvalidProgramException. That will be the subject of the next article.