Investigating an InvalidProgramException from a memory dump (part 2 of 3)
In this series of article, we’re retracing how I debugged an InvalidProgramException
, caused by a bug in the Datadog profiler, from a memory dump sent by a customer.
- Part 1: Preliminary exploration
- Part 2: Finding the generated IL
- Part 3: Identifying the error and fixing the bug
Let’s start with a quick reminder. The profiler works by rewriting the IL of interesting methods to inject instrumentation code. The InvalidProgramException
is thrown by the JIT when trying to compile the IL emitted by the profiler, which must be somehow invalid. The first part was about identifying in what method the exception was thrown, and I ended up concluding that Npgsql.PostgresDatabaseInfo.LoadBackendTypes
was the culprit. The second part is going to be about how to find the generated IL for that method.
Finding the generated IL
Npgsql.PostgresDatabaseInfo.LoadBackendTypes
is an asynchronous method. The logic of an async method is stored in the MoveNext
method of its state-machine, so that’s the one I was interested in.
There’s a function in dotnet-dump
to display the IL of a method: dumpil
. This method requires the MethodDescriptor (MD) of the target method, so I needed to find it for MoveNext
.
I started by dumping all the types in the module, using the dumpmodule -mt
command, to find the state-machine:
This gave me the MT (MethodTable) of the state-machine type: 00007fcf16509c10
. Then I fed it to the dumpmt -md
command to get the MD:
The command outputs all the methods from the given type, and from there we can see that our MD is 00007FCF16509B58
.
Unfortunately, the dumpil
command returned the original IL, not the rewritten one. Looking for ideas, I used the dumpmd
command to get more information about the method:
Interestingly, the method was marked as not jitted. In hindsight, that makes sense. We rewrite the method using the JitCompilationStarted
event of the profiler API. The JIT then tries to compile it, fails, and throws away the rewritten IL.
Fun fact for those who know about tiered compilation: you may have noticed in the
dumpmd
output that there are two versions of the method, QuickJitted and OptimizedTier1, despite theIsJitted
flag being false. I managed to reproduce this on a test app with a profiler emitting bad IL: after calling the method 30 times, the tiered JIT promotes it to tier 1, even though the method was never jitted successfully
Dead-end? I really didn’t want to give up after going through the tedious process of finding the method, so I decided to get creative. The same way that I managed to find the InvalidProgramException
on the heap even though it wasn’t referenced anymore, I figured out that there could still be traces of the generated IL somewhere.
To give the rewritten IL to the JIT, the profiler uses the SetILFunctionBody
API. What’s interesting about it is that the buffer, used to write the IL, is provided by the JIT own’s allocator. Quoting the documentation:
Use the ICorProfilerInfo::GetILFunctionBodyAllocator method to allocate space for the new method to ensure that the buffer is compatible.
Maybe I could find traces of the IL in whatever data structure is used internally by the body allocator? Unfortunately, the allocator is just a call to the new
operator:
I have no clue how the new
operator works in C++, so I decided to follow another lead. What happens to that buffer after it’s given to the SetILFunctionBody
method? I’m not going to show the method implementation, but the interesting bit is how it ends up calling Module::SetDynamicIL
. In turn, SetDynamicIL
stores the IL body in an internal table (this time I’m showing the implementation because it’ll be important for later):
fTemporaryOverride
is false in this codepath, so m_debuggerSpecificData.m_pDynamicILBlobTable
is used to store the IL. If I could find the address of that table in the memory dump, then maybe I could retrieve the generated IL!
As I shown in a previous article, it’s possible to export all the symbols of a module on Linux by using the nm
command. So I tried looking for m_debuggerSpecificData
, but no luck:
> nm -C libcoreclr.so | grep m_debuggerSpecificData
>
How could I possibly find this structure without symbols?
I firmly believe that debugging is a creative process. So I took a step back and started thinking. When Module::SetDynamicIL
is called, the runtime is capable, somehow, of locating that structure. So the answer, whatever it is, must be somewhere in the assembly code of that method.
Reading this article makes it sound like it’s an instantaneous process, but locating
m_debuggerSpecificData
without symbols is the result of 2 hours of trial and error and bouncing ideas back and forth with my former coworkers Christophe Nasarre and Gregory Leocadie
In the process, I also discovered that a ISOSDacInterface7 is being implemented for .NET 5, and it has all the facilities I needed to find the dynamic IL. *sigh*
Fortunately, that method is exported in the symbols:
> nm -C libcoreclr.so | grep Module::SetDynamicIL
0000000000543da0 t Module::SetDynamicIL(unsigned int, unsigned long, int)
I used gdb to decompile it:
OK, that’s a lot to process. Especially if, like me, you’re not that familiar with native disassembly. The trick is to compare it with the original source code (that’s why I posted SetDynamicIL
earlier), and focus exclusively on what you’re looking for.
First, we need to locate the this
parameter. Object-oriented programming does not exist at the assembly level, so the this
pointer that we magically use must be given to the target function somehow. By convention, when calling an instance method, this
is the first argument of the function.
Next, we need to know how arguments are given to the function. Calling Wikipedia to the rescue, we learn that Linux uses the “System V AMD64 ABI” calling convention. In that convention, the first argument of a function is stored in the rdi
register.
Now we need some kind of “anchor”. A well-identified point within the function that we can focus on. Right at the beginning of SetDynamicIL
, we find this condition:
This is great because it uses m_debuggerSpecificData
(the field we're looking for), it has a condition, and it calls a method (InitializeDynamicILCrst
). This makes it very easy to spot in the disassembly. Now we know we have to focus on this bit:
Remember, this
is stored in the rdi
register. This register is copied to rbx
:
Then we reuse this register here:
This code reads the memory at the address rbx+0x568
, pushes the contents to the r14
register, then tests something: test r14,r14
. Testing a register against itself is the assembly way of checking if a value is empty. That’s our if (m_debuggerSpecificData.m_pDynamicILCrst == NULL)
check! This means that m_debuggerSpecificData.m_pDynamicILCrst
is located at the offset 0x568
from the address of the module instance.
That’s great, but I needed m_debuggerSpecificData.m_pDynamicILBlobTable
, not m_debuggerSpecificData.m_pDynamicILCrst
. So I had a look at the structure stored in the m_debuggerSpecificData
field:
Fields are stored in memory in the same order as they are declared in the code. So m_pDynamicILBlobTable
is the pointer stored right after m_pDynamicILCrst
in the memory.
To test this, I first needed the address of the module containing LoadBackendTypes
. If you scroll all the way back to where I called dumpmd
, you can find it in the output:
I was looking for the content of the memory at the offset 0x568
of the Module, so I added that to the Module address to get 0x7FCF13EC67D8 + 0x568 = 0x7FCF13EC6D40
I then used LLDB to dump the memory at that address:
(lldb) memory read --count 4 --size 8 --format x 7FCF13EC6D400x7fcf13ec6d40: 0x00007fce8c90c820 0x00007fce8c938380
0x7fcf13ec6d50: 0x0000000000000000 0x0000000000000000
Assuming my reasoning was correct, 0x00007fce8c90c820
would be the address of m_pDynamicILCrst
, and 0x00007fce8c938380
the address of m_pDynamicILBlobTable
. There was no way to be completely sure, but I could check if the values in memory matched the layout of the table in the source code:
One pointer to a table, followed by 4 integers indicating the size and the occupancy of the table.
I first dumped the pointer:
(lldb) memory read --count 2 --size 8 --format x 0x00007fce8c9383800x7fce8c938380: 0x00007fce8c938a90 0x000000010000000b
0x00007fce8c938a90
sure looks like a pointer to the heap. Then I checked the integers (using size 4
instead of size 8
):
(lldb) memory read --count 8 --size 4 --format x 0x00007fce8c9383800x7fce8c938380: 0x8c938a90 0x00007fce 0x0000000b 0x00000001
0x7fce8c938390: 0x00000001 0x00000008 0x00000045 0x00000000
The pointer was still there (looking backward because of endianness), then a few small values. I mapped everything to the fields of the table and got:
m_table = 0x00007fce8c938a90
m_tableSize = 0x0b
m_tableCount = 0x1
m_tableOccupied = 0x1
m_tableMax = 0x8
Once again, there was no way to be sure, but the values were consistent with the expected layout, and seemed to indicate that there was one element stored in the table!
m_table
is a hashtable associating method tokens and pointers to IL code:
I must confess I had a bit of trouble figuring out the layout of the hashtable from the source code (full of templates and other C++ magic), so I cheated a bit. I knew from the dumpmd
output that my method token was 0x6000D24
. So I just dumped a bunch of memory at the memory location of the hashtable and looked for that value:
(lldb) memory read --count 32 --size 4 --format x 0x00007fce8c938a90
0x7fce8c938a90: 0x00000000 0x00007fce 0x00000000 0x00000000
0x7fce8c938aa0: 0x00000000 0x00007fcf 0x00000000 0x00000000
0x7fce8c938ab0: 0x00000000 0x6265645b 0x00000000 0x00000000
0x7fce8c938ac0: 0x00000000 0x6974616c 0x00000000 0x00000000
0x7fce8c938ad0: 0x00000000 0x74636e75 0x00000000 0x00000000
0x7fce8c938ae0: 0x00000000 0x36303437 0x00000000 0x00000000
0x7fce8c938af0: 0x06000d24 0x00000000 0x8c983790 0x00007fce
0x7fce8c938b00: 0x00000000 0x7367704e 0x00000000 0x00000000
It turned out that the value was next to a pointer (0x00007fce8c983790
, backwards), so there was a good probability that it was pointing to the IL I was looking for!
How to confirm it? Every IL method has a header, so I decompiled the original PostgresDatabaseInfo.LoadBackendTypes
method with dnSpy to look for a remarkable value. The LocalVarSig
token had a value of 0x11000275
:
I then dumped a few bytes at the address I found and looked for the value:
(lldb) memory read --count 8 --size 4 --format x 0x00007fce8c9837900x7fce8c983790: 0x0002301b 0x000005fe 0x11000275 0x08007b02
0x7fce8c9837a0: 0x020a0400 0x0008037b 0x19060b04 0x0d167936
And sure enough, it matched!
The next and final step was to dump the IL and try to understand why it was causing the InvalidProgramException
. That will be the subject of the next article.