An unconventional way of investigating a NullReferenceException
The crash
This one started when trying to understand why an integration test was failing, only on Linux with ARM64.
As I had no ARM64 dev environment available, I first tried adding more and more traces and let the test run in the CI, without much success.
Eventually, I realized this was leading nowhere, and took the time to setup an ARM64 VM to investigate further. After running the test with LLDB (see my previous article to learn how to fetch the symbols for the CLR), I found out that the process was raising two segmentations faults, and the second one caused the crash:
I opened an issue on the dotnet/runtime repository, and David Mason was quick to track it down to a missing null check in AdjustContextForVirtualStub
.
It turns out that the assertion “.NET checks for nullity when doing a virtual call” is not exactly true. In some situations, when checking the type of an object, the runtime assumes the value is not null, then catches the access violation/segmentation fault when trying to dereference the instance. The fault is then converted to a NullReferenceException
. The end result is the same as if .NET was explicitly checking for nullity. So what happened in my test application was:
- I was trying to call a virtual method on a null reference
- This caused a segmentation fault, which was caught by the runtime and converted into a
NullReferenceException
. An important point to understand is that the fault/exception occurred during the dispatch of the virtual call, which is not considered as managed code. - The exception was rethrown in the catch block of the method
- When unwinding the stack, it hit a special case in
UnwindManagedExceptionPass1
when the exception originates from native code. That code path was missing a null check and caused the fatal segmentation fault.
I built a custom version of the CLR with the additional null check, and as predicted this fixed the crash. End of story?
Writing a repro
The story could have ended here, but I felt like I was still missing something to get the full picture. .NET isn’t widely used on ARM64, but I figured out that if the issue was as simple as “crashes when invoking a virtual method on a null reference”, the bug would have been found much sooner.
To understand the exact conditions for the crash, I decided to try and write a repro. I started with a simple virtual call on a null reference:
Without much surprise, this program didn’t crash. So there was more to the problem than just “a virtual call on a null reference”.
When running the repro program with LLDB, it broke on a segmentation fault:
The segmentation fault occurred in the Request
method, which was expected. But if I tried the same thing in my crashing app, the segfault would happen in a non-managed method:
What was this method? It wasn’t exported in the .NET CLR symbols, yet it wasn’t a managed method either. To get more information, I enabled the perf map generation (by setting the COMPlus_PerfMapEnable
environment variable). The perf map is a simple text file in which the JIT stores the name and address of the methods it compiles. Luckily, I found the address of the mysterious method in there, with the name GenerateDispatchStub<GenerateDispatchStub>
.
I then looked into the code of the CLR to understand what this name was associated to.
The CLR was creating a DispatchHolder
:
Then calling Initialize
to emit some code:
That method had a specific implementation for ARM64:
The instructions in the comment matched exactly what LLDB was showing me:
But this was different from what I was getting in my repro app:
So it seemed like the crashing app was using a “dispatch stub”, but not my repro app. Did it matter?
Looking back at the method in which the crash happened:
The crashed occurred because pExceptionRecord
was null, line 48. If my repro app didn’t crash, it either meant that the method wasn’t called, pExceptionRecord
wasn’t null, or the method exited earlier. I confirmed by setting a breakpoint that the method was called with a null argument. So it would mean that either pThread
was null (which seemed incredibly unlikely), or the return value of VirtualCallStubManager::FindStubManager
was different from SK_DISPATCH
or SK_LOOKUP
. “SK_DISPATCH” ? Whatever that was, it seemed consistent with the dispatch stub mentioned earlier.
Learning about stubs
While looking for more information about what those stubs were, I ended up in the Book of The Runtime. There’s a lot to digest in there, but the bottom-line is that whenever a method is called on an interface (and only an interface), a special resolution mechanism is used, named “virtual stub dispatch”. That resolution mechanism uses 3 types of stubs: lookup stubs, resolve stubs, or the dispatch stub we were looking for.
- The lookup stub just calls the resolver with the right parameters to resolve the address of the target method.
- The dispatch stub is an optimistic dispatch mechanism: it’s hardcoded with the expected implementation type of the interface and the corresponding address of the method. When invoked, it performs a quick type check and jumps to the address. If the type check fails, it instead fallbacks to the resolve stub.
- The resolve stub is pretty much a cache. If checks if the address of the target method is already in the cache. If yes, it jumps to it. If not, it calls the resolver and adds the new entry to the cache.
With that information in hand, I modified my repro to use interfaces:
But it still wouldn’t crash. Worse, it wouldn’t even cause a segmentation fault anymore!
After reading a bit more about virtual dispatch stubs, it turns out the type of stub used for a given call site changes through the life of a process. When compiling a method, the JIT emits a lookup stub at the call site. When invoked, that stub will emit both a dispatch stub and a resolve stub, and will backpatch the call site to use the new stubs. The reason the JIT does not directly emit a dispatch/resolve stub is apparently because it’s missing some contextual information, that is only available at invocation time.
In my repro app, since I was calling the method only once, a lookup stub was used. I needed to call the method multiple times to make sure the dispatch stub was emitted, and then cause the NullReferenceException
:
And this time it had the expected result:
$ ./bin/Release/net5.0/testconsole
Segmentation fault
There was still one thing puzzling me: I needed to call Request(false)
two times for the dispatch stub to be emitted. I was expecting the JIT to emit a lookup stub during compilation, then the lookup stub to emit a dispatch stub during the first invocation. So only one Request(false)
would have been needed. Why did I need two?
I found the answer in a comment in the source code of the resolver (the resolver is the bit of code called by the lookup stub):
Note, if we encounter a method that hasn’t been jitted yet, we will return the prestub, which should cause it to be jitted and we will be able to build the dispatching stub on a later call thru the call site
Therefore, the sequence of events was:
Request
is compiled by the JIT. At this point, contextual information is missing and a lookup stub is emitted.- First call to
Request(false)
: when callingIClient.GetResponse
, the lookup stub is invoked. It resolves the call toClient.GetResponse
but notices that this method hasn’t been JITted. Since it doesn’t know the final location of the code, if just returns the address of the prestub. When the prestub is executed, it triggers the JIT compilation of the method. - Second call to
Request(false)
: when callingIClient.GetResponse
, the lookup stub is invoked. It resolves the call toClient.GetResponse
and emits a dispatch stub that points to the address of the JITted code. - Call to
Request(true)
: the dispatch stub is used for resolution, but the instance is null. This causes a segmentation fault inside of the stub, which in turn will cause the crash when the stack is unwind.
If my analysis was correct, I would need only one call to Request(false)
if I made sure Client.GetResponse
was already JIT-compiled at the moment of the invocation. I confirmed that by making this change to the repro:
And indeed, it crashed with a segmentation fault.
Back to my NullReferenceException
Meanwhile, even though it wasn’t causing a crash anymore with the patched CLR, my integration test was still throwing a NullReferenceException
somewhere.
In normal conditions, debugging a null reference is hardly something noteworthy. But for various reasons, such as the remote ARM64 VM or the fact that the issue happened only with our profiler attached, I couldn’t attach the Visual Studio debugger and it was very difficult for me to make any change in the code. All I knew was that the error occurred in this method from the Elasticsearch.net client:
There was a lot of things in there that could be null. And the fact that it came from an external library made things even harder. I needed to figure a way to debug that issue from LLDB without any code change.
So I ran the application again with LLDB, until the location of the first segmentation fault. This time, thanks to everything I learned above, I knew the segmentation fault was occurring in the dispatch stub, with the now familiar ldr x13, [x0]
instruction:
The comment in the stub implementation indicates:
So by decompiling the stub, I should be able to retrieve the hardcoded _expectedMT
and _implTarget
:
We can see the ldp x10, x12, [x9]
instruction. According to the comments, it loads the _expectedMT
and _implTarget
values we’re looking for. ldp
is an ARM instruction that loads two words from the target address (stored in x9
) and stores them in the designated registers (x10
and x12
). The value of x9
was set by the instruction adr x9, #0x1c
. That means “get the address of the current instruction, add the offset 0x1c
, and stores it in x9
”. The address of that instruction is 0xffff7d866304
, so it stores the value 0xffff7d866304 + 0x1c = 0xffff7d866320
in the register. At 0xffff7d866320
, we can see a sequence of values:
0x7fdd0f50
0xffff
0x80be2a98
0xffff
Stitched together, it means that the instruction ldp x10, x12, [x9]
stores 0xffff7dd0f50
into x10
and 0xffff80be2a98
into x12
. At that point I had all the information I needed! I could then inspect the expected MT, and from there see what method was stored at address 0xffff80be2a98
:
From there, I knew the NullReferenceException
occurred when trying to call the getter of the ApiCall
property on an instance of Nest.CreateIndexResponse
. With that information, finding the exact line of failure in the source code of the method was trivial:
Thanks to that, I was able to understand what caused that value to be null and fix the issue. And finally close the pull request.