This one started when trying to understand why an integration test was failing, only on Linux with ARM64.
As I had no ARM64 dev environment available, I first tried adding more and more traces and let the test run in the CI, without much success.
Eventually, I realized this was leading nowhere, and took the time to setup an ARM64 VM to investigate further. After running the test with LLDB (see my previous article to learn how to fetch the symbols for the CLR), I found out that the process was raising two segmentations faults, and the second one caused the crash:
I opened an issue on the dotnet/runtime repository, and David Mason was quick to track it down to a missing null check in
It turns out that the assertion “.NET checks for nullity when doing a virtual call” is not exactly true. In some situations, when checking the type of an object, the runtime assumes the value is not null, then catches the access violation/segmentation fault when trying to dereference the instance. The fault is then converted to a
NullReferenceException. The end result is the same as if .NET was explicitly checking for nullity. So what happened in my test application was:
- I was trying to call a virtual method on a null reference
- This caused a segmentation fault, which was caught by the runtime and converted into a
NullReferenceException. An important point to understand is that the fault/exception occurred during the dispatch of the virtual call, which is not considered as managed code.
- The exception was rethrown in the catch block of the method
- When unwinding the stack, it hit a special case in
UnwindManagedExceptionPass1when the exception originates from native code. That code path was missing a null check and caused the fatal segmentation fault.
I built a custom version of the CLR with the additional null check, and as predicted this fixed the crash. End of story?
Writing a repro
The story could have ended here, but I felt like I was still missing something to get the full picture. .NET isn’t widely used on ARM64, but I figured out that if the issue was as simple as “crashes when invoking a virtual method on a null reference”, the bug would have been found much sooner.
To understand the exact conditions for the crash, I decided to try and write a repro. I started with a simple virtual call on a null reference:
Without much surprise, this program didn’t crash. So there was more to the problem than just “a virtual call on a null reference”.
When running the repro program with LLDB, it broke on a segmentation fault:
The segmentation fault occurred in the
Request method, which was expected. But if I tried the same thing in my crashing app, the segfault would happen in a non-managed method:
What was this method? It wasn’t exported in the .NET CLR symbols, yet it wasn’t a managed method either. To get more information, I enabled the perf map generation (by setting the
COMPlus_PerfMapEnable environment variable). The perf map is a simple text file in which the JIT stores the name and address of the methods it compiles. Luckily, I found the address of the mysterious method in there, with the name
I then looked into the code of the CLR to understand what this name was associated to.
The CLR was creating a
Initialize to emit some code:
That method had a specific implementation for ARM64:
The instructions in the comment matched exactly what LLDB was showing me:
But this was different from what I was getting in my repro app:
So it seemed like the crashing app was using a “dispatch stub”, but not my repro app. Did it matter?
Looking back at the method in which the crash happened:
The crashed occurred because
pExceptionRecord was null, line 48. If my repro app didn’t crash, it either meant that the method wasn’t called,
pExceptionRecord wasn’t null, or the method exited earlier. I confirmed by setting a breakpoint that the method was called with a null argument. So it would mean that either
pThread was null (which seemed incredibly unlikely), or the return value of
VirtualCallStubManager::FindStubManager was different from
SK_LOOKUP. “SK_DISPATCH” ? Whatever that was, it seemed consistent with the dispatch stub mentioned earlier.
Learning about stubs
While looking for more information about what those stubs were, I ended up in the Book of The Runtime. There’s a lot to digest in there, but the bottom-line is that whenever a method is called on an interface (and only an interface), a special resolution mechanism is used, named “virtual stub dispatch”. That resolution mechanism uses 3 types of stubs: lookup stubs, resolve stubs, or the dispatch stub we were looking for.
- The lookup stub just calls the resolver with the right parameters to resolve the address of the target method.
- The dispatch stub is an optimistic dispatch mechanism: it’s hardcoded with the expected implementation type of the interface and the corresponding address of the method. When invoked, it performs a quick type check and jumps to the address. If the type check fails, it instead fallbacks to the resolve stub.
- The resolve stub is pretty much a cache. If checks if the address of the target method is already in the cache. If yes, it jumps to it. If not, it calls the resolver and adds the new entry to the cache.
With that information in hand, I modified my repro to use interfaces:
But it still wouldn’t crash. Worse, it wouldn’t even cause a segmentation fault anymore!
After reading a bit more about virtual dispatch stubs, it turns out the type of stub used for a given call site changes through the life of a process. When compiling a method, the JIT emits a lookup stub at the call site. When invoked, that stub will emit both a dispatch stub and a resolve stub, and will backpatch the call site to use the new stubs. The reason the JIT does not directly emit a dispatch/resolve stub is apparently because it’s missing some contextual information, that is only available at invocation time.
In my repro app, since I was calling the method only once, a lookup stub was used. I needed to call the method multiple times to make sure the dispatch stub was emitted, and then cause the
And this time it had the expected result:
There was still one thing puzzling me: I needed to call
Request(false) two times for the dispatch stub to be emitted. I was expecting the JIT to emit a lookup stub during compilation, then the lookup stub to emit a dispatch stub during the first invocation. So only one
Request(false) would have been needed. Why did I need two?
I found the answer in a comment in the source code of the resolver (the resolver is the bit of code called by the lookup stub):
Note, if we encounter a method that hasn’t been jitted yet, we will return the prestub, which should cause it to be jitted and we will be able to build the dispatching stub on a later call thru the call site
Therefore, the sequence of events was:
Requestis compiled by the JIT. At this point, contextual information is missing and a lookup stub is emitted.
- First call to
Request(false): when calling
IClient.GetResponse, the lookup stub is invoked. It resolves the call to
Client.GetResponsebut notices that this method hasn’t been JITted. Since it doesn’t know the final location of the code, if just returns the address of the prestub. When the prestub is executed, it triggers the JIT compilation of the method.
- Second call to
Request(false): when calling
IClient.GetResponse, the lookup stub is invoked. It resolves the call to
Client.GetResponseand emits a dispatch stub that points to the address of the JITted code.
- Call to
Request(true): the dispatch stub is used for resolution, but the instance is null. This causes a segmentation fault inside of the stub, which in turn will cause the crash when the stack is unwind.
If my analysis was correct, I would need only one call to
Request(false) if I made sure
Client.GetResponse was already JIT-compiled at the moment of the invocation. I confirmed that by making this change to the repro:
And indeed, it crashed with a segmentation fault.
Back to my NullReferenceException
Meanwhile, even though it wasn’t causing a crash anymore with the patched CLR, my integration test was still throwing a
In normal conditions, debugging a null reference is hardly something noteworthy. But for various reasons, such as the remote ARM64 VM or the fact that the issue happened only with our profiler attached, I couldn’t attach the Visual Studio debugger and it was very difficult for me to make any change in the code. All I knew was that the error occurred in this method from the Elasticsearch.net client:
There was a lot of things in there that could be null. And the fact that it came from an external library made things even harder. I needed to figure a way to debug that issue from LLDB without any code change.
So I ran the application again with LLDB, until the location of the first segmentation fault. This time, thanks to everything I learned above, I knew the segmentation fault was occurring in the dispatch stub, with the now familiar
ldr x13, [x0] instruction:
The comment in the stub implementation indicates:
So by decompiling the stub, I should be able to retrieve the hardcoded
We can see the
ldp x10, x12, [x9] instruction. According to the comments, it loads the
_implTarget values we’re looking for.
ldp is an ARM instruction that loads two words from the target address (stored in
x9) and stores them in the designated registers (
x12 ). The value of
x9 was set by the instruction
adr x9, #0x1c. That means “get the address of the current instruction, add the offset
0x1c , and stores it in
x9”. The address of that instruction is
0xffff7d866304, so it stores the value
0xffff7d866304 + 0x1c = 0xffff7d866320 in the register. At
0xffff7d866320, we can see a sequence of values:
Stitched together, it means that the instruction
ldp x10, x12, [x9] stores
x12. At that point I had all the information I needed! I could then inspect the expected MT, and from there see what method was stored at address
From there, I knew the
NullReferenceException occurred when trying to call the getter of the
ApiCall property on an instance of
Nest.CreateIndexResponse. With that information, finding the exact line of failure in the source code of the method was trivial:
Thanks to that, I was able to understand what caused that value to be null and fix the issue. And finally close the pull request.