This is the second part of an investigation where I tried to understand why an application was randomly crashing with an
If you haven’t read it, you can find part 1 of the investigation here.
As a reminder, here is what we uncovered so far:
- The server runs Orchard, with the Datadog .NET tracer, and crashes about once or twice per day
- The crash dump indicated an access violation in method
clr!ObjectNative::IsLockHeld, itself called by
- In WinDbg, the
!syncblkcommand failed with an error
Part 2 starts when, as I ran out of easy things to try, I decided to map the assembly code of the
IsLockHeld method to the original C++ code to understand exactly where it crashed. But first I wanted to confirm with a new memory dump that the symptoms were identical, and fortunately a new crash occurred by the time I reached that point.
Into the second memory dump
At first glance, the new crash was identical to the first one: an access violation in
clr!ObjectNative::IsLockHeld, trying to dereference the address
0x2B. Just like before, the method was called from
OutputCacheFilter.Dipose. However, this time the
!syncblk command ran without any error:
One of the locks had no associated thread. After closer inspection of the
OutputCacheFilter code, it became apparent that the lock was acquired in the
OnActionExecuting event of the ASP.NET MVC framework, and released in the
OnResultExecuted event. This could only work if both events were called from the same thread, and ASP.NET MVC does not make any guarantee about that. I concluded that we could end up with orphaned locks, and that would explain why a lock had no associated thread.
Still, that was bad but it didn’t explain the crash. So I decided to focus on the assembly code to understand where the wrong value came from.
The disassembly code, from the beginning of the method to the point where it failed, was:
The source code of
How to proceed? As I mentioned in a previous article, the key is to find some kind of “anchor”, some easily-recognizable part in the assembly and work your way from there. The
mov rax, qword ptr [clr!g_pSyncTable (00007ffb`62ef4ac8)] instruction was interesting, because it refers to
g_pSyncTable. Unfortunately there was no such symbol in the
IsLockHeld method, which probably meant that a method was inlined. So I needed to dig further into the source code. The first method called was
GetThreadOwningMonitorLock, what did it look like?
That looked really good. The method did indeed use
g_pSyncTable. Also, it started with two bit test operations,
bits & BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX and
bits & BIT_SBLK_IS_HASHCODE, which mapped nicely to the
bt ebx, 1Bh and
bt ebx, 1Ah instructions. Just to be sure, I confirmed that the value of
0x08000000, so to test that flag you would need to check the 28th bit (
bt ebx, 1Bh). Likewise,
0x04000000, so the 27th bit (
bt ebx, 1Ah).
What else could I use to map to the original source code? The
mov rax, qword ptr [clr!g_pSyncTable (00007ffb`62ef4ac8)] instruction was dereferencing
mov rdx, qword ptr [rax+rcx*8] read the table at index
rcx*8. This looked a lot like
g_pSyncTable[(int)index]. Further down,
test rcx, rcx looked like a null check, so probably
if(pThread == NULL).
To confirm it, let’s look up a bit. The line just before the null check was:
Thread* pThread = psb->GetMonitor()->m_HoldingThread;
m_Monitor, which is the first field of the syncblock (therefore located at the base address of the syncblock).
m_Monitor is an
AwareLock with the following structure:
ULONG are 4 bytes wide, so
m_holdingThread would be at offset 8, which was consistent with the
mov rcx, qword ptr [rdx+8] instruction. So at this point it seemed reasonable to assume that the code was failing at the
*pThreadId = pThread->GetThreadId(); instruction.
GetThreadId returned the
m_ThreadId field of the thread. Given that the crash occurred when dereferencing
mov ebx, dword ptr [rcx+2Ch], I could have the definite confirmation if
m_ThreadId was located at offset
2C of the
Thread class. The layout of the
Thread class is:
ThreadStateis an enum (4 bytes)
ULONG: 4 bytes
PTR_*objects are pointer to an object, so 8 bytes each (64 bits process)
DWORD: 4 bytes
4 + 4 + (3 * 8) + 4 = 36
Offset 2C would be 44 in decimal, so we’re missing 8 bytes! Did I miss something?
Actually, yes: the signature of the
Thread inherits from
IUnknown, which declares virtual methods. It means that
Thread has a vtable, and the pointer of the vtable is stored at the beginning of the objects. That’s the missing 8 bytes!
Now if we take the vtable into account: 8 + 4 + 4 + (3 * 8) + 4 = 44 = 2C
So I could conclude that the crash occured at the
*pThreadId = pThread->GetThreadId(); line, which meant that for some reason
0xffffffff. Could I confirm that from the memory dump?
Following my previous analysis,
mov rdx, qword ptr [rax+rcx*8] was reading the syncblock from the synctable and storing the address into
rdx. Since the register was untouched past this point, I could expect that the value would still be there.
Inspecting the register gave me the value
0000020c6dd4ec38. I then dumped the value of all syncblocks and…
Score! The syncblock at index 1069 had the same address,
0000020c6dd4ec38. As mentioned earlier, the first field of a syncblock is an
AwareLock with the following structure:
So if we dump the memory at the address
0000020c6dd4ec38, we can map the values accordingly:
At this point there was no doubt left that the thread address associated to the syncblock was
ffffffff`ffffffff, and this caused the crash. The next step was finding how this value ended up there.
ffffffff`ffffffff is a very remarkable value, equal to
-1. So I scanned through the CLR source code to find where the
-1 value was assigned to the
m_HoldingThread field of the syncblock. Luckily there was only one place, in
ObjHeader::GetSyncBlock(). This piece of code gets executed when a thin-lock is promoted to a syncblock (you can refer to part 1 of the article for a refresher on thin locks) :
The lock is orphaned comment was very encouraging, as I already knew that Orchard was orphaning locks. But not all orphaned lock cause a crash, so I was still missing a piece of the puzzle.
The thread address came from
g_pThinLockThreadIdDispenser->IdToThreadWithValidation. Upon further inspection, it looked like a table to convert some kind of thread ids to addresses. What kind of ids?
If you remember from part 1, thin locks store the information in the object header. Headers are 32 bits wide, and contain only 26 bits of usable space for the thin lock. Monitors need to keep track of the owning thread, for instance to allow recursive access. How to store that information in 26 bits? That’s not enough for the thread address, so instead the runtime stores the managed thread id and maintains a table to map that id back to the address of the thread object. That table is
Back to our
ObjHeader::GetSyncBlock() method, it meant that
-1 was used as a fallback value for the thread address when the thread id couldn’t be found in the table. So the last missing bit was to figure out when ids are removed from that table.
After more time spent reading the CLR code, I concluded that the table was cleaned by the finalizer of the thread object. At that point, I had all the pieces of the puzzle and could write a small repro:
The program starts by creating a bunch of objects, then starts threads to lock on those objects. At that point, a thin lock is used and the thread ids are added to the
g_pThinLockThreadIdDispenser table. Then the GC is triggered, which causes the thread finalizers to run, which in turn will remove the thread ids from the table. Line 27, a loop call
GetHashCode on the objects, forcing the thin locks to be promoted to syncblocks. While doing so, the runtime will try to fetch the thread addresses from the table, but won’t find them since they were removed by the finalizers. Instead, the runtime will store the
-1 fallback value in the syncblocks. Finally, the loop line 33 starts new threads and make them inspect the state of the lock. This causes the runtime to dereference the thread addresses stored in the syncblocks (that is,
-1) and triggers the AccessViolationException.
Wrapping things up
This is a runtime bug, that has probably been around for a very long time. I’ve reported it on Github, so it may be fixed in time for .NET 6.0. Still, there were a few questions left:
- If orphaned locks could trigger a crash, how come the Orchard maintainers never noticed it?
- At Datadog we have a test server running our instrumentation code, and a control server running without. If that was a runtime bug, why did we detect it on the test server but not on the control server?
The reason for 1/ was simply that we weren’t running vanilla Orchard. The locking scheme used by Orchard relies on the assumption that ASP.NET MVC will execute the
OnResultExecuted events in the same thread. This is actually true when using synchronous controllers. But to better test our code, we added some asynchronous workload to Orchard, and therefore switched to asynchronous controllers to follow best practices. This broke the assumption.
For 2/, I must commend the intuition of OzCode’s Omer Raviv, who sensed immediately after reading part 1 that something was fishy:
Remember that the crash is caused by some error condition that has a probability of happening every time an orphaned lock is promoted. The locks in question are used by Orchard to update its internal cache. Orchard has a finite number of caches, and therefore a finite number of locks. Once all of them are orphaned, the conditions for the crash will never happen again.
Put another way: every time the application restarted, it had some probability of crashing in the following minutes. Past that window, the crash could not happen again because the locks for all the caches would have been orphaned. The test server was redeployed every time a new change was merged in our GitHub repository, that is a few times per day. On the other hand, the control server was almost never restarted. So what caused the test server to crash wasn’t the fact that our code was running on it, but simply that it was restarted more often.