AccessViolation in ObjectNative::IsLockHeld (part 2 of 2)

This is the second part of an investigation where I tried to understand why an application was randomly crashing with an AccessViolationException.

If you haven’t read it, you can find part 1 of the investigation here.

As a reminder, here is what we uncovered so far:

  • The server runs Orchard, with the Datadog .NET tracer, and crashes about once or twice per day
  • The crash dump indicated an access violation in method clr!ObjectNative::IsLockHeld, itself called by Orchard.OutputCache.Filters.OutputCacheFilter.Dispose
  • In WinDbg, the !syncblk command failed with an error

Part 2 starts when, as I ran out of easy things to try, I decided to map the assembly code of the IsLockHeld method to the original C++ code to understand exactly where it crashed. But first I wanted to confirm with a new memory dump that the symptoms were identical, and fortunately a new crash occurred by the time I reached that point.

Into the second memory dump

At first glance, the new crash was identical to the first one: an access violation in clr!ObjectNative::IsLockHeld, trying to dereference the address 0x2B. Just like before, the method was called from OutputCacheFilter.Dipose. However, this time the !syncblk command ran without any error:

One of the locks had no associated thread. After closer inspection of the OutputCacheFilter code, it became apparent that the lock was acquired in the OnActionExecuting event of the ASP.NET MVC framework, and released in the OnResultExecuted event. This could only work if both events were called from the same thread, and ASP.NET MVC does not make any guarantee about that. I concluded that we could end up with orphaned locks, and that would explain why a lock had no associated thread.

Still, that was bad but it didn’t explain the crash. So I decided to focus on the assembly code to understand where the wrong value came from.

The disassembly code, from the beginning of the method to the point where it failed, was:

The source code of IsLockHeld is:

How to proceed? As I mentioned in a previous article, the key is to find some kind of “anchor”, some easily-recognizable part in the assembly and work your way from there. The mov rax, qword ptr [clr!g_pSyncTable (00007ffb`62ef4ac8)] instruction was interesting, because it refers to g_pSyncTable. Unfortunately there was no such symbol in the IsLockHeld method, which probably meant that a method was inlined. So I needed to dig further into the source code. The first method called was GetThreadOwningMonitorLock, what did it look like?

I removed some bits for brevity

That looked really good. The method did indeed use g_pSyncTable. Also, it started with two bit test operations, bits & BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX and bits & BIT_SBLK_IS_HASHCODE, which mapped nicely to the bt ebx, 1Bh and bt ebx, 1Ah instructions. Just to be sure, I confirmed that the value of BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX was 0x08000000, so to test that flag you would need to check the 28th bit (bt ebx, 1Bh). Likewise, BIT_SBLK_IS_HASHCODE was 0x04000000, so the 27th bit (bt ebx, 1Ah).

What else could I use to map to the original source code? The mov rax, qword ptr [clr!g_pSyncTable (00007ffb`62ef4ac8)] instruction was dereferencing g_pSyncTable into rax. Then mov rdx, qword ptr [rax+rcx*8] read the table at index rcx*8. This looked a lot like g_pSyncTable[(int)index]. Further down, test rcx, rcx looked like a null check, so probably if(pThread == NULL).

To confirm it, let’s look up a bit. The line just before the null check was:

Thread* pThread = psb->GetMonitor()->m_HoldingThread;

GetMonitor() returns m_Monitor, which is the first field of the syncblock (therefore located at the base address of the syncblock). m_Monitor is an AwareLock with the following structure:

Both LONG and ULONG are 4 bytes wide, so m_holdingThread would be at offset 8, which was consistent with the mov rcx, qword ptr [rdx+8] instruction. So at this point it seemed reasonable to assume that the code was failing at the *pThreadId = pThread->GetThreadId(); instruction. GetThreadId returned the m_ThreadId field of the thread. Given that the crash occurred when dereferencing mov ebx, dword ptr [rcx+2Ch], I could have the definite confirmation if m_ThreadId was located at offset 2C of the Thread class. The layout of the Thread class is:

Let’s count:

  • ThreadState is an enum (4 bytes)
  • ULONG : 4 bytes
  • PTR_* objects are pointer to an object, so 8 bytes each (64 bits process)
  • DWORD : 4 bytes

4 + 4 + (3 * 8) + 4 = 36

Offset 2C would be 44 in decimal, so we’re missing 8 bytes! Did I miss something?

Actually, yes: the signature of the Thread class.

Thread inherits from IUnknown, which declares virtual methods. It means that Thread has a vtable, and the pointer of the vtable is stored at the beginning of the objects. That’s the missing 8 bytes!

Now if we take the vtable into account: 8 + 4 + 4 + (3 * 8) + 4 = 44 = 2C

So I could conclude that the crash occured at the *pThreadId = pThread->GetThreadId(); line, which meant that for some reason pThread was 0xffffffff. Could I confirm that from the memory dump?

Following my previous analysis, mov rdx, qword ptr [rax+rcx*8] was reading the syncblock from the synctable and storing the address into rdx. Since the register was untouched past this point, I could expect that the value would still be there.

Inspecting the register gave me the value 0000020c6dd4ec38. I then dumped the value of all syncblocks and…

Score! The syncblock at index 1069 had the same address, 0000020c6dd4ec38. As mentioned earlier, the first field of a syncblock is an AwareLock with the following structure:

So if we dump the memory at the address 0000020c6dd4ec38, we can map the values accordingly:

  • m_MonitorHeld = 1
  • m_Recursion = 1
  • m_HoldingThread = ffffffff`ffffffff

At this point there was no doubt left that the thread address associated to the syncblock was ffffffff`ffffffff, and this caused the crash. The next step was finding how this value ended up there.

ffffffff`ffffffff is a very remarkable value, equal to -1. So I scanned through the CLR source code to find where the -1 value was assigned to the m_HoldingThread field of the syncblock. Luckily there was only one place, in ObjHeader::GetSyncBlock(). This piece of code gets executed when a thin-lock is promoted to a syncblock (you can refer to part 1 of the article for a refresher on thin locks) :

The The lock is orphaned comment was very encouraging, as I already knew that Orchard was orphaning locks. But not all orphaned lock cause a crash, so I was still missing a piece of the puzzle.

The thread address came from g_pThinLockThreadIdDispenser->IdToThreadWithValidation. Upon further inspection, it looked like a table to convert some kind of thread ids to addresses. What kind of ids?

If you remember from part 1, thin locks store the information in the object header. Headers are 32 bits wide, and contain only 26 bits of usable space for the thin lock. Monitors need to keep track of the owning thread, for instance to allow recursive access. How to store that information in 26 bits? That’s not enough for the thread address, so instead the runtime stores the managed thread id and maintains a table to map that id back to the address of the thread object. That table is g_pThinLockThreadIdDispenser.

Back to our ObjHeader::GetSyncBlock() method, it meant that -1 was used as a fallback value for the thread address when the thread id couldn’t be found in the table. So the last missing bit was to figure out when ids are removed from that table.

After more time spent reading the CLR code, I concluded that the table was cleaned by the finalizer of the thread object. At that point, I had all the pieces of the puzzle and could write a small repro:

The program starts by creating a bunch of objects, then starts threads to lock on those objects. At that point, a thin lock is used and the thread ids are added to the g_pThinLockThreadIdDispenser table. Then the GC is triggered, which causes the thread finalizers to run, which in turn will remove the thread ids from the table. Line 27, a loop call GetHashCode on the objects, forcing the thin locks to be promoted to syncblocks. While doing so, the runtime will try to fetch the thread addresses from the table, but won’t find them since they were removed by the finalizers. Instead, the runtime will store the -1 fallback value in the syncblocks. Finally, the loop line 33 starts new threads and make them inspect the state of the lock. This causes the runtime to dereference the thread addresses stored in the syncblocks (that is, -1) and triggers the AccessViolationException.

Wrapping things up

This is a runtime bug, that has probably been around for a very long time. I’ve reported it on Github, so it may be fixed in time for .NET 6.0. Still, there were a few questions left:

  1. If orphaned locks could trigger a crash, how come the Orchard maintainers never noticed it?
  2. At Datadog we have a test server running our instrumentation code, and a control server running without. If that was a runtime bug, why did we detect it on the test server but not on the control server?

The reason for 1/ was simply that we weren’t running vanilla Orchard. The locking scheme used by Orchard relies on the assumption that ASP.NET MVC will execute the OnActionExecuting and OnResultExecuted events in the same thread. This is actually true when using synchronous controllers. But to better test our code, we added some asynchronous workload to Orchard, and therefore switched to asynchronous controllers to follow best practices. This broke the assumption.

We were just following best practices :(

For 2/, I must commend the intuition of OzCode’s Omer Raviv, who sensed immediately after reading part 1 that something was fishy:

Remember that the crash is caused by some error condition that has a probability of happening every time an orphaned lock is promoted. The locks in question are used by Orchard to update its internal cache. Orchard has a finite number of caches, and therefore a finite number of locks. Once all of them are orphaned, the conditions for the crash will never happen again.
Put another way: every time the application restarted, it had some probability of crashing in the following minutes. Past that window, the crash could not happen again because the locks for all the caches would have been orphaned. The test server was redeployed every time a new change was merged in our GitHub repository, that is a few times per day. On the other hand, the control server was almost never restarted. So what caused the test server to crash wasn’t the fact that our code was running on it, but simply that it was restarted more often.

Software developer passionate about .NET, performance, and debugging