Debugging a native deadlock in a .NET Linux application

This story begins when one of our integrations tests started got stuck on one PR that seemingly impacted unrelated code. This is a nice excuse to cover some concepts I haven’t touched in my previous articles, such as downloading the .NET symbols on Linux.

Preliminary inspection

The failure was occurring in a Linux test. After a while, I managed to reproduce the issue locally in a docker container. Usually the first step would be to attach a debugger, but I didn’t want to spend the time to find the commands to inject the debugger in the container. So I started by capturing a memory dump with dotnet-dump:

$ dotnet-dump collect -p <pid> --type Full

Note that a heap-only memory dump () should be enough in theory, but I’ve had so much trouble with those in the past that I now always capture full dump unless I have a compelling reason not to.

I then used the command to open it:

$ dotnet-dump analyze core_20210205_190420
Loading core dump: core_20210205_190420 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.

The integration test was getting stuck, so I was looking for a deadlock of sorts. I started by using the command to get an overview of the callstacks:

This revealed one thread executing . This method has no synchronization, so no obvious reason to get stuck. Just to be extra sure it was actually stuck (verifying your assumptions can save you a lot of time in the long run), I captured a second memory dump and confirmed that the same thread was stuck in the same method.

Looking for more information, I decided to have a closer look at the callstack using . indicated that the OS thread id was , unfortunately that’s not a value I could use directly. Instead, I ran to dump the list of managed threads:

I looked for the one with the value in the column, then took the value of the column (15). Then I could feed that value to the command. An alternative is to convert the OS thread id to decimal (c79 = 3193), then use the parameter of the command:

Then I could run the command to get the detailed callstack:

This uncovered an interesting piece of data: the thread was not actually stuck on but on a “” pointing to . I guessed this was the stub used to trigger the JIT-compilation of the method. To confirm that hypothesis, I checked if the method was already JIT-compiled:

An alternative would have been to use the command with the pointer to the method table:

You can fetch the MT using , and the module address using … Yeah, is usually faster.

Anyway, both and revealed that the method wasn’t JIT-compiled yet. Given that our product patches the code during JIT compilation, it wasn’t really surprising to me that something could go wrong in that part of the process. In any case, it was time to switch to native debugging to get more information.

One layer deeper

LLDB is the recommended debugger to use with the CLR on Linux, though GDB works fine in most scenarios.

On most Linux distributions, you can get LLDB using your package manager:

$ sudo apt install lldb

If not (for instance, if you’re using CentOS), good luck building it from the source.

Even though I don’t use it for this investigation, I also recommend installing dotnet-sos then running the command. This will update your LLDB profile to automatically load SOS, which is very convenient.

Then you can load the memory dump in LLDB:

$ lldb -c core_20210205_190420
Added Microsoft public symbol server
(lldb) target create --core "core_20210205_190420"
Core file '/temp/core_20210205_190420' (x86_64) was loaded.

I used the command to display the callstack of the first thread:

There’s a lot of noise but I could still deduce that code from was waiting for… something in .

I then looked for the thread that we spotted in (the one with the thread id (c79 / 3193) and inspected its callstack:

This thread was also executing code in , and waited on a global lock that we use to protect our data structures. So it looked like the first thread was holding the lock and calling some code in , and the other thread was waiting for this lock to be freed. The last missing piece was figuring out why the code in was taking so long.

.NET Core ships without the internal symbols, but they can be downloaded using the tool. It’s very simple to use, just indicate the location of the memory dump and the folder to write the symbols to:

$ dotnet-symbol /temp/core_20210205_190420 -o /symbols/

Downloading all the symbols will take a few minutes. You may get an error when downloading the symbols for (“Unable to read data from the transport connection: Operation canceled”). That’s because the tool has a hardcoded 4 minutes timeout, and the symbol server is very slow when accessed outside of the United States. There is already an issue to track that, until it’s fixed I suggest you clone the github repository, and change the value of the timeout in the class:

Then you can use the provided script to compile your changes, and you’ll find your fixed version of dotnet-symbol in the folder.

Then you can feed those symbols to LLDB:

$ lldb -O "settings set target.debug-file-search-paths /symbols/" -c core_20210205_190420

Or if like me you’re too lazy to remember/find/type this thing, just copy all the symbols in the same folder as the memory dump and it’ll work just fine.

With the symbols, I could get more information about what the first thread was doing:

The thread was calling which in turn was calling . From there I had the full picture: the first thread was holding the global lock and indirectly calling . This method blocks until all the managed threads are suspended. At the same time, the other thread was waiting on the global lock to JIT a method, which causes the deadlock.

The fix was simply to run our ReJIT operations on a dedicated native thread, outside of the global shared lock.

Interestingly, this behavior and the recommended fix are actually documented, so I guess we should just have spent more time reading the documentation.

Software developer passionate about .NET, performance, and debugging

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store