Debugging a native deadlock in a .NET Linux application
This story begins when one of our integrations tests started got stuck on one PR that seemingly impacted unrelated code. This is a nice excuse to cover some concepts I haven’t touched in my previous articles, such as downloading the .NET symbols on Linux.
Preliminary inspection
The failure was occurring in a Linux test. After a while, I managed to reproduce the issue locally in a docker container. Usually the first step would be to attach a debugger, but I didn’t want to spend the time to find the commands to inject the debugger in the container. So I started by capturing a memory dump with dotnet-dump:
$ dotnet-dump collect -p <pid> --type Full
Note that a heap-only memory dump (--type Heap
) should be enough in theory, but I’ve had so much trouble with those in the past that I now always capture full dump unless I have a compelling reason not to.
I then used the dotnet-dump analyze
command to open it:
$ dotnet-dump analyze core_20210205_190420
Loading core dump: core_20210205_190420 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
>
The integration test was getting stuck, so I was looking for a deadlock of sorts. I started by using the pstacks
command to get an overview of the callstacks:
This revealed one thread executing Datadog.Trace.Agent.Api.SendTracesAsync
. This method has no synchronization, so no obvious reason to get stuck. Just to be extra sure it was actually stuck (verifying your assumptions can save you a lot of time in the long run), I captured a second memory dump and confirmed that the same thread was stuck in the same method.
Looking for more information, I decided to have a closer look at the callstack using clrstack
. pstacks
indicated that the OS thread id was c79
, unfortunately that’s not a value I could use directly. Instead, I ran clrthreads
to dump the list of managed threads:
I looked for the one with the value c79
in the OSID
column, then took the value of the DBG
column (15). Then I could feed that value to the setthread
command. An alternative is to convert the OS thread id to decimal (c79 = 3193), then use the -t
parameter of the setthread
command:
Then I could run the clrstack
command to get the detailed callstack:
This uncovered an interesting piece of data: the thread was not actually stuck on Datadog.Trace.Agent.Api.SendTracesAsync
but on a “PrestubMethodFrame
” pointing to Datadog.Trace.Agent.Transports.ApiWebRequestFactory.Create
. I guessed this was the stub used to trigger the JIT-compilation of the method. To confirm that hypothesis, I checked if the method was already JIT-compiled:
An alternative would have been to use the dumpmt -md
command with the pointer to the method table:
You can fetch the MT using dumpmodule -mt <module address>
, and the module address using dumpdomain
… Yeah, name2ee
is usually faster.
Anyway, both name2ee
and dumpmt
revealed that the method wasn’t JIT-compiled yet. Given that our product patches the code during JIT compilation, it wasn’t really surprising to me that something could go wrong in that part of the process. In any case, it was time to switch to native debugging to get more information.
One layer deeper
LLDB is the recommended debugger to use with the CLR on Linux, though GDB works fine in most scenarios.
On most Linux distributions, you can get LLDB using your package manager:
$ sudo apt install lldb
If not (for instance, if you’re using CentOS), good luck building it from the source.
Even though I don’t use it for this investigation, I also recommend installing dotnet-sos then running the dotnet-sos install
command. This will update your LLDB profile to automatically load SOS, which is very convenient.
Then you can load the memory dump in LLDB:
$ lldb -c core_20210205_190420
Added Microsoft public symbol server
(lldb) target create --core "core_20210205_190420"
Core file '/temp/core_20210205_190420' (x86_64) was loaded.
I used the thread backtrace
command to display the callstack of the first thread:
There’s a lot of noise but I could still deduce that code from Datadog.Trace.ClrProfiler.Native
was waiting for… something in libcoreclr.so
.
I then looked for the thread that we spotted in dotnet-dump
(the one with the thread id (c79 / 3193) and inspected its callstack:
This thread was also executing code in Datadog.Trace.ClrProfiler.Native
, and waited on a global lock that we use to protect our data structures. So it looked like the first thread was holding the lock and calling some code in libcoreclr
, and the other thread was waiting for this lock to be freed. The last missing piece was figuring out why the code in libcoreclr
was taking so long.
.NET Core ships without the internal symbols, but they can be downloaded using the dotnet-symbol
tool. It’s very simple to use, just indicate the location of the memory dump and the folder to write the symbols to:
$ dotnet-symbol /temp/core_20210205_190420 -o /symbols/
Downloading all the symbols will take a few minutes. You may get an error when downloading the symbols for libcoreclr.so
(“Unable to read data from the transport connection: Operation canceled”). That’s because the tool has a hardcoded 4 minutes timeout, and the symbol server is very slow when accessed outside of the United States. There is already an issue to track that, until it’s fixed I suggest you clone the dotnet/symstore
github repository, and change the value of the timeout in the HttpSymbolStore
class:
Then you can use the provided build.sh
script to compile your changes, and you’ll find your fixed version of dotnet-symbol in the artifacts/bin/dotnet-symbol/
folder.
Then you can feed those symbols to LLDB:
$ lldb -O "settings set target.debug-file-search-paths /symbols/" -c core_20210205_190420
Or if like me you’re too lazy to remember/find/type this settings set target.debug-file-search-paths
thing, just copy all the symbols in the same folder as the memory dump and it’ll work just fine.
With the symbols, I could get more information about what the first thread was doing:
The thread was calling ProfToEEInterfaceImpl::RequestReJIT
which in turn was calling ThreadSuspend::SuspendRuntime
. From there I had the full picture: the first thread was holding the global lock and indirectly calling ThreadSuspend::SuspendRuntime
. This method blocks until all the managed threads are suspended. At the same time, the other thread was waiting on the global lock to JIT a method, which causes the deadlock.
The fix was simply to run our ReJIT operations on a dedicated native thread, outside of the global shared lock.
Interestingly, this behavior and the recommended fix are actually documented, so I guess we should just have spent more time reading the documentation.