This story begins when one of our integrations tests started got stuck on one PR that seemingly impacted unrelated code. This is a nice excuse to cover some concepts I haven’t touched in my previous articles, such as downloading the .NET symbols on Linux.
The failure was occurring in a Linux test. After a while, I managed to reproduce the issue locally in a docker container. Usually the first step would be to attach a debugger, but I didn’t want to spend the time to find the commands to inject the debugger in the container. So I started by capturing a memory dump with dotnet-dump:
$ dotnet-dump collect -p <pid> --type Full
Note that a heap-only memory dump (
--type Heap) should be enough in theory, but I’ve had so much trouble with those in the past that I now always capture full dump unless I have a compelling reason not to.
I then used the
dotnet-dump analyze command to open it:
$ dotnet-dump analyze core_20210205_190420
Loading core dump: core_20210205_190420 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
The integration test was getting stuck, so I was looking for a deadlock of sorts. I started by using the
pstacks command to get an overview of the callstacks:
This revealed one thread executing
Datadog.Trace.Agent.Api.SendTracesAsync. This method has no synchronization, so no obvious reason to get stuck. Just to be extra sure it was actually stuck (verifying your assumptions can save you a lot of time in the long run), I captured a second memory dump and confirmed that the same thread was stuck in the same method.
Looking for more information, I decided to have a closer look at the callstack using
pstacks indicated that the OS thread id was
c79, unfortunately that’s not a value I could use directly. Instead, I ran
clrthreads to dump the list of managed threads:
I looked for the one with the value
c79 in the
OSID column, then took the value of the
DBG column (15). Then I could feed that value to the
setthread command. An alternative is to convert the OS thread id to decimal (c79 = 3193), then use the
-t parameter of the
Then I could run the
clrstack command to get the detailed callstack:
This uncovered an interesting piece of data: the thread was not actually stuck on
Datadog.Trace.Agent.Api.SendTracesAsync but on a “
PrestubMethodFrame” pointing to
Datadog.Trace.Agent.Transports.ApiWebRequestFactory.Create. I guessed this was the stub used to trigger the JIT-compilation of the method. To confirm that hypothesis, I checked if the method was already JIT-compiled:
An alternative would have been to use the
dumpmt -md command with the pointer to the method table:
You can fetch the MT using
dumpmodule -mt <module address>, and the module address using
name2ee is usually faster.
dumpmt revealed that the method wasn’t JIT-compiled yet. Given that our product patches the code during JIT compilation, it wasn’t really surprising to me that something could go wrong in that part of the process. In any case, it was time to switch to native debugging to get more information.
One layer deeper
LLDB is the recommended debugger to use with the CLR on Linux, though GDB works fine in most scenarios.
On most Linux distributions, you can get LLDB using your package manager:
$ sudo apt install lldb
If not (for instance, if you’re using CentOS), good luck building it from the source.
Even though I don’t use it for this investigation, I also recommend installing dotnet-sos then running the
dotnet-sos install command. This will update your LLDB profile to automatically load SOS, which is very convenient.
Then you can load the memory dump in LLDB:
$ lldb -c core_20210205_190420
Added Microsoft public symbol server
(lldb) target create --core "core_20210205_190420"
Core file '/temp/core_20210205_190420' (x86_64) was loaded.
I used the
thread backtrace command to display the callstack of the first thread:
There’s a lot of noise but I could still deduce that code from
Datadog.Trace.ClrProfiler.Native was waiting for… something in
I then looked for the thread that we spotted in
dotnet-dump (the one with the thread id (c79 / 3193) and inspected its callstack:
This thread was also executing code in
Datadog.Trace.ClrProfiler.Native, and waited on a global lock that we use to protect our data structures. So it looked like the first thread was holding the lock and calling some code in
libcoreclr, and the other thread was waiting for this lock to be freed. The last missing piece was figuring out why the code in
libcoreclr was taking so long.
.NET Core ships without the internal symbols, but they can be downloaded using the
dotnet-symbol tool. It’s very simple to use, just indicate the location of the memory dump and the folder to write the symbols to:
$ dotnet-symbol /temp/core_20210205_190420 -o /symbols/
Downloading all the symbols will take a few minutes. You may get an error when downloading the symbols for
libcoreclr.so (“Unable to read data from the transport connection: Operation canceled”). That’s because the tool has a hardcoded 4 minutes timeout, and the symbol server is very slow when accessed outside of the United States. There is already an issue to track that, until it’s fixed I suggest you clone the
dotnet/symstore github repository, and change the value of the timeout in the
Then you can use the provided
build.sh script to compile your changes, and you’ll find your fixed version of dotnet-symbol in the
Then you can feed those symbols to LLDB:
$ lldb -O "settings set target.debug-file-search-paths /symbols/" -c core_20210205_190420
Or if like me you’re too lazy to remember/find/type this
settings set target.debug-file-search-paths thing, just copy all the symbols in the same folder as the memory dump and it’ll work just fine.
With the symbols, I could get more information about what the first thread was doing:
The thread was calling
ProfToEEInterfaceImpl::RequestReJIT which in turn was calling
ThreadSuspend::SuspendRuntime. From there I had the full picture: the first thread was holding the global lock and indirectly calling
ThreadSuspend::SuspendRuntime. This method blocks until all the managed threads are suspended. At the same time, the other thread was waiting on the global lock to JIT a method, which causes the deadlock.
The fix was simply to run our ReJIT operations on a dedicated native thread, outside of the global shared lock.
Interestingly, this behavior and the recommended fix are actually documented, so I guess we should just have spent more time reading the documentation.