This is the second part of an investigation where I tried to understand why an application was randomly crashing with an AccessViolationException.

If you haven’t read it, you can find part 1 of the investigation here.

As a reminder, here is what we uncovered so far:

  • The server runs Orchard, with the Datadog .NET tracer, and crashes about once or twice per day
  • The crash dump indicated an access violation in method clr!ObjectNative::IsLockHeld, itself called by Orchard.OutputCache.Filters.OutputCacheFilter.Dispose
  • In WinDbg, the !syncblk command failed with an error

Part 2 starts when, as I ran out of easy things to try, I decided to map the assembly code of the IsLockHeld method to the original C++ code to understand exactly where it crashed. But first I wanted to confirm with a new memory dump that the symptoms were identical, and fortunately a new crash occurred by the time I reached that point. …


This is a two parts article. Part two is available here.

Symptoms

To monitor the stability of the Datadog .NET tracer, we have a reliability environment where we continuously run mainstream applications such as Orchard. This story starts when, while preparing a release, we discovered that the latest version of our tracer was crashing the app with the message:

Application: w3wp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FFB62CB0A8D (00007FFB625B0000) with exit code 80131506.

A look at the reliability monitor showed that the app was crashing once or twice per…


In this series of article, we’re retracing how I debugged an InvalidProgramException, caused by a bug in the Datadog profiler, from a memory dump sent by a customer.

In the previous part, we’ve located the bad generated IL, stored in an internal CLR structure. The third part is going to be about understanding what’s wrong with the IL, and finding the root cause.

Identifying the error

There are a lot of things that can be wrong in some IL code, so I wasn’t short of stuff to try. …


In this series of article, we’re retracing how I debugged an InvalidProgramException, caused by a bug in the Datadog profiler, from a memory dump sent by a customer.

Let’s start with a quick reminder. The profiler works by rewriting the IL of interesting methods to inject instrumentation code. The InvalidProgramException is thrown by the JIT when trying to compile the IL emitted by the profiler, which must be somehow invalid. The first part was about identifying in what method the exception was thrown, and I ended up concluding that Npgsql.PostgresDatabaseInfo.LoadBackendTypes was the culprit. …


Datadog automated instrumentation for .NET works by rewriting the IL of interesting methods to emit traces that are then sent to the back-end. This is a complex piece of logic, written using the profiler API, and ridden with corner-cases. And as always with complex code, bugs are bound to happen, and those can be very difficult to diagnose.

As it turns out, we had customer reports of applications throwing InvalidProgramException when using our instrumentation. This exception is thrown when the JIT encounters invalid IL code, most likely emitted by our profiler. The symptoms were always the same: upon starting, the application had a random probability of ending up in a state where the exception would be thrown every time a method in particular was called. When that happened, restarting the application fixed the issue. The affected method would change from one time to the other. The issue was bad enough that the customers felt the need to report it, and rare enough that it couldn’t be reproduced at will. …


The Datadog .NET tracer uses the HTTP protocol to communicate with the agent (usually installed on the same machine) and send traces every second. We originally used HttpClient to handle the communication. However, we ran into some dependencies errors on .NET Framework (it’s a bit complicated and outside of the scope of the article, but it has something to do with us instrumenting the System.Net.HttpClient assembly while at the same time having a dependency on it), and so we decided to switch to HttpWebRequest. …


.NET core startup hooks is a feature I really like, and I had a lot of fun with it in the past. Still, I had yet to find a legitimate use for them, and the opportunity finally came a few days ago.

What are startup hooks?

Let’s start by a quick catch-up, for those who don’t know what startup hooks are. The feature was introduced with .net core 2.2, and allows to execute any arbitrary code in a .net process before the Main entry point has a chance to run. This is done by declaring a DOTNET_STARTUP_HOOKS environment variable, pointing to the assembly you want to inject. The target assembly must declare a StartupHook class outside of any namespace, with a static Initialize method. …


… and find an unexpected bug in the process

We recently run into performance issues, and many elements seemed to point towards the GC verbose events being enabled. To test this, I needed a way to check on live servers whether the events were activated.

Note: the whole article is about .net core on Linux. While the first part (implementation) can be transposed to any OS, I’m not sure the second part could be done on Windows without installing additional tools.

Checking the implementation

The first step is to understand how the runtime decides whether those events are enabled or not. …


As I recently had to compile a list of best practices in C# for Criteo, I figured it would be nice to share it publicly. The goal of this article is to provide a non-exhaustive list of code patterns to avoid, either because they’re risky or because they perform poorly. The list may seem a bit random because it’s out of context, but all the items have been spotted in our code at some point and have caused production issues. Hopefully this will serve as a warning and prevent you from making the same mistakes.

Also note that Criteo web services rely on high-performance code, hence the need to avoid inefficient code patterns. Some of those patterns wouldn’t make a noticeable difference in most applications. …


There has been plenty of talk about ThreadPool starvation in .NET:

What is it about? This is one of the numerous ways asynchronous code can break if you wait synchronously on a task.

Image for post
Image for post
Photo by Tim Gouw on Unsplash

To illustrate that, consider a web server that would execute this code:

You start an asynchronous operation (DoSomethingAsync) then block the current thread. At some point, the asynchronous operation will need a thread to finish executing, so it’ll ask the ThreadPool for a new one. You end up using two threads for an operation that could be done with just one: one waiting actively on the Wait() method call and another one performing the continuation. In most cases this is fine. …

About

Kevin Gosse

Software developer passionate about .NET, performance, and debugging

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store