Tracking a Ruby memory leak in 2021

By Ulysse BUONOMO

Ulysse BUONOMO

We are going to find how you can track a memory leak using the most recent and performant tools. This article’s goal is to give an up-to-date and as-simple-as-it-can-be reference on the main steps towards tracking a memory leak. If you want to get the most out of it, I’ve added (IMHO) very useful links all along with the article.

If you still want to enjoy the read and do not have a leak, you can create one.

And if your app is still leaking, or if you found another way around your leak, comment below!

1. Use Jemalloc, or MALLOC_ARENA_MAX=2

DISCLAIMER: If you are not using Jemalloc, please use it. Or at least set MALLOC_ARENA_MAX=2. Come back to this article if you still have evergrowing memory.

Another quick note about Jemalloc, sometimes LD_PRELOAD can be overridden, preventing Jemalloc from being used. We had that exact problem using bin/qgtunnel. You can make sure that Jemalloc is running using MALLOC_CONF=stats_print:true ruby -e '1 + 1' | grep jemalloc.

2. Is your app really leaking?

Ok, now let’s start talking about real memory leaks. Maybe. If you are reading this article, chances are that you are experiencing OOM issues, or that your app starts swapping at some point and you don’t understand why.

The first thing to do in that case should be to ask yourself: what does that memory evolution mean? Question to which I know 3 answers:

  1. My app is too big for my available RAM
  2. I have a memory bloat
  3. I have a memory leak

Fortunately, those 3 answers can be distinguished fairly easily. First, increase your available RAM and then take a look at memory evolution.

NOTE: if you cannot do that in production, or do not want to do that, you can use derailed exec perf:mem_over_time. See below Using derailed benchmarks.

If it keeps growing slowly, you can be fairly sure you have a leak. this gives you an indication of how much your app really consumes, and if it is too much, you will have to drop useless gems (and maybe more). If memory spikes suddenly at some point, you have a bloat.

3. Let others find the leak for you

That’s it, if the last command shows something other than No leaks found, you are lucky here’s the leak. Fix it, ship it and let’s go! See rubymem/bundler-leak for more information.

Unfortunately, this tool (and maybe others that I don’t know about) is not enough for two reasons:

  1. Not all leaks are referenced here (hence if you find one, please advise @memruby).
  2. Sorry to say this, but more often than not, the issue is in our codebase, not the lib we use.

4. Finding the leak by yourself

There are already some articles telling you how to do that. Hence I’ll focus on finding the leak locally (not on your production machine) and using existing tools rather than coding them ourselves.

So, once you are sure your issue is a leak, you can start searching for it. This task is harder than searching for a memory bloat since memory slowly grows over time, and you cannot just benchmark before and after a method to see evolution, you wouldn’t know if it was just retained because no GC occurred yet, or because it was leaking.

Fortunately, there exists one really great ruby tool: ObjectSpace.dump_all (and a bit more) which lets us analyze our application memory at a given time. And we can take advantage of that to take a few snapshots, and then compare them.

Using derailed benchmarks

Now is a good time to talk about derailed_benchmarks. If you don't know about it yet, it is a tool for benchmarking rails (and rack) applications performances either for time or memory issues. You may dig through the README for more information, I'll stick with a simple and essential way here:

  1. gem install derailed_benchmarks
  2. RAILS_ENV=production RACK_ENV=production derailed_benchmarks exec <cmd>

Analyzing memory heaps

The idea here is to generate three consecutive dumps (A, B, and C for instance) separated by a few requests to your application, and then compare them to detect what is retained and what is not. I will not detail what is the content of those dumps here, just how to exploit it (here for content).

So what we are going to do is checking which objects are in B, and not in A. Meaning, objects that have been created during requests received between both memory snapshots.

Then we’ll keep only objects that are both in B and C. Meaning, objects that were present at B and retained afterward.

For instance, if you have three dumps with the next objects in memory:

We can say that our object at 0x01 is a long-term object due to our usual memory growth at boot. However 0x02 is ALLOCATED at B, hence if big it could be the cause of a bloat. However, it cannot be the cause of our leak since it has been garbage collected in C. Finally, 0x03 is RETAINED since it appears after some time (B) and stays there (C), it has great chances to be the reason of our leak.

To do that in your code, you just need to derailed exec perf:heap_diff. This will save three dumps for you, and run a three-way diff against those. You can then just check which retained line concern your application.

If you get a few lines with a huge size, bingo! It is now a matter of understanding why the code is leaking, and fixing that. Otherwise, consider testing another endpoint, changing PATH_TO_HIT. I suggest to always start with a simple endpoint as it will tell you if the leak lies in your whole stack or only in a single endpoint.

5. Wrapping up

It is now time to commit, run derailed exec perf:mem_over_time on the former leaking endpoint, and see that nice plateau after a few requests, indicating that you've solved the leak (congrats!)

References