Finding a Ruby bug with GDB

By Zachary Anker

While I was at Square, a coworker reported a bug to my team, and I’ve been meaning to write it up for a few months. Andrew Lazarus deserves credit as well, since we worked together on debugging this over a few days.

This issue still exists in Ruby 2.5.0, and can be found as far back as Ruby 2.0.0. If you just want a summary of the bug, see ruby-lang#13882.

Someone reported they were seeing old Puma (a HTTP web server) processes hanging around after their app was deployed. Easy enough to confirm, SSH into the server, and run ps -fu foo

% 20:42 <zanker:~> ps -fu foo
33333 5105 1 0 Aug15 ? 00:00:05 puma: cluster worker 2: 61264 [foo_aaaaaa]
33333 26644 1 0 Jun29 ? 00:02:12 puma: cluster worker 0: 26639 [foo_bbbbbb]
33333 42796 1 0 Sep07 ? 00:05:43 puma: cluster worker 1: 53755 [foo_dddddd]
33333 59085 1 0 Aug29 ? 00:03:44 puma: cluster worker 2: 76064 [foo_eeeeee]

Truncated for brevity, but multiple Puma workers with different SHAs (the foo_##### part) should never happen, especially not with start times (the STIME column) that are weeks old.

Given the age of the process, it wasn’t feasible to track down exceptions or logs, and we’re not even sure what we want was recorded properly. We started off by trying to replicate it, but deploying a few times yielded no results.

Since the parent process id (PPID column) is 1, we know these are orphaned processes and can be more aggressive in our debugging by attaching GDB and looking around.

Picking one of the processes at random, we run: cd foo_aaaaaa/ && gdb ruby/bin/ruby 5105 . GDB was looking for a Ruby bin path that didn’t exist, and this way we get the symbols data in GDB. Symbols are important since it gives us the file/line number in Ruby’s C code to look at.

GDB is treating our process as a very complicated C program, since we want to get access to the Ruby state we need to run:

define r_eval
call (void)rb_p((unsigned long)rb_eval_string_protect("'/tmp/caller.out', 'a') { |f| begin; f.write $arg0; f.write('\n'); rescue => ex; f.write ex.inspect; f.write('\n'); end }",(int*)0))

This creates a GDB function which calls the Ruby VM to eval a string and dump the data to a file, even if we make a typo. Now we can run Ruby code, such as r_eval and look at /tmp/caller.out to see the end result.

Threads are a good place to start to get an idea of the current state of the process, we can run, r_eval Thread.list.inspect:

#<Thread:0x007fd99d4a2260@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 sleep_forever>,
#<Thread:0x007fd99d4a2148@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 sleep_forever>

Weird! The process should have at least 5 threads running. We can look at what the thread is currently doing by running r_eval Thread.list.first.send(:caller).inspect:

"eval:1:in `open'", "eval:1:in `block (2 levels) in spawn_thread'",
"foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:106:in `sleep'",
"foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:106:in `wait'",
"foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:106:in `block (2 levels) in spawn_thread'",
"foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:90:in `synchronize'",
"foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:90:in `block in spawn_thread'"

(Rerun it a few times to see if the thread is doing something else.)

Checking out the code for thread_pool.rb shows it’s waiting on not_empty.wait mutex, which is Puma sleeping until it gets a HTTP request to process.

Since Puma manages the startup/shutdown lifecycle, we move onto looking at Puma’s internal state withr_eval Thread.current[:puma_server].instance_variable_get(:@status).inspect which tells us it’s in the run state. From looking at the code, we see it should be stop if it were trying to shutdown.

For thoroughness, we can dump Puma’s thread pool with r_eval Thread.current[:puma_server].instance_variable_get(:@thread_pool).instance_variable_get(:@workers).inspect which leads to:

#<Thread:0x007fd99d4a23a0@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 dead>,
#<Thread:0x007fd99d4a2260@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 sleep_forever>,
#<Thread:0x007fd99d4a2148@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 sleep_forever>,
#<Thread:0x007fd99d4a2008@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 dead>,
#<Thread:0x007fd99d4a1ec8@puma 005@foo_aaaaaa/bundle/ruby/2.4.0/gems/puma-3.9.1/lib/puma/thread_pool.rb:74 dead>

What. Puma has a reaper thread that is responsible for removing dead threads to make room for new ones… except that reaper thread is dead since it wasn’t in Thread.list.

Given everything we’ve found, it seems likely that Puma was attempting to shutdown, but was interrupted partway — perhaps by magic, maybe solar.

Back to GDB, we can run thread apply all backtrace which gives us the backtrace of all the running threads from the C perspective:

Thread 1 (Thread 0x6f187658c83e (LWP 279882)):
#0 0x00000033d940b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/
#1 0x00007f9cb9847959 in native_cond_wait (cond=<value optimized out>, mutex=<value optimized out>) at thread_pthread.c:343
#2 0x00007f9cb98515b8 in native_sleep (th=0xb18d521b551d, timeout_tv=0x0) at thread_pthread.c:1147
#3 0x00007f9cb9851e34 in rb_thread_terminate_all () at thread.c:494
#4 0x00007f9cb9742466 in ruby_cleanup (ex=6) at eval.c:186
#5 0x00007f9cb9742725 in ruby_run_node (n=0xd368e9e61fd1) at eval.c:300
#6 0x00000000004008eb in main (argc=2, argv=0xe1f78525fd22) at main.c:36

Looks promising! Pulling up the Ruby C for rb_thread_terminate_all:

while (vm_living_thread_num(vm) > 1) {
* Thread exiting routine in thread_start_func_2 notify
* me when the last sub-thread exit.
sleeping = 1;
native_sleep(th, 0);
sleeping = 0;

We now know that at least one thread fails to exit, which creates an infinite loop and prevents the process from exiting gracefully. That doesn’t really help us a lot, because we didn’t think that was possible! As a hail mary, we went back to looking at the Ruby process itself.

By default, app exceptions and logs are reported using a background thread. Since that thread is killed at exit, maybe something got stuck in the queue that can give us better insight.

Omitting the relevant snippet, as it’s internal code, but the exception message wasThreadError: can't alloc thread and the stack trace pointed to ruby-kafka’s async_producer.rb file.

Since we’ve not run into that exception before, it’s back to the Ruby C code:

static VALUE
thread_s_new(int argc, VALUE *argv, VALUE klass)
rb_thread_t *th;
VALUE thread = rb_thread_alloc(klass);
if (GET_VM()->main_thread->status == THREAD_KILLED)
rb_raise(rb_eThreadError, "can't alloc thread");

The error only comes up if you call while the main VM thread is dead/dying. Makes sense, we don’t want to start a new thread during shutdown. Verifying against other orphaned processes shows they contain similar exceptions.

My personal guess was the ThreadError exception is corrupting the Ruby VM internal state on a thread, which prevents the thread from cleanly exiting. Since we’re unable to replicate it, we’ve nothing to go on beyond debugging after the fact.

From further digging, we found Ruby provides a THREAD_DEBUG option to give more low level debug info on threads… unfortunately you have to recompile Ruby to get it.

Before we tried deploying a debug version of Ruby, Andrew had the idea of looking into how ensure works. He figured out that the bug can be replicated with: do
loop do
puts "Loop start"
rescue => e
p e
sleep 1

After that, everything else was easy. We can recompile Ruby locally with THREAD_DEBUG to confirm that rb_threadptr_to_kill is being called. This is called by Ruby to mark a thread as exiting, and looks like:

static void
rb_threadptr_to_kill(rb_thread_t *th)
th->status = THREAD_RUNNABLE;
th->to_kill = 1;
th->errinfo = INT2FIX(TAG_FATAL);

errinfo is a VALUE field that contains anything from a meta tag (such as TAG_FATAL) to the exception object if the thread had an exception. The majority of the state handling in threads is keyed off errinfo. We open up GDB on a stuck process, and run:

(gdb) p ruby_current_thread->to_kill
$2 = 1
(gdb) p ruby_current_thread->status
(gdb) p ruby_current_thread->errinfo
$4 = 8

(0x08 is the value for nil in Ruby)

All of the stuck threads have the to_kill field set, but no errinfo , which confirms the initial theory that exceptions were causing the thread to lose the kill state.

Tying everything together, we now know the following has to happen:

  1. The thread needs to contain an infinite loop
  2. The loop needs to include a begin rescue block
  3. Within the begin block it needs another begin with an ensure
  4. An exception has to happen within the embeddedensure, which is then rescued by the top level rescue

The top level rescue is causing the Ruby VM to think the thread is “safe” and doesn’t have to be killed due to an exception. Unfortunately, this happens during shutdown which means the Ruby VM forgets that it needs to shutdown, and the infinite loop continues, and orphans the process.

The short-term fix is to ensure the relevant code using ruby-kafka never raises an exception. The long term fix requires fixing Ruby, which was reported in ruby-lang#13882 and hopefully it’ll be fixed soon.

After the fact, Andrew figured out a better way of doing r_eval that goes to the GDB process stdout to avoid tailing a file.

Instead, when you sudo you run: sudo -iu <user> script -q /dev/null 
And then you run GDB using: gdb -ex ‘set $tty=”’$(tty)’”’ -p <pid> 
Finally, you define r_eval using:

define r_eval
eval "call (unsigned long)rb_eval_string_protect(\"'%s', 'a') { |f| begin; f.puts $arg0; rescue => ex; f.puts ex.inspect; end }\",(int*)0)", $tty