I support multiple clients in the chip design industry. Most tools this industry uses are Linux based, which is an OS I love to support. Most problem reports we get are same old stuff! Recently however, a user reported a fairly interesting issue. The user runs a scientific simulator app, and his VNC session crashes. That’s it! The simulator app allocates around 30GB of RAM then crashes! Let’s run that app to reproduce the issue
Running that app, it takes some time, allocated roughly 30 gigs of RAM and sure enough as reported, not only does it crash, but takes VNC with it!
This was pretty repeatable. My first thought, was the server was out of RAM. I suspected the OOM killer was on the loose, but nope! No logs contained any traces of running out of RAM. Also the server still had plenty of free RAM. Thinking it might be a tigervnc bug, I ssh’ed into the server, switched to the user account and ran the tool from ssh. Surprisingly the tool crashed, and took the “su” session with it!
The simulator crashing is ok in my book, but why would it kill “su” .. Something is fishy! Every single process owned by that user (VNC et al) also died. I was annoyed bec I didn’t see that behavior before. I did not have a convincing reason to tell my user. I had to dig deeper!
Since it wasn’t out of memory, and wasn’t a VNC bug. Could it be that something else is killing those processes! I had a theory, time to test it. Since the kernel on those boxes is fairly old (el7) I only had system-tap available. That should be enough. I fired the quick snippet below, compiled and loaded it into the kernel and ran the tool again
Waited a bit for the simulation tool to do its initial work, and right when it crashes I saw the below
wow that sort of confirmed my doubts. Those processes were not crashing, they were being deliberately killed! By non other than the simulator tool itself! That was pretty weird. I still had doubts and wanted to confirm my line of thought. So I ran the binary one more time, and right before it was supposed to crash I attached trusty old strace to it.
And again this confirmed my doubts. The binary is calling kill() to send a SIGKILL. It is sending to pid “-1” which according to the man page, means send that signal to every process it can kill! That was pretty weird, but at least things are starting to make sense! We got a SIGKILL’er on the loose!
Trying to “contain” this killer. I thought to contain it in a separate Linux pid namespace. I performed a quick experiment seen below. After “unshare”ing the pid namespace and forking a new csh in the separate pid namespace. Indeed I was not able to kill other processes. So the simulator won’t be able to kill VNC yaay
I performed the experiment and ran the simulator app. And indeed it is not able to kill VNC or any of the other processes owned by that user. But it still does kill itself 😅.. Making this experiment much less practical to use!
My next lines of thought to try and contain this were to use gdb to breakpoint when the app was calling kill. Indeed “catch syscall kill” was very helpful. However the app used a lot of kill calls, and trying to skip over them would have been hard to automate. I also thought about preloading a shim above glibc to override kill(), or even altering the kernel’s call table. But I was not motivated enough to follow with such solutions especially that the situation should not be happening in the first place. This is probably an application bug that should be fixed in a later update!
Well, that was a fun ride! If you would like me or my team @cloud9ers.com to support your Linux servers or devops process, please do click the above and send a message, or leave a comment below. Cheers!