Part I: How your batch system watches your processes (and why it's so bad at it)

By Brian Bockelman

Series Preamble Almost every cluster sysadmin has faced a case of "users gone wild"; for us, it's almost always due to users abusing the shared file system or user processes escaping the watchful eye of the batch system.  If I could prevent abuse of the shared file system while keeping it functional, I'd be a rich man.  I'm not a rich man, so I'm going to be talking about the latter issue.  This is a big topic, so I'm going to be splitting it up into a few posts:
  • Part I: How your batch system watches your processes (and why it's so bad at it).
  • Part II: Keeping a mindful eye on your users with ProcPolice.
  • Part III: Death of the fork-bomb: Ironclad process tracking in batch systems.
A few caveats up-front: I'm going to be talking about the platform I know (Linux-based OS's) and the batch systems we use (Condor, PBS, and a bit of SGE).  Apologies to the Windows/obscure-Unix-variant/LSF users out there. So, onward and upward!

Strategy: Process Groups

Each process on the system belongs to a process group, and the process groups are further grouped into a session (as in, a login session).  Most batch systems, when starting a job, will start the job in a new session and a fresh process group.  Process groups are at their most useful when sending signals: the batch system can send a signal (such as SIGKILL to terminate processes) to a process group.  The kernel does the process tracking and appropriately signals all the processes in a group. If this worked well, it would be a short blog series.  Unfortunately, any processes can start a new process group - removing it from its previous group.  To hide from the batch system, one can simply daemonize and start a new process group.  Viola!  You've now escaped. Process groups are utilized by all batch systems; despite not working well, this is the most straightforward mechanism.

Strategy: Process Trees

Another  "process trees".  When a process is created, it has both a process ID and parent ID.  By reading from the /proc filesystem (side note: Mac OS X has a far superior way of getting a programmatic snapshot of all system processes.  Jealous.), the batch system can create a list of all processes on the system, their parents, and builds an in-memory tree of the parent-child relationships.  By starting with the process the batch system launched for the job, and walking through the tree, the batch system can determine all the processes associated with a job.  Creating and updating a process tree is poll-based: the operation is performed every X seconds (a typical value might be X=5).  As long as the lifetime of all processes is X or more, the batch system will see all processes. Well-behaved batch jobs stay within the process tree started by the batch system; this way, they get monitored by the batch system and their resource usage gets accounted. The issue here, again, is badly behaved processes.  When a process's parent exits, the child is assigned a new parent: init, or PID 1.  To escape notice of a batch system that uses process trees, one needs to fork twice (creating a child and "grandchild"), then have the immediate child die.  The grandchild is re-parented to init, as if it was a system daemon.   Done quickly enough, and the grandchild has escaped the batch system. The process tree strategy is used by Condor.

Strategy: Environment Cookies

A process's Unix environment is automatically inherited by its children, and remains unchanged if the parent exits.  Condor currently takes advantage of these facts and inserts an extra environment variable into each batch system job.  If you dump the environment of your current job using "env", you might see something like this:
_CONDOR_ANCESTOR_17948=17952:1307975354:2631244213 _CONDOR_ANCESTOR_17952=18260:1307976308:2791283533


Each of these are environment variables used by Condor to track the process's ancestry.  In this case, the condor_starter's PID is 18260 and the job's PID is 18263 (the other entries are from parents of the condor_starter process, the condor_startd and condor_master).  Any sub-process started by the job will retain the _CONDOR_ANCESTOR_18260 variable by default. When Condor polls the /proc filesystem to build a process tree, it can also read out the environment variables and use this information to build the process tree.  As before, this relies on the user being friendly: if the environment variables are changed, then it again can escape the batch system.

Strategy: Supplementary Group IDs

Notice that all strategies so far involve some property of the process which is automatically inherited by its children (the process group, the process ancestry, or the Unix environment variables), but can be changed by the user's job. A property inherited by subprocesses that cannot be changed without special privilege is the set of group IDs.  Each process has a set of group IDs it is associated with it (if you look at the contents of /proc/self/status, you can see the groups associated with your terminal); it requires administrator privileges to add or remove group IDs, which the batch system has but the user does not.

Condor and SGE can be assigned a range of group IDs to hand out, and assign one of the IDs to the job process they launch.  Assuming there is only one instance of the batch system on the node, any process with that group ID must have come from the batch job.  So, when it comes time to kill batch jobs or perform accounting, we can map any process back to the batch system job.

While the user process cannot get rid of the ID, this setup is still possible to defeat (discussed below), and has a few drawbacks.  The user process now has a new GID, and can create files using that GID; I have no clue how this might be useful, but it's a sign of misusing the GID concept.  Anything that caches the user-to-groups mapping may get the wrong set of GIDs (as having unique per-process GIDs are rare, these caches may have broken assumptions).  Finally, lays extra work on the sysadmin, who now must maintain a range of unused GIDs; they must  sufficient to provide a GID per batch slot.  Locally, we've run into the fact that the number of GIDs increases with the number of cores per node: what was a good setting last year is no longer sufficient. Note that, with Condor, you can take this one step further and assign a unique user ID per batch slot, and run the job under that UID as opposed to the submitter's UID.  This is a nightmare in terms of NFS-based shared file systems, but the approach at least works on both Unix and Windows.

How to defeat your batch system (inadvertently, right?)

Despite the drawbacks, the supplementary GID mechanism seems pretty foolproof: the user can no longer launch processes that can't be tracked back to a batch slot.  However, this isn't sufficient to stop malicious users. In order to kill all processes based on some attribute of the process (besides the process group), one must iterate through the contents of the /proc directory, read and parse the process's status file, and send a kill signal as appropriate.  Ultimately, all batch systems currently do some variation of this; if you want a simple source code example, go lookup the sources of the venerable 'killall' utility. The approach described above does have a fatal flaw: it is not atomic.  Between looking at the contents of /proc, and opening /proc/PID/status, a process could have already forked another child and exited.  Processes may have been spawned between the time when the directory iteration begins and ends, meaning they might never be seen. Hence, a process may spawn more children in the time the batch system iterates through /proc and kills it; in fact, if the batch system is unlucky, they may do this fast enough the batch system may never detect the process exists in the first place!  In the latter case, regardless of the tracking mechanism, the process may escape the batch system. Worse, because these short-lived processes can be invisible to the batch system, the batch system may not detect it's being fooled; if the batch system could reliably detect the attack, it might be able to send an alert or turn off the worker node. Ultimately, the batch system is defeated because it is trying to do process control from user-space.  We lack three things:
  1. Reliably track processes without changing the semantics of the job's runtime environment.
  2. Atomically operations for determining and signaling a set of processes.
  3. Detecting when (1) or (2) have failed.
Luckily, with a little help from the Linux kernel, we can overcome all three of the above issues.  Item (2) takes a fairly modern kernel (2.6.24 or later), but items (1) and (3) can be accomplished with 2.6.0 or later.

As long as we have the ability to detect attacks as in (3), we can limp along until everyone gets onto a modern kernel: this is the topic of the next post.  Stay tuned.