Decoded: GNU coreutils


October 2018

coreutils brought to you by the GNU project

This is a long-term project to decode all of the GNU coreutils in version 8.3.

This resource is for novice programmers exploring the design of command-line utilities. It is best used as an accompaniment providing useful background while reading the source code of the utility you may be interested in. This is not a user guide -- Please see applicable man pages for instructions on using these utilities.

Status: Paused -- 52% complete  (3 complete, 95 partial)

  • Phase 1 [complete] - Each utility has a dedicated page discussing the namespace and execution overview.
  • Phase 2 - Line by line code walkthrough for each utility. Links from execution steps to v8.3 source. Expanded discussion about important ideas. Porting diagrams to something more collaborative. Resuming 2H 2019.

The GNU Core Utilities

I'll link the utility pages here at the top. Click the command name for the detailed page decoding that utility. The discussion, source code, and walkthroughs are available on each page. Enjoy!

Helpful background for code reading

The GNU coreutils has its foibles. Most of these utilities are over 30 years old with many revisions by many people over the years. Here are some things to keep in mind when reading the code:
  • Tiny programs - These utilities are small, (mostly) single-source file programs designed to do one thing and do it well. They are not designed for long life or to scale beyond their role. Consequently, we see designs often considered 'bad practice' such as:
    • Many globals
    • Liberal use of macros
    • goto statements
    • Long functions with nested switchs/loops
  • Know POSIX - Start with the Utility Syntax Guidelines. In general, POSIX supports interoperability by defining appropriate inputs and outputs, but leaves the 'work' to the implementation. While the GNU coreutils may not strictly conform to POSIX, many ideas are entrenched: permission bits, uids/gids, environment variables, exit status, and about 3718 pages of more trivia.
  • Outside help - Portability is a complex problem and coreutils relies on extra help from a related project: gnulib. Almost every utility includes functions from gnulib which are specially designed for common problems used in many places across various systems - No need to reinvent the wheel.
  • Launched from a shell - The Core utilities expect support from a shell such as bash, zsh, ksh, and others. The shell forks/clones in to the utility, passes the arguments, sets up the environment, redirects I/O via pipes, and retains exit values.
  • Three families - GNU coreutils were originally three distinct packages for shell, text, and file utilities. Utilities within the same type share many of the same design patterns.

Basic design

Under the hood, most CLI utilities break down to this:

General CLI procedure

The key ideas:

  • A setup phase for flags, options, localization, etc
  • An argument parsing phase thats reads input to set execution parameters
  • A processing/execution phase that prepares input for one or more syscalls
  • Many opportunities to check constraints and fail out of execution
    • Distinct EXIT status hint about problem location
    • EXIT_FAILURE is general and commonly used
  • Providing feedback after failed execution

This is the framework I'll use to organize the decoding of each utility. We'll see that each has a unique variant of this idea which range from a few lines to thousands of lines. I'd categorize the variants in three groups: trivial, wrappers, and full utilities

Trivial utilities
Trivial utilities have a unique set up phase which defines a macro in a couple lines. Then it 'includes' the source of another utility in which the macro forces a specific flow control. Examples include: arch, dir, and vdir

Wrapper utilities
Wrappers perform setup and parse command line options which are passed directly as arguments to a syscall. The result of the syscall is the result of the utility. These utilities do little processing on their own. Examples include: link, whoami, hostid, logname, and more

Full utilities
The diagram above shows a design for full utilities. A setup phase, an option/argument parsing phase, and execution. Execution means processing input data and may invoke many syscalls along the way to handle more data until complete. Most utilities fall in to this category.

Digging deeper

Let's go through the most common ideas shared across many of the utilities. Knowing these concepts beforehand should speed up code reading.

Utility Initialization

All utilities have a short initialization procedure near the beginning of main():

 initialize_main (&argc, &argv); set_program_name (argv[0]); setlocale (LC_ALL, ""); bindtextdomain (PACKAGE, LOCALEDIR); textdomain (PACKAGE); atexit (close_stdout); 

This short sequence solves a few administartive issues, the most important of which are internationalization and defining the exit action. I'll go through each of these lines below. I general, these lines don't apply to the specific action of a utility.

Parsing with Getopt

Ever wonder why command line utilities have had the same look and feel for the past 40 years? You can thank the Getopt toolset. The bare minimum you need to know to follow the coreutils is:

  • Command line options can be 'short' and 'long', prefixed with (-) and (--) respectively. Short options are defined as a string while long options use a struct.
  • Short options use 1) only a letter if the option has no argument, 2) A single colon (:) for mandatory arguments, and (::) for optional arguments. For example, the short option string for kill is: Lln:s:t. Which says that L, l, t take no arguments but n and s need an argument.
  • Long options often have a short analogue
  • The getopt_long() function returns the next option and is used in all utilities
  • The optind index is a position within the argv[] array for the next argument.
  • The optarg char pointer points to the value of the option's argument.

Traversing the file system with fts

Unix-like systems often support the fts library to easily manage walking through the file system. The basic hand-waved details are:

  • The tree is represented by an FTS structure built by calling fts_open() or xfts_open() on a path.
  • A node (file/directory) from the tree is a FTSENT structure.
  • Calling fts_read() on the FTS generates FTSENTs. This is walking the tree.
  • The FTSENT->fts_info field describes the entries. It is used often to decide how to handle the entry.

Syscall wrappers, and helpers

coreutils often invokes syscalls through wrappers and helpers beyond those provided by libc. Many are linked through the Gnulib project.

write

libc provides many text writing functions, such as fwrite() for buffered stream access, and the write() syscall wrapper. Coreutils brings in non-standard functions such as full_write(). The full_write() function continuously retries writes unless there is a hard failure. It relies on safe_write() to retry the write() syscall across interrupts. Other write-related helpers are used only in a single utility. Such as iwrite() in dd, cwrite() in split. I'll discuss those within the utilities themselves.

Common functions

All utilities use at least three functions: main(), usage(), and _().

The usage() function displays help for the utility that includes a list of input parameters, their meaning, and appropriate syntax.

The _() function is really a macro defined in system.h that binds simple strings to the Native Language Support capability in GNU gettext.h. If it's a string meant to be shown to the user, it's probably wrapped with this function.

Common code lines

The following code lines occur in most non-trivial utilities:

#include "system.h"
This header defines system-dependent marcos, variables, and functions. It provides 'translations' necessary to allow coreutils to build on as many architectures as possible. The file really is a patchwork of corner cases with little organization. I won't be taking this apart unless there is overwhelming number of requests.

#define PROGRAM_NAME "cat"
Defines the official name for the utility. Used in the 'version' check.

#define AUTHORS proper_name ("Richard M. Stallman")
Defines the authors for the utility. Used in the 'version' check.

emit_try_help ()
Prints help suggestion after failed output. Includes a link to the online documents. This will appear at the beginning of usage()

emit_ancillary_info (PROGRAM_NAME)
Prints common extra help info after the command-specific output. Includes a link to the online documents. This appears close to the end of usage()

exit (status)
Syscall to end execution with the given status. This appears at the end of usage()

initialize_main(&argc, &argv)
Special handler for OS/2 forcing built-in wildcard expansion. This is defined away for most other operating systems

set_program_name(argv[0]);
Saves the basic program name using the first input argument. Discards the path component of argv[0].

setlocale(LC_ALL, "");
Sets up internationalization options during execution. Provided by libc in <locale.h>

bindtextdomain (PACKAGE, LOCALEDIR);
Sets the directory of intenationalization features using the free software gettext.h

textdomain (PACKAGE);
Sets the text domain to enable i18n.

atexit(close_stdout);
Registers the close_stdout function for call when the program ends. This flushes the buffer steam in addition to closing.

IF_LINT(something);
Suppresses GCC warnings if using a linter by including the code within the parens. Usually this is NOP

C idioms

There are a few idioms buried in the coreutils source that may be unfamiliar to beginners.

!!
The double exclaimation point is exactly what you see, a double unary NOT operation. The purpose is to coerce a value in to a boolean. It's often used to make a flag from a function return value.

do { ... } while (0)
The non-loop often encloses a multi-statement macro to ensure proper tokenization after preprocessor substitution. The core use-case is as a consequent:

 if (condition) MACRO; else something else 
Note that lack of semi-colon after the while -- this is included wherever the macro appears in the C code.

Fun facts

Shortest utility: false (2 lines - tied with arch, dir, and vdir)
Shortest standalone utility: true (80 lines)
Longest utility: ls (5308 lines)

  • Many utilities trace their origins to pre-System V UNIX (some as far back as Multics).
  • The distinct syntax of the dd utility is reminiscent of the OS/360 job control language (early 1960s).
  • The sort program is the only utility that takes advantage of multi-threading
  • The fmt utility demonstrates optimization of lines and paragraphs using feature costs
  • The deceptively simple yes utility has high-performance output using page-aligned memory buffers
  • The df utility is faster than du. The former uses device metadata while the latter checks all files
  • cksum includes two entry points, one for normal operation and one to generate the CRC-32 table
  • There is no failure condition for the echo utility
  • The design of the test and expr utilities departs significantly from the typical utility
  • expr is a standalone example of left-associative expression evaluation that doesn't rely on automated tools
  • My personal least used utilities are tsort and ptx - haven't touched either since the 1990s

FAQ

No questions yet