Replacing 90s C Linux Utilities With Python


Welcome to Infrastructure Week July 2018! New articles and tools every day this week.

Everybody knows the netstat tool, but do you know these fun facts too?

  • netstat is part of a package called net-tools which includes: arp hostname ifconfig ipmaddr iptunnel mii-tool nameif netstat plipconfig rarp route slattach statistics
    • what the heck is a plipconfig? Well, it optimizes the performance of your parallel port — I sure am glad my 2018 server with 288 cores and 8 TB RAM has plipconfig.
      • double fun fact: stackoverflow has zero questions about plipconfig — it must be a very intuitive and easy to use utility!
  • the net-tools codebase was written in the mid 90s
    • meaning: pre-C99, probably by people who still didn’t even trust or believe in C89 yet.
  • 25 years later, the entire codebase still looks like dirty old C
    • plus it’s included by default on millions of machines
  • and, surprise!, the entire project ended up mostly abandoned
  • people are still trying to correct “90s dirty C” idioms in the code to this day
  • maint of the package is now seemingly just ad-hoc by OS package maintainers whenever they find a problem or modern Linux incompatibility

TOC:

and, as always, you can ignore all the hard work I put into this write up and just jump right to the code.

What If We Replaced 90s C netstat With Python?

We’re going to focus on one tool in the net-tools package: netstat.

It’s full of poorly formatted code you’d be (hopefully) fired for if you wrote today, but we’ll cover that towards the end so people don’t get scared or scarred up front.

netstat -nape

netstat -p is one of its most useful features: it shows you which pid and process name is listening on a port.

Example: netstat -nape |grep LISTEN

It gives us all listening IP:Port combinations along with their pid and process names (scroll to the right where the style falls off).

Unfortunately, we see some limitations:

  • look at the nginx process name: it’s nginx: master p
    • netstat has a fixed-length 20 character buffer for process names. Thanks, 90s!
    • also, since the output is so short, you don’t get full paths.
      • any process by any user in any directory could call itself “sshd” and you wouldn’t notice the difference based on the extremely truncated output netstat provides.
  • the output is wide. really wide. not very terminal friendly.
  • the output doesn’t appear to be ordered by anything useful? Not by pid, not by IP, not by port.
  • oh, and root.
    • netstat must be run as root to generate the IP:Port to pid/name mappings. That’s not cool.

Plus, the output has six columns we don’t care about!

My first attempt at making the output more useful was: netstat --numeric-hosts --listening --program --tcp --inet --inet6 |awk '{if (NR > 2) {printf "%-4s %-20s served by %-20s\n", $1, $4, $7} }' | sort -k 5,5 -n:

A little better! We are now sorted by pid and the bad columns are gone, but the process names are still truncated and netstat must still be run as root to generate them at all.

Still not good enough for our needs though.

It’s [Almost] Code Time

Replacing netstat -p requires figuring out how netstat matches IP:Ports to pids and why it requires root to show the mapping.

A quick look through the code tells us:

  • netstat reads the pid mappings from /proc/[pid]/fd/*, but each of those directories requires root permission to enter (unless you own the pid yourself).
    • why? it’s a security issue to let anybody directly access the open FDs/inodes of any random process
    • but why must those directories be consulted?
      • Linux only exposes which pids are using which inodes as a /proc/ symlink in those directories. There’s no other way to discover the mappings.
      • Those symlinks look like this ls -latrh /proc/*/fd/*
l-wx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/3 -> /dev/kmsg
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/0 -> /dev/null
lr-x------ 1 root root 64 Jul 11 18:33 /proc/1/fd/18 -> /run/utmp
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/39 -> socket:[14648]
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/38 -> socket:[878100]
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/37 -> anon_inode:bpf-prog
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/32 -> anon_inode:bpf-prog
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/31 -> anon_inode:bpf-prog
lrwx------ 1 root root 64 Jul 11 18:33 /proc/1/fd/30 -> anon_inode:bpf-map
lrwx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/9 -> /dev/kmsg
lrwx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/8 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/7 -> /dev/kmsg
lrwx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/6 -> socket:[14675]
lrwx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/5 -> socket:[14681]
lrwx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/4 -> socket:[14679]
lrwx------ 1 root root 64 Jul 11 18:33 /proc/408/fd/36 -> /var/log/journal/aa7c9b37491043cca93eed3d1d242ed6/user-1000.journal

Each of those is a symlink where the link name itself tells you the actual inode used by the process.

What use is an inode?

Well, the inodes of each listening IP:Port socket are freely available to any user in /proc/net/{tcp,udp}{,6}. Here’s a sample of /proc/net/tcp:

Don’t you like how they stopped naming columns towards the end? What are those extra 7 fields? Don’t you worry your pretty little head about it. Just let those fields be fields and you be you.

Those IP addresses do look a bit off.

Let’s ask Erlang to convert those hex IP addresses back into individual bytes for us:

And, if you hadn’t guessed before, hex 0100007F is little endian 127.0.0.1.

If we tell Erlang about the byte orientation up front, it can fix it for us:

The port numbers are easy to decipher too (oddly, Linux provides the port numbers in network byte order while the IP addresses aren’t. thanks, Linux!):

or if you are tired of Erlang:

or if you want some Python flavor:

whew, so now we know:

  • We can get every IP:Port active on the system, both incoming and outgoing, from:
    • /proc/net/tcp
    • /proc/net/udp
    • /proc/net/tcp6
    • /proc/net/udp6
  • We can pick listening addresses (using the st(ate) column) and save the inode column.
  • We can read every open fd in the system by walking /proc/[pid]/fd/*
    • If a symlink points to a socket:[INODE], then
      • parse the symlink, extract the inode number, compare to our previously retrieved inode list from /proc/net/{tcp,udp}{,6}.
  • then we can read /proc/[pid]/cmdline for each pid we matched to get the full command line (instead of being limited to a 20 character buffer like 90s netstat C code).
  • Finally, we can do any remaining formatting/sorting/filtering for presentation.

We can do all those steps in Python easily, right? It’s walking some directories, matching some files against other files, then printing the output we actually want.

Let’s do this.

Intermission!

Want more infrastructure content? Join our infrastructure mailing list right now!

Now It’s Python Coding Time

The netstat source uses C APIs for directory walking by:

  • opendir(3) of /proc to walk the pid directories
    • so, readdir(3) for each directory entry
    • if readdir returns a pid directory, run another opendir() to walk the pid/fd directory.
      • now readdir() again to walk the fd entries
        • then call readlink(3) trying to find socket:[INODE] entries.

It’s a lot of system calls for file operations even though it’s running through procfs.

We could copy the netstat algorithm exactly using Python’s os.walk() API.

So, that’s what I did the first time through. I used os.walk() and it took 500ms to generate results (30x slower than old netstat, not cool).

But, this is the future and we have better APIs: if we replace 20 lines of looping os.walk() code with one line of glob.glob("/proc/*/fd/*"), our runtime drops from 500ms to 70ms.

Read Dem Files

Even though we can get a nice quick file listing with globohmyglob! we still have to os.readlink() on every filename returned by the glob.

create map of inode->list of pids

Capturing every processes socket inode->pid mapping becomes:

Note at the end how we append the pid to our map of inodes->[pid].

netstat -p doesn’t have the ability to show us every process listening on a socket, but with forking servers and perhaps even REUSEPORT, multiple processes can listen on the same socket, but you’d never realize that from reading netstat -p output.

We’re already better than netstat — we can report the truth of our system instead of having our output lie to us because unmaintained C code from 1993 can’t handle the modern world.

look up the command line for each pid

Now, with our list of pids, we can look up each command line:

i… i… inodes!

Where did the inodes dict come from? We didn’t populate that yet!

inodes was the result of parsing /proc/net/{tcp,udp}{,6}, which is as simple as:

Simple enough? We also use functions ipv4() and ipv6() to parse the hex IPs from /proc/net to readable formats:

And we’re done! We now have a dict called inodes containing every listening IP:Port on our system.

All that’s remaining is to draw the rest of the fscking owl format it how we want, which gives us:

Proto Listening PID Process udp 192.168.122.10:bootpc 441 /lib/systemd/systemd-networkd tcp 127.0.0.53:domain 493 /lib/systemd/systemd-resolved udp 127.0.0.53:domain 493 /lib/systemd/systemd-resolved tcp 192.168.122.10:ssh 579 /usr/sbin/sshd -D udp 127.0.0.1:323 581 /usr/sbin/chronyd udp6 ::1:323 581 /usr/sbin/chronyd tcp 158.69.158.251:http 620 nginx: master process /usr/sbin/nginx -g daemon on; 2893 nginx: worker process 2894 nginx: worker process 2895 nginx: worker process 2896 nginx: worker process tcp 158.69.158.251:https 620 nginx: master process /usr/sbin/nginx -g daemon on; 2893 nginx: worker process 2894 nginx: worker process 2895 nginx: worker process 2896 nginx: worker process tcp 0.0.0.0:smtp 908 /usr/lib/postfix/sbin/master -w 11580 smtpd -n smtp -t inet -u -c -o stress= -s 2 tcp6 :::smtp 908 /usr/lib/postfix/sbin/master -w 11580 smtpd -n smtp -t inet -u -c -o stress= -s 2 tcp 127.0.0.1:epmd 968 /opt/otp/17.5/erts-6.4/bin/epmd -daemon tcp 127.0.0.1:7781 987 /opt/otp/17.5/erts-6.4/bin/beam.smp -- -root /opt/
tcp 127.0.0.1:40001 987 /opt/otp/17.5/erts-6.4/bin/beam.smp -- -root /opt/
tcp 127.0.0.1:8888 1029 /opt/otp/17.5/erts-6.4/bin/beam.smp -- -root /opt/
tcp 127.0.0.1:40002 1029 /opt/otp/17.5/erts-6.4/bin/beam.smp -- -root /opt/
tcp 127.0.0.1:7780 8445 /opt/otp/17.5/erts-6.4/bin/beam.smp -- -root /opt/
tcp 127.0.0.1:40000 8445 /opt/otp/17.5/erts-6.4/bin/beam.smp -- -root /opt/

oooooh so pretty! no unnecessary columns, sorted by primary pid number, reports multiple listeners controlling one socket…

Plus: colors! We’re using blue for private/local IPs and red for other. Double Plus: terminal-width aware printing so we never wrap lines!

But, sadly, because of Linux design choices, the only place we can discover the pid mappings is by running as root to read /proc/[pid]/fd/* entries. How can we get around such a restriction so everybody can run the netstat of the people? seizethemeansofnetstatting!

Let’s Make A Module

The problem with finding system-wide inode to pid mappings is simple: Linux never created an interface to discover them without opening O(N) directories and reading symlink targets of O(k) files. Oh, and those directories can only be opened by root or the process owner. Whoops.

But, just because Linux never created such an interface doesn’t mean we can’t create one for ourselves!

Let’s write a simple Linux kernel function to list every inode belonging to every pid:

The code prints a line of {pid} {short process name} {inodes*} for each task/pid on the system.

We run our function by turning it into a Linux kernel module using both the simplified seq_file and procfs APIs to:

  • create an entry in /proc
  • tell Linux how to run our function when anybody reads /proc/pid_inode_map

Load Module, See Results!

After loading the module, cat /proc/pid_inode_map shows lines like:

With our custom mapping of pids to their inodes, we can adjust the netstat replacement to take advantage of one-stop-shopping-mapping.

Replacing glob with Custom Proc Parsing

Let’s read /proc/pid_inode_map once instead of parsing O(N*k) symlinks:

We save thousands of system operations by generating one file any user can read instead of needing root to go through every fd of every process.

Our approach of parsing a pre-generated file at /proc/pid_inode_map is 40% faster than iterating all the pids and fd symlinks every time we want a network status.

Hey Linux, give us a built-in pid to inode mapping by default!

Code for all the Python scripts plus the Linux kernel module is at mattsta/netmatt.

C Thy Shame

Jumping back a bit, let’s look at some netstat.c code.

Code has not been modified to protect the guilty. It actually is formatted like this.

This is in netstat.c still shipping in your Linux net-tools package in 2018:

Did you notice the excerpt has an unterminated while loop? Do you see it?

If you want to follow along at home, you can get the source with apt-get source net-tools.

What if I spend 0.03 seconds to run it through modern automated formatting tools?

90s C code has plenty of weird properties, but the strangest is an absolute refusal to use proper indentation combined with a massive lack of visual whitespace.

Though, even in 2018 backwards people still argue “brevity” is the highest form of coding. Never write in 4 lines what you can technically manipulate your compiler into accepting as 1 line, even if you just remove all the whitespace and brackets and indentation. We call these people CDs (C Dolts) and they should be monitored carefully to minimize ongoing damage to the time stream.

Making 90s C code is easy:

  • cram everything close together
  • align most everything to the left with no indentation
  • never — never! — use brackets if your if, for, or while only has one result statement
    • as a bonus, lie using indentation about what your if does, like the inexcusably bad if (lnamelen == -1) statement below.
      • look, it has a ‘continue’ but then everything below is still indented like it applies to the if statement! Gotta love 25 year old abandoned code running on millions of machines around the world.

Check out this source excerpt still shipping in 2018 too:

If your eyes haven’t exploded from code stress yet, count the unterminated flow control statements. Do you see?

Let’s clean this up again using 0.03 seconds of automated tooling:

In the original code section, did you notice if (!cmdlp) { was unterminated? No, you didn’t notice, because they refused to use indentation in 1993 and nobody has fixed it in the subsequent 25 years.

90s C is basically the pinnacle of the Write Once, Read Never Again coding movement and must be ridiculed at all costs. Riddikulus!

Conclusion

What did we learn today?

  • netstat is part of net-tools
  • net-tools is a mostly abandoned set of Linux utilities from the mid 90s
  • Linux doesn’t let non-root users discover pid to [inodes] metadata
  • netstat actually under-reports which pids own which sockets
    • netstat only lists one pid even though sockets can be owned by multiple pids
  • But we can write a Linux kernel module to generate the mapping anyway! seizethemeansofnetstatting!
  • The 90s Linux utility C code is awful and needs to be either adopted and completely re-formatted, re-reviewed, and brought up to modern standards, or outright abandoned.
  • We can write much safer system utilities in Python
    • they are fast enough
    • they are safe enough
    • they are readable enough
    • and doggone it, people like me.

-Matt@mattsta☁mattsta

Still want more infrastructure content? Join our infrastructure mailing list right now! Really do it this time!

If you liked the C teardown, you’ll love our new series: Your Code Is Bad And You Should Feel Bad.

Get pre-release announcements by signing up here:

Stay tuned for more Infrastructure Week July 2018! New articles and tools every day this week.