Poke-a-hole and friends


Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Reducing the memory footprint of a binary is important for improving performance. Poke-a-hole (pahole) and other binary object file analysis programs developed by Arnaldo Carvalho de Melo help in analyzing the object files for finding inefficiencies such as holes in data structures, or functions declared inlined being eventually un-inlined functions in the object code.

Poke-a-hole

Poke-a-hole (pahole) is an object-file analysis tool to find the size of the data structures, and the holes caused due to aligning the data elements to the word-size of the CPU by the compiler. Consider a simple data structure:

 struct sample { char a[2]; long l; int i; void *p; short s; };

Adding the size of individual elements of the structure, the expected size of the sample data structure is:

 2*1 (char) + 4 (long) + 4 (int) + 4 (pointer) + 2 (short) = 16 bytes

Compiling this on a 32-bit architecture (ILP32, or Int-Long-Pointer 32 bits) reveals that the size is actually 20 bytes. The additional bytes are inserted by the compiler to make the data elements aligned to word size of the CPU. In this case, two bytes padding is added after char a[2], and another two bytes are added after short s. Compiling the same program on a 64-bit machine (LP64, or Long-Pointer 64 bits) results in struct sample occupying 40 bytes. In this case, six bytes are added after char a[2], four bytes after int i, and six bytes after short 2. Pahole was developed to narrow down on such holes created by word-size alignment by the compiler. To analyze the object files, the source must be compiled with the debugging flag "-g". In the kernel, this is activated by CONFIG_DEBUG_INFO, or "Kernel Hacking > Compile the kernel with debug info".

Analyzing the object file generated by the program with struct sample on a i386 machine results in:

 i386$ pahole sizes.o struct sample { char c[2]; /* 0	2 */ /* XXX 2 bytes hole, try to pack */ long int l; /* 4	4 */ int i; /* 8	4 */ void * p; /* 12	4 */ short int s; /* 16	2 */ /* size: 20, cachelines: 1, members: 5 */ /* sum members: 16, holes: 1, sum holes: 2 */ /* padding: 2 */ /* last cacheline: 20 bytes */ };

Each data element of the structure has two numbers listed in C-style comments. The first number represents the offset of the data element from the start of the structure and the second number represents the size in bytes. At the end of the structure, pahole summarizes the details of the size and the holes in the structure.

Similarly, analyzing the object file generated by the program with struct sample on a x86_64 machine results in:

 x86_64$ pahole sizes.o struct sample { char c[2]; /* 0	2 */ /* XXX 6 bytes hole, try to pack */ long int l; /* 8	8 */ int i; /* 16	4 */ /* XXX 4 bytes hole, try to pack */ void * p; /* 24	8 */ short int s; /* 32	2 */ /* size: 40, cachelines: 1, members: 5 */ /* sum members: 24, holes: 2, sum holes: 10 */ /* padding: 6 */ /* last cacheline: 40 bytes */ };

Notice that there is a new hole introduced after int i, which was not present in the object compiled for the 32-bit machine. Compiling a source code developed on i386 but compiled on x86_64 might be wasting more space because of such alignment problems because long and pointer graduated to being eight bytes wide while integer remained as four bytes. Ignoring data structure re-structuring is a common mistake developers do when porting applications from i386 to x86_64. This results in larger memory footprint of the program than expected. A larger data structure leads to more cacheline reads than required and hence decreasing performance.

Pahole is capable of suggesting an alternate compact data structure reorganizing the data elements in the data structure, by using the --reorganize option. Pahole also accepts an optional --show_reorg_steps to show the steps taken to compress the data structure.

 x86_64$ pahole --show_reorg_steps --reorganize -C sample sizes.o /* Moving 'i' from after 'l' to after 'c' */ struct sample { char c[2]; /* 0	2 */ /* XXX 2 bytes hole, try to pack */ int i; /* 4	4 */ long int l; /* 8	8 */ void * p; /* 16	8 */ short int s; /* 24	2 */ /* size: 32, cachelines: 1, members: 5 */ /* sum members: 24, holes: 1, sum holes: 2 */ /* padding: 6 */ /* last cacheline: 32 bytes */ } /* Moving 's' from after 'p' to after 'c' */ struct sample { char c[2]; /* 0	2 */ short int s; /* 2	2 */ int i; /* 4	4 */ long int l; /* 8	8 */ void * p; /* 16	8 */ /* size: 24, cachelines: 1, members: 5 */ /* last cacheline: 24 bytes */ } /* Final reorganized struct: */ struct sample { char c[2]; /* 0	2 */ short int s; /* 2	2 */ int i; /* 4	4 */ long int l; /* 8	8 */ void * p; /* 16	8 */ /* size: 24, cachelines: 1, members: 5 */ /* last cacheline: 24 bytes */ }; /* saved 16 bytes! */

The --reorganize algorithm tries to compact the structure by moving the data elements from the end of the struct to fill holes. It makes an attempt to move the padding at the end of the struct. Pahole demotes the bit fields to a smaller basic type when the type being used has more bits that required by the element in the bit field. For example, int flag:1 will be demoted to char.

Being over-zealous in compacting a data structure sometimes may reduce performance. Writes to data elements may flush the cachelines of other data elements being read from the same cacheline. So, some structures are defined with ____cacheline_aligned in order to force them to start from the beginning of a fresh cacheline. An example output of structure which used ____cacheline_aligned from drivers/net/e100.c is:

 struct nic { /* Begin: frequently used values: keep adjacent for cache * effect */ u32 msg_enable ____cacheline_aligned; struct net_device *netdev; struct pci_dev *pdev; struct rx *rxs ____cacheline_aligned; struct rx *rx_to_use; struct rx *rx_to_clean; struct rfd blank_rfd; enum ru_state ru_running; spinlock_t cb_lock ____cacheline_aligned; spinlock_t cmd_lock; <output snipped>
Analyzing the nic structure using pahole results in holes just before the cacheline boundary, the data elements before rxs and cb_lock.
 x86_64$ pahole -C nic /space/kernels/linux-2.6/drivers/net/e100.o struct nic { u32 msg_enable; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct net_device * netdev; /* 8 8 */ struct pci_dev * pdev; /* 16 8 */ /* XXX 40 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct rx * rxs; /* 64 8 */ struct rx * rx_to_use; /* 72 8 */ struct rx * rx_to_clean; /* 80 8 */ struct rfd blank_rfd; /* 88 16 */ enum ru_state ru_running; /* 104 4 */ /* XXX 20 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ spinlock_t cb_lock; /* 128 4 */ spinlock_t cmd_lock; /* 132 4 */ <output snipped>

Besides finding holes, pahole can be used for the data field sitting at a particular offset from the start of the data structure. Pahole can also list the sizes of all the data structures:

 x86_64$ pahole --sizes linux-2.6/vmlinux | sort -k3 -nr | head -5 tty_struct	1328	10 vc_data	432	9 request_queue	2272	8 net_device	1536	8 mddev_s	792	8

The first field represents data structure name, the second represents the current size of the data structure and the final field represents the number of holes present in the structure.

Similarly, to get the summary of possible data structure that can be packed to save the size of the data structure:

 x86_64$ pahole --packable sizes.o sample	40	24	16

The first field represents the data structure, the second represents the current size, the third represents the packed size and the fourth field represents the total number of bytes saved by packing the holes.

Pfunct

The pfunct tool shows the function aspects in the object code. It is capable of showing the number of goto labels used, number of parameters to the functions, the size of the functions etc. Most popular usage however is finding the number of functions declared inline but not inlined, or the number of function declared uninlined but are eventually inlined. The compiler tends to optimize the functions by inlining or uninlining the functions depending on the size.

 x86_64$ pfunct --cc_inlined linux-2.6/vmlinux | tail -5 run_init_process do_initcalls zap_identity_mappings clear_bss copy_bootdata

The compiler may also choose to uninline functions which have been specifically declared inline. This may be caused by multiple reasons, such as recursive functions for which inlining will cause infinite regress. pfunct --cc_uninlined shows functions which are declared inline but have been uninlined by the compiler. Such functions are good candidates for a second look, or for removing the inline declaration altogether. Fortunately, pfunct --cc_uninlined on vmlinux (only) did not list any functions.

Debug Info

The utilities rely on the debug_info section of the object file, when the source code is compiled using the debug option. These utilities rely on the DWARF standard or Compact C-Type Format (CTF) which are common debugging file format used by most compilers. Gcc uses the DWARF format.

The debugging data is organized under the debug_info section of ELF (Executable and Linkage Format), in the form of tags with values such as representing variables, parameters of a function, placed in hierarchical nested format. To read raw information, you may use readelf provided by binutils, or eu-readelf provided by elfutils. Common standard distributions do not compile the packages with debuginfo because it tends to make the binaries pretty big. Instead they include this information as debuginfo packages, which contain the debuginfo information which can be analyzed through these tools or gdb.

Utilities discussed in this article were initially developed to analyze kernel object files. However, these utilities are not limited to kernel object files and can be used with any userspace programs generating debug information. The source code of pahole utilities are maintained at git://git.kernel.org/pub/scm/linux/kernel/git/acme/pahole.git More information about pahole and other utilities to analyze debug object files can be found in the PDF about 7 dwarves. (Log in to post comments)