Why is My Perfectly Good Shellcode Not Working?: Cache Coherency on MIPS and ARM

Picturegdb showing nonsensical crashes
To set the scene: You found a stack buffer overflow, wrote your shellcode to an executable heap or stack, and used your overflow to direct the instruction pointer to the address of your shellcode. Yet your shellcode is inconsistent, crashes frequently, and core dumps show the processor jumped to an address halfway through your shellcode, seemingly without executing the first half. The symptoms haven’t helped diagnose the problem, they’ve left you more confused.

You’ve tried everything. Changing the size of the buffer, page aligning your code, even waiting extra cycles, but your code is still broken. When you turn on debug mode for the target process, or step through with a debugger, it works perfectly, but that isn’t good enough. Your code doesn’t self-modify, so you shouldn’t have to worry about cache coherency, right?

PictureWe accessed a root console via UART
That’s what happened to us on MIPS when we exploited a TP-Link router. In order to save time, we added a series of NOPs from the beginning of the shellcode buffer to where the processor often “jumped,” and put the issue in the queue to explore later. We encountered a similar problem on ARM when we exploited Devil’s Ivy on an ARM chip. We circumvented the problem by not using self-modifying shellcode, and logged the issue so we could follow up later.

Since we finished exploring lateral attacks, the research team has taken some time to dig into the shellcoding oddities that puzzled us earlier, and we’d like to share what we've learned.

PictureOverview of MIPS caches
Our MIPS shellcode did not self-modify, but it ran afoul of cache coherency anyway. MIPS maintains two caches, a data cache and an instruction cache. These caches are designed to increase the speed of memory access by conducting reads and writes to main memory asynchronously. The caches are completely separate, MIPS writes data to the data cache and instructions to the instruction cache. To save time, the running process pulls instructions and data from the caches rather than from main memory. When a value is not available from the cache, the processor syncs the cache with main memory before the process tries again.

When the TP-Link’s MIPS processor wrote our shellcode to the executable heap it only wrote the shellcode to the data cache, not to main memory. Modified areas in the data cache are marked for later syncing with main memory. However, although the heap was marked executable,  the processor didn’t automatically recognize our bytes as code and never updated the instruction cache with our new values. What’s more, even if the instruction cache synced with main memory before our code ran, it still wouldn’t have received our values because they had not yet been written from the data cache to main memory. Before our shellcode could run, it needed to move from the data cache to the instruction cache, by way of main memory, and that wasn't happening.

This explained the strange crashes. After our stack buffer overflow overwrote the stored return address with our shellcode address, the processor directed execution to the correct location because the return address was data. However, it executed the old instructions that still occupied the instruction cache, rather than the ones we had recently written to the data cache. The buffer had previously been filled mostly by zeros, which MIPS interprets as NOPs. Core dumps showed an apparent “jump” to the middle of our shellcode because the processor loaded our values just before, or during, generating the core dump. The processor hadn't synced because it assumed that the instructions that had been at that location would still be at that location, a reasonable assumption given that code does not usually change mid-execution. There are legitimate reasons for modifying code (most importantly, every time a new process loads), so chip manufacturers generally provide ways to flush the data and instruction cache.

One easy way to cause a data cache write to main memory is to call sleep(), a well known strategy which causes the processor to suspend operation for a specified period of time. Originally our ROP chain only consisted of two addresses, one to calculate the address of the shellcode buffer from two registers we controlled on the stack, and the next to jump to the calculated address.

To call sleep() we inserted two addresses before the original ROP chain. The first code snippet set $a0 to 1. $a0 is the first argument to sleep and tells the processor how many milliseconds to sleep. This code also loaded the registers $ra and $s0 from the stack, returning to the value we placed on the stack for $ra.


Setting up call to sleep()

The next code snippet called sleep(). Since sleep() returned to the return address passed into the function, we needed the return address to be something we controlled. We found a location that loaded the return address from the stack and then jumped to a register. We were pleased to find the code snippet below, which transfers the value in $s1, which we set to sleep(), into $t9 and then calls $t9 after loading $ra from the stack.


Calling sleep()

From there, we executed the rest of the ROP chain and finally achieved consistent execution of our exploit.

Read on for more details about syncing the MIPS cache and why calling sleep() works or scroll down for a discussion of ARM cache coherency problems.

Most of the time when we talk about syncing data, we're trying to avoid race conditions between two entities sharing a data buffer. That is, at a high level, the problem we encountered, essentially ​a race condition between syncing our shellcode and executing it. If syncing won, the code would work, if execution won, it would fail. Because the caches do not sync frequently, as syncing is a time consuming process, we almost always lost this race. According to the MIPS Software Training materials (PDF) on caches, whenever we write instructions that the OS would normally write, we need to make the data cache and main memory coherent and then mark the area containing the old instructions in the instruction cache invalid, which is what the OS does every time it loads a new process into memory.

The data and instruction caches store between 8 and 64KBs of values, depending on the MIPS processor. The instruction cache will sync with main memory if the processor encounters a syncing instruction, execution is directed to a location outside the bounds of what is stored in the instruction cache, and after cache initialization. With a jump to the heap from a library more than a page away, we can be fairly certain that the values there will not be in the instruction cache, but we still need to write the data cache to main memory.

We learned from devttys0 that sleep() would sync the caches. We tried it out and our shellcode worked! We also learned about another option from emaze, calling cacheflush() from libc will more precisely flush the area of memory that you require. However, it requires the address, number of bytes, and cache to be flushed, which is difficult from ROP. Because calling sleep(), with its single argument, was far easier, we dug a little deeper to find out why it's so effective.

During sleep, a process or thread gives up its allotted time and yields execution to the next scheduled process. However, a context switch on MIPS does not necessitate a cache flush. On older chips it may, but on modern MIPS instruction cache architectures, cached addresses are tagged with an ID corresponding to the process they belong to, resulting in those addresses staying in cache rather than slowing down the context switch process any further. Without these IDs, the processor would have to sync the caches during every context switch, which would make context switching even more expensive. So how did sleep() trigger a data cache write back to main memory?

The two ways data caches are designed to write to main memory are write-back and write-through. Write-through means every memory modification triggers a w

rite out to main memory and the appropriate cache. This ensures data from the cache will not be lost, but greatly slows down processing speed. The other method is write-back, where data is written only to the copy in the cache, and the subsequent write to main memory is postponed for an optimal time. MIPS uses the write-back method (if it didn’t, we wouldn’t have these problems) so we need to wait until the blocks of memory in the cache containing the modified values are written to main memory. This can be triggered a few different ways.

 One trigger is any Direct Memory Access (DMA) . Because the processor needs to ensure that the correct bytes are in memory before access occurs, it syncs the data cache with main memory to complete any pending writes to the selected memory. Another trigger is when the data cache requires the cache blocks containing modified values for new memory. As noted before, the data cache size is at least 8KB, large enough that this should rarely happen. However, during a context switch, if the data cache requires enough new memory that it needs in-use blocks, it will trigger a write-back of modified data, moving our shellcode from the data cache to main memory.

As before, when the sleeping process woke, it caused an instruction cache miss when directing execution to our shellcode, because the address of the shellcode was far from where the processor expected to execute next. This time, our shellcode was in main memory, ready to be loaded into the instruction cache and executed.

It sure is. ARM maintains separate data and instruction caches too. The difference is we’re far less likely to find executable heaps and stacks (which was the default on MIPS toolchains until recently). The lack of executable space ready for shellcode forces us to allocate a new buffer, copy our shellcode to it, mark it executable, and then jump to it. Using mprotect to mark a buffer executable triggers a cache flush, according to the Android Hacker’s Handbook. The section also includes an important and very helpful note.


Excerpt from Chapter 9, Separate Code and Instruction Cache, "Android Hackers Handbook"

However there are still times we need to sync the instruction cache on ARM, as in the case of exploiting Devil’s Ivy. We put together a ROP chain that gave us code execution and wrote self-modifying shellcode that decoded itself in place because incoming data was heavily filtered.  Although we included code that we thought would sync the instruction cache, the code crashed in the strangest ways. Again, the symptoms were not even close to what we expected. We saw the processor raise a segfault while executing a perfectly good piece of shellcode, a missed register write that caused an incomprehensible crash ten lines of code later, and a socket that connected but would not transmit data. Worse yet, when we attached gdb and went through the code step by step, it worked perfectly. There was no behavior that pointed to an instruction cache issue, and nothing easy to search for help on, other than “Why isn’t my perfectly good shellcode working!?”  By now you can guess what the problem was, and we did too.

If you are on ARMv7 or newer and running into odd problems, one solution is to execute data barrier and instruction cache sync instructions after you write but before you execute your new bytes, as shown below.


ARMv7+ cache syncing instructions

On ARMv6, instead of DSB and ISB, ARM provided MCR instructions to manipulate the cache. The following instructions have the same effect  as DSB and ISB above, though prior to ARMv6 they were privileged and so won't work on older chips.


ARMv6 cache syncing instructions

PictureShellcode to call sleep()
If you are too restricted by a filter to execute these instructions, as we were, neither of these solutions will work. While there are rumors about using SWI 0x9F0002 and overwriting the call number because the system interprets it as data, this method did not work for us and so we can’t recommend it (but feel free to let us know if you tried it and it worked for you).

One thing we could do is call mprotect() from libc on the modified shellcode, but an even easier thing is to call sleep() just like we did on MIPS. We ran a series of experiments and determined that calling sleep() caused the caches to sync on ARMv6.

Our shellcode was limited by a filter, so, although we were executing shellcode at this point, we took advantage of functions in libc. We found the address of sleep, but its lower byte was below the threshold of the filter. We added 0x20 to the address (the lowest byte allowed) to pass it through the filter and subtracted it with our shellcode, as shown to the right.

​Although context switches don't directly cause cache invalidation, we suspect that the next process to execute often uses enough of the instruction cache that it requires blocks belonging to the sleeping process. The technique worked well on this processor and platform, but if it doesn’t work for you, we recommend using mprotect() for higher certainty. 

The way systems work in theory is not necessarily what happens in the real world. While chips have been designed to prevent additional overhead during context switches, no system runs in precisely the way it was intended.

We had fun digging into these issues. Diagnosing computer problems reminds us how difficult it can be to diagnose health conditions. Symptoms show up in a different location than their cause, like pain referred from one part of the leg to another, and simply observing the problem can change its behavior. Embedded devices were designed to be black boxes, telling us nothing and quietly going about the one task they were designed to do. With more insight into their behavior, we can begin to solve the security problems that confound us.

​Just getting started in security? Check out the recent video series on the fundamentals of device security. Old hand? Try our team's research on lateral attacks, the vulnerability our ARM work was based on, and the MIPS-based router vulnerability.