Structuring your program to consider memory can improve performance. Wesley Maness and Richard Reich demonstrate this with a producer–consumer queue.
In this paper, we explore cache-line friendly data structures, in particular queues built with atomics that will be used in multi-threaded environments. We will illustrate our topic with a real-world use case that is not cache-line aware, measure, incorporate our suggestions, measure again, and finally review our results. Before we get into the nuts and bolts of our data structures, we need to define a few terms followed with some examples. For those readers who are not familiar with the topics of NUMA and CPU cache, we highly recommend reviewing them (at [Wikipedia-1] and [Wikipedia-2]).
Jitter can be defined as the variance of time around operations that can arguably have constant time expectations.
A few examples to better illustrate are:
- The wall clock on a system in which each ‘tick’ of the highest-precision unit has some variance. Back in 2012, we measured the wall clock on the best server we had at the time. It was accurate down to 1 microsecond. However, we had jitter of ±1.5 microseconds. Consider that the clock’s accuracy increases over longer time periods, but each tick jittered.
- When we are linearly accessing memory in a system, each access may take a constant time until you hit a page fault. That page fault can introduce delay. In this case, we have a constant time operation that ends up having predictable jitter.
- Sparsely accessing an array can have many cache misses as well as cache hits. Depending on the location and layout of the memory, what should be constant will be different depending on:
- Is the memory in the same NUMA node? [Tsang17]
- Is the memory located in the cache?
- Which cache is the memory located in? (L3, L2, L1)
- Is the memory in the current cache line?
- Is the memory contended with another CPU socket/core?
Many of the above are just examples which are not investigated in the scope of the paper and many can also be mitigated by using prefetching.
The cache line is the smallest unit of RAM the CPU can load to perform operations. On the Intel CPU, this is 64 bytes, or 8 pointers in a 64-bit operating system. If at least one of the cores is writing, then cache coherency causes the cache line to be synchronized between the cores as each write forces a synchronization between the cores. Many cores reading from the same cache line causes no performance issues if there is no writing to that same cache line.
Cache awareness really comes down to structuring your memory layout (memory model) of your program and its data structures. Careful consideration of what memory goes where can significantly improve the performance of the resulting machine code that must be verified by measuring.
For example, if we look at the Intel core i7 [Levinthal09] we can see its specifications are: L1 is 4 CPU cycles, L2 is 11 CPU cycles and L3 is 30-40 CPU cycles. Main memory can range from 200-350 CPU cycles on a 3GHz system. Crossing NUMA nodes incurs even more penalties. Code that sparsely accesses memory incurring many cache misses can spend 95% of the time doing nothing!
In previous sections, we hinted as to why we would want to be cache aware, but here we will explain in a bit more detail, and then consider the benefits of potential future hardware progression and its impact.
From a multi-threading perspective, we need to be sure that data that are independently accessed by different threads are not shared over a cache line. Doing so will cause all reading threads to stall while the dirty cache line is synchronized across all cores. This is compounded if one or more of the threads exists on a different NUMA node.
The most direct benefit of fully independent data between threads not sharing cache lines is linear scalability. Some of the most obvious costs are:
- Increased code complexity due to increased complexity of the memory model.
- If each thread is accessing small amounts of data, some of the space in each cache line may not be utilized.
- Because of the above, we may need more memory to force alignment. If memory is limited, this could require special consideration.
- Perhaps there are multiple copies of the same data in various locations resulting in less efficient usage of memory.
Each socket is being packed with more and more cores, and FPGA integration into Intel CPUs is on the horizon (as this is written) [Intel] [Patrizio17]. This will effectively model memory to separate memory access at the cache-line level (to avoid false sharing [Bolosky93]). Furthermore, it will have a greater impact as cores and specialized hardware compete for resources.
Next, we will demonstrate the necessity of cache-line awareness with a modern-day use case that would benefit nicely from such consideration.
Common use case
There exist situations in technology where we find ourselves having to set up and use a shared environment with many agents performing various coordinating tasks. In this use case, we will be given a machine in which we have a single multi-producer multi-consumer queue (MPMC) instance shared amongst a set of producers and consumers. The producers, we can assume, are clients which are running some computations and generating work, or some partial work. Hence each client would be running a different set of computations and each of those publishing their resulting work items to the MPMC for later consumption. Each consumer would be responsible for taking the work off the queue, perhaps performing some post-checks, or validation of the work on an item in the queue then dispatching that work out to some client who will need to process or make some determination. Each consumer and producer would run in their own thread and potentially could be pinned to any CPU. Depending on what is required, it’s quite reasonable that the number of producers and consumers would need to scale up or down as clients and or workloads are running (some can go on and off line driven by various external events). The most critical measurement we would want to consider here would be the time it takes to produce or push a work item on the queue and for the work item to be removed from the queue. We consider other types of measurements as well later.
Running the benchmark
To enforce that we are not aligning the data, we use the
alignas specifier for our data. The
struct that we will be using is shown in Listing 1.
We will store the number of cycles in the
struct along with some other data fields. How we store these is shown in Listing 2 (not shown is the
GetTravelStore method which is the same as
GetStore). We used the
boost::lockfree::queue for our testing the aligned section and we modify their implementation to disable alignment (not shown) for our baseline numbers.
In establishing our baseline number, we will consider various scenarios or ratios of numbers of producers to consumers. Currently we only display results for the ratio 2:2. Each consumer and each producer will spin in a busy loop to minimize overall cycles required to publish and or consume data. The code for producer and consumer is shown in Listing 3. The total number of items produced in each scenario will be scaled to the number of consumers. In these examples, we will use the standard atomics (as part of the MPMC queue) in C++ and spin on their values to identify when data is available. The producers and consumers are constructed in the run method shown in Listing 4.
Finally, for our benchmark without cache-line awareness, we put all of this together:
run<Alignment< Benchmark , alignof(Benchmark)> , boost::lockfree::bad_queue>
We will pin each thread created to its very own CPU to reduce overall variance on measurements. This is done in the method
setAffinity mentioned in the code examples but not shown. You can find more information about thread affinity on the Linux man pages [Kerrisk]. We will measure push, pop, and total queue traversal time (pop end time – push start time) along with various other metrics. These results are shown in Figure 1. All measurements are in cycles (using RDTSCP and CPUID instructions) using the recommended guidelines from Intel [Paoloni10]. These measurements are computed inside the functions
getcc_e (not shown).
Applying cache-line awareness to our example
Now that we have identified how we are going to take advantage of cache-line awareness in our queue, we will re-run our previous tests with the changes we just applied. We will now enable cache-line awareness by using the alignas specifier for our data. The code used here is:
run<Alignment< Benchmark, 64> , boost::lockfree::queue> (producers, consumers);
We will run through the same scenarios as we saw in Figure 1. These results are shown in Figure 2.
Note: If the measurements are not clear in Figure 2, the max measurements for Push, Pop, and Traversal are observed as 22977, 20221, and 18102 respectively.
Comparing where we started and where we ended up, there are several items of considerable mention. The first is that the distribution of the percentiles for Figure 1 clearly show fat tail properties mostly noticeably in the traversal measurements. Another quite fascinating point is that the 90th and 99th percentiles of all measurements dropped considerably in all operations. Push went from 518 cycles to 431 and 715 cycles to 575 for 90th and 99th percentiles. Pop, not necessarily as impressive as Push, went from 414 cycles to 370 and 623 cycles to 500 cycles respectfully. Traversal was even more impressive going from 810236 cycles to 12307 and 8497191 cycles to 15571 for 90th and 99th percentiles respectively.
Conclusion and future direction
There are many items we didn’t address in this paper. For example, we did not discuss the data throughput through the queue as a function of cache-line awareness changes. We could have also considered measuring different scenarios and the ratios of producers to consumers. It is important to note that the problem addressed in this paper is just one instance of a group of problems collectively known as flow control problems. These problems exist in many domains, and are quite common in some aspects of financial technology, but have relevancy in many others. We mention the points above to illustrate different ways in which this problem can be expanded upon and perhaps further analysis may be continued.
Finally, as we were writing this paper, running measurements of various scenarios, the results were in many ways quite interesting and often initially appeared to be counter-intuitive, but after carefully examining the assembly and measuring the performance more closely the results made more sense. We can’t stress enough how important it is to measure your applications and the tools used to measure those applications.
Boost 1.63.0 and GCC 5.4.0 were used. For a complete package of the code used in this article and that which is referenced and not shown, please contact the authors directly. All measurements were the average of 1000 runs of 1e6 iterations for 2 consumers and 2 producer threads. Intel i7-3610QM CPU 2.30GHz 4 cores per socket for 8 cores total was used to produce all measurements discussed in the paper. Operating system used was Linux ll 4.12.5-gentoo.
[Bolosky93] William Bolosky and Michael Scott (1993) ‘False Sharing and its Effect on Shared Memory Performance’, originally published in Proceedings of the USENIX SEDMS IV Conference Sept 22–23 1993, available at http://static.usenix.org/publications/library/proceedings/sedms4/full_papers/bolosky.txt
[Intel] ‘The Power of FPGAS’ available athttps://www.intel.com/content/www/us/en/fpga/solutions.html
[Kerrisk] Michael Kerrisk (maintainer), Linux man pages online, http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html
[Levinthal09] David Levinthal PhD (2009) ‘Performance Analyis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors’ at https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
[Paoloni10] Gabriele Paolone (2010) How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures, published by Intel Corporation, https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
[Patrizio17] Andy Patrizio (2017) ‘Intel plans hybrid CPU-FPGA chips’, posted 5 October 2017 at https://www.networkworld.com/article/3230929/data-center/intel-unveils-hybrid-cpu-fpga-plans.html
[Tsang17] Stanley Tsang (2017) ‘Real-Time NUMA Node Performance Analysis Using Intel Performance Monitor’ at http://www.acceleware.com/blog/real-time-NUMA-node-performance-analysis-using-intel-performance-counter-monitor
[Wikipedia-1] ‘CPU cache’ at https://en.wikipedia.org/wiki/CPU_cache
[Wikipedia-2] ‘Non-uniform memory access’ at https://en.wikipedia.org/wiki/Non-uniform_memory_access
Stuck on a problem? Frances Buontempo considers where to turn to for inspiration.
I started to clear my desk and found a cue card with the words “Lead by example” written on it, which distracted me from writing an editorial. This card is from Seb Rose’s closing keynote ‘Learning to Walk again’ at this year’s ACCU conference [Rose18]. The idea was to write down something you had learnt from the conference on a card and give the card, along with contact details, to someone so they could later remind you what you wrote. I failed to comply with the instructions precisely, since I still have my own card, but it did jog my memory, though not enough to recall what had taught me this. It could have been Arne Mertz’s ‘Code review’ session [Mertz18]. You cannot expect to get away with telling others how they can improve their code if you don’t follow your own suggestions. As a mentor, I encouraged new programmers to add unit tests to their code and put it all in version control, but frequently noticed I had strategic scripts with no tests at all and some not even in version control. For shame! Lead by example. Of course, it’s not just unit tests. I complain when others don’t keep a diary, but don’t always write appointments in my own diary. Far too much ‘Do as I say, not as I do.’
OK, what does lead by example mean? In one sense, leading by example contrasts with trying to bully people into doing things your way. How do you persuade people to adopt your approach? I was recently asked this at an interview. If faced with a task that might take a day, and two possible implementations, it’s surely worth knocking together both alternatives rather than spending three days arguing over which is best? You do not always need to make everyone go along with one idea. Sometimes it matters, sometimes it doesn’t. Tempers can get frayed if people don’t see things your way, and I value working software over proving my idea is the best. A recent blog post [DestroyAllSoftware] broke down a response from Linus Tovarlds on union aliasing. Tovarlds’ language was inflammatory and unkind. The blog post showed an alternative way of making the same points without being so hateful. You can disagree with people without resorting to bullying by telling them they are brain-dead or worse. I can recall many time when I’ve categorically told someone they were stupid and didn’t know what they were doing. I believe I have stopped doing this now. Pull me up if you notice me being a bully. Taking the lead by bullying others into submission is not a good idea. What are the alternatives?
To encourage the adoption of a new approach to a problem, instead of arguing or laying down the law, you may be able to knock together a prototype showing the alternative works. Sometimes the better tech wins, so giving people alternatives allowing them to try out your new ideas can be more persuasive than banning older tool-chains or similar. If you want to change an API, try adding new functions and gradually deprecating the older ones as people stop using them. The strangle vine pattern or strangle applicator [Fowler04] describes ways to ensure a new approach strangles or kills off the old way, drawing an analogy with strangle vines, which grow over other plants. It contrasts with a complete re-write allowing new approaches to live side by side with old approaches, at least for a while. This probably doesn’t count as leadership per se, but does give alternative paths. More suggest by alternatives than lead by example, though it incorporates the nub of the idea: have a demonstration or example to make your point.
Any prototype is certainly an example, though whether this counts as leading is another matter. In the sense of pioneering, or going out in front, it surely does? Sometimes attempting to trail-blaze leaves you more a lone lunatic in no-mans’ land than a leader. You need a level of self-confidence to be able to demonstrate an idea or working example without wanting to hide under a blanket or behind a sofa. I hope we manage to encourage Overload writers, even if we don’t always agree with everything that gets written. That’s ok. Thank you for sharing your ideas with us. An article is often an example, sometimes a novel idea even, which can lead readers to try new things, learn or even write in to disagree. That’s ok too. However, an article as a way of leading by example is probably not what my cue card meant.
What did I mean? I am not sure. It sounds like sensible advice, but I am unclear what it really means, as is often the case with slightly trite phrases. Furthermore, it begs the question: should I lead by example? Perhaps this is straying near Betteridge’s law of headlines: ‘Any headline that ends in a question mark can be answered by the word no’. [Wikipedia-1]. When used in a headline, a question is often sensationalist, in order to drum up an audience, something I suspect my title is unlikely to achieve. One website [BetteridgeLaw] has examples filling over two thousand pages. I can’t vouch for the veracity of any of those, though many are hilarious. ‘Does Bill Gates still know what computer users want?’ and so on. I’ll leave you to explore.
You may not aspire to be a leader, but can find yourself out in front from time to time. I sometimes walk quicker than others, finding myself ahead when walking with friends or family to an unfamiliar location, having just said, ‘I’ll follow you.’ It is difficult to follow if you are in front. Perhaps you find yourself reluctantly in front, being the first person to try to make a new technology work, or resurrect some old code no one knows how to build. Bad luck. In that situation, you are unlikely to have examples to follow. You can make sure you put the code in version control, add some kind of tests or at least ways of spotting regressions. You can add a make file, or other build script, once you have figured out how to build the code. A short readme file is a good idea too. Even if it says little more than ‘type
run_tests’. In a sense, this is leading by example because you have improved the situation, just quietly in the background, without needing long meetings to decide what approaches to take. You might need these too, but at least the fundamental parts are in place and you have recorded what you spent time discovering in a simple and clear format. Perhaps that’s all I meant. Instead of moaning about the state of the world, or the project, or codebase, step up and make the changes needed. This might need to be in a non-invasive way, so people can still email themselves files, write word documents and have meetings if they want to. Meanwhile the code builds and runs. And crashes. But that’s another story.
This nuance is leading in the sense of forging ahead and getting stuff done. It’s not leadership in the sense of an authoritarian head of state or someone guiding or conducting a process, team or project. A leader can also be a front page news splash or similar. Something at the front, in your face, trying to stir up discussion. An editorial of sorts. The etymology of the ending -ship might trace to the Dutch for cut or hack [Etymology], I presume along the lines of essence of something rather than thrown together or a newspaper hack. Such a pen for hire is not to be confused with relatively recent phone hacking scandals by News International [Wikipedia-2]. In a sense, the hack makes a path through something or in a direction. Any example gives a hint of how to do something or the direction to take. An entrepreneur or pioneer may take a lead, one dealing with the enterprisey requirements to form a business, the other possible being more like a lone ranger going off in front, perhaps a alone. Do such people lead by example? They lead. The best team leads I have ever worked with lead from behind. They were happy to take a back seat and enable the developers. The Harvard Business Review attributes this to Nelson Mandela [Hill04], equating a leader with a shepherd who “stays behind the flock, letting the most nimble go out ahead, whereupon the others follow, not realizing that all along they are being directed from behind.”
Are there leaders, whether team leads at work or from other realms, you admire? I suspect each of us can think of at least one person who seems to have a knack of getting things done in an effective manner. I have a few people I bring to mind when I get stuck on various problems. For mathematics, I often wonder what my Dad would have done. For some coding problems, I wonder what specific coders I know would do. I won’t name and embarrass anyone, but I’ve met many such people through the ACCU. The meme, ‘What would [insert name here] do?’ has run for a long time. As with many memes or inspirational sayings, the question is a cliché. Commonplace sayings become clichés because they capture a heart of a common idea or experience, which rings true for many people. They can give an accurate encapsulation of an idea, or an example to put out in front. I have caught myself a few times thinking a project is going badly wrong and rather than asking ‘What would XXX do?’ I start asking myself what do I want to do. What would I do, if I were leading this project? What would I do if this were a personal project? That’s why I have managed to sneak in a make file and a way to run some regression tests on my current project. I’m not suggesting you use me as a fine example of how to solve any problems. I am asking if you have some self-belief. If you’re suffering from a confidence nose-dive, be kind to yourself. Remind yourself what you are good at, or at least enjoy. Spending time listening to people you admire, reading what they’ve written; articles, code or blogs. Or stories. At least get to a point where you can reflect on the bigger picture and get what you think clear. That might be no more than deciding you are stuck and haven’t got a clue what to do.
If you’re not sure how to tackle a problem, do consider asking yourself what XXX would do. Not necessarily Vin Deisel.1 Find an example to lead you through your problem. I think I have now devolved into giving myself advice or suggestions using a few suspiciously platitudinous phrases. Thanks for listening. I may have intended to lead by example, but now wonder if perhaps I should re-word my cue card to say “Hack by cliché.” Whichever you think is most appropriate, take a moment to ask yourself, what would Linus do? What would Bjarne do? What should I do? Be the person you want to be. Don’t be mean. Lead by example, or hack by cliché.
[Etymology] -ship in the Online Etymology Dictionary athttps://www.etymonline.com/word/-ship?ref=etymonline_crossreferen
[Fowler04] Martin Fowler (2004) ‘Strangler Application’, at: https://www.martinfowler.com/bliki/StranglerApplication.html
[Hill04] Linda Hill (2010) ‘Leading from Behind’ in the Harvard Business Review, https://hbr.org/2010/05/leading-from-behind
[Mertz18] Arne Mertz (2018) ‘Code Reviews – Why, what and how’ https://isocpp.org/blog/2018/01/code-reviews-why-what-and-how-arne-mertz (not recorded at the ACCU conference)
[Rose18] Seb Rose (2018) ‘Software development – learning to walk again’ from the ACCU18 conference, available at https://www.youtube.com/watch?v=iFwm-_04rLg
[Wikipedia-1] ‘Betteridge's law of headlines’, https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines
[Wikipedia-2] ‘News International phone hacking scandal’, at https://en.wikipedia.org/wiki/News_International_phone_hacking_scandal