User-space RCU


This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The user-space read-copy update (URCU) library was established in February of 2009, and is now used by several projects and available from several distributions. URCU is similar to its Linux-kernel counterpart, providing a replacement for reader-writer locking, among other uses. This similarity continues with readers not synchronizing directly with RCU updaters, thus making RCU read-side code paths exceedingly fast, while furthermore permitting RCU readers to make useful forward progress even when running concurrently with RCU updaters—and vice versa.

Although people who have worked with the Linux kernel's RCU implementation find much to be familiar with in URCU, there are some important differences, including support for several concurrent data structures. To that end, this article summarizes the similarities and discusses the differences as follows:

  1. RCU review
  2. URCU overview
  3. URCU flavors
  4. URCU library API
  5. Using URCU
  6. Summary

And then there are, of course, the inescapable answers to the quick quizzes.

Companion articles cover RCU-protected hash tables and stacks and queues.

RCU review

RCU is most frequently used to provide concurrency control for read-mostly linked data structures, so that read-side overhead is minimal. The word “minimal” is no euphemism: In some common cases, RCU's read-side overhead both in the kernel and in user space approaches zero.

Of course, if readers are to have zero overhead, then there logically cannot be any way for updaters to delay them or otherwise prevent them from running. Which in turn means that RCU updaters must handle readers traversing the data structure at all times before, during, and after the update. RCU updaters accomplish this feat by maintaining multiple versions of the data structure, waiting to free up a given old version until such time as all readers that might be referencing that version have moved on.

The following examples show how this works, first for insertion of a new data element into an RCU-protected structure, and then for deletion from an RCU-protected list. As we will see later, RCU can be applied to a great many other data structures as well.

The insertion process is shown in the following diagram, with time progressing through four states from left to right.

[RCU
insertion diagram]

The first state shows a global pointer named gptr with value NULL. The first green arrow shows the updater using malloc() to arrive at the second state, in which gptr is still NULL, but where the updater's local variable tmp now references an uninitialized block of memory. The second green arrow shows the updater initializing the newly allocated block of memory, resulting in the third state, where the new structure's fields ->a, ->b, and ->c have the values 1, 2, and 3, respectively. Finally, the red arrow shows the updater assigning a pointer to the new structure to the global variable gptr using rcu_assign_pointer(). The structure is now colored red, indicating that readers might now reference it. This in turn means that any further updates to the new structure must be done carefully, taking the possibility of concurrent readers into account.

Quick Quiz 1: Why use rcu_assign_pointer() rather than the simple gptr=tmp assignment statement that the C language provided for this straightforward job???

Answer

The process of deleting an element from an RCU-protected linked list is as follows, again with time progressing through four states from left to right.

[RCU
deletion diagram]

The first state shows a linked list with elements A, B, and C. The first (red) arrow shows the updater removing element B from the list using cds_list_del_rcu().

Quick Quiz 2: Why cds_list_del_rcu() instead of simply list_del_rcu() like the Linux kernel does?

Answer

In the second state, old readers might still be accessing element B, but new readers have no path to it, hence the yellow color. This means that the list has two versions: old readers will see the list containing A, B, and C, while at the same time new readers will see the list containing A and C. Eventually, all the old readers still accessing element B will move to element C, but until then, element B cannot be freed. We clearly need some way to wait until all of these old readers have finished, and synchronize_rcu() does just that, waiting for any reader that was in already in progress at the time that synchronize_rcu() started executing.

Quick Quiz 3: And just how is synchronize_rcu() supposed to know when all the currently executing RCU readers started? Doesn't that mean that all the RCU readers must do heavyweight operations to get the current time and store it somewhere? And doesn't this also require perfect time synchronization?

Answer

After synchronize_rcu() completes, the list will be as shown in the third column. Element B is green to indicate that readers can no longer have a reference to it, so that all readers see the list as containing only elements A and C. It is thus safe to free() element B, resulting in the final state shown on the right.

Of course, for synchronize_rcu() to do its job, there must be some way of determining where RCU readers begin and end. This job is handled by rcu_read_lock(), which marks the beginning of an RCU read-side critical section and rcu_read_unlock(), which marks the end. RCU read-side critical sections may be nested, and cannot contain synchronize_rcu() (otherwise deadlock would result).

Any line of code not in an RCU read-side critical section is termed a quiescent state, and any period of time during which each reader thread resides in at least one quiescent state is called a grace period.

Quick Quiz 4: Why the “reader” above? Doesn't every thread need reside in a quiescent state in order for a grace period to have elapsed?

Answer

The code to traverse an RCU-protected linked list might look as follows:

 1 rcu_read_lock(); 2 cds_list_for_each_entry_rcu(p, head, list) { 3 do_something_with(p->a, p->b, p->c); 4 } 5 rcu_read_unlock();

So what? There are any number of linked-list traversal schemes, and this probably does not look all that special. The special thing about RCU-protected list traversal is that the above code often compiles to exactly the same assembly language as does a simple single-threaded linked-list traversal but nevertheless provides full protection against concurrent modification.

Quick Quiz 5: When does RCU-protected list traversal compile to different assembly language than does simple single-threaded linked-list traversal?

Answer

More detail on RCU is available:

Given this general RCU overview, we are ready to look at URCU in particular.

URCU overview

URCU is a library that is available in a number of Linux distributions as liburcu2 (liburcu1 and liburcu0 for older versions), or that can be downloaded here. An additional liburcu-dev package is required if you wish to develop programs that use RCU. It runs on a number of architectures, including x86, ARM, PowerPC, Alpha, Itanium, Mainframe, and SPARC, and also on several operating systems, including Linux, FreeBSD, and Cygwin. It is also expected to run on Android, NetBSD, OpenBSD, and Darwin, but more testing is required before full support can be claimed.

Along with a number of concurrent data structures, URCU provides five different flavors of RCU for different types of applications:

  1. QSBR (quiescent-state-based RCU).
  2. Memory-barrier-based RCU.
  3. “Bullet-proof” RCU.
  4. Signal-based RCU.
  5. Signal-based RCU using an out-of-tree sys_membarrier() system call.

QSBR closely resembles the “classic” RCU implementations (RCU_TREE and RCU_TINY) in the Linux kernel in having zero-overhead implementations of rcu_read_lock() and rcu_read_unlock(). However, each thread must periodically invoke rcu_quiescent_state(), just as in the kernel, where schedule() must be invoked periodically. Each thread that is to execute RCU read-side critical sections must also invoke rcu_register_thread() after thread creation and rcu_unregister_thread() before thread exit. These requirements clearly put a stringent constraint on the overall application design that, for example, prohibit the use of QSBR RCU in most library code, but in return, QSBR provides unmatched performance.

Memory-barrier-based RCU resembles the earliest preemptible RCU implementation accepted into the -rt patchset in that rcu_read_lock() and rcu_read_unlock() contain memory barriers (though not heavyweight atomic instructions!). This situation results in high read-side overhead by RCU standards but, in return, removes QSBR's requirement that all threads periodically invoke anything, thus making memory-barrier-based RCU usable from within library functions, at least in cases where the the application is registering its threads with RCU via rcu_register_thread().

Bullet-proof RCU is like memory-barrier-based RCU but, in addition, automatically invokes rcu_register_thread() when needed from the RCU read-side primitives. This further increases these primitives' overheads, but in return allows RCU to be used by libraries that are used by applications that create their own threads such that RCU cannot hook into thread creation and destruction.

Signal-based RCU removes the memory barriers from rcu_read_lock() and rcu_read_unlock(), while still allowing these primitives to be used by library functions. This reduces these primitives' overheads to near those of the QSBR RCU implementation, but requires that the user application give up a POSIX signal to be used by synchronize_rcu() in place of the read-side memory barriers. However, this approach requires that all threads, including those created by library functions, invoke rcu_register_thread() before entering their first RCU read-side critical section.

The final flavor, interrupt-based RCU using sys_membarrier(), uses an experimental sys_membarrier() system call in place of the POSIX signals used by the preceding approach. This sys_membarrier() system call uses inter-processor interrupts, which are lighter weight than POSIX signals. In addition, in constrast to RCU_SIGNAL, which sends a POSIX signal to every thread in the process that has been registered as an RCU reader, RCU_MEMBARRIER interrupts only those threads in this process that are currently running. This optimization works because any blocked threads already executed memory barriers in the process of blocking and will execute more when they wake up.

Each of these flavors of RCU are described in the next section.

URCU flavors

There are five flavors of RCU available from liburcu, each of which can be selected on a per-compilation-unit basis as described below.

The first flavor is QSBR, which is accessed as follows:

 #include <urcu-qsbr.h>

It is also necessary to include “-lurcu-qsbr” on the linker command line.

QSBR provides the highest performance, but requires that all reader threads periodically indicate quiescent states via either rcu_quiescent_state() for a momentary quiescent state or using rcu_thread_offline()/rcu_thread_online() pairs for extended quiescent states.

Signal-based RCU is slightly slower than QSBR, but does not require explicitly indicating quiescent states. Instead, the application must allow RCU to have dedicated use of SIGUSR1. Users who are building their own liburcu library can override this default by using -DSIGRCU. Signal-based RCU is accessed as follows:

 #define RCU_SIGNAL #include <urcu.h>

It is also necessary to include “-lurcu-signal” on the linker command line.

Memory-barrier-based RCU is somewhat slower than signal-based RCU, but does not require a signal to be dedicated to RCU's use. Instead, this flavor of RCU's read-side primitives contains explicit memory barriers. Memory-barrier-based RCU is accessed as follows:

 #define RCU_MB #include <urcu.h>

It is also necessary to include “-lurcu-mb” on the linker command line.

Some kernels have a sys_membarrier() system call that causes every currently running thread of the current process to execute a memory barrier. Systems with such a system call can gain the performance benefits of signal-based RCU without the need to dedicate a signal to RCU, however, until sys_membarrier() is in mainline, those using it will need to carefully ensure that the kernel and user-space RCU agree on the system-call number. This variant of RCU is accessed as follows:

 #define RCU_MEMBARRIER #include <urcu.h>

It is also necessary to include “-lurcu” on the linker command line.

The four previous flavors of RCU require that reader threads be initialized using rcu_register_thread() before the first use of rcu_read_lock(), and further that they be cleaned up using rcu_unregister_thread() after the last use of rcu_read_unlock(). These requirements can be problematic in some types of libraries, and can therefore be dispensed with (with an additional slight degradation in performance compared to RCU_MB) by accessing bullet-proof RCU as follows:

 #include <urcu-bp.h>

It is also necessary to include “-lurcu-bp” on the linker command line.

For all flavors, defining either of the C preprocessor symbols _LGPL_SOURCE or URCU_INLINE_SMALL_FUNCTIONS before the first #include enables static linking, which gives the compiler more freedom to carry out optimizations, inlining in particular. These optimizations can be particularly important for the lightweight read-side primitives such as rcu_read_lock() and rcu_read_unlock(). In short, static linking provides increased performance for those willing to abide by its licensing and technical constraints. These constraints include the need to distribute “.o” object files with any binary of your application, and the need to provide a separate build of your application for each version of URCU's ABI.

Quick Quiz 6: Is it OK to use URCU_INLINE_SMALL_FUNCTIONS even in proprietary code?

Answer

The tradeoffs between the five flavors are shown in the following table, where blank cells indicate fewer restrictions:

QSBR RCU_MEMBARRIER RCU_SIGNAL RCU_MB Bullet-proof
Header <urcu-qsbr.h> <urcu.h> <urcu.h> <urcu.h> <urcu-bp.h>
Define RCU_MEMBARRIER RCU_SIGNAL RCU_MB
Libraries -lurcu-qsbr -lurcu -lurcu-signal -lurcu-mb -lurcu-bp
Read-Side Cost Free! ++ ++, test ++, smp_mb() ++, smp_mb(), test
Explicit QS Required Yes
Thread Registry Required Yes Yes Yes Yes
Kernel Patch Yes
POSIX Signal Required Yes
OK From Library Not in general
OK From Library with unannounced application threads Not in general Not in general Not in general Not in general

The first three rows show the compiler directives and linker libraries needed to access the corresponding RCU flavor. The “Read-Side” row roughly summarizes the read-side cost of each flavor, where “++” indicates a local increment/decrement pair, “test” indicates an if test and branch, and smp_mb() indicates a pair of memory barriers. Thus, although read-side cost does vary, all flavors have extremely low overhead when compared to synchronization primitives based on mutual exclusion.

The “Explicit QS Required” row shows which flavors require explicit calls to rcu_quiescent_state, the “Thread Registry Required” shows which flavors require explicit rcu_register_thread() and rcu_unregister_thread() calls from threads using RCU's read-side primitives, the “Kernel Patch” row shows which flavors require the sys_membarrier() kernel patch, and the “POSIX Signal Required” row shows which flavors require that the application give up a POSIX signal for RCU's use.

The “OK From Library” row shows which flavors can be used with little or no restriction from library functions intended to be linked with arbitrary applications, and finally, the “OK From Library with Unannounced Application Threads” row shows which flavors can be used with little or no restriction from such library functions, but where the linking application creates threads that cannot be forced to invoke the rcu_register_thread() and rcu_unregister_thread() primitives.

In short, no one of the five RCU flavors is suitable in all situations. The following rough rules of thumb can help select the flavor that is best for you:

  • If you are building a standalone application, and if speed is of the essence, then QSBR is probably your best choice.

  • If you are building a library or a library function, and you have absolutely no control over thread creation and destruction, then bullet-proof RCU is your only viable choice.

  • If you are building a library or a library function, but you are able to hook into thread creation and destruction, then you will need to use one of RCU_MEMBARRIER, RCU_SIGNAL, or RCU_MB as follows;

    • If you are running Linux and can patch your kernel, then build your kernel with Mathieu Desnoyers's sys_membarrier() patch and use RCU_MEMBARRIER. (You will also need to inform your application of sys_membarrier()'s syscall number, for example, by patching the relevant include files.) If you build with RCU_MEMBARRIER and run on a system lacking sys_membarrier(), URCU will fall back to RCU_MB. Assuming, that is, that the kernel does not contain another custom system call with a number that happens to collide with the choice of sys_membarrier() on the build machine.

    • Otherwise, if you can reserve a signal for your library's use (perhaps in conjunction with any other libraries or library functions wishing to use RCU), then use RCU_SIGNAL.

    • Otherwise, use RCU_MB.

Regardless of the URCU flavor chosen, the API is the same; for example, you will use rcu_read_lock() regardless of the URCU flavor. This is very convenient because it allows common source code to be shared among different programs or libraries that might select different URCU flavors. However, it also means that special care is required when sharing RCU-protected data structures among different libraries that are using different flavors of RCU. In some cases, it is necessary to include pointers to the relevant RCU functions in order to avoid RCU-flavor confusion. That said, in the typical case of a standalone application using RCU, this problem does not arise.

URCU library API

The URCU library's API is surprisingly large, and will therefore be discussed in the following pieces:

Using URCU

The following sections give some advice on the use of RCU:

When is URCU useful?

RCU is a specialized synchronization mechanism, and its performance, scalability, and real-time response advantages are due to this specialization. It is always important to use the right tool for the job, and RCU is no exception to this rule. The following diagram gives some rules of thumb for where RCU is best used:

[RCU
applicability]

As shown in the blue box, RCU works best if you have read-mostly data where stale and inconsistent data is permissible. The canonical example of this case in the Linux kernel is routing tables. Because it may have taken many seconds or even minutes for the routing updates to propagate across the Internet, the system has been sending packets the wrong way for quite some time. Having some small probability of continuing to send some of them the wrong way for a few more milliseconds is almost never a problem. User-level classification and routing data structures are likely to be just as amenable to RCU as those in the Linux kernel.

If you have a read-mostly workload where consistent data is required, RCU works well, as shown by the green box. Examples of this situation are data structures that map from a key to a structure, but where operations on each such structure must be fully synchronized, especially including deletion of the structure. The trick here is to use RCU to protect the mapping data structure and to use a per-structure lock to protect the individual structures being mapped to. In addition, each structure has a flag indicating that whether or not it has been deleted. When an RCU-protected search finds a data structure, it acquires that data structure's lock and checks the deletion flag. If the flag is set, the search pretends not to have found the structure, otherwise, the desired operation on the structure is carried out under the protection of the lock.

In this case, RCU provides high performance and scalability for the mapping operation, and also guarantees that the structure will remain in existence: Updaters must wait for a grace period after removing a given structure before they are permitted to free that structure.

As indicated by the yellow box, RCU can also be useful for read-write workloads where consistent data is required, although usually in conjunction with a number of other synchronization primitives, including per-data-structure locks (as above), per-thread locks, sequence locks, and atomic operations.

Finally, as indicated by the red box, update-mostly workloads requiring consistent data are rarely good places to use RCU. As noted in the figure, RCU's existence guarantees can be useful even in update-heavy situations because this allows simpler per-structure locking designs to be constructed, as is in fact the case for lfstack and rculfqueue—even though these data structures typically only have operations that update the structure itself. In effect, RCU is providing type-safe memory for algorithms using non-blocking synchronization. In addition, the low latency of RCU readers can be important to real-time applications in cases where the real-time threads do only reads, and non-real-time threads do updates.

So, in cases where RCU is the right tool for the job, what benefits does it provide?

  1. Excellent performance and scalability for RCU readers.

  2. Wait-free RCU readers, resulting in determinism, which can be quite useful for real-time applications.

  3. Read-side immunity to lock-based deadlocks.

  4. Existence guarantees that can simplify design of algorithms that use per-data-structure locks.

  5. RCU read-side critical sections can be freely nested, enabling high levels of composability.

  6. Because RCU read-side critical sections are not subject to rollback and retry, they can contain irrevocable operations such as I/O operations (in contrast to many transactional-memory implementations).

  7. RCU readers and writers can make concurrent forward progress, even in the presence of conflicting operations. In contrast, many other synchronization operations impose some flavor of mutual exclusion, so that conflicts result in spinning, blocking, rollbacks, or retries.

  8. RCU is independent of choice of memory allocator because it does not need to distinguish between pointers to the beginning of a given object and pointers internal to that object (in contrast with hazard pointers).

In short, within its area of applicability, RCU is a useful and powerful tool.

Quick Quiz 7: You say “read-side immunity to lock-based deadlocks”. So what non-lock-based deadlocks can involve RCU readers?

Answer

Quick Quiz 8: But given that most of the URCU implementations count the read-side nesting level, composability cannot be perfect, can it?

Answer

Application quiescent states

Kernels can use a context switch as a quiescent state, and many applications have similar natural quiescent states. However, just as RCU has hooks into the Linux scheduler in order to determine when context switches happen, applications using the QSBR flavor of URCU must tell RCU where those naturally occurring quiescent states are located using the rcu_quiescent_state(), rcu_thread_offline(), and rcu_thread_online() functions. For example, consider an application with a work-processing loop, such as show below:

[Work loop diagram]

Examples of such applications include those using work-stealing task frameworks. Each thread of this application processes one work item on each pass through its loop, then fetches the next work item. Because RCU read-side critical sections will not span units of work, a natural quiescent state occurs when each work item completes execution, as shown on the right-hand side of the diagram. The code represented by the green boxes can contain RCU read-side critical sections.

Quick Quiz 9: Can the code in the white rcu_quiescent_state() box contain RCU read-side critical sections?

Answer

In the typical work-stealing task framework, there is almost always work immediately available until the application terminates. In contrast, event-based systems might wait an arbitrary length of time for the next work item to arrive. URCU clearly needs to understand that a thread waiting on work to arrive is in a quiescent state, and the rcu_thread_offline() and rcu_thread_online() functions allow the application to tell URCU just that, as shown below:

[Event-based work loop diagram]

The application executes rcu_thread_offline() after executing each work item, and then executes rcu_thread_online() immediately before starting execution of the next work item. Therefore, RCU read-side critical sections are permitted when executing a work item, but prohibited while waiting for the next work item.

In this way, the rcu_quiescent_state(), rcu_thread_offline(), and rcu_thread_online() allow the application to communicate the pertinent details of that application's structure to RCU.

Quick Quiz 10: But what if some portions of the “Wait For Next Work Item” code needed to use RCU read-side critical sections?

Answer

However, please note that these three functions are necessary only for the QSBR flavor of RCU. In the other four flavors, RCU is instead explicitly notified of the beginning and end of each RCU read-side critical section, so that the rcu_quiescent_state(), rcu_thread_offline(), and rcu_thread_online() functions are no-ops. Therefore, if your application does not have convenient quiescent states, you should use one of the other four RCU flavors.

Mixing and matching URCU flavors

Different compilation units can be using different flavors of RCU concurrently. This works extremely well when different modules or libraries are using RCU to access internal data structures: Each such module or library selects its flavor of RCU and uses it, and the multiple flavors operate independently as desired. One use case for this behavior is a tracer library for user-space applications and libraries (LTTng-UST), which in turn must use RCU-bp because it can neither control the application's threads nor even hook into their creation and destruction. This use case is supported quite naturally.

That said, it is necessary to take some care when passing RCU-protected data structures from one compilation unit to another. In some cases, it may be necessary to wrap the RCU functions in order to ensure that they are compiled in the correct compilation unit. In other cases, including the hash table described here, it is necessary to maintain function pointers to the selected RCU flavor's API members in order to allow the data structure to be safely manipulated by other compilation units that neither know or care what RCU flavor the data structure is using.

Building URCU

A number of Linux distributions provide the URCU library, including the Debian family (e.g., Ubuntu), Fedora, openSUSE, SUSE Enterprise RT Linux, Linaro, and all Yocto-based RT distros (e.g., Wind River Linux, Mentor Embedded Linux, Montavista, and ElinOS). However, if you want features and data structures from the latest version of the URCU library, you will need to download and build it.

The URCU library may be downloaded here. It can also be cloned or viewed using:

 git://git.liburcu.org/urcu.git https://github.com/urcu/userspace-rcu

Once downloaded, follow the build instructions in the README.

Summary

The user-space RCU library has been in existence for almost four years, and has grown from being just RCU to also including RCU-protected data structures, including a resizable hash table, a stack, and a couple of queues. We expect that more data structures will make their appearance over time, with possibilities including Judy arrays [PDF], bonsai trees [PDF], and many more besides. Other hoped-for enhancements include improved IPC (user space wait_event() and wake_up(), anyone?) and, of course, a locking validator similar to the Linux kernel's lockdep.

This library has been used by a number of projects, including LTTng-UST (User-Space Tracer) (for which it was first constructed), Knot DNS, the gdnsd geographic DNS server, the netsniff-ng toolkit, the Sheepdog Project, and an academic project on crowd simulation, as well as a number of commercial applications.

Acknowledgments

We are grateful to Jérémie Galarneau, Josh Triplett, Frédéric Weisbecker for their comments on a draft of this article. We owe thanks to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: Why use rcu_assign_pointer() rather than the simple gptr=tmp assignment statement that the C language provided for this straightforward job???

Answer: Unfortunately, this job is not so straightforward. Remember that readers might be accessing gptr at any time: There is no way to stop them. We therefore need to prevent both the compiler and the CPU from mischievously reordering the memory references, which they are both otherwise fully within their rights to do. To see why this is important, suppose that either the compiler or the CPU decided to optimize the code by moving the gptr=tmp assignment statement so that it preceded the structure-initialization assignment statements. Then the concurrent readers might well load from gptr and get a pointer to a structure full of pre-initialization garbage. Not good!!!

Therefore, we use rcu_assign_pointer(gptr, tmp), which acts as a mischief-preventing assignment statement.

Back to Quick Quiz 1.

Quick Quiz 2: Why cds_list_del_rcu() instead of simply list_del_rcu() like the Linux kernel does?

Answer: We could not take the Linux kernel list primitives directly due to license issues, namely the Linux kernel's GPLv2 license as opposed to the user-space RCU library's LGPLv2.1-or-later license. Instead, the GNU C library's list primitives were pressed into service. There are a few differences from the kernel, and there will be more as the kernel's API evolves, so the cds_ prefix serves as a helpful reminder. In addition, names like list_add() and list_del() are already used by quite a few user applications, and the cds_ prefix helps avoid symbol-name clashes.

Back to Quick Quiz 2.

Quick Quiz 3: And just how is synchronize_rcu() supposed to know when all the currently executing RCU readers started? Doesn't that mean that all the RCU readers must do heavyweight operations to get the current time and store it somewhere? And doesn't this also require perfect time synchronization?

Answer: There are a number of ways for synchronize_rcu() to do its job, for example using the “Classic RCU” algorithm described here. Heavyweight read-side operations and perfect time synchronization are not required, although in-kernel RCU implementations have been known to complain when CPUs' notions of the current time differ by more than one minute.

Back to Quick Quiz 3.

Quick Quiz 4: Why the “reader” above? Doesn't every thread need reside in a quiescent state in order for a grace period to have elapsed?

Answer: The URCU Library API section introduces the rcu_register_thread() and rcu_unregister_thread() APIs. A reader thread is a thread that has invoked rcu_register_thread() and not yet invoked rcu_unregister_thread(). All other threads are forbidden from executing RCU read-side critical sections and are ignored by RCU.

Back to Quick Quiz 4.

Quick Quiz 5: When does RCU-protected list traversal compile to different assembly language than does simple single-threaded linked-list traversal?

Answer: In some flavors of user-space RCU, rcu_read_lock() and rcu_read_unlock() produce some lightweight code (more on this later). In all flavors of user-space RCU, cds_list_for_each_entry_rcu() uses the CMM_ACCESS_ONCE() macro (similar to ACCESS_ONCE() in the Linux kernel), which can disable some code-motion compiler optimizations. Finally, when running on DEC Alpha, cds_list_for_each_entry_rcu() contains a memory barrier.

Back to Quick Quiz 5.

Quick Quiz 6: Is it OK to use URCU_INLINE_SMALL_FUNCTIONS even in proprietary code?

Answer: Yes, it is. Use of URCU_INLINE_SMALL_FUNCTIONS inlines only small functions and macros that fit within LGPL's 10-line inlining limit for non-LGPL code.

See paragraph 4 of clause 5 of LGPL v2.1, which reads:

If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.)

Back to Quick Quiz 6.

Quick Quiz 7: You say “read-side immunity to lock-based deadlocks”. So what non-lock-based deadlocks can involve RCU readers?

Answer: Any sequence of events that can result in an RCU reader waiting for an RCU grace period results in deadlock in all RCU flavors except for QSBR. In QSBR, this bug results in splitting of the RCU read-side critical section containing the operation that waits on a grace period. Note that acquiring a lock that is held across a grace period can indirectly result in waiting for a grace period. There are of course a number of more elaborate ways of arriving at this sort of deadlock.

Back to Quick Quiz 7.

Quick Quiz 8: But given that most of the URCU implementations count the read-side nesting level, composability cannot be perfect, can it?

Answer: Indeed it cannot. This is an imperfect world, beset by finite-sized counters. And this problem is not restricted to RCU: Other composable synchronization operations have implementation-defined limits to nesting depth, and thus limits on composability.

Back to Quick Quiz 8.

Quick Quiz 9: Can the code in the white rcu_quiescent_state() box contain RCU read-side critical sections?

Answer: That is a matter that is internal to the RCU implementation. That said, any RCU read-side critical sections must be outside the area of code where RCU is ignoring this thread.

Back to Quick Quiz 9.

Quick Quiz 10: But what if some portions of the “Wait For Next Work Item” code needed to use RCU read-side critical sections?

Answer: Then the calls to rcu_thread_offline() and rcu_thread_online() would need to be moved into the “Wait For Next Work Item” code.

Back to Quick Quiz 10. (Log in to post comments)