| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" | 
 |         "http://www.w3.org/TR/html4/loose.dtd"> | 
 |         <html> | 
 |         <head><title>A Tour Through RCU's Requirements [LWN.net]</title> | 
 |         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> | 
 |  | 
 | <h1>A Tour Through RCU's Requirements</h1> | 
 |  | 
 | <p>Copyright IBM Corporation, 2015</p> | 
 | <p>Author: Paul E. McKenney</p> | 
 | <p><i>The initial version of this document appeared in the | 
 | <a href="https://lwn.net/">LWN</a> articles | 
 | <a href="https://lwn.net/Articles/652156/">here</a>, | 
 | <a href="https://lwn.net/Articles/652677/">here</a>, and | 
 | <a href="https://lwn.net/Articles/653326/">here</a>.</i></p> | 
 |  | 
 | <h2>Introduction</h2> | 
 |  | 
 | <p> | 
 | Read-copy update (RCU) is a synchronization mechanism that is often | 
 | used as a replacement for reader-writer locking. | 
 | RCU is unusual in that updaters do not block readers, | 
 | which means that RCU's read-side primitives can be exceedingly fast | 
 | and scalable. | 
 | In addition, updaters can make useful forward progress concurrently | 
 | with readers. | 
 | However, all this concurrency between RCU readers and updaters does raise | 
 | the question of exactly what RCU readers are doing, which in turn | 
 | raises the question of exactly what RCU's requirements are. | 
 |  | 
 | <p> | 
 | This document therefore summarizes RCU's requirements, and can be thought | 
 | of as an informal, high-level specification for RCU. | 
 | It is important to understand that RCU's specification is primarily | 
 | empirical in nature; | 
 | in fact, I learned about many of these requirements the hard way. | 
 | This situation might cause some consternation, however, not only | 
 | has this learning process been a lot of fun, but it has also been | 
 | a great privilege to work with so many people willing to apply | 
 | technologies in interesting new ways. | 
 |  | 
 | <p> | 
 | All that aside, here are the categories of currently known RCU requirements: | 
 | </p> | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Fundamental Requirements"> | 
 | 	Fundamental Requirements</a> | 
 | <li>	<a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> | 
 | <li>	<a href="#Parallelism Facts of Life"> | 
 | 	Parallelism Facts of Life</a> | 
 | <li>	<a href="#Quality-of-Implementation Requirements"> | 
 | 	Quality-of-Implementation Requirements</a> | 
 | <li>	<a href="#Linux Kernel Complications"> | 
 | 	Linux Kernel Complications</a> | 
 | <li>	<a href="#Software-Engineering Requirements"> | 
 | 	Software-Engineering Requirements</a> | 
 | <li>	<a href="#Other RCU Flavors"> | 
 | 	Other RCU Flavors</a> | 
 | <li>	<a href="#Possible Future Changes"> | 
 | 	Possible Future Changes</a> | 
 | </ol> | 
 |  | 
 | <p> | 
 | This is followed by a <a href="#Summary">summary</a>, | 
 | however, the answers to each quick quiz immediately follows the quiz. | 
 | Select the big white space with your mouse to see the answer. | 
 |  | 
 | <h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> | 
 |  | 
 | <p> | 
 | RCU's fundamental requirements are the closest thing RCU has to hard | 
 | mathematical requirements. | 
 | These are: | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Grace-Period Guarantee"> | 
 | 	Grace-Period Guarantee</a> | 
 | <li>	<a href="#Publish-Subscribe Guarantee"> | 
 | 	Publish-Subscribe Guarantee</a> | 
 | <li>	<a href="#Memory-Barrier Guarantees"> | 
 | 	Memory-Barrier Guarantees</a> | 
 | <li>	<a href="#RCU Primitives Guaranteed to Execute Unconditionally"> | 
 | 	RCU Primitives Guaranteed to Execute Unconditionally</a> | 
 | <li>	<a href="#Guaranteed Read-to-Write Upgrade"> | 
 | 	Guaranteed Read-to-Write Upgrade</a> | 
 | </ol> | 
 |  | 
 | <h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> | 
 |  | 
 | <p> | 
 | RCU's grace-period guarantee is unusual in being premeditated: | 
 | Jack Slingwine and I had this guarantee firmly in mind when we started | 
 | work on RCU (then called “rclock”) in the early 1990s. | 
 | That said, the past two decades of experience with RCU have produced | 
 | a much more detailed understanding of this guarantee. | 
 |  | 
 | <p> | 
 | RCU's grace-period guarantee allows updaters to wait for the completion | 
 | of all pre-existing RCU read-side critical sections. | 
 | An RCU read-side critical section | 
 | begins with the marker <tt>rcu_read_lock()</tt> and ends with | 
 | the marker <tt>rcu_read_unlock()</tt>. | 
 | These markers may be nested, and RCU treats a nested set as one | 
 | big RCU read-side critical section. | 
 | Production-quality implementations of <tt>rcu_read_lock()</tt> and | 
 | <tt>rcu_read_unlock()</tt> are extremely lightweight, and in | 
 | fact have exactly zero overhead in Linux kernels built for production | 
 | use with <tt>CONFIG_PREEMPT=n</tt>. | 
 |  | 
 | <p> | 
 | This guarantee allows ordering to be enforced with extremely low | 
 | overhead to readers, for example: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 int x, y; | 
 |  2 | 
 |  3 void thread0(void) | 
 |  4 { | 
 |  5   rcu_read_lock(); | 
 |  6   r1 = READ_ONCE(x); | 
 |  7   r2 = READ_ONCE(y); | 
 |  8   rcu_read_unlock(); | 
 |  9 } | 
 | 10 | 
 | 11 void thread1(void) | 
 | 12 { | 
 | 13   WRITE_ONCE(x, 1); | 
 | 14   synchronize_rcu(); | 
 | 15   WRITE_ONCE(y, 1); | 
 | 16 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | Because the <tt>synchronize_rcu()</tt> on line 14 waits for | 
 | all pre-existing readers, any instance of <tt>thread0()</tt> that | 
 | loads a value of zero from <tt>x</tt> must complete before | 
 | <tt>thread1()</tt> stores to <tt>y</tt>, so that instance must | 
 | also load a value of zero from <tt>y</tt>. | 
 | Similarly, any instance of <tt>thread0()</tt> that loads a value of | 
 | one from <tt>y</tt> must have started after the | 
 | <tt>synchronize_rcu()</tt> started, and must therefore also load | 
 | a value of one from <tt>x</tt>. | 
 | Therefore, the outcome: | 
 | <blockquote> | 
 | <pre> | 
 | (r1 == 0 && r2 == 1) | 
 | </pre> | 
 | </blockquote> | 
 | cannot happen. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Wait a minute! | 
 | 	You said that updaters can make useful forward progress concurrently | 
 | 	with readers, but pre-existing readers will block | 
 | 	<tt>synchronize_rcu()</tt>!!! | 
 | 	Just who are you trying to fool??? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	First, if updaters do not wish to be blocked by readers, they can use | 
 | 	<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will | 
 | 	be discussed later. | 
 | 	Second, even when using <tt>synchronize_rcu()</tt>, the other | 
 | 	update-side code does run concurrently with readers, whether | 
 | 	pre-existing or not. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | This scenario resembles one of the first uses of RCU in | 
 | <a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, | 
 | which managed a distributed lock manager's transition into | 
 | a state suitable for handling recovery from node failure, | 
 | more or less as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 #define STATE_NORMAL        0 | 
 |  2 #define STATE_WANT_RECOVERY 1 | 
 |  3 #define STATE_RECOVERING    2 | 
 |  4 #define STATE_WANT_NORMAL   3 | 
 |  5 | 
 |  6 int state = STATE_NORMAL; | 
 |  7 | 
 |  8 void do_something_dlm(void) | 
 |  9 { | 
 | 10   int state_snap; | 
 | 11 | 
 | 12   rcu_read_lock(); | 
 | 13   state_snap = READ_ONCE(state); | 
 | 14   if (state_snap == STATE_NORMAL) | 
 | 15     do_something(); | 
 | 16   else | 
 | 17     do_something_carefully(); | 
 | 18   rcu_read_unlock(); | 
 | 19 } | 
 | 20 | 
 | 21 void start_recovery(void) | 
 | 22 { | 
 | 23   WRITE_ONCE(state, STATE_WANT_RECOVERY); | 
 | 24   synchronize_rcu(); | 
 | 25   WRITE_ONCE(state, STATE_RECOVERING); | 
 | 26   recovery(); | 
 | 27   WRITE_ONCE(state, STATE_WANT_NORMAL); | 
 | 28   synchronize_rcu(); | 
 | 29   WRITE_ONCE(state, STATE_NORMAL); | 
 | 30 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | The RCU read-side critical section in <tt>do_something_dlm()</tt> | 
 | works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> | 
 | to guarantee that <tt>do_something()</tt> never runs concurrently | 
 | with <tt>recovery()</tt>, but with little or no synchronization | 
 | overhead in <tt>do_something_dlm()</tt>. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Why is the <tt>synchronize_rcu()</tt> on line 28 needed? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Without that extra grace period, memory reordering could result in | 
 | 	<tt>do_something_dlm()</tt> executing <tt>do_something()</tt> | 
 | 	concurrently with the last bits of <tt>recovery()</tt>. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | In order to avoid fatal problems such as deadlocks, | 
 | an RCU read-side critical section must not contain calls to | 
 | <tt>synchronize_rcu()</tt>. | 
 | Similarly, an RCU read-side critical section must not | 
 | contain anything that waits, directly or indirectly, on completion of | 
 | an invocation of <tt>synchronize_rcu()</tt>. | 
 |  | 
 | <p> | 
 | Although RCU's grace-period guarantee is useful in and of itself, with | 
 | <a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, | 
 | it would be good to be able to use RCU to coordinate read-side | 
 | access to linked data structures. | 
 | For this, the grace-period guarantee is not sufficient, as can | 
 | be seen in function <tt>add_gp_buggy()</tt> below. | 
 | We will look at the reader's code later, but in the meantime, just think of | 
 | the reader as locklessly picking up the <tt>gp</tt> pointer, | 
 | and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the | 
 | <tt>->a</tt> and <tt>->b</tt> fields. | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool add_gp_buggy(int a, int b) | 
 |  2 { | 
 |  3   p = kmalloc(sizeof(*p), GFP_KERNEL); | 
 |  4   if (!p) | 
 |  5     return -ENOMEM; | 
 |  6   spin_lock(&gp_lock); | 
 |  7   if (rcu_access_pointer(gp)) { | 
 |  8     spin_unlock(&gp_lock); | 
 |  9     return false; | 
 | 10   } | 
 | 11   p->a = a; | 
 | 12   p->b = a; | 
 | 13   gp = p; /* ORDERING BUG */ | 
 | 14   spin_unlock(&gp_lock); | 
 | 15   return true; | 
 | 16 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | The problem is that both the compiler and weakly ordered CPUs are within | 
 | their rights to reorder this code as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool add_gp_buggy_optimized(int a, int b) | 
 |  2 { | 
 |  3   p = kmalloc(sizeof(*p), GFP_KERNEL); | 
 |  4   if (!p) | 
 |  5     return -ENOMEM; | 
 |  6   spin_lock(&gp_lock); | 
 |  7   if (rcu_access_pointer(gp)) { | 
 |  8     spin_unlock(&gp_lock); | 
 |  9     return false; | 
 | 10   } | 
 | <b>11   gp = p; /* ORDERING BUG */ | 
 | 12   p->a = a; | 
 | 13   p->b = a;</b> | 
 | 14   spin_unlock(&gp_lock); | 
 | 15   return true; | 
 | 16 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | If an RCU reader fetches <tt>gp</tt> just after | 
 | <tt>add_gp_buggy_optimized</tt> executes line 11, | 
 | it will see garbage in the <tt>->a</tt> and <tt>->b</tt> | 
 | fields. | 
 | And this is but one of many ways in which compiler and hardware optimizations | 
 | could cause trouble. | 
 | Therefore, we clearly need some way to prevent the compiler and the CPU from | 
 | reordering in this manner, which brings us to the publish-subscribe | 
 | guarantee discussed in the next section. | 
 |  | 
 | <h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> | 
 |  | 
 | <p> | 
 | RCU's publish-subscribe guarantee allows data to be inserted | 
 | into a linked data structure without disrupting RCU readers. | 
 | The updater uses <tt>rcu_assign_pointer()</tt> to insert the | 
 | new data, and readers use <tt>rcu_dereference()</tt> to | 
 | access data, whether new or old. | 
 | The following shows an example of insertion: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool add_gp(int a, int b) | 
 |  2 { | 
 |  3   p = kmalloc(sizeof(*p), GFP_KERNEL); | 
 |  4   if (!p) | 
 |  5     return -ENOMEM; | 
 |  6   spin_lock(&gp_lock); | 
 |  7   if (rcu_access_pointer(gp)) { | 
 |  8     spin_unlock(&gp_lock); | 
 |  9     return false; | 
 | 10   } | 
 | 11   p->a = a; | 
 | 12   p->b = a; | 
 | 13   rcu_assign_pointer(gp, p); | 
 | 14   spin_unlock(&gp_lock); | 
 | 15   return true; | 
 | 16 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually | 
 | equivalent to a simple assignment statement, but also guarantees | 
 | that its assignment will | 
 | happen after the two assignments in lines 11 and 12, | 
 | similar to the C11 <tt>memory_order_release</tt> store operation. | 
 | It also prevents any number of “interesting” compiler | 
 | optimizations, for example, the use of <tt>gp</tt> as a scratch | 
 | location immediately preceding the assignment. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	But <tt>rcu_assign_pointer()</tt> does nothing to prevent the | 
 | 	two assignments to <tt>p->a</tt> and <tt>p->b</tt> | 
 | 	from being reordered. | 
 | 	Can't that also cause problems? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	No, it cannot. | 
 | 	The readers cannot see either of these two fields until | 
 | 	the assignment to <tt>gp</tt>, by which time both fields are | 
 | 	fully initialized. | 
 | 	So reordering the assignments | 
 | 	to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly | 
 | 	cause any problems. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | It is tempting to assume that the reader need not do anything special | 
 | to control its accesses to the RCU-protected data, | 
 | as shown in <tt>do_something_gp_buggy()</tt> below: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool do_something_gp_buggy(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   p = gp;  /* OPTIMIZATIONS GALORE!!! */ | 
 |  5   if (p) { | 
 |  6     do_something(p->a, p->b); | 
 |  7     rcu_read_unlock(); | 
 |  8     return true; | 
 |  9   } | 
 | 10   rcu_read_unlock(); | 
 | 11   return false; | 
 | 12 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | However, this temptation must be resisted because there are a | 
 | surprisingly large number of ways that the compiler | 
 | (to say nothing of | 
 | <a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) | 
 | can trip this code up. | 
 | For but one example, if the compiler were short of registers, it | 
 | might choose to refetch from <tt>gp</tt> rather than keeping | 
 | a separate copy in <tt>p</tt> as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool do_something_gp_buggy_optimized(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   if (gp) { /* OPTIMIZATIONS GALORE!!! */ | 
 | <b> 5     do_something(gp->a, gp->b);</b> | 
 |  6     rcu_read_unlock(); | 
 |  7     return true; | 
 |  8   } | 
 |  9   rcu_read_unlock(); | 
 | 10   return false; | 
 | 11 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | If this function ran concurrently with a series of updates that | 
 | replaced the current structure with a new one, | 
 | the fetches of <tt>gp->a</tt> | 
 | and <tt>gp->b</tt> might well come from two different structures, | 
 | which could cause serious confusion. | 
 | To prevent this (and much else besides), <tt>do_something_gp()</tt> uses | 
 | <tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool do_something_gp(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   p = rcu_dereference(gp); | 
 |  5   if (p) { | 
 |  6     do_something(p->a, p->b); | 
 |  7     rcu_read_unlock(); | 
 |  8     return true; | 
 |  9   } | 
 | 10   rcu_read_unlock(); | 
 | 11   return false; | 
 | 12 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) | 
 | memory barriers in the Linux kernel. | 
 | Should a | 
 | <a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> | 
 | ever appear, then <tt>rcu_dereference()</tt> could be implemented | 
 | as a <tt>memory_order_consume</tt> load. | 
 | Regardless of the exact implementation, a pointer fetched by | 
 | <tt>rcu_dereference()</tt> may not be used outside of the | 
 | outermost RCU read-side critical section containing that | 
 | <tt>rcu_dereference()</tt>, unless protection of | 
 | the corresponding data element has been passed from RCU to some | 
 | other synchronization mechanism, most commonly locking or | 
 | <a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. | 
 |  | 
 | <p> | 
 | In short, updaters use <tt>rcu_assign_pointer()</tt> and readers | 
 | use <tt>rcu_dereference()</tt>, and these two RCU API elements | 
 | work together to ensure that readers have a consistent view of | 
 | newly added data elements. | 
 |  | 
 | <p> | 
 | Of course, it is also necessary to remove elements from RCU-protected | 
 | data structures, for example, using the following process: | 
 |  | 
 | <ol> | 
 | <li>	Remove the data element from the enclosing structure. | 
 | <li>	Wait for all pre-existing RCU read-side critical sections | 
 | 	to complete (because only pre-existing readers can possibly have | 
 | 	a reference to the newly removed data element). | 
 | <li>	At this point, only the updater has a reference to the | 
 | 	newly removed data element, so it can safely reclaim | 
 | 	the data element, for example, by passing it to <tt>kfree()</tt>. | 
 | </ol> | 
 |  | 
 | This process is implemented by <tt>remove_gp_synchronous()</tt>: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool remove_gp_synchronous(void) | 
 |  2 { | 
 |  3   struct foo *p; | 
 |  4 | 
 |  5   spin_lock(&gp_lock); | 
 |  6   p = rcu_access_pointer(gp); | 
 |  7   if (!p) { | 
 |  8     spin_unlock(&gp_lock); | 
 |  9     return false; | 
 | 10   } | 
 | 11   rcu_assign_pointer(gp, NULL); | 
 | 12   spin_unlock(&gp_lock); | 
 | 13   synchronize_rcu(); | 
 | 14   kfree(p); | 
 | 15   return true; | 
 | 16 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | This function is straightforward, with line 13 waiting for a grace | 
 | period before line 14 frees the old data element. | 
 | This waiting ensures that readers will reach line 7 of | 
 | <tt>do_something_gp()</tt> before the data element referenced by | 
 | <tt>p</tt> is freed. | 
 | The <tt>rcu_access_pointer()</tt> on line 6 is similar to | 
 | <tt>rcu_dereference()</tt>, except that: | 
 |  | 
 | <ol> | 
 | <li>	The value returned by <tt>rcu_access_pointer()</tt> | 
 | 	cannot be dereferenced. | 
 | 	If you want to access the value pointed to as well as | 
 | 	the pointer itself, use <tt>rcu_dereference()</tt> | 
 | 	instead of <tt>rcu_access_pointer()</tt>. | 
 | <li>	The call to <tt>rcu_access_pointer()</tt> need not be | 
 | 	protected. | 
 | 	In contrast, <tt>rcu_dereference()</tt> must either be | 
 | 	within an RCU read-side critical section or in a code | 
 | 	segment where the pointer cannot change, for example, in | 
 | 	code protected by the corresponding update-side lock. | 
 | </ol> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Without the <tt>rcu_dereference()</tt> or the | 
 | 	<tt>rcu_access_pointer()</tt>, what destructive optimizations | 
 | 	might the compiler make use of? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Let's start with what happens to <tt>do_something_gp()</tt> | 
 | 	if it fails to use <tt>rcu_dereference()</tt>. | 
 | 	It could reuse a value formerly fetched from this same pointer. | 
 | 	It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time | 
 | 	manner, resulting in <i>load tearing</i>, in turn resulting a bytewise | 
 | 	mash-up of two distinct pointer values. | 
 | 	It might even use value-speculation optimizations, where it makes | 
 | 	a wrong guess, but by the time it gets around to checking the | 
 | 	value, an update has changed the pointer to match the wrong guess. | 
 | 	Too bad about any dereferences that returned pre-initialization garbage | 
 | 	in the meantime! | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	For <tt>remove_gp_synchronous()</tt>, as long as all modifications | 
 | 	to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, | 
 | 	the above optimizations are harmless. | 
 | 	However, <tt>sparse</tt> will complain if you | 
 | 	define <tt>gp</tt> with <tt>__rcu</tt> and then | 
 | 	access it without using | 
 | 	either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | In short, RCU's publish-subscribe guarantee is provided by the combination | 
 | of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. | 
 | This guarantee allows data elements to be safely added to RCU-protected | 
 | linked data structures without disrupting RCU readers. | 
 | This guarantee can be used in combination with the grace-period | 
 | guarantee to also allow data elements to be removed from RCU-protected | 
 | linked data structures, again without disrupting RCU readers. | 
 |  | 
 | <p> | 
 | This guarantee was only partially premeditated. | 
 | DYNIX/ptx used an explicit memory barrier for publication, but had nothing | 
 | resembling <tt>rcu_dereference()</tt> for subscription, nor did it | 
 | have anything resembling the <tt>smp_read_barrier_depends()</tt> | 
 | that was later subsumed into <tt>rcu_dereference()</tt>. | 
 | The need for these operations made itself known quite suddenly at a | 
 | late-1990s meeting with the DEC Alpha architects, back in the days when | 
 | DEC was still a free-standing company. | 
 | It took the Alpha architects a good hour to convince me that any sort | 
 | of barrier would ever be needed, and it then took me a good <i>two</i> hours | 
 | to convince them that their documentation did not make this point clear. | 
 | More recent work with the C and C++ standards committees have provided | 
 | much education on tricks and traps from the compiler. | 
 | In short, compilers were much less tricky in the early 1990s, but in | 
 | 2015, don't even think about omitting <tt>rcu_dereference()</tt>! | 
 |  | 
 | <h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> | 
 |  | 
 | <p> | 
 | The previous section's simple linked-data-structure scenario clearly | 
 | demonstrates the need for RCU's stringent memory-ordering guarantees on | 
 | systems with more than one CPU: | 
 |  | 
 | <ol> | 
 | <li>	Each CPU that has an RCU read-side critical section that | 
 | 	begins before <tt>synchronize_rcu()</tt> starts is | 
 | 	guaranteed to execute a full memory barrier between the time | 
 | 	that the RCU read-side critical section ends and the time that | 
 | 	<tt>synchronize_rcu()</tt> returns. | 
 | 	Without this guarantee, a pre-existing RCU read-side critical section | 
 | 	might hold a reference to the newly removed <tt>struct foo</tt> | 
 | 	after the <tt>kfree()</tt> on line 14 of | 
 | 	<tt>remove_gp_synchronous()</tt>. | 
 | <li>	Each CPU that has an RCU read-side critical section that ends | 
 | 	after <tt>synchronize_rcu()</tt> returns is guaranteed | 
 | 	to execute a full memory barrier between the time that | 
 | 	<tt>synchronize_rcu()</tt> begins and the time that the RCU | 
 | 	read-side critical section begins. | 
 | 	Without this guarantee, a later RCU read-side critical section | 
 | 	running after the <tt>kfree()</tt> on line 14 of | 
 | 	<tt>remove_gp_synchronous()</tt> might | 
 | 	later run <tt>do_something_gp()</tt> and find the | 
 | 	newly deleted <tt>struct foo</tt>. | 
 | <li>	If the task invoking <tt>synchronize_rcu()</tt> remains | 
 | 	on a given CPU, then that CPU is guaranteed to execute a full | 
 | 	memory barrier sometime during the execution of | 
 | 	<tt>synchronize_rcu()</tt>. | 
 | 	This guarantee ensures that the <tt>kfree()</tt> on | 
 | 	line 14 of <tt>remove_gp_synchronous()</tt> really does | 
 | 	execute after the removal on line 11. | 
 | <li>	If the task invoking <tt>synchronize_rcu()</tt> migrates | 
 | 	among a group of CPUs during that invocation, then each of the | 
 | 	CPUs in that group is guaranteed to execute a full memory barrier | 
 | 	sometime during the execution of <tt>synchronize_rcu()</tt>. | 
 | 	This guarantee also ensures that the <tt>kfree()</tt> on | 
 | 	line 14 of <tt>remove_gp_synchronous()</tt> really does | 
 | 	execute after the removal on | 
 | 	line 11, but also in the case where the thread executing the | 
 | 	<tt>synchronize_rcu()</tt> migrates in the meantime. | 
 | </ol> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Given that multiple CPUs can start RCU read-side critical sections | 
 | 	at any time without any ordering whatsoever, how can RCU possibly | 
 | 	tell whether or not a given RCU read-side critical section starts | 
 | 	before a given instance of <tt>synchronize_rcu()</tt>? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	If RCU cannot tell whether or not a given | 
 | 	RCU read-side critical section starts before a | 
 | 	given instance of <tt>synchronize_rcu()</tt>, | 
 | 	then it must assume that the RCU read-side critical section | 
 | 	started first. | 
 | 	In other words, a given instance of <tt>synchronize_rcu()</tt> | 
 | 	can avoid waiting on a given RCU read-side critical section only | 
 | 	if it can prove that <tt>synchronize_rcu()</tt> started first. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	A related question is “When <tt>rcu_read_lock()</tt> | 
 | 	doesn't generate any code, why does it matter how it relates | 
 | 	to a grace period?” | 
 | 	The answer is that it is not the relationship of | 
 | 	<tt>rcu_read_lock()</tt> itself that is important, but rather | 
 | 	the relationship of the code within the enclosed RCU read-side | 
 | 	critical section to the code preceding and following the | 
 | 	grace period. | 
 | 	If we take this viewpoint, then a given RCU read-side critical | 
 | 	section begins before a given grace period when some access | 
 | 	preceding the grace period observes the effect of some access | 
 | 	within the critical section, in which case none of the accesses | 
 | 	within the critical section may observe the effects of any | 
 | 	access following the grace period. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	As of late 2016, mathematical models of RCU take this | 
 | 	viewpoint, for example, see slides 62 and 63 | 
 | 	of the | 
 | 	<a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> | 
 | 	presentation. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	The first and second guarantees require unbelievably strict ordering! | 
 | 	Are all these memory barriers <i> really</i> required? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Yes, they really are required. | 
 | 	To see why the first guarantee is required, consider the following | 
 | 	sequence of events: | 
 | 	</font> | 
 |  | 
 | 	<ol> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 1: <tt>rcu_read_lock()</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 1: <tt>q = rcu_dereference(gp); | 
 | 		/* Very likely to return p. */</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 0: <tt>list_del_rcu(p);</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 0: <tt>synchronize_rcu()</tt> starts. | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 1: <tt>do_something_with(q->a); | 
 | 		/* No smp_mb(), so might happen after kfree(). */</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 1: <tt>rcu_read_unlock()</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 0: <tt>synchronize_rcu()</tt> returns. | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 0: <tt>kfree(p);</tt> | 
 | 		</font> | 
 | 	</ol> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	Therefore, there absolutely must be a full memory barrier between the | 
 | 	end of the RCU read-side critical section and the end of the | 
 | 	grace period. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	The sequence of events demonstrating the necessity of the second rule | 
 | 	is roughly similar: | 
 | 	</font> | 
 |  | 
 | 	<ol> | 
 | 	<li>	<font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. | 
 | 		</font> | 
 | 	<li>	<font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); | 
 | 		/* Might return p if no memory barrier. */</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. | 
 | 		</font> | 
 | 	<li>	<font color="ffffff">CPU 0: <tt>kfree(p);</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff"> | 
 | 		CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> | 
 | 		</font> | 
 | 	<li>	<font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> | 
 | 		</font> | 
 | 	</ol> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	And similarly, without a memory barrier between the beginning of the | 
 | 	grace period and the beginning of the RCU read-side critical section, | 
 | 	CPU 1 might end up accessing the freelist. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	The “as if” rule of course applies, so that any | 
 | 	implementation that acts as if the appropriate memory barriers | 
 | 	were in place is a correct implementation. | 
 | 	That said, it is much easier to fool yourself into believing | 
 | 	that you have adhered to the as-if rule than it is to actually | 
 | 	adhere to it! | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> | 
 | 	generate absolutely no code in some kernel builds. | 
 | 	This means that the compiler might arbitrarily rearrange consecutive | 
 | 	RCU read-side critical sections. | 
 | 	Given such rearrangement, if a given RCU read-side critical section | 
 | 	is done, how can you be sure that all prior RCU read-side critical | 
 | 	sections are done? | 
 | 	Won't the compiler rearrangements make that impossible to determine? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> | 
 | 	generate absolutely no code, RCU infers quiescent states only at | 
 | 	special locations, for example, within the scheduler. | 
 | 	Because calls to <tt>schedule()</tt> had better prevent calling-code | 
 | 	accesses to shared variables from being rearranged across the call to | 
 | 	<tt>schedule()</tt>, if RCU detects the end of a given RCU read-side | 
 | 	critical section, it will necessarily detect the end of all prior | 
 | 	RCU read-side critical sections, no matter how aggressively the | 
 | 	compiler scrambles the code. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	Again, this all assumes that the compiler cannot scramble code across | 
 | 	calls to the scheduler, out of interrupt handlers, into the idle loop, | 
 | 	into user-mode code, and so on. | 
 | 	But if your kernel build allows that sort of scrambling, you have broken | 
 | 	far more than just RCU! | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | Note that these memory-barrier requirements do not replace the fundamental | 
 | RCU requirement that a grace period wait for all pre-existing readers. | 
 | On the contrary, the memory barriers called out in this section must operate in | 
 | such a way as to <i>enforce</i> this fundamental requirement. | 
 | Of course, different implementations enforce this requirement in different | 
 | ways, but enforce it they must. | 
 |  | 
 | <h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> | 
 |  | 
 | <p> | 
 | The common-case RCU primitives are unconditional. | 
 | They are invoked, they do their job, and they return, with no possibility | 
 | of error, and no need to retry. | 
 | This is a key RCU design philosophy. | 
 |  | 
 | <p> | 
 | However, this philosophy is pragmatic rather than pigheaded. | 
 | If someone comes up with a good justification for a particular conditional | 
 | RCU primitive, it might well be implemented and added. | 
 | After all, this guarantee was reverse-engineered, not premeditated. | 
 | The unconditional nature of the RCU primitives was initially an | 
 | accident of implementation, and later experience with synchronization | 
 | primitives with conditional primitives caused me to elevate this | 
 | accident to a guarantee. | 
 | Therefore, the justification for adding a conditional primitive to | 
 | RCU would need to be based on detailed and compelling use cases. | 
 |  | 
 | <h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> | 
 |  | 
 | <p> | 
 | As far as RCU is concerned, it is always possible to carry out an | 
 | update within an RCU read-side critical section. | 
 | For example, that RCU read-side critical section might search for | 
 | a given data element, and then might acquire the update-side | 
 | spinlock in order to update that element, all while remaining | 
 | in that RCU read-side critical section. | 
 | Of course, it is necessary to exit the RCU read-side critical section | 
 | before invoking <tt>synchronize_rcu()</tt>, however, this | 
 | inconvenience can be avoided through use of the | 
 | <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members | 
 | described later in this document. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	But how does the upgrade-to-write operation exclude other readers? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	It doesn't, just like normal RCU updates, which also do not exclude | 
 | 	RCU readers. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | This guarantee allows lookup code to be shared between read-side | 
 | and update-side code, and was premeditated, appearing in the earliest | 
 | DYNIX/ptx RCU documentation. | 
 |  | 
 | <h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> | 
 |  | 
 | <p> | 
 | RCU provides extremely lightweight readers, and its read-side guarantees, | 
 | though quite useful, are correspondingly lightweight. | 
 | It is therefore all too easy to assume that RCU is guaranteeing more | 
 | than it really is. | 
 | Of course, the list of things that RCU does not guarantee is infinitely | 
 | long, however, the following sections list a few non-guarantees that | 
 | have caused confusion. | 
 | Except where otherwise noted, these non-guarantees were premeditated. | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Readers Impose Minimal Ordering"> | 
 | 	Readers Impose Minimal Ordering</a> | 
 | <li>	<a href="#Readers Do Not Exclude Updaters"> | 
 | 	Readers Do Not Exclude Updaters</a> | 
 | <li>	<a href="#Updaters Only Wait For Old Readers"> | 
 | 	Updaters Only Wait For Old Readers</a> | 
 | <li>	<a href="#Grace Periods Don't Partition Read-Side Critical Sections"> | 
 | 	Grace Periods Don't Partition Read-Side Critical Sections</a> | 
 | <li>	<a href="#Read-Side Critical Sections Don't Partition Grace Periods"> | 
 | 	Read-Side Critical Sections Don't Partition Grace Periods</a> | 
 | <li>	<a href="#Disabling Preemption Does Not Block Grace Periods"> | 
 | 	Disabling Preemption Does Not Block Grace Periods</a> | 
 | </ol> | 
 |  | 
 | <h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> | 
 |  | 
 | <p> | 
 | Reader-side markers such as <tt>rcu_read_lock()</tt> and | 
 | <tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees | 
 | except through their interaction with the grace-period APIs such as | 
 | <tt>synchronize_rcu()</tt>. | 
 | To see this, consider the following pair of threads: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 void thread0(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   WRITE_ONCE(x, 1); | 
 |  5   rcu_read_unlock(); | 
 |  6   rcu_read_lock(); | 
 |  7   WRITE_ONCE(y, 1); | 
 |  8   rcu_read_unlock(); | 
 |  9 } | 
 | 10 | 
 | 11 void thread1(void) | 
 | 12 { | 
 | 13   rcu_read_lock(); | 
 | 14   r1 = READ_ONCE(y); | 
 | 15   rcu_read_unlock(); | 
 | 16   rcu_read_lock(); | 
 | 17   r2 = READ_ONCE(x); | 
 | 18   rcu_read_unlock(); | 
 | 19 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | After <tt>thread0()</tt> and <tt>thread1()</tt> execute | 
 | concurrently, it is quite possible to have | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 | (r1 == 1 && r2 == 0) | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | (that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), | 
 | which would not be possible if <tt>rcu_read_lock()</tt> and | 
 | <tt>rcu_read_unlock()</tt> had much in the way of ordering | 
 | properties. | 
 | But they do not, so the CPU is within its rights | 
 | to do significant reordering. | 
 | This is by design:  Any significant ordering constraints would slow down | 
 | these fast-path APIs. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Can't the compiler also reorder this code? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	No, the volatile casts in <tt>READ_ONCE()</tt> and | 
 | 	<tt>WRITE_ONCE()</tt> prevent the compiler from reordering in | 
 | 	this particular case. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> | 
 |  | 
 | <p> | 
 | Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> | 
 | exclude updates. | 
 | All they do is to prevent grace periods from ending. | 
 | The following example illustrates this: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 void thread0(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   r1 = READ_ONCE(y); | 
 |  5   if (r1) { | 
 |  6     do_something_with_nonzero_x(); | 
 |  7     r2 = READ_ONCE(x); | 
 |  8     WARN_ON(!r2); /* BUG!!! */ | 
 |  9   } | 
 | 10   rcu_read_unlock(); | 
 | 11 } | 
 | 12 | 
 | 13 void thread1(void) | 
 | 14 { | 
 | 15   spin_lock(&my_lock); | 
 | 16   WRITE_ONCE(x, 1); | 
 | 17   WRITE_ONCE(y, 1); | 
 | 18   spin_unlock(&my_lock); | 
 | 19 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> | 
 | excluded the <tt>thread1()</tt> function's update, | 
 | the <tt>WARN_ON()</tt> could never fire. | 
 | But the fact is that <tt>rcu_read_lock()</tt> does not exclude | 
 | much of anything aside from subsequent grace periods, of which | 
 | <tt>thread1()</tt> has none, so the | 
 | <tt>WARN_ON()</tt> can and does fire. | 
 |  | 
 | <h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> | 
 |  | 
 | <p> | 
 | It might be tempting to assume that after <tt>synchronize_rcu()</tt> | 
 | completes, there are no readers executing. | 
 | This temptation must be avoided because | 
 | new readers can start immediately after <tt>synchronize_rcu()</tt> | 
 | starts, and <tt>synchronize_rcu()</tt> is under no | 
 | obligation to wait for these new readers. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Suppose that synchronize_rcu() did wait until <i>all</i> | 
 | 	readers had completed instead of waiting only on | 
 | 	pre-existing readers. | 
 | 	For how long would the updater be able to rely on there | 
 | 	being no readers? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	For no time at all. | 
 | 	Even if <tt>synchronize_rcu()</tt> were to wait until | 
 | 	all readers had completed, a new reader might start immediately after | 
 | 	<tt>synchronize_rcu()</tt> completed. | 
 | 	Therefore, the code following | 
 | 	<tt>synchronize_rcu()</tt> can <i>never</i> rely on there being | 
 | 	no readers. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> | 
 | Grace Periods Don't Partition Read-Side Critical Sections</a></h3> | 
 |  | 
 | <p> | 
 | It is tempting to assume that if any part of one RCU read-side critical | 
 | section precedes a given grace period, and if any part of another RCU | 
 | read-side critical section follows that same grace period, then all of | 
 | the first RCU read-side critical section must precede all of the second. | 
 | However, this just isn't the case: A single grace period does not | 
 | partition the set of RCU read-side critical sections. | 
 | An example of this situation can be illustrated as follows, where | 
 | <tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 void thread0(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   WRITE_ONCE(a, 1); | 
 |  5   WRITE_ONCE(b, 1); | 
 |  6   rcu_read_unlock(); | 
 |  7 } | 
 |  8 | 
 |  9 void thread1(void) | 
 | 10 { | 
 | 11   r1 = READ_ONCE(a); | 
 | 12   synchronize_rcu(); | 
 | 13   WRITE_ONCE(c, 1); | 
 | 14 } | 
 | 15 | 
 | 16 void thread2(void) | 
 | 17 { | 
 | 18   rcu_read_lock(); | 
 | 19   r2 = READ_ONCE(b); | 
 | 20   r3 = READ_ONCE(c); | 
 | 21   rcu_read_unlock(); | 
 | 22 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | It turns out that the outcome: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 | (r1 == 1 && r2 == 0 && r3 == 1) | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | is entirely possible. | 
 | The following figure show how this can happen, with each circled | 
 | <tt>QS</tt> indicating the point at which RCU recorded a | 
 | <i>quiescent state</i> for each thread, that is, a state in which | 
 | RCU knows that the thread cannot be in the midst of an RCU read-side | 
 | critical section that started before the current grace period: | 
 |  | 
 | <p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> | 
 |  | 
 | <p> | 
 | If it is necessary to partition RCU read-side critical sections in this | 
 | manner, it is necessary to use two grace periods, where the first | 
 | grace period is known to end before the second grace period starts: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 void thread0(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   WRITE_ONCE(a, 1); | 
 |  5   WRITE_ONCE(b, 1); | 
 |  6   rcu_read_unlock(); | 
 |  7 } | 
 |  8 | 
 |  9 void thread1(void) | 
 | 10 { | 
 | 11   r1 = READ_ONCE(a); | 
 | 12   synchronize_rcu(); | 
 | 13   WRITE_ONCE(c, 1); | 
 | 14 } | 
 | 15 | 
 | 16 void thread2(void) | 
 | 17 { | 
 | 18   r2 = READ_ONCE(c); | 
 | 19   synchronize_rcu(); | 
 | 20   WRITE_ONCE(d, 1); | 
 | 21 } | 
 | 22 | 
 | 23 void thread3(void) | 
 | 24 { | 
 | 25   rcu_read_lock(); | 
 | 26   r3 = READ_ONCE(b); | 
 | 27   r4 = READ_ONCE(d); | 
 | 28   rcu_read_unlock(); | 
 | 29 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | Here, if <tt>(r1 == 1)</tt>, then | 
 | <tt>thread0()</tt>'s write to <tt>b</tt> must happen | 
 | before the end of <tt>thread1()</tt>'s grace period. | 
 | If in addition <tt>(r4 == 1)</tt>, then | 
 | <tt>thread3()</tt>'s read from <tt>b</tt> must happen | 
 | after the beginning of <tt>thread2()</tt>'s grace period. | 
 | If it is also the case that <tt>(r2 == 1)</tt>, then the | 
 | end of <tt>thread1()</tt>'s grace period must precede the | 
 | beginning of <tt>thread2()</tt>'s grace period. | 
 | This mean that the two RCU read-side critical sections cannot overlap, | 
 | guaranteeing that <tt>(r3 == 1)</tt>. | 
 | As a result, the outcome: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 | (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | cannot happen. | 
 |  | 
 | <p> | 
 | This non-requirement was also non-premeditated, but became apparent | 
 | when studying RCU's interaction with memory ordering. | 
 |  | 
 | <h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> | 
 | Read-Side Critical Sections Don't Partition Grace Periods</a></h3> | 
 |  | 
 | <p> | 
 | It is also tempting to assume that if an RCU read-side critical section | 
 | happens between a pair of grace periods, then those grace periods cannot | 
 | overlap. | 
 | However, this temptation leads nowhere good, as can be illustrated by | 
 | the following, with all variables initially zero: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 void thread0(void) | 
 |  2 { | 
 |  3   rcu_read_lock(); | 
 |  4   WRITE_ONCE(a, 1); | 
 |  5   WRITE_ONCE(b, 1); | 
 |  6   rcu_read_unlock(); | 
 |  7 } | 
 |  8 | 
 |  9 void thread1(void) | 
 | 10 { | 
 | 11   r1 = READ_ONCE(a); | 
 | 12   synchronize_rcu(); | 
 | 13   WRITE_ONCE(c, 1); | 
 | 14 } | 
 | 15 | 
 | 16 void thread2(void) | 
 | 17 { | 
 | 18   rcu_read_lock(); | 
 | 19   WRITE_ONCE(d, 1); | 
 | 20   r2 = READ_ONCE(c); | 
 | 21   rcu_read_unlock(); | 
 | 22 } | 
 | 23 | 
 | 24 void thread3(void) | 
 | 25 { | 
 | 26   r3 = READ_ONCE(d); | 
 | 27   synchronize_rcu(); | 
 | 28   WRITE_ONCE(e, 1); | 
 | 29 } | 
 | 30 | 
 | 31 void thread4(void) | 
 | 32 { | 
 | 33   rcu_read_lock(); | 
 | 34   r4 = READ_ONCE(b); | 
 | 35   r5 = READ_ONCE(e); | 
 | 36   rcu_read_unlock(); | 
 | 37 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | In this case, the outcome: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 | (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | is entirely possible, as illustrated below: | 
 |  | 
 | <p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> | 
 |  | 
 | <p> | 
 | Again, an RCU read-side critical section can overlap almost all of a | 
 | given grace period, just so long as it does not overlap the entire | 
 | grace period. | 
 | As a result, an RCU read-side critical section cannot partition a pair | 
 | of RCU grace periods. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	How long a sequence of grace periods, each separated by an RCU | 
 | 	read-side critical section, would be required to partition the RCU | 
 | 	read-side critical sections at the beginning and end of the chain? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	In theory, an infinite number. | 
 | 	In practice, an unknown number that is sensitive to both implementation | 
 | 	details and timing considerations. | 
 | 	Therefore, even in practice, RCU users must abide by the | 
 | 	theoretical rather than the practical answer. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h3><a name="Disabling Preemption Does Not Block Grace Periods"> | 
 | Disabling Preemption Does Not Block Grace Periods</a></h3> | 
 |  | 
 | <p> | 
 | There was a time when disabling preemption on any given CPU would block | 
 | subsequent grace periods. | 
 | However, this was an accident of implementation and is not a requirement. | 
 | And in the current Linux-kernel implementation, disabling preemption | 
 | on a given CPU in fact does not block grace periods, as Oleg Nesterov | 
 | <a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>. | 
 |  | 
 | <p> | 
 | If you need a preempt-disable region to block grace periods, you need to add | 
 | <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example | 
 | as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 preempt_disable(); | 
 |  2 rcu_read_lock(); | 
 |  3 do_something(); | 
 |  4 rcu_read_unlock(); | 
 |  5 preempt_enable(); | 
 |  6 | 
 |  7 /* Spinlocks implicitly disable preemption. */ | 
 |  8 spin_lock(&mylock); | 
 |  9 rcu_read_lock(); | 
 | 10 do_something(); | 
 | 11 rcu_read_unlock(); | 
 | 12 spin_unlock(&mylock); | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | In theory, you could enter the RCU read-side critical section first, | 
 | but it is more efficient to keep the entire RCU read-side critical | 
 | section contained in the preempt-disable region as shown above. | 
 | Of course, RCU read-side critical sections that extend outside of | 
 | preempt-disable regions will work correctly, but such critical sections | 
 | can be preempted, which forces <tt>rcu_read_unlock()</tt> to do | 
 | more work. | 
 | And no, this is <i>not</i> an invitation to enclose all of your RCU | 
 | read-side critical sections within preempt-disable regions, because | 
 | doing so would degrade real-time response. | 
 |  | 
 | <p> | 
 | This non-requirement appeared with preemptible RCU. | 
 | If you need a grace period that waits on non-preemptible code regions, use | 
 | <a href="#Sched Flavor">RCU-sched</a>. | 
 |  | 
 | <h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> | 
 |  | 
 | <p> | 
 | These parallelism facts of life are by no means specific to RCU, but | 
 | the RCU implementation must abide by them. | 
 | They therefore bear repeating: | 
 |  | 
 | <ol> | 
 | <li>	Any CPU or task may be delayed at any time, | 
 | 	and any attempts to avoid these delays by disabling | 
 | 	preemption, interrupts, or whatever are completely futile. | 
 | 	This is most obvious in preemptible user-level | 
 | 	environments and in virtualized environments (where | 
 | 	a given guest OS's VCPUs can be preempted at any time by | 
 | 	the underlying hypervisor), but can also happen in bare-metal | 
 | 	environments due to ECC errors, NMIs, and other hardware | 
 | 	events. | 
 | 	Although a delay of more than about 20 seconds can result | 
 | 	in splats, the RCU implementation is obligated to use | 
 | 	algorithms that can tolerate extremely long delays, but where | 
 | 	“extremely long” is not long enough to allow | 
 | 	wrap-around when incrementing a 64-bit counter. | 
 | <li>	Both the compiler and the CPU can reorder memory accesses. | 
 | 	Where it matters, RCU must use compiler directives and | 
 | 	memory-barrier instructions to preserve ordering. | 
 | <li>	Conflicting writes to memory locations in any given cache line | 
 | 	will result in expensive cache misses. | 
 | 	Greater numbers of concurrent writes and more-frequent | 
 | 	concurrent writes will result in more dramatic slowdowns. | 
 | 	RCU is therefore obligated to use algorithms that have | 
 | 	sufficient locality to avoid significant performance and | 
 | 	scalability problems. | 
 | <li>	As a rough rule of thumb, only one CPU's worth of processing | 
 | 	may be carried out under the protection of any given exclusive | 
 | 	lock. | 
 | 	RCU must therefore use scalable locking designs. | 
 | <li>	Counters are finite, especially on 32-bit systems. | 
 | 	RCU's use of counters must therefore tolerate counter wrap, | 
 | 	or be designed such that counter wrap would take way more | 
 | 	time than a single system is likely to run. | 
 | 	An uptime of ten years is quite possible, a runtime | 
 | 	of a century much less so. | 
 | 	As an example of the latter, RCU's dyntick-idle nesting counter | 
 | 	allows 54 bits for interrupt nesting level (this counter | 
 | 	is 64 bits even on a 32-bit system). | 
 | 	Overflowing this counter requires 2<sup>54</sup> | 
 | 	half-interrupts on a given CPU without that CPU ever going idle. | 
 | 	If a half-interrupt happened every microsecond, it would take | 
 | 	570 years of runtime to overflow this counter, which is currently | 
 | 	believed to be an acceptably long time. | 
 | <li>	Linux systems can have thousands of CPUs running a single | 
 | 	Linux kernel in a single shared-memory environment. | 
 | 	RCU must therefore pay close attention to high-end scalability. | 
 | </ol> | 
 |  | 
 | <p> | 
 | This last parallelism fact of life means that RCU must pay special | 
 | attention to the preceding facts of life. | 
 | The idea that Linux might scale to systems with thousands of CPUs would | 
 | have been met with some skepticism in the 1990s, but these requirements | 
 | would have otherwise have been unsurprising, even in the early 1990s. | 
 |  | 
 | <h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> | 
 |  | 
 | <p> | 
 | These sections list quality-of-implementation requirements. | 
 | Although an RCU implementation that ignores these requirements could | 
 | still be used, it would likely be subject to limitations that would | 
 | make it inappropriate for industrial-strength production use. | 
 | Classes of quality-of-implementation requirements are as follows: | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Specialization">Specialization</a> | 
 | <li>	<a href="#Performance and Scalability">Performance and Scalability</a> | 
 | <li>	<a href="#Composability">Composability</a> | 
 | <li>	<a href="#Corner Cases">Corner Cases</a> | 
 | </ol> | 
 |  | 
 | <p> | 
 | These classes is covered in the following sections. | 
 |  | 
 | <h3><a name="Specialization">Specialization</a></h3> | 
 |  | 
 | <p> | 
 | RCU is and always has been intended primarily for read-mostly situations, | 
 | which means that RCU's read-side primitives are optimized, often at the | 
 | expense of its update-side primitives. | 
 | Experience thus far is captured by the following list of situations: | 
 |  | 
 | <ol> | 
 | <li>	Read-mostly data, where stale and inconsistent data is not | 
 | 	a problem:   RCU works great! | 
 | <li>	Read-mostly data, where data must be consistent: | 
 | 	RCU works well. | 
 | <li>	Read-write data, where data must be consistent: | 
 | 	RCU <i>might</i> work OK. | 
 | 	Or not. | 
 | <li>	Write-mostly data, where data must be consistent: | 
 | 	RCU is very unlikely to be the right tool for the job, | 
 | 	with the following exceptions, where RCU can provide: | 
 | 	<ol type=a> | 
 | 	<li>	Existence guarantees for update-friendly mechanisms. | 
 | 	<li>	Wait-free read-side primitives for real-time use. | 
 | 	</ol> | 
 | </ol> | 
 |  | 
 | <p> | 
 | This focus on read-mostly situations means that RCU must interoperate | 
 | with other synchronization primitives. | 
 | For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> | 
 | examples discussed earlier use RCU to protect readers and locking to | 
 | coordinate updaters. | 
 | However, the need extends much farther, requiring that a variety of | 
 | synchronization primitives be legal within RCU read-side critical sections, | 
 | including spinlocks, sequence locks, atomic operations, reference | 
 | counters, and memory barriers. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	What about sleeping locks? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	These are forbidden within Linux-kernel RCU read-side critical | 
 | 	sections because it is not legal to place a quiescent state | 
 | 	(in this case, voluntary context switch) within an RCU read-side | 
 | 	critical section. | 
 | 	However, sleeping locks may be used within userspace RCU read-side | 
 | 	critical sections, and also within Linux-kernel sleepable RCU | 
 | 	<a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a> | 
 | 	read-side critical sections. | 
 | 	In addition, the -rt patchset turns spinlocks into a | 
 | 	sleeping locks so that the corresponding critical sections | 
 | 	can be preempted, which also means that these sleeplockified | 
 | 	spinlocks (but not other sleeping locks!)  may be acquire within | 
 | 	-rt-Linux-kernel RCU read-side critical sections. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	Note that it <i>is</i> legal for a normal RCU read-side | 
 | 	critical section to conditionally acquire a sleeping locks | 
 | 	(as in <tt>mutex_trylock()</tt>), but only as long as it does | 
 | 	not loop indefinitely attempting to conditionally acquire that | 
 | 	sleeping locks. | 
 | 	The key point is that things like <tt>mutex_trylock()</tt> | 
 | 	either return with the mutex held, or return an error indication if | 
 | 	the mutex was not immediately available. | 
 | 	Either way, <tt>mutex_trylock()</tt> returns immediately without | 
 | 	sleeping. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | It often comes as a surprise that many algorithms do not require a | 
 | consistent view of data, but many can function in that mode, | 
 | with network routing being the poster child. | 
 | Internet routing algorithms take significant time to propagate | 
 | updates, so that by the time an update arrives at a given system, | 
 | that system has been sending network traffic the wrong way for | 
 | a considerable length of time. | 
 | Having a few threads continue to send traffic the wrong way for a | 
 | few more milliseconds is clearly not a problem:  In the worst case, | 
 | TCP retransmissions will eventually get the data where it needs to go. | 
 | In general, when tracking the state of the universe outside of the | 
 | computer, some level of inconsistency must be tolerated due to | 
 | speed-of-light delays if nothing else. | 
 |  | 
 | <p> | 
 | Furthermore, uncertainty about external state is inherent in many cases. | 
 | For example, a pair of veterinarians might use heartbeat to determine | 
 | whether or not a given cat was alive. | 
 | But how long should they wait after the last heartbeat to decide that | 
 | the cat is in fact dead? | 
 | Waiting less than 400 milliseconds makes no sense because this would | 
 | mean that a relaxed cat would be considered to cycle between death | 
 | and life more than 100 times per minute. | 
 | Moreover, just as with human beings, a cat's heart might stop for | 
 | some period of time, so the exact wait period is a judgment call. | 
 | One of our pair of veterinarians might wait 30 seconds before pronouncing | 
 | the cat dead, while the other might insist on waiting a full minute. | 
 | The two veterinarians would then disagree on the state of the cat during | 
 | the final 30 seconds of the minute following the last heartbeat. | 
 |  | 
 | <p> | 
 | Interestingly enough, this same situation applies to hardware. | 
 | When push comes to shove, how do we tell whether or not some | 
 | external server has failed? | 
 | We send messages to it periodically, and declare it failed if we | 
 | don't receive a response within a given period of time. | 
 | Policy decisions can usually tolerate short | 
 | periods of inconsistency. | 
 | The policy was decided some time ago, and is only now being put into | 
 | effect, so a few milliseconds of delay is normally inconsequential. | 
 |  | 
 | <p> | 
 | However, there are algorithms that absolutely must see consistent data. | 
 | For example, the translation between a user-level SystemV semaphore | 
 | ID to the corresponding in-kernel data structure is protected by RCU, | 
 | but it is absolutely forbidden to update a semaphore that has just been | 
 | removed. | 
 | In the Linux kernel, this need for consistency is accommodated by acquiring | 
 | spinlocks located in the in-kernel data structure from within | 
 | the RCU read-side critical section, and this is indicated by the | 
 | green box in the figure above. | 
 | Many other techniques may be used, and are in fact used within the | 
 | Linux kernel. | 
 |  | 
 | <p> | 
 | In short, RCU is not required to maintain consistency, and other | 
 | mechanisms may be used in concert with RCU when consistency is required. | 
 | RCU's specialization allows it to do its job extremely well, and its | 
 | ability to interoperate with other synchronization mechanisms allows | 
 | the right mix of synchronization tools to be used for a given job. | 
 |  | 
 | <h3><a name="Performance and Scalability">Performance and Scalability</a></h3> | 
 |  | 
 | <p> | 
 | Energy efficiency is a critical component of performance today, | 
 | and Linux-kernel RCU implementations must therefore avoid unnecessarily | 
 | awakening idle CPUs. | 
 | I cannot claim that this requirement was premeditated. | 
 | In fact, I learned of it during a telephone conversation in which I | 
 | was given “frank and open” feedback on the importance | 
 | of energy efficiency in battery-powered systems and on specific | 
 | energy-efficiency shortcomings of the Linux-kernel RCU implementation. | 
 | In my experience, the battery-powered embedded community will consider | 
 | any unnecessary wakeups to be extremely unfriendly acts. | 
 | So much so that mere Linux-kernel-mailing-list posts are | 
 | insufficient to vent their ire. | 
 |  | 
 | <p> | 
 | Memory consumption is not particularly important for in most | 
 | situations, and has become decreasingly | 
 | so as memory sizes have expanded and memory | 
 | costs have plummeted. | 
 | However, as I learned from Matt Mackall's | 
 | <a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> | 
 | efforts, memory footprint is critically important on single-CPU systems with | 
 | non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus | 
 | <a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> | 
 | was born. | 
 | Josh Triplett has since taken over the small-memory banner with his | 
 | <a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> | 
 | project, which resulted in | 
 | <a href="#Sleepable RCU">SRCU</a> | 
 | becoming optional for those kernels not needing it. | 
 |  | 
 | <p> | 
 | The remaining performance requirements are, for the most part, | 
 | unsurprising. | 
 | For example, in keeping with RCU's read-side specialization, | 
 | <tt>rcu_dereference()</tt> should have negligible overhead (for | 
 | example, suppression of a few minor compiler optimizations). | 
 | Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and | 
 | <tt>rcu_read_unlock()</tt> should have exactly zero overhead. | 
 |  | 
 | <p> | 
 | In preemptible environments, in the case where the RCU read-side | 
 | critical section was not preempted (as will be the case for the | 
 | highest-priority real-time process), <tt>rcu_read_lock()</tt> and | 
 | <tt>rcu_read_unlock()</tt> should have minimal overhead. | 
 | In particular, they should not contain atomic read-modify-write | 
 | operations, memory-barrier instructions, preemption disabling, | 
 | interrupt disabling, or backwards branches. | 
 | However, in the case where the RCU read-side critical section was preempted, | 
 | <tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. | 
 | This is why it is better to nest an RCU read-side critical section | 
 | within a preempt-disable region than vice versa, at least in cases | 
 | where that critical section is short enough to avoid unduly degrading | 
 | real-time latencies. | 
 |  | 
 | <p> | 
 | The <tt>synchronize_rcu()</tt> grace-period-wait primitive is | 
 | optimized for throughput. | 
 | It may therefore incur several milliseconds of latency in addition to | 
 | the duration of the longest RCU read-side critical section. | 
 | On the other hand, multiple concurrent invocations of | 
 | <tt>synchronize_rcu()</tt> are required to use batching optimizations | 
 | so that they can be satisfied by a single underlying grace-period-wait | 
 | operation. | 
 | For example, in the Linux kernel, it is not unusual for a single | 
 | grace-period-wait operation to serve more than | 
 | <a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> | 
 | of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation | 
 | overhead down to nearly zero. | 
 | However, the grace-period optimization is also required to avoid | 
 | measurable degradation of real-time scheduling and interrupt latencies. | 
 |  | 
 | <p> | 
 | In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> | 
 | latencies are unacceptable. | 
 | In these cases, <tt>synchronize_rcu_expedited()</tt> may be used | 
 | instead, reducing the grace-period latency down to a few tens of | 
 | microseconds on small systems, at least in cases where the RCU read-side | 
 | critical sections are short. | 
 | There are currently no special latency requirements for | 
 | <tt>synchronize_rcu_expedited()</tt> on large systems, but, | 
 | consistent with the empirical nature of the RCU specification, | 
 | that is subject to change. | 
 | However, there most definitely are scalability requirements: | 
 | A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 | 
 | CPUs should at least make reasonable forward progress. | 
 | In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> | 
 | is permitted to impose modest degradation of real-time latency | 
 | on non-idle online CPUs. | 
 | Here, “modest” means roughly the same latency | 
 | degradation as a scheduling-clock interrupt. | 
 |  | 
 | <p> | 
 | There are a number of situations where even | 
 | <tt>synchronize_rcu_expedited()</tt>'s reduced grace-period | 
 | latency is unacceptable. | 
 | In these situations, the asynchronous <tt>call_rcu()</tt> can be | 
 | used in place of <tt>synchronize_rcu()</tt> as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 struct foo { | 
 |  2   int a; | 
 |  3   int b; | 
 |  4   struct rcu_head rh; | 
 |  5 }; | 
 |  6 | 
 |  7 static void remove_gp_cb(struct rcu_head *rhp) | 
 |  8 { | 
 |  9   struct foo *p = container_of(rhp, struct foo, rh); | 
 | 10 | 
 | 11   kfree(p); | 
 | 12 } | 
 | 13 | 
 | 14 bool remove_gp_asynchronous(void) | 
 | 15 { | 
 | 16   struct foo *p; | 
 | 17 | 
 | 18   spin_lock(&gp_lock); | 
 | 19   p = rcu_dereference(gp); | 
 | 20   if (!p) { | 
 | 21     spin_unlock(&gp_lock); | 
 | 22     return false; | 
 | 23   } | 
 | 24   rcu_assign_pointer(gp, NULL); | 
 | 25   call_rcu(&p->rh, remove_gp_cb); | 
 | 26   spin_unlock(&gp_lock); | 
 | 27   return true; | 
 | 28 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | A definition of <tt>struct foo</tt> is finally needed, and appears | 
 | on lines 1-5. | 
 | The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> | 
 | on line 25, and will be invoked after the end of a subsequent | 
 | grace period. | 
 | This gets the same effect as <tt>remove_gp_synchronous()</tt>, | 
 | but without forcing the updater to wait for a grace period to elapse. | 
 | The <tt>call_rcu()</tt> function may be used in a number of | 
 | situations where neither <tt>synchronize_rcu()</tt> nor | 
 | <tt>synchronize_rcu_expedited()</tt> would be legal, | 
 | including within preempt-disable code, <tt>local_bh_disable()</tt> code, | 
 | interrupt-disable code, and interrupt handlers. | 
 | However, even <tt>call_rcu()</tt> is illegal within NMI handlers | 
 | and from idle and offline CPUs. | 
 | The callback function (<tt>remove_gp_cb()</tt> in this case) will be | 
 | executed within softirq (software interrupt) environment within the | 
 | Linux kernel, | 
 | either within a real softirq handler or under the protection | 
 | of <tt>local_bh_disable()</tt>. | 
 | In both the Linux kernel and in userspace, it is bad practice to | 
 | write an RCU callback function that takes too long. | 
 | Long-running operations should be relegated to separate threads or | 
 | (in the Linux kernel) workqueues. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Why does line 19 use <tt>rcu_access_pointer()</tt>? | 
 | 	After all, <tt>call_rcu()</tt> on line 25 stores into the | 
 | 	structure, which would interact badly with concurrent insertions. | 
 | 	Doesn't this mean that <tt>rcu_dereference()</tt> is required? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes | 
 | 	any changes, including any insertions that <tt>rcu_dereference()</tt> | 
 | 	would protect against. | 
 | 	Therefore, any insertions will be delayed until after | 
 | 	<tt>->gp_lock</tt> | 
 | 	is released on line 25, which in turn means that | 
 | 	<tt>rcu_access_pointer()</tt> suffices. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | However, all that <tt>remove_gp_cb()</tt> is doing is | 
 | invoking <tt>kfree()</tt> on the data element. | 
 | This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, | 
 | which allows “fire and forget” operation as shown below: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 struct foo { | 
 |  2   int a; | 
 |  3   int b; | 
 |  4   struct rcu_head rh; | 
 |  5 }; | 
 |  6 | 
 |  7 bool remove_gp_faf(void) | 
 |  8 { | 
 |  9   struct foo *p; | 
 | 10 | 
 | 11   spin_lock(&gp_lock); | 
 | 12   p = rcu_dereference(gp); | 
 | 13   if (!p) { | 
 | 14     spin_unlock(&gp_lock); | 
 | 15     return false; | 
 | 16   } | 
 | 17   rcu_assign_pointer(gp, NULL); | 
 | 18   kfree_rcu(p, rh); | 
 | 19   spin_unlock(&gp_lock); | 
 | 20   return true; | 
 | 21 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | Note that <tt>remove_gp_faf()</tt> simply invokes | 
 | <tt>kfree_rcu()</tt> and proceeds, without any need to pay any | 
 | further attention to the subsequent grace period and <tt>kfree()</tt>. | 
 | It is permissible to invoke <tt>kfree_rcu()</tt> from the same | 
 | environments as for <tt>call_rcu()</tt>. | 
 | Interestingly enough, DYNIX/ptx had the equivalents of | 
 | <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not | 
 | <tt>synchronize_rcu()</tt>. | 
 | This was due to the fact that RCU was not heavily used within DYNIX/ptx, | 
 | so the very few places that needed something like | 
 | <tt>synchronize_rcu()</tt> simply open-coded it. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Earlier it was claimed that <tt>call_rcu()</tt> and | 
 | 	<tt>kfree_rcu()</tt> allowed updaters to avoid being blocked | 
 | 	by readers. | 
 | 	But how can that be correct, given that the invocation of the callback | 
 | 	and the freeing of the memory (respectively) must still wait for | 
 | 	a grace period to elapse? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	We could define things this way, but keep in mind that this sort of | 
 | 	definition would say that updates in garbage-collected languages | 
 | 	cannot complete until the next time the garbage collector runs, | 
 | 	which does not seem at all reasonable. | 
 | 	The key point is that in most cases, an updater using either | 
 | 	<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the | 
 | 	next update as soon as it has invoked <tt>call_rcu()</tt> or | 
 | 	<tt>kfree_rcu()</tt>, without having to wait for a subsequent | 
 | 	grace period. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | But what if the updater must wait for the completion of code to be | 
 | executed after the end of the grace period, but has other tasks | 
 | that can be carried out in the meantime? | 
 | The polling-style <tt>get_state_synchronize_rcu()</tt> and | 
 | <tt>cond_synchronize_rcu()</tt> functions may be used for this | 
 | purpose, as shown below: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 bool remove_gp_poll(void) | 
 |  2 { | 
 |  3   struct foo *p; | 
 |  4   unsigned long s; | 
 |  5 | 
 |  6   spin_lock(&gp_lock); | 
 |  7   p = rcu_access_pointer(gp); | 
 |  8   if (!p) { | 
 |  9     spin_unlock(&gp_lock); | 
 | 10     return false; | 
 | 11   } | 
 | 12   rcu_assign_pointer(gp, NULL); | 
 | 13   spin_unlock(&gp_lock); | 
 | 14   s = get_state_synchronize_rcu(); | 
 | 15   do_something_while_waiting(); | 
 | 16   cond_synchronize_rcu(s); | 
 | 17   kfree(p); | 
 | 18   return true; | 
 | 19 } | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a | 
 | “cookie” from RCU, | 
 | then line 15 carries out other tasks, | 
 | and finally, line 16 returns immediately if a grace period has | 
 | elapsed in the meantime, but otherwise waits as required. | 
 | The need for <tt>get_state_synchronize_rcu</tt> and | 
 | <tt>cond_synchronize_rcu()</tt> has appeared quite recently, | 
 | so it is too early to tell whether they will stand the test of time. | 
 |  | 
 | <p> | 
 | RCU thus provides a range of tools to allow updaters to strike the | 
 | required tradeoff between latency, flexibility and CPU overhead. | 
 |  | 
 | <h3><a name="Composability">Composability</a></h3> | 
 |  | 
 | <p> | 
 | Composability has received much attention in recent years, perhaps in part | 
 | due to the collision of multicore hardware with object-oriented techniques | 
 | designed in single-threaded environments for single-threaded use. | 
 | And in theory, RCU read-side critical sections may be composed, and in | 
 | fact may be nested arbitrarily deeply. | 
 | In practice, as with all real-world implementations of composable | 
 | constructs, there are limitations. | 
 |  | 
 | <p> | 
 | Implementations of RCU for which <tt>rcu_read_lock()</tt> | 
 | and <tt>rcu_read_unlock()</tt> generate no code, such as | 
 | Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be | 
 | nested arbitrarily deeply. | 
 | After all, there is no overhead. | 
 | Except that if all these instances of <tt>rcu_read_lock()</tt> | 
 | and <tt>rcu_read_unlock()</tt> are visible to the compiler, | 
 | compilation will eventually fail due to exhausting memory, | 
 | mass storage, or user patience, whichever comes first. | 
 | If the nesting is not visible to the compiler, as is the case with | 
 | mutually recursive functions each in its own translation unit, | 
 | stack overflow will result. | 
 | If the nesting takes the form of loops, perhaps in the guise of tail | 
 | recursion, either the control variable | 
 | will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. | 
 | Nevertheless, this class of RCU implementations is one | 
 | of the most composable constructs in existence. | 
 |  | 
 | <p> | 
 | RCU implementations that explicitly track nesting depth | 
 | are limited by the nesting-depth counter. | 
 | For example, the Linux kernel's preemptible RCU limits nesting to | 
 | <tt>INT_MAX</tt>. | 
 | This should suffice for almost all practical purposes. | 
 | That said, a consecutive pair of RCU read-side critical sections | 
 | between which there is an operation that waits for a grace period | 
 | cannot be enclosed in another RCU read-side critical section. | 
 | This is because it is not legal to wait for a grace period within | 
 | an RCU read-side critical section:  To do so would result either | 
 | in deadlock or | 
 | in RCU implicitly splitting the enclosing RCU read-side critical | 
 | section, neither of which is conducive to a long-lived and prosperous | 
 | kernel. | 
 |  | 
 | <p> | 
 | It is worth noting that RCU is not alone in limiting composability. | 
 | For example, many transactional-memory implementations prohibit | 
 | composing a pair of transactions separated by an irrevocable | 
 | operation (for example, a network receive operation). | 
 | For another example, lock-based critical sections can be composed | 
 | surprisingly freely, but only if deadlock is avoided. | 
 |  | 
 | <p> | 
 | In short, although RCU read-side critical sections are highly composable, | 
 | care is required in some situations, just as is the case for any other | 
 | composable synchronization mechanism. | 
 |  | 
 | <h3><a name="Corner Cases">Corner Cases</a></h3> | 
 |  | 
 | <p> | 
 | A given RCU workload might have an endless and intense stream of | 
 | RCU read-side critical sections, perhaps even so intense that there | 
 | was never a point in time during which there was not at least one | 
 | RCU read-side critical section in flight. | 
 | RCU cannot allow this situation to block grace periods:  As long as | 
 | all the RCU read-side critical sections are finite, grace periods | 
 | must also be finite. | 
 |  | 
 | <p> | 
 | That said, preemptible RCU implementations could potentially result | 
 | in RCU read-side critical sections being preempted for long durations, | 
 | which has the effect of creating a long-duration RCU read-side | 
 | critical section. | 
 | This situation can arise only in heavily loaded systems, but systems using | 
 | real-time priorities are of course more vulnerable. | 
 | Therefore, RCU priority boosting is provided to help deal with this | 
 | case. | 
 | That said, the exact requirements on RCU priority boosting will likely | 
 | evolve as more experience accumulates. | 
 |  | 
 | <p> | 
 | Other workloads might have very high update rates. | 
 | Although one can argue that such workloads should instead use | 
 | something other than RCU, the fact remains that RCU must | 
 | handle such workloads gracefully. | 
 | This requirement is another factor driving batching of grace periods, | 
 | but it is also the driving force behind the checks for large numbers | 
 | of queued RCU callbacks in the <tt>call_rcu()</tt> code path. | 
 | Finally, high update rates should not delay RCU read-side critical | 
 | sections, although some small read-side delays can occur when using | 
 | <tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use | 
 | of <tt>smp_call_function_single()</tt>. | 
 |  | 
 | <p> | 
 | Although all three of these corner cases were understood in the early | 
 | 1990s, a simple user-level test consisting of <tt>close(open(path))</tt> | 
 | in a tight loop | 
 | in the early 2000s suddenly provided a much deeper appreciation of the | 
 | high-update-rate corner case. | 
 | This test also motivated addition of some RCU code to react to high update | 
 | rates, for example, if a given CPU finds itself with more than 10,000 | 
 | RCU callbacks queued, it will cause RCU to take evasive action by | 
 | more aggressively starting grace periods and more aggressively forcing | 
 | completion of grace-period processing. | 
 | This evasive action causes the grace period to complete more quickly, | 
 | but at the cost of restricting RCU's batching optimizations, thus | 
 | increasing the CPU overhead incurred by that grace period. | 
 |  | 
 | <h2><a name="Software-Engineering Requirements"> | 
 | Software-Engineering Requirements</a></h2> | 
 |  | 
 | <p> | 
 | Between Murphy's Law and “To err is human”, it is necessary to | 
 | guard against mishaps and misuse: | 
 |  | 
 | <ol> | 
 | <li>	It is all too easy to forget to use <tt>rcu_read_lock()</tt> | 
 | 	everywhere that it is needed, so kernels built with | 
 | 	<tt>CONFIG_PROVE_RCU=y</tt> will splat if | 
 | 	<tt>rcu_dereference()</tt> is used outside of an | 
 | 	RCU read-side critical section. | 
 | 	Update-side code can use <tt>rcu_dereference_protected()</tt>, | 
 | 	which takes a | 
 | 	<a href="https://lwn.net/Articles/371986/">lockdep expression</a> | 
 | 	to indicate what is providing the protection. | 
 | 	If the indicated protection is not provided, a lockdep splat | 
 | 	is emitted. | 
 |  | 
 | 	<p> | 
 | 	Code shared between readers and updaters can use | 
 | 	<tt>rcu_dereference_check()</tt>, which also takes a | 
 | 	lockdep expression, and emits a lockdep splat if neither | 
 | 	<tt>rcu_read_lock()</tt> nor the indicated protection | 
 | 	is in place. | 
 | 	In addition, <tt>rcu_dereference_raw()</tt> is used in those | 
 | 	(hopefully rare) cases where the required protection cannot | 
 | 	be easily described. | 
 | 	Finally, <tt>rcu_read_lock_held()</tt> is provided to | 
 | 	allow a function to verify that it has been invoked within | 
 | 	an RCU read-side critical section. | 
 | 	I was made aware of this set of requirements shortly after Thomas | 
 | 	Gleixner audited a number of RCU uses. | 
 | <li>	A given function might wish to check for RCU-related preconditions | 
 | 	upon entry, before using any other RCU API. | 
 | 	The <tt>rcu_lockdep_assert()</tt> does this job, | 
 | 	asserting the expression in kernels having lockdep enabled | 
 | 	and doing nothing otherwise. | 
 | <li>	It is also easy to forget to use <tt>rcu_assign_pointer()</tt> | 
 | 	and <tt>rcu_dereference()</tt>, perhaps (incorrectly) | 
 | 	substituting a simple assignment. | 
 | 	To catch this sort of error, a given RCU-protected pointer may be | 
 | 	tagged with <tt>__rcu</tt>, after which sparse | 
 | 	will complain about simple-assignment accesses to that pointer. | 
 | 	Arnd Bergmann made me aware of this requirement, and also | 
 | 	supplied the needed | 
 | 	<a href="https://lwn.net/Articles/376011/">patch series</a>. | 
 | <li>	Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> | 
 | 	will splat if a data element is passed to <tt>call_rcu()</tt> | 
 | 	twice in a row, without a grace period in between. | 
 | 	(This error is similar to a double free.) | 
 | 	The corresponding <tt>rcu_head</tt> structures that are | 
 | 	dynamically allocated are automatically tracked, but | 
 | 	<tt>rcu_head</tt> structures allocated on the stack | 
 | 	must be initialized with <tt>init_rcu_head_on_stack()</tt> | 
 | 	and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. | 
 | 	Similarly, statically allocated non-stack <tt>rcu_head</tt> | 
 | 	structures must be initialized with <tt>init_rcu_head()</tt> | 
 | 	and cleaned up with <tt>destroy_rcu_head()</tt>. | 
 | 	Mathieu Desnoyers made me aware of this requirement, and also | 
 | 	supplied the needed | 
 | 	<a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. | 
 | <li>	An infinite loop in an RCU read-side critical section will | 
 | 	eventually trigger an RCU CPU stall warning splat, with | 
 | 	the duration of “eventually” being controlled by the | 
 | 	<tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, | 
 | 	alternatively, by the | 
 | 	<tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs | 
 | 	parameter. | 
 | 	However, RCU is not obligated to produce this splat | 
 | 	unless there is a grace period waiting on that particular | 
 | 	RCU read-side critical section. | 
 | 	<p> | 
 | 	Some extreme workloads might intentionally delay | 
 | 	RCU grace periods, and systems running those workloads can | 
 | 	be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> | 
 | 	to suppress the splats. | 
 | 	This kernel parameter may also be set via <tt>sysfs</tt>. | 
 | 	Furthermore, RCU CPU stall warnings are counter-productive | 
 | 	during sysrq dumps and during panics. | 
 | 	RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and | 
 | 	<tt>rcu_sysrq_end()</tt> API members to be called before | 
 | 	and after long sysrq dumps. | 
 | 	RCU also supplies the <tt>rcu_panic()</tt> notifier that is | 
 | 	automatically invoked at the beginning of a panic to suppress | 
 | 	further RCU CPU stall warnings. | 
 |  | 
 | 	<p> | 
 | 	This requirement made itself known in the early 1990s, pretty | 
 | 	much the first time that it was necessary to debug a CPU stall. | 
 | 	That said, the initial implementation in DYNIX/ptx was quite | 
 | 	generic in comparison with that of Linux. | 
 | <li>	Although it would be very good to detect pointers leaking out | 
 | 	of RCU read-side critical sections, there is currently no | 
 | 	good way of doing this. | 
 | 	One complication is the need to distinguish between pointers | 
 | 	leaking and pointers that have been handed off from RCU to | 
 | 	some other synchronization mechanism, for example, reference | 
 | 	counting. | 
 | <li>	In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related | 
 | 	information is provided via event tracing. | 
 | <li>	Open-coded use of <tt>rcu_assign_pointer()</tt> and | 
 | 	<tt>rcu_dereference()</tt> to create typical linked | 
 | 	data structures can be surprisingly error-prone. | 
 | 	Therefore, RCU-protected | 
 | 	<a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> | 
 | 	and, more recently, RCU-protected | 
 | 	<a href="https://lwn.net/Articles/612100/">hash tables</a> | 
 | 	are available. | 
 | 	Many other special-purpose RCU-protected data structures are | 
 | 	available in the Linux kernel and the userspace RCU library. | 
 | <li>	Some linked structures are created at compile time, but still | 
 | 	require <tt>__rcu</tt> checking. | 
 | 	The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this | 
 | 	purpose. | 
 | <li>	It is not necessary to use <tt>rcu_assign_pointer()</tt> | 
 | 	when creating linked structures that are to be published via | 
 | 	a single external pointer. | 
 | 	The <tt>RCU_INIT_POINTER()</tt> macro is provided for | 
 | 	this task and also for assigning <tt>NULL</tt> pointers | 
 | 	at runtime. | 
 | </ol> | 
 |  | 
 | <p> | 
 | This not a hard-and-fast list:  RCU's diagnostic capabilities will | 
 | continue to be guided by the number and type of usage bugs found | 
 | in real-world RCU usage. | 
 |  | 
 | <h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> | 
 |  | 
 | <p> | 
 | The Linux kernel provides an interesting environment for all kinds of | 
 | software, including RCU. | 
 | Some of the relevant points of interest are as follows: | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Configuration">Configuration</a>. | 
 | <li>	<a href="#Firmware Interface">Firmware Interface</a>. | 
 | <li>	<a href="#Early Boot">Early Boot</a>. | 
 | <li>	<a href="#Interrupts and NMIs"> | 
 | 	Interrupts and non-maskable interrupts (NMIs)</a>. | 
 | <li>	<a href="#Loadable Modules">Loadable Modules</a>. | 
 | <li>	<a href="#Hotplug CPU">Hotplug CPU</a>. | 
 | <li>	<a href="#Scheduler and RCU">Scheduler and RCU</a>. | 
 | <li>	<a href="#Tracing and RCU">Tracing and RCU</a>. | 
 | <li>	<a href="#Energy Efficiency">Energy Efficiency</a>. | 
 | <li>	<a href="#Scheduling-Clock Interrupts and RCU"> | 
 | 	Scheduling-Clock Interrupts and RCU</a>. | 
 | <li>	<a href="#Memory Efficiency">Memory Efficiency</a>. | 
 | <li>	<a href="#Performance, Scalability, Response Time, and Reliability"> | 
 | 	Performance, Scalability, Response Time, and Reliability</a>. | 
 | </ol> | 
 |  | 
 | <p> | 
 | This list is probably incomplete, but it does give a feel for the | 
 | most notable Linux-kernel complications. | 
 | Each of the following sections covers one of the above topics. | 
 |  | 
 | <h3><a name="Configuration">Configuration</a></h3> | 
 |  | 
 | <p> | 
 | RCU's goal is automatic configuration, so that almost nobody | 
 | needs to worry about RCU's <tt>Kconfig</tt> options. | 
 | And for almost all users, RCU does in fact work well | 
 | “out of the box.” | 
 |  | 
 | <p> | 
 | However, there are specialized use cases that are handled by | 
 | kernel boot parameters and <tt>Kconfig</tt> options. | 
 | Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users | 
 | about new <tt>Kconfig</tt> options, which requires almost all of them | 
 | be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. | 
 |  | 
 | <p> | 
 | This all should be quite obvious, but the fact remains that | 
 | Linus Torvalds recently had to | 
 | <a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> | 
 | me of this requirement. | 
 |  | 
 | <h3><a name="Firmware Interface">Firmware Interface</a></h3> | 
 |  | 
 | <p> | 
 | In many cases, kernel obtains information about the system from the | 
 | firmware, and sometimes things are lost in translation. | 
 | Or the translation is accurate, but the original message is bogus. | 
 |  | 
 | <p> | 
 | For example, some systems' firmware overreports the number of CPUs, | 
 | sometimes by a large factor. | 
 | If RCU naively believed the firmware, as it used to do, | 
 | it would create too many per-CPU kthreads. | 
 | Although the resulting system will still run correctly, the extra | 
 | kthreads needlessly consume memory and can cause confusion | 
 | when they show up in <tt>ps</tt> listings. | 
 |  | 
 | <p> | 
 | RCU must therefore wait for a given CPU to actually come online before | 
 | it can allow itself to believe that the CPU actually exists. | 
 | The resulting “ghost CPUs” (which are never going to | 
 | come online) cause a number of | 
 | <a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. | 
 |  | 
 | <h3><a name="Early Boot">Early Boot</a></h3> | 
 |  | 
 | <p> | 
 | The Linux kernel's boot sequence is an interesting process, | 
 | and RCU is used early, even before <tt>rcu_init()</tt> | 
 | is invoked. | 
 | In fact, a number of RCU's primitives can be used as soon as the | 
 | initial task's <tt>task_struct</tt> is available and the | 
 | boot CPU's per-CPU variables are set up. | 
 | The read-side primitives (<tt>rcu_read_lock()</tt>, | 
 | <tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, | 
 | and <tt>rcu_access_pointer()</tt>) will operate normally very early on, | 
 | as will <tt>rcu_assign_pointer()</tt>. | 
 |  | 
 | <p> | 
 | Although <tt>call_rcu()</tt> may be invoked at any | 
 | time during boot, callbacks are not guaranteed to be invoked until after | 
 | all of RCU's kthreads have been spawned, which occurs at | 
 | <tt>early_initcall()</tt> time. | 
 | This delay in callback invocation is due to the fact that RCU does not | 
 | invoke callbacks until it is fully initialized, and this full initialization | 
 | cannot occur until after the scheduler has initialized itself to the | 
 | point where RCU can spawn and run its kthreads. | 
 | In theory, it would be possible to invoke callbacks earlier, | 
 | however, this is not a panacea because there would be severe restrictions | 
 | on what operations those callbacks could invoke. | 
 |  | 
 | <p> | 
 | Perhaps surprisingly, <tt>synchronize_rcu()</tt>, | 
 | <a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a> | 
 | (<a href="#Bottom-Half Flavor">discussed below</a>), | 
 | <a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>, | 
 | <tt>synchronize_rcu_expedited()</tt>, | 
 | <tt>synchronize_rcu_bh_expedited()</tt>, and | 
 | <tt>synchronize_sched_expedited()</tt> | 
 | will all operate normally | 
 | during very early boot, the reason being that there is only one CPU | 
 | and preemption is disabled. | 
 | This means that the call <tt>synchronize_rcu()</tt> (or friends) | 
 | itself is a quiescent | 
 | state and thus a grace period, so the early-boot implementation can | 
 | be a no-op. | 
 |  | 
 | <p> | 
 | However, once the scheduler has spawned its first kthread, this early | 
 | boot trick fails for <tt>synchronize_rcu()</tt> (as well as for | 
 | <tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt> | 
 | kernels. | 
 | The reason is that an RCU read-side critical section might be preempted, | 
 | which means that a subsequent <tt>synchronize_rcu()</tt> really does have | 
 | to wait for something, as opposed to simply returning immediately. | 
 | Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of | 
 | its kthreads are spawned, which doesn't happen until some time during | 
 | <tt>early_initcalls()</tt> time. | 
 | But this is no excuse:  RCU is nevertheless required to correctly handle | 
 | synchronous grace periods during this time period. | 
 | Once all of its kthreads are up and running, RCU starts running | 
 | normally. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	How can RCU possibly handle grace periods before all of its | 
 | 	kthreads have been spawned??? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Very carefully! | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	During the “dead zone” between the time that the | 
 | 	scheduler spawns the first task and the time that all of RCU's | 
 | 	kthreads have been spawned, all synchronous grace periods are | 
 | 	handled by the expedited grace-period mechanism. | 
 | 	At runtime, this expedited mechanism relies on workqueues, but | 
 | 	during the dead zone the requesting task itself drives the | 
 | 	desired expedited grace period. | 
 | 	Because dead-zone execution takes place within task context, | 
 | 	everything works. | 
 | 	Once the dead zone ends, expedited grace periods go back to | 
 | 	using workqueues, as is required to avoid problems that would | 
 | 	otherwise occur when a user task received a POSIX signal while | 
 | 	driving an expedited grace period. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	And yes, this does mean that it is unhelpful to send POSIX | 
 | 	signals to random tasks between the time that the scheduler | 
 | 	spawns its first kthread and the time that RCU's kthreads | 
 | 	have all been spawned. | 
 | 	If there ever turns out to be a good reason for sending POSIX | 
 | 	signals during that time, appropriate adjustments will be made. | 
 | 	(If it turns out that POSIX signals are sent during this time for | 
 | 	no good reason, other adjustments will be made, appropriate | 
 | 	or otherwise.) | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | I learned of these boot-time requirements as a result of a series of | 
 | system hangs. | 
 |  | 
 | <h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> | 
 |  | 
 | <p> | 
 | The Linux kernel has interrupts, and RCU read-side critical sections are | 
 | legal within interrupt handlers and within interrupt-disabled regions | 
 | of code, as are invocations of <tt>call_rcu()</tt>. | 
 |  | 
 | <p> | 
 | Some Linux-kernel architectures can enter an interrupt handler from | 
 | non-idle process context, and then just never leave it, instead stealthily | 
 | transitioning back to process context. | 
 | This trick is sometimes used to invoke system calls from inside the kernel. | 
 | These “half-interrupts” mean that RCU has to be very careful | 
 | about how it counts interrupt nesting levels. | 
 | I learned of this requirement the hard way during a rewrite | 
 | of RCU's dyntick-idle code. | 
 |  | 
 | <p> | 
 | The Linux kernel has non-maskable interrupts (NMIs), and | 
 | RCU read-side critical sections are legal within NMI handlers. | 
 | Thankfully, RCU update-side primitives, including | 
 | <tt>call_rcu()</tt>, are prohibited within NMI handlers. | 
 |  | 
 | <p> | 
 | The name notwithstanding, some Linux-kernel architectures | 
 | can have nested NMIs, which RCU must handle correctly. | 
 | Andy Lutomirski | 
 | <a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> | 
 | with this requirement; | 
 | he also kindly surprised me with | 
 | <a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> | 
 | that meets this requirement. | 
 |  | 
 | <h3><a name="Loadable Modules">Loadable Modules</a></h3> | 
 |  | 
 | <p> | 
 | The Linux kernel has loadable modules, and these modules can | 
 | also be unloaded. | 
 | After a given module has been unloaded, any attempt to call | 
 | one of its functions results in a segmentation fault. | 
 | The module-unload functions must therefore cancel any | 
 | delayed calls to loadable-module functions, for example, | 
 | any outstanding <tt>mod_timer()</tt> must be dealt with | 
 | via <tt>del_timer_sync()</tt> or similar. | 
 |  | 
 | <p> | 
 | Unfortunately, there is no way to cancel an RCU callback; | 
 | once you invoke <tt>call_rcu()</tt>, the callback function is | 
 | going to eventually be invoked, unless the system goes down first. | 
 | Because it is normally considered socially irresponsible to crash the system | 
 | in response to a module unload request, we need some other way | 
 | to deal with in-flight RCU callbacks. | 
 |  | 
 | <p> | 
 | RCU therefore provides | 
 | <tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, | 
 | which waits until all in-flight RCU callbacks have been invoked. | 
 | If a module uses <tt>call_rcu()</tt>, its exit function should therefore | 
 | prevent any future invocation of <tt>call_rcu()</tt>, then invoke | 
 | <tt>rcu_barrier()</tt>. | 
 | In theory, the underlying module-unload code could invoke | 
 | <tt>rcu_barrier()</tt> unconditionally, but in practice this would | 
 | incur unacceptable latencies. | 
 |  | 
 | <p> | 
 | Nikita Danilov noted this requirement for an analogous filesystem-unmount | 
 | situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. | 
 | The need for <tt>rcu_barrier()</tt> for module unloading became | 
 | apparent later. | 
 |  | 
 | <p> | 
 | <b>Important note</b>: The <tt>rcu_barrier()</tt> function is not, | 
 | repeat, <i>not</i>, obligated to wait for a grace period. | 
 | It is instead only required to wait for RCU callbacks that have | 
 | already been posted. | 
 | Therefore, if there are no RCU callbacks posted anywhere in the system, | 
 | <tt>rcu_barrier()</tt> is within its rights to return immediately. | 
 | Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not | 
 | necessarily need to wait for a grace period. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Wait a minute! | 
 | 	Each RCU callbacks must wait for a grace period to complete, | 
 | 	and <tt>rcu_barrier()</tt> must wait for each pre-existing | 
 | 	callback to be invoked. | 
 | 	Doesn't <tt>rcu_barrier()</tt> therefore need to wait for | 
 | 	a full grace period if there is even one callback posted anywhere | 
 | 	in the system? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Absolutely not!!! | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	Yes, each RCU callbacks must wait for a grace period to complete, | 
 | 	but it might well be partly (or even completely) finished waiting | 
 | 	by the time <tt>rcu_barrier()</tt> is invoked. | 
 | 	In that case, <tt>rcu_barrier()</tt> need only wait for the | 
 | 	remaining portion of the grace period to elapse. | 
 | 	So even if there are quite a few callbacks posted, | 
 | 	<tt>rcu_barrier()</tt> might well return quite quickly. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff"> | 
 | 	So if you need to wait for a grace period as well as for all | 
 | 	pre-existing callbacks, you will need to invoke both | 
 | 	<tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>. | 
 | 	If latency is a concern, you can always use workqueues | 
 | 	to invoke them concurrently. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h3><a name="Hotplug CPU">Hotplug CPU</a></h3> | 
 |  | 
 | <p> | 
 | The Linux kernel supports CPU hotplug, which means that CPUs | 
 | can come and go. | 
 | It is of course illegal to use any RCU API member from an offline CPU, | 
 | with the exception of <a href="#Sleepable RCU">SRCU</a> read-side | 
 | critical sections. | 
 | This requirement was present from day one in DYNIX/ptx, but | 
 | on the other hand, the Linux kernel's CPU-hotplug implementation | 
 | is “interesting.” | 
 |  | 
 | <p> | 
 | The Linux-kernel CPU-hotplug implementation has notifiers that | 
 | are used to allow the various kernel subsystems (including RCU) | 
 | to respond appropriately to a given CPU-hotplug operation. | 
 | Most RCU operations may be invoked from CPU-hotplug notifiers, | 
 | including even synchronous grace-period operations such as | 
 | <tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>. | 
 |  | 
 | <p> | 
 | However, all-callback-wait operations such as | 
 | <tt>rcu_barrier()</tt> are also not supported, due to the | 
 | fact that there are phases of CPU-hotplug operations where | 
 | the outgoing CPU's callbacks will not be invoked until after | 
 | the CPU-hotplug operation ends, which could also result in deadlock. | 
 | Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations | 
 | during its execution, which results in another type of deadlock | 
 | when invoked from a CPU-hotplug notifier. | 
 |  | 
 | <h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> | 
 |  | 
 | <p> | 
 | RCU depends on the scheduler, and the scheduler uses RCU to | 
 | protect some of its data structures. | 
 | This means the scheduler is forbidden from acquiring | 
 | the runqueue locks and the priority-inheritance locks | 
 | in the middle of an outermost RCU read-side critical section unless either | 
 | (1) it releases them before exiting that same | 
 | RCU read-side critical section, or | 
 | (2) interrupts are disabled across | 
 | that entire RCU read-side critical section. | 
 | This same prohibition also applies (recursively!) to any lock that is acquired | 
 | while holding any lock to which this prohibition applies. | 
 | Adhering to this rule prevents preemptible RCU from invoking | 
 | <tt>rcu_read_unlock_special()</tt> while either runqueue or | 
 | priority-inheritance locks are held, thus avoiding deadlock. | 
 |  | 
 | <p> | 
 | Prior to v4.4, it was only necessary to disable preemption across | 
 | RCU read-side critical sections that acquired scheduler locks. | 
 | In v4.4, expedited grace periods started using IPIs, and these | 
 | IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath. | 
 | Therefore, this expedited-grace-period change required disabling of | 
 | interrupts, not just preemption. | 
 |  | 
 | <p> | 
 | For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt> | 
 | implementation must be written carefully to avoid similar deadlocks. | 
 | In particular, <tt>rcu_read_unlock()</tt> must tolerate an | 
 | interrupt where the interrupt handler invokes both | 
 | <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. | 
 | This possibility requires <tt>rcu_read_unlock()</tt> to use | 
 | negative nesting levels to avoid destructive recursion via | 
 | interrupt handler's use of RCU. | 
 |  | 
 | <p> | 
 | This pair of mutual scheduler-RCU requirements came as a | 
 | <a href="https://lwn.net/Articles/453002/">complete surprise</a>. | 
 |  | 
 | <p> | 
 | As noted above, RCU makes use of kthreads, and it is necessary to | 
 | avoid excessive CPU-time accumulation by these kthreads. | 
 | This requirement was no surprise, but RCU's violation of it | 
 | when running context-switch-heavy workloads when built with | 
 | <tt>CONFIG_NO_HZ_FULL=y</tt> | 
 | <a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. | 
 | RCU has made good progress towards meeting this requirement, even | 
 | for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, | 
 | but there is room for further improvement. | 
 |  | 
 | <h3><a name="Tracing and RCU">Tracing and RCU</a></h3> | 
 |  | 
 | <p> | 
 | It is possible to use tracing on RCU code, but tracing itself | 
 | uses RCU. | 
 | For this reason, <tt>rcu_dereference_raw_notrace()</tt> | 
 | is provided for use by tracing, which avoids the destructive | 
 | recursion that could otherwise ensue. | 
 | This API is also used by virtualization in some architectures, | 
 | where RCU readers execute in environments in which tracing | 
 | cannot be used. | 
 | The tracing folks both located the requirement and provided the | 
 | needed fix, so this surprise requirement was relatively painless. | 
 |  | 
 | <h3><a name="Energy Efficiency">Energy Efficiency</a></h3> | 
 |  | 
 | <p> | 
 | Interrupting idle CPUs is considered socially unacceptable, | 
 | especially by people with battery-powered embedded systems. | 
 | RCU therefore conserves energy by detecting which CPUs are | 
 | idle, including tracking CPUs that have been interrupted from idle. | 
 | This is a large part of the energy-efficiency requirement, | 
 | so I learned of this via an irate phone call. | 
 |  | 
 | <p> | 
 | Because RCU avoids interrupting idle CPUs, it is illegal to | 
 | execute an RCU read-side critical section on an idle CPU. | 
 | (Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat | 
 | if you try it.) | 
 | The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> | 
 | event tracing is provided to work around this restriction. | 
 | In addition, <tt>rcu_is_watching()</tt> may be used to | 
 | test whether or not it is currently legal to run RCU read-side | 
 | critical sections on this CPU. | 
 | I learned of the need for diagnostics on the one hand | 
 | and <tt>RCU_NONIDLE()</tt> on the other while inspecting | 
 | idle-loop code. | 
 | Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, | 
 | which is used quite heavily in the idle loop. | 
 | However, there are some restrictions on the code placed within | 
 | <tt>RCU_NONIDLE()</tt>: | 
 |  | 
 | <ol> | 
 | <li>	Blocking is prohibited. | 
 | 	In practice, this is not a serious restriction given that idle | 
 | 	tasks are prohibited from blocking to begin with. | 
 | <li>	Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot | 
 | 	nest indefinitely deeply. | 
 | 	However, given that they can be nested on the order of a million | 
 | 	deep, even on 32-bit systems, this should not be a serious | 
 | 	restriction. | 
 | 	This nesting limit would probably be reached long after the | 
 | 	compiler OOMed or the stack overflowed. | 
 | <li>	Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence | 
 | 	out of that same <tt>RCU_NONIDLE()</tt>. | 
 | 	For example, the following is grossly illegal: | 
 |  | 
 | 	<blockquote> | 
 | 	<pre> | 
 |  1     RCU_NONIDLE({ | 
 |  2       do_something(); | 
 |  3       goto bad_idea;  /* BUG!!! */ | 
 |  4       do_something_else();}); | 
 |  5   bad_idea: | 
 | 	</pre> | 
 | 	</blockquote> | 
 |  | 
 | 	<p> | 
 | 	It is just as illegal to transfer control into the middle of | 
 | 	<tt>RCU_NONIDLE()</tt>'s argument. | 
 | 	Yes, in theory, you could transfer in as long as you also | 
 | 	transferred out, but in practice you could also expect to get sharply | 
 | 	worded review comments. | 
 | </ol> | 
 |  | 
 | <p> | 
 | It is similarly socially unacceptable to interrupt an | 
 | <tt>nohz_full</tt> CPU running in userspace. | 
 | RCU must therefore track <tt>nohz_full</tt> userspace | 
 | execution. | 
 | RCU must therefore be able to sample state at two points in | 
 | time, and be able to determine whether or not some other CPU spent | 
 | any time idle and/or executing in userspace. | 
 |  | 
 | <p> | 
 | These energy-efficiency requirements have proven quite difficult to | 
 | understand and to meet, for example, there have been more than five | 
 | clean-sheet rewrites of RCU's energy-efficiency code, the last of | 
 | which was finally able to demonstrate | 
 | <a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. | 
 | As noted earlier, | 
 | I learned of many of these requirements via angry phone calls: | 
 | Flaming me on the Linux-kernel mailing list was apparently not | 
 | sufficient to fully vent their ire at RCU's energy-efficiency bugs! | 
 |  | 
 | <h3><a name="Scheduling-Clock Interrupts and RCU"> | 
 | Scheduling-Clock Interrupts and RCU</a></h3> | 
 |  | 
 | <p> | 
 | The kernel transitions between in-kernel non-idle execution, userspace | 
 | execution, and the idle loop. | 
 | Depending on kernel configuration, RCU handles these states differently: | 
 |  | 
 | <table border=3> | 
 | <tr><th><tt>HZ</tt> Kconfig</th> | 
 | 	<th>In-Kernel</th> | 
 | 		<th>Usermode</th> | 
 | 			<th>Idle</th></tr> | 
 | <tr><th align="left"><tt>HZ_PERIODIC</tt></th> | 
 | 	<td>Can rely on scheduling-clock interrupt.</td> | 
 | 		<td>Can rely on scheduling-clock interrupt and its | 
 | 		    detection of interrupt from usermode.</td> | 
 | 			<td>Can rely on RCU's dyntick-idle detection.</td></tr> | 
 | <tr><th align="left"><tt>NO_HZ_IDLE</tt></th> | 
 | 	<td>Can rely on scheduling-clock interrupt.</td> | 
 | 		<td>Can rely on scheduling-clock interrupt and its | 
 | 		    detection of interrupt from usermode.</td> | 
 | 			<td>Can rely on RCU's dyntick-idle detection.</td></tr> | 
 | <tr><th align="left"><tt>NO_HZ_FULL</tt></th> | 
 | 	<td>Can only sometimes rely on scheduling-clock interrupt. | 
 | 	    In other cases, it is necessary to bound kernel execution | 
 | 	    times and/or use IPIs.</td> | 
 | 		<td>Can rely on RCU's dyntick-idle detection.</td> | 
 | 			<td>Can rely on RCU's dyntick-idle detection.</td></tr> | 
 | </table> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the | 
 | 	scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> | 
 | 	and <tt>NO_HZ_IDLE</tt> do? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Because, as a performance optimization, <tt>NO_HZ_FULL</tt> | 
 | 	does not necessarily re-enable the scheduling-clock interrupt | 
 | 	on entry to each and every system call. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | However, RCU must be reliably informed as to whether any given | 
 | CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, | 
 | also whether that CPU is executing in usermode, as discussed | 
 | <a href="#Energy Efficiency">earlier</a>. | 
 | It also requires that the scheduling-clock interrupt be enabled when | 
 | RCU needs it to be: | 
 |  | 
 | <ol> | 
 | <li>	If a CPU is either idle or executing in usermode, and RCU believes | 
 | 	it is non-idle, the scheduling-clock tick had better be running. | 
 | 	Otherwise, you will get RCU CPU stall warnings.  Or at best, | 
 | 	very long (11-second) grace periods, with a pointless IPI waking | 
 | 	the CPU from time to time. | 
 | <li>	If a CPU is in a portion of the kernel that executes RCU read-side | 
 | 	critical sections, and RCU believes this CPU to be idle, you will get | 
 | 	random memory corruption.  <b>DON'T DO THIS!!!</b> | 
 |  | 
 | 	<br>This is one reason to test with lockdep, which will complain | 
 | 	about this sort of thing. | 
 | <li>	If a CPU is in a portion of the kernel that is absolutely | 
 | 	positively no-joking guaranteed to never execute any RCU read-side | 
 | 	critical sections, and RCU believes this CPU to to be idle, | 
 | 	no problem.  This sort of thing is used by some architectures | 
 | 	for light-weight exception handlers, which can then avoid the | 
 | 	overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> | 
 | 	at exception entry and exit, respectively. | 
 | 	Some go further and avoid the entireties of <tt>irq_enter()</tt> | 
 | 	and <tt>irq_exit()</tt>. | 
 |  | 
 | 	<br>Just make very sure you are running some of your tests with | 
 | 	<tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths | 
 | 	was in fact joking about not doing RCU read-side critical sections. | 
 | <li>	If a CPU is executing in the kernel with the scheduling-clock | 
 | 	interrupt disabled and RCU believes this CPU to be non-idle, | 
 | 	and if the CPU goes idle (from an RCU perspective) every few | 
 | 	jiffies, no problem.  It is usually OK for there to be the | 
 | 	occasional gap between idle periods of up to a second or so. | 
 |  | 
 | 	<br>If the gap grows too long, you get RCU CPU stall warnings. | 
 | <li>	If a CPU is either idle or executing in usermode, and RCU believes | 
 | 	it to be idle, of course no problem. | 
 | <li>	If a CPU is executing in the kernel, the kernel code | 
 | 	path is passing through quiescent states at a reasonable | 
 | 	frequency (preferably about once per few jiffies, but the | 
 | 	occasional excursion to a second or so is usually OK) and the | 
 | 	scheduling-clock interrupt is enabled, of course no problem. | 
 |  | 
 | 	<br>If the gap between a successive pair of quiescent states grows | 
 | 	too long, you get RCU CPU stall warnings. | 
 | </ol> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	But what if my driver has a hardware interrupt handler | 
 | 	that can run for many seconds? | 
 | 	I cannot invoke <tt>schedule()</tt> from an hardware | 
 | 	interrupt handler, after all! | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> | 
 | 	every so often. | 
 | 	But given that long-running interrupt handlers can cause | 
 | 	other problems, not least for response time, shouldn't you | 
 | 	work to keep your interrupt handler's runtime within reasonable | 
 | 	bounds? | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | But as long as RCU is properly informed of kernel state transitions between | 
 | in-kernel execution, usermode execution, and idle, and as long as the | 
 | scheduling-clock interrupt is enabled when RCU needs it to be, you | 
 | can rest assured that the bugs you encounter will be in some other | 
 | part of RCU or some other part of the kernel! | 
 |  | 
 | <h3><a name="Memory Efficiency">Memory Efficiency</a></h3> | 
 |  | 
 | <p> | 
 | Although small-memory non-realtime systems can simply use Tiny RCU, | 
 | code size is only one aspect of memory efficiency. | 
 | Another aspect is the size of the <tt>rcu_head</tt> structure | 
 | used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. | 
 | Although this structure contains nothing more than a pair of pointers, | 
 | it does appear in many RCU-protected data structures, including | 
 | some that are size critical. | 
 | The <tt>page</tt> structure is a case in point, as evidenced by | 
 | the many occurrences of the <tt>union</tt> keyword within that structure. | 
 |  | 
 | <p> | 
 | This need for memory efficiency is one reason that RCU uses hand-crafted | 
 | singly linked lists to track the <tt>rcu_head</tt> structures that | 
 | are waiting for a grace period to elapse. | 
 | It is also the reason why <tt>rcu_head</tt> structures do not contain | 
 | debug information, such as fields tracking the file and line of the | 
 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. | 
 | Although this information might appear in debug-only kernel builds at some | 
 | point, in the meantime, the <tt>->func</tt> field will often provide | 
 | the needed debug information. | 
 |  | 
 | <p> | 
 | However, in some cases, the need for memory efficiency leads to even | 
 | more extreme measures. | 
 | Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field | 
 | shares storage with a great many other structures that are used at | 
 | various points in the corresponding page's lifetime. | 
 | In order to correctly resolve certain | 
 | <a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, | 
 | the Linux kernel's memory-management subsystem needs a particular bit | 
 | to remain zero during all phases of grace-period processing, | 
 | and that bit happens to map to the bottom bit of the | 
 | <tt>rcu_head</tt> structure's <tt>->next</tt> field. | 
 | RCU makes this guarantee as long as <tt>call_rcu()</tt> | 
 | is used to post the callback, as opposed to <tt>kfree_rcu()</tt> | 
 | or some future “lazy” | 
 | variant of <tt>call_rcu()</tt> that might one day be created for | 
 | energy-efficiency purposes. | 
 |  | 
 | <p> | 
 | That said, there are limits. | 
 | RCU requires that the <tt>rcu_head</tt> structure be aligned to a | 
 | two-byte boundary, and passing a misaligned <tt>rcu_head</tt> | 
 | structure to one of the <tt>call_rcu()</tt> family of functions | 
 | will result in a splat. | 
 | It is therefore necessary to exercise caution when packing | 
 | structures containing fields of type <tt>rcu_head</tt>. | 
 | Why not a four-byte or even eight-byte alignment requirement? | 
 | Because the m68k architecture provides only two-byte alignment, | 
 | and thus acts as alignment's least common denominator. | 
 |  | 
 | <p> | 
 | The reason for reserving the bottom bit of pointers to | 
 | <tt>rcu_head</tt> structures is to leave the door open to | 
 | “lazy” callbacks whose invocations can safely be deferred. | 
 | Deferring invocation could potentially have energy-efficiency | 
 | benefits, but only if the rate of non-lazy callbacks decreases | 
 | significantly for some important workload. | 
 | In the meantime, reserving the bottom bit keeps this option open | 
 | in case it one day becomes useful. | 
 |  | 
 | <h3><a name="Performance, Scalability, Response Time, and Reliability"> | 
 | Performance, Scalability, Response Time, and Reliability</a></h3> | 
 |  | 
 | <p> | 
 | Expanding on the | 
 | <a href="#Performance and Scalability">earlier discussion</a>, | 
 | RCU is used heavily by hot code paths in performance-critical | 
 | portions of the Linux kernel's networking, security, virtualization, | 
 | and scheduling code paths. | 
 | RCU must therefore use efficient implementations, especially in its | 
 | read-side primitives. | 
 | To that end, it would be good if preemptible RCU's implementation | 
 | of <tt>rcu_read_lock()</tt> could be inlined, however, doing | 
 | this requires resolving <tt>#include</tt> issues with the | 
 | <tt>task_struct</tt> structure. | 
 |  | 
 | <p> | 
 | The Linux kernel supports hardware configurations with up to | 
 | 4096 CPUs, which means that RCU must be extremely scalable. | 
 | Algorithms that involve frequent acquisitions of global locks or | 
 | frequent atomic operations on global variables simply cannot be | 
 | tolerated within the RCU implementation. | 
 | RCU therefore makes heavy use of a combining tree based on the | 
 | <tt>rcu_node</tt> structure. | 
 | RCU is required to tolerate all CPUs continuously invoking any | 
 | combination of RCU's runtime primitives with minimal per-operation | 
 | overhead. | 
 | In fact, in many cases, increasing load must <i>decrease</i> the | 
 | per-operation overhead, witness the batching optimizations for | 
 | <tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, | 
 | <tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. | 
 | As a general rule, RCU must cheerfully accept whatever the | 
 | rest of the Linux kernel decides to throw at it. | 
 |  | 
 | <p> | 
 | The Linux kernel is used for real-time workloads, especially | 
 | in conjunction with the | 
 | <a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. | 
 | The real-time-latency response requirements are such that the | 
 | traditional approach of disabling preemption across RCU | 
 | read-side critical sections is inappropriate. | 
 | Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore | 
 | use an RCU implementation that allows RCU read-side critical | 
 | sections to be preempted. | 
 | This requirement made its presence known after users made it | 
 | clear that an earlier | 
 | <a href="https://lwn.net/Articles/107930/">real-time patch</a> | 
 | did not meet their needs, in conjunction with some | 
 | <a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> | 
 | encountered by a very early version of the -rt patchset. | 
 |  | 
 | <p> | 
 | In addition, RCU must make do with a sub-100-microsecond real-time latency | 
 | budget. | 
 | In fact, on smaller systems with the -rt patchset, the Linux kernel | 
 | provides sub-20-microsecond real-time latencies for the whole kernel, | 
 | including RCU. | 
 | RCU's scalability and latency must therefore be sufficient for | 
 | these sorts of configurations. | 
 | To my surprise, the sub-100-microsecond real-time latency budget | 
 | <a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> | 
 | applies to even the largest systems [PDF]</a>, | 
 | up to and including systems with 4096 CPUs. | 
 | This real-time requirement motivated the grace-period kthread, which | 
 | also simplified handling of a number of race conditions. | 
 |  | 
 | <p> | 
 | RCU must avoid degrading real-time response for CPU-bound threads, whether | 
 | executing in usermode (which is one use case for | 
 | <tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel. | 
 | That said, CPU-bound loops in the kernel must execute | 
 | <tt>cond_resched_rcu_qs()</tt> at least once per few tens of milliseconds | 
 | in order to avoid receiving an IPI from RCU. | 
 |  | 
 | <p> | 
 | Finally, RCU's status as a synchronization primitive means that | 
 | any RCU failure can result in arbitrary memory corruption that can be | 
 | extremely difficult to debug. | 
 | This means that RCU must be extremely reliable, which in | 
 | practice also means that RCU must have an aggressive stress-test | 
 | suite. | 
 | This stress-test suite is called <tt>rcutorture</tt>. | 
 |  | 
 | <p> | 
 | Although the need for <tt>rcutorture</tt> was no surprise, | 
 | the current immense popularity of the Linux kernel is posing | 
 | interesting—and perhaps unprecedented—validation | 
 | challenges. | 
 | To see this, keep in mind that there are well over one billion | 
 | instances of the Linux kernel running today, given Android | 
 | smartphones, Linux-powered televisions, and servers. | 
 | This number can be expected to increase sharply with the advent of | 
 | the celebrated Internet of Things. | 
 |  | 
 | <p> | 
 | Suppose that RCU contains a race condition that manifests on average | 
 | once per million years of runtime. | 
 | This bug will be occurring about three times per <i>day</i> across | 
 | the installed base. | 
 | RCU could simply hide behind hardware error rates, given that no one | 
 | should really expect their smartphone to last for a million years. | 
 | However, anyone taking too much comfort from this thought should | 
 | consider the fact that in most jurisdictions, a successful multi-year | 
 | test of a given mechanism, which might include a Linux kernel, | 
 | suffices for a number of types of safety-critical certifications. | 
 | In fact, rumor has it that the Linux kernel is already being used | 
 | in production for safety-critical applications. | 
 | I don't know about you, but I would feel quite bad if a bug in RCU | 
 | killed someone. | 
 | Which might explain my recent focus on validation and verification. | 
 |  | 
 | <h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> | 
 |  | 
 | <p> | 
 | One of the more surprising things about RCU is that there are now | 
 | no fewer than five <i>flavors</i>, or API families. | 
 | In addition, the primary flavor that has been the sole focus up to | 
 | this point has two different implementations, non-preemptible and | 
 | preemptible. | 
 | The other four flavors are listed below, with requirements for each | 
 | described in a separate section. | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Bottom-Half Flavor">Bottom-Half Flavor</a> | 
 | <li>	<a href="#Sched Flavor">Sched Flavor</a> | 
 | <li>	<a href="#Sleepable RCU">Sleepable RCU</a> | 
 | <li>	<a href="#Tasks RCU">Tasks RCU</a> | 
 | <li>	<a href="#Waiting for Multiple Grace Periods"> | 
 | 	Waiting for Multiple Grace Periods</a> | 
 | </ol> | 
 |  | 
 | <h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3> | 
 |  | 
 | <p> | 
 | The softirq-disable (AKA “bottom-half”, | 
 | hence the “_bh” abbreviations) | 
 | flavor of RCU, or <i>RCU-bh</i>, was developed by | 
 | Dipankar Sarma to provide a flavor of RCU that could withstand the | 
 | network-based denial-of-service attacks researched by Robert | 
 | Olsson. | 
 | These attacks placed so much networking load on the system | 
 | that some of the CPUs never exited softirq execution, | 
 | which in turn prevented those CPUs from ever executing a context switch, | 
 | which, in the RCU implementation of that time, prevented grace periods | 
 | from ever ending. | 
 | The result was an out-of-memory condition and a system hang. | 
 |  | 
 | <p> | 
 | The solution was the creation of RCU-bh, which does | 
 | <tt>local_bh_disable()</tt> | 
 | across its read-side critical sections, and which uses the transition | 
 | from one type of softirq processing to another as a quiescent state | 
 | in addition to context switch, idle, user mode, and offline. | 
 | This means that RCU-bh grace periods can complete even when some of | 
 | the CPUs execute in softirq indefinitely, thus allowing algorithms | 
 | based on RCU-bh to withstand network-based denial-of-service attacks. | 
 |  | 
 | <p> | 
 | Because | 
 | <tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> | 
 | disable and re-enable softirq handlers, any attempt to start a softirq | 
 | handlers during the | 
 | RCU-bh read-side critical section will be deferred. | 
 | In this case, <tt>rcu_read_unlock_bh()</tt> | 
 | will invoke softirq processing, which can take considerable time. | 
 | One can of course argue that this softirq overhead should be associated | 
 | with the code following the RCU-bh read-side critical section rather | 
 | than <tt>rcu_read_unlock_bh()</tt>, but the fact | 
 | is that most profiling tools cannot be expected to make this sort | 
 | of fine distinction. | 
 | For example, suppose that a three-millisecond-long RCU-bh read-side | 
 | critical section executes during a time of heavy networking load. | 
 | There will very likely be an attempt to invoke at least one softirq | 
 | handler during that three milliseconds, but any such invocation will | 
 | be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. | 
 | This can of course make it appear at first glance as if | 
 | <tt>rcu_read_unlock_bh()</tt> was executing very slowly. | 
 |  | 
 | <p> | 
 | The | 
 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> | 
 | includes | 
 | <tt>rcu_read_lock_bh()</tt>, | 
 | <tt>rcu_read_unlock_bh()</tt>, | 
 | <tt>rcu_dereference_bh()</tt>, | 
 | <tt>rcu_dereference_bh_check()</tt>, | 
 | <tt>synchronize_rcu_bh()</tt>, | 
 | <tt>synchronize_rcu_bh_expedited()</tt>, | 
 | <tt>call_rcu_bh()</tt>, | 
 | <tt>rcu_barrier_bh()</tt>, and | 
 | <tt>rcu_read_lock_bh_held()</tt>. | 
 |  | 
 | <h3><a name="Sched Flavor">Sched Flavor</a></h3> | 
 |  | 
 | <p> | 
 | Before preemptible RCU, waiting for an RCU grace period had the | 
 | side effect of also waiting for all pre-existing interrupt | 
 | and NMI handlers. | 
 | However, there are legitimate preemptible-RCU implementations that | 
 | do not have this property, given that any point in the code outside | 
 | of an RCU read-side critical section can be a quiescent state. | 
 | Therefore, <i>RCU-sched</i> was created, which follows “classic” | 
 | RCU in that an RCU-sched grace period waits for for pre-existing | 
 | interrupt and NMI handlers. | 
 | In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched | 
 | APIs have identical implementations, while kernels built with | 
 | <tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. | 
 |  | 
 | <p> | 
 | Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, | 
 | <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> | 
 | disable and re-enable preemption, respectively. | 
 | This means that if there was a preemption attempt during the | 
 | RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> | 
 | will enter the scheduler, with all the latency and overhead entailed. | 
 | Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look | 
 | as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. | 
 | However, the highest-priority task won't be preempted, so that task | 
 | will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. | 
 |  | 
 | <p> | 
 | The | 
 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> | 
 | includes | 
 | <tt>rcu_read_lock_sched()</tt>, | 
 | <tt>rcu_read_unlock_sched()</tt>, | 
 | <tt>rcu_read_lock_sched_notrace()</tt>, | 
 | <tt>rcu_read_unlock_sched_notrace()</tt>, | 
 | <tt>rcu_dereference_sched()</tt>, | 
 | <tt>rcu_dereference_sched_check()</tt>, | 
 | <tt>synchronize_sched()</tt>, | 
 | <tt>synchronize_rcu_sched_expedited()</tt>, | 
 | <tt>call_rcu_sched()</tt>, | 
 | <tt>rcu_barrier_sched()</tt>, and | 
 | <tt>rcu_read_lock_sched_held()</tt>. | 
 | However, anything that disables preemption also marks an RCU-sched | 
 | read-side critical section, including | 
 | <tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, | 
 | <tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, | 
 | and so on. | 
 |  | 
 | <h3><a name="Sleepable RCU">Sleepable RCU</a></h3> | 
 |  | 
 | <p> | 
 | For well over a decade, someone saying “I need to block within | 
 | an RCU read-side critical section” was a reliable indication | 
 | that this someone did not understand RCU. | 
 | After all, if you are always blocking in an RCU read-side critical | 
 | section, you can probably afford to use a higher-overhead synchronization | 
 | mechanism. | 
 | However, that changed with the advent of the Linux kernel's notifiers, | 
 | whose RCU read-side critical | 
 | sections almost never sleep, but sometimes need to. | 
 | This resulted in the introduction of | 
 | <a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, | 
 | or <i>SRCU</i>. | 
 |  | 
 | <p> | 
 | SRCU allows different domains to be defined, with each such domain | 
 | defined by an instance of an <tt>srcu_struct</tt> structure. | 
 | A pointer to this structure must be passed in to each SRCU function, | 
 | for example, <tt>synchronize_srcu(&ss)</tt>, where | 
 | <tt>ss</tt> is the <tt>srcu_struct</tt> structure. | 
 | The key benefit of these domains is that a slow SRCU reader in one | 
 | domain does not delay an SRCU grace period in some other domain. | 
 | That said, one consequence of these domains is that read-side code | 
 | must pass a “cookie” from <tt>srcu_read_lock()</tt> | 
 | to <tt>srcu_read_unlock()</tt>, for example, as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 int idx; | 
 |  2 | 
 |  3 idx = srcu_read_lock(&ss); | 
 |  4 do_something(); | 
 |  5 srcu_read_unlock(&ss, idx); | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | As noted above, it is legal to block within SRCU read-side critical sections, | 
 | however, with great power comes great responsibility. | 
 | If you block forever in one of a given domain's SRCU read-side critical | 
 | sections, then that domain's grace periods will also be blocked forever. | 
 | Of course, one good way to block forever is to deadlock, which can | 
 | happen if any operation in a given domain's SRCU read-side critical | 
 | section can block waiting, either directly or indirectly, for that domain's | 
 | grace period to elapse. | 
 | For example, this results in a self-deadlock: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 int idx; | 
 |  2 | 
 |  3 idx = srcu_read_lock(&ss); | 
 |  4 do_something(); | 
 |  5 synchronize_srcu(&ss); | 
 |  6 srcu_read_unlock(&ss, idx); | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | However, if line 5 acquired a mutex that was held across | 
 | a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, | 
 | deadlock would still be possible. | 
 | Furthermore, if line 5 acquired a mutex that was held across | 
 | a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, | 
 | and if an <tt>ss1</tt>-domain SRCU read-side critical section | 
 | acquired another mutex that was held across as <tt>ss</tt>-domain | 
 | <tt>synchronize_srcu()</tt>, | 
 | deadlock would again be possible. | 
 | Such a deadlock cycle could extend across an arbitrarily large number | 
 | of different SRCU domains. | 
 | Again, with great power comes great responsibility. | 
 |  | 
 | <p> | 
 | Unlike the other RCU flavors, SRCU read-side critical sections can | 
 | run on idle and even offline CPUs. | 
 | This ability requires that <tt>srcu_read_lock()</tt> and | 
 | <tt>srcu_read_unlock()</tt> contain memory barriers, which means | 
 | that SRCU readers will run a bit slower than would RCU readers. | 
 | It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> | 
 | API, which, in combination with <tt>srcu_read_unlock()</tt>, | 
 | guarantees a full memory barrier. | 
 |  | 
 | <p> | 
 | Also unlike other RCU flavors, SRCU's callbacks-wait function | 
 | <tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers, | 
 | though this is not necessarily a good idea. | 
 | The reason that this is possible is that SRCU is insensitive | 
 | to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt> | 
 | need not exclude CPU-hotplug operations. | 
 |  | 
 | <p> | 
 | SRCU also differs from other RCU flavors in that SRCU's expedited and | 
 | non-expedited grace periods are implemented by the same mechanism. | 
 | This means that in the current SRCU implementation, expediting a | 
 | future grace period has the side effect of expediting all prior | 
 | grace periods that have not yet completed. | 
 | (But please note that this is a property of the current implementation, | 
 | not necessarily of future implementations.) | 
 | In addition, if SRCU has been idle for longer than the interval | 
 | specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter | 
 | (25 microseconds by default), | 
 | and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, | 
 | that invocation will be automatically expedited. | 
 |  | 
 | <p> | 
 | As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating | 
 | a locking bottleneck present in prior kernel versions. | 
 | Although this will allow users to put much heavier stress on | 
 | <tt>call_srcu()</tt>, it is important to note that SRCU does not | 
 | yet take any special steps to deal with callback flooding. | 
 | So if you are posting (say) 10,000 SRCU callbacks per second per CPU, | 
 | you are probably totally OK, but if you intend to post (say) 1,000,000 | 
 | SRCU callbacks per second per CPU, please run some tests first. | 
 | SRCU just might need a few adjustment to deal with that sort of load. | 
 | Of course, your mileage may vary based on the speed of your CPUs and | 
 | the size of your memory. | 
 |  | 
 | <p> | 
 | The | 
 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> | 
 | includes | 
 | <tt>srcu_read_lock()</tt>, | 
 | <tt>srcu_read_unlock()</tt>, | 
 | <tt>srcu_dereference()</tt>, | 
 | <tt>srcu_dereference_check()</tt>, | 
 | <tt>synchronize_srcu()</tt>, | 
 | <tt>synchronize_srcu_expedited()</tt>, | 
 | <tt>call_srcu()</tt>, | 
 | <tt>srcu_barrier()</tt>, and | 
 | <tt>srcu_read_lock_held()</tt>. | 
 | It also includes | 
 | <tt>DEFINE_SRCU()</tt>, | 
 | <tt>DEFINE_STATIC_SRCU()</tt>, and | 
 | <tt>init_srcu_struct()</tt> | 
 | APIs for defining and initializing <tt>srcu_struct</tt> structures. | 
 |  | 
 | <h3><a name="Tasks RCU">Tasks RCU</a></h3> | 
 |  | 
 | <p> | 
 | Some forms of tracing use “trampolines” to handle the | 
 | binary rewriting required to install different types of probes. | 
 | It would be good to be able to free old trampolines, which sounds | 
 | like a job for some form of RCU. | 
 | However, because it is necessary to be able to install a trace | 
 | anywhere in the code, it is not possible to use read-side markers | 
 | such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. | 
 | In addition, it does not work to have these markers in the trampoline | 
 | itself, because there would need to be instructions following | 
 | <tt>rcu_read_unlock()</tt>. | 
 | Although <tt>synchronize_rcu()</tt> would guarantee that execution | 
 | reached the <tt>rcu_read_unlock()</tt>, it would not be able to | 
 | guarantee that execution had completely left the trampoline. | 
 |  | 
 | <p> | 
 | The solution, in the form of | 
 | <a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, | 
 | is to have implicit | 
 | read-side critical sections that are delimited by voluntary context | 
 | switches, that is, calls to <tt>schedule()</tt>, | 
 | <tt>cond_resched_rcu_qs()</tt>, and | 
 | <tt>synchronize_rcu_tasks()</tt>. | 
 | In addition, transitions to and from userspace execution also delimit | 
 | tasks-RCU read-side critical sections. | 
 |  | 
 | <p> | 
 | The tasks-RCU API is quite compact, consisting only of | 
 | <tt>call_rcu_tasks()</tt>, | 
 | <tt>synchronize_rcu_tasks()</tt>, and | 
 | <tt>rcu_barrier_tasks()</tt>. | 
 |  | 
 | <h3><a name="Waiting for Multiple Grace Periods"> | 
 | Waiting for Multiple Grace Periods</a></h3> | 
 |  | 
 | <p> | 
 | Perhaps you have an RCU protected data structure that is accessed from | 
 | RCU read-side critical sections, from softirq handlers, and from | 
 | hardware interrupt handlers. | 
 | That is three flavors of RCU, the normal flavor, the bottom-half flavor, | 
 | and the sched flavor. | 
 | How to wait for a compound grace period? | 
 |  | 
 | <p> | 
 | The best approach is usually to “just say no!” and | 
 | insert <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> | 
 | around each RCU read-side critical section, regardless of what | 
 | environment it happens to be in. | 
 | But suppose that some of the RCU read-side critical sections are | 
 | on extremely hot code paths, and that use of <tt>CONFIG_PREEMPT=n</tt> | 
 | is not a viable option, so that <tt>rcu_read_lock()</tt> and | 
 | <tt>rcu_read_unlock()</tt> are not free. | 
 | What then? | 
 |  | 
 | <p> | 
 | You <i>could</i> wait on all three grace periods in succession, as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 synchronize_rcu(); | 
 |  2 synchronize_rcu_bh(); | 
 |  3 synchronize_sched(); | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | This works, but triples the update-side latency penalty. | 
 | In cases where this is not acceptable, <tt>synchronize_rcu_mult()</tt> | 
 | may be used to wait on all three flavors of grace period concurrently: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched); | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | But what if it is necessary to also wait on SRCU? | 
 | This can be done as follows: | 
 |  | 
 | <blockquote> | 
 | <pre> | 
 |  1 static void call_my_srcu(struct rcu_head *head, | 
 |  2        void (*func)(struct rcu_head *head)) | 
 |  3 { | 
 |  4   call_srcu(&my_srcu, head, func); | 
 |  5 } | 
 |  6 | 
 |  7 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched, call_my_srcu); | 
 | </pre> | 
 | </blockquote> | 
 |  | 
 | <p> | 
 | If you needed to wait on multiple different flavors of SRCU | 
 | (but why???), you would need to create a wrapper function resembling | 
 | <tt>call_my_srcu()</tt> for each SRCU flavor. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	But what if I need to wait for multiple RCU flavors, but I also need | 
 | 	the grace periods to be expedited? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	If you are using expedited grace periods, there should be less penalty | 
 | 	for waiting on them in succession. | 
 | 	But if that is nevertheless a problem, you can use workqueues | 
 | 	or multiple kthreads to wait on the various expedited grace | 
 | 	periods concurrently. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p> | 
 | Again, it is usually better to adjust the RCU read-side critical sections | 
 | to use a single flavor of RCU, but when this is not feasible, you can use | 
 | <tt>synchronize_rcu_mult()</tt>. | 
 |  | 
 | <h2><a name="Possible Future Changes">Possible Future Changes</a></h2> | 
 |  | 
 | <p> | 
 | One of the tricks that RCU uses to attain update-side scalability is | 
 | to increase grace-period latency with increasing numbers of CPUs. | 
 | If this becomes a serious problem, it will be necessary to rework the | 
 | grace-period state machine so as to avoid the need for the additional | 
 | latency. | 
 |  | 
 | <p> | 
 | Expedited grace periods scan the CPUs, so their latency and overhead | 
 | increases with increasing numbers of CPUs. | 
 | If this becomes a serious problem on large systems, it will be necessary | 
 | to do some redesign to avoid this scalability problem. | 
 |  | 
 | <p> | 
 | RCU disables CPU hotplug in a few places, perhaps most notably in the | 
 | <tt>rcu_barrier()</tt> operations. | 
 | If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug | 
 | notifiers, it will be necessary to avoid disabling CPU hotplug. | 
 | This would introduce some complexity, so there had better be a <i>very</i> | 
 | good reason. | 
 |  | 
 | <p> | 
 | The tradeoff between grace-period latency on the one hand and interruptions | 
 | of other CPUs on the other hand may need to be re-examined. | 
 | The desire is of course for zero grace-period latency as well as zero | 
 | interprocessor interrupts undertaken during an expedited grace period | 
 | operation. | 
 | While this ideal is unlikely to be achievable, it is quite possible that | 
 | further improvements can be made. | 
 |  | 
 | <p> | 
 | The multiprocessor implementations of RCU use a combining tree that | 
 | groups CPUs so as to reduce lock contention and increase cache locality. | 
 | However, this combining tree does not spread its memory across NUMA | 
 | nodes nor does it align the CPU groups with hardware features such | 
 | as sockets or cores. | 
 | Such spreading and alignment is currently believed to be unnecessary | 
 | because the hotpath read-side primitives do not access the combining | 
 | tree, nor does <tt>call_rcu()</tt> in the common case. | 
 | If you believe that your architecture needs such spreading and alignment, | 
 | then your architecture should also benefit from the | 
 | <tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set | 
 | to the number of CPUs in a socket, NUMA node, or whatever. | 
 | If the number of CPUs is too large, use a fraction of the number of | 
 | CPUs. | 
 | If the number of CPUs is a large prime number, well, that certainly | 
 | is an “interesting” architectural choice! | 
 | More flexible arrangements might be considered, but only if | 
 | <tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only | 
 | if the inadequacy has been demonstrated by a carefully run and | 
 | realistic system-level workload. | 
 |  | 
 | <p> | 
 | Please note that arrangements that require RCU to remap CPU numbers will | 
 | require extremely good demonstration of need and full exploration of | 
 | alternatives. | 
 |  | 
 | <p> | 
 | There is an embarrassingly large number of flavors of RCU, and this | 
 | number has been increasing over time. | 
 | Perhaps it will be possible to combine some at some future date. | 
 |  | 
 | <p> | 
 | RCU's various kthreads are reasonably recent additions. | 
 | It is quite likely that adjustments will be required to more gracefully | 
 | handle extreme loads. | 
 | It might also be necessary to be able to relate CPU utilization by | 
 | RCU's kthreads and softirq handlers to the code that instigated this | 
 | CPU utilization. | 
 | For example, RCU callback overhead might be charged back to the | 
 | originating <tt>call_rcu()</tt> instance, though probably not | 
 | in production kernels. | 
 |  | 
 | <h2><a name="Summary">Summary</a></h2> | 
 |  | 
 | <p> | 
 | This document has presented more than two decade's worth of RCU | 
 | requirements. | 
 | Given that the requirements keep changing, this will not be the last | 
 | word on this subject, but at least it serves to get an important | 
 | subset of the requirements set forth. | 
 |  | 
 | <h2><a name="Acknowledgments">Acknowledgments</a></h2> | 
 |  | 
 | I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, | 
 | Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and | 
 | Andy Lutomirski for their help in rendering | 
 | this article human readable, and to Michelle Rankin for her support | 
 | of this effort. | 
 | Other contributions are acknowledged in the Linux kernel's git archive. | 
 |  | 
 | </body></html> |