| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title> |
| <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
| |
| <h2>Introduction</h2> |
| |
| This document describes RCU's expedited grace periods. |
| Unlike RCU's normal grace periods, which accept long latencies to attain |
| high efficiency and minimal disturbance, expedited grace periods accept |
| lower efficiency and significant disturbance to attain shorter latencies. |
| |
| <p> |
| There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched), |
| but only two flavors of expedited grace periods because the RCU-bh |
| expedited grace period maps onto the RCU-sched expedited grace period. |
| Each of the remaining two implementations is covered in its own section. |
| |
| <ol> |
| <li> <a href="#Expedited Grace Period Design"> |
| Expedited Grace Period Design</a> |
| <li> <a href="#RCU-preempt Expedited Grace Periods"> |
| RCU-preempt Expedited Grace Periods</a> |
| <li> <a href="#RCU-sched Expedited Grace Periods"> |
| RCU-sched Expedited Grace Periods</a> |
| <li> <a href="#Expedited Grace Period and CPU Hotplug"> |
| Expedited Grace Period and CPU Hotplug</a> |
| <li> <a href="#Expedited Grace Period Refinements"> |
| Expedited Grace Period Refinements</a> |
| </ol> |
| |
| <h2><a name="Expedited Grace Period Design"> |
| Expedited Grace Period Design</a></h2> |
| |
| <p> |
| The expedited RCU grace periods cannot be accused of being subtle, |
| given that they for all intents and purposes hammer every CPU that |
| has not yet provided a quiescent state for the current expedited |
| grace period. |
| The one saving grace is that the hammer has grown a bit smaller |
| over time: The old call to <tt>try_stop_cpus()</tt> has been |
| replaced with a set of calls to <tt>smp_call_function_single()</tt>, |
| each of which results in an IPI to the target CPU. |
| The corresponding handler function checks the CPU's state, motivating |
| a faster quiescent state where possible, and triggering a report |
| of that quiescent state. |
| As always for RCU, once everything has spent some time in a quiescent |
| state, the expedited grace period has completed. |
| |
| <p> |
| The details of the <tt>smp_call_function_single()</tt> handler's |
| operation depend on the RCU flavor, as described in the following |
| sections. |
| |
| <h2><a name="RCU-preempt Expedited Grace Periods"> |
| RCU-preempt Expedited Grace Periods</a></h2> |
| |
| <p> |
| The overall flow of the handling of a given CPU by an RCU-preempt |
| expedited grace period is shown in the following diagram: |
| |
| <p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%"> |
| |
| <p> |
| The solid arrows denote direct action, for example, a function call. |
| The dotted arrows denote indirect action, for example, an IPI |
| or a state that is reached after some time. |
| |
| <p> |
| If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt> |
| will ignore it because idle and offline CPUs are already residing |
| in quiescent states. |
| Otherwise, the expedited grace period will use |
| <tt>smp_call_function_single()</tt> to send the CPU an IPI, which |
| is handled by <tt>sync_rcu_exp_handler()</tt>. |
| |
| <p> |
| However, because this is preemptible RCU, <tt>sync_rcu_exp_handler()</tt> |
| can check to see if the CPU is currently running in an RCU read-side |
| critical section. |
| If not, the handler can immediately report a quiescent state. |
| Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt> |
| invocation will provide the needed quiescent-state report. |
| This flag-setting avoids the previous forced preemption of all |
| CPUs that might have RCU read-side critical sections. |
| In addition, this flag-setting is done so as to avoid increasing |
| the overhead of the common-case fastpath through the scheduler. |
| |
| <p> |
| Again because this is preemptible RCU, an RCU read-side critical section |
| can be preempted. |
| When that happens, RCU will enqueue the task, which will the continue to |
| block the current expedited grace period until it resumes and finds its |
| outermost <tt>rcu_read_unlock()</tt>. |
| The CPU will report a quiescent state just after enqueuing the task because |
| the CPU is no longer blocking the grace period. |
| It is instead the preempted task doing the blocking. |
| The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>, |
| which is called from <tt>rcu_preempt_note_context_switch()</tt>, which |
| in turn is called from <tt>rcu_note_context_switch()</tt>, which in |
| turn is called from the scheduler. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| Why not just have the expedited grace period check the |
| state of all the CPUs? |
| After all, that would avoid all those real-time-unfriendly IPIs. |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| Because we want the RCU read-side critical sections to run fast, |
| which means no memory barriers. |
| Therefore, it is not possible to safely check the state from some |
| other CPU. |
| And even if it was possible to safely check the state, it would |
| still be necessary to IPI the CPU to safely interact with the |
| upcoming <tt>rcu_read_unlock()</tt> invocation, which means that |
| the remote state testing would not help the worst-case |
| latency that real-time applications care about. |
| |
| <p><font color="ffffff">One way to prevent your real-time |
| application from getting hit with these IPIs is to |
| build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>. |
| RCU would then perceive the CPU running your application |
| as being idle, and it would be able to safely detect that |
| state without needing to IPI the CPU. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <p> |
| Please note that this is just the overall flow: |
| Additional complications can arise due to races with CPUs going idle |
| or offline, among other things. |
| |
| <h2><a name="RCU-sched Expedited Grace Periods"> |
| RCU-sched Expedited Grace Periods</a></h2> |
| |
| <p> |
| The overall flow of the handling of a given CPU by an RCU-sched |
| expedited grace period is shown in the following diagram: |
| |
| <p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%"> |
| |
| <p> |
| As with RCU-preempt's <tt>synchronize_rcu_expedited()</tt>, |
| <tt>synchronize_sched_expedited()</tt> ignores offline and |
| idle CPUs, again because they are in remotely detectable |
| quiescent states. |
| However, the <tt>synchronize_rcu_expedited()</tt> handler |
| is <tt>sync_sched_exp_handler()</tt>, and because the |
| <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> |
| leave no trace of their invocation, in general it is not possible to tell |
| whether or not the current CPU is in an RCU read-side critical section. |
| The best that <tt>sync_sched_exp_handler()</tt> can do is to check |
| for idle, on the off-chance that the CPU went idle while the IPI |
| was in flight. |
| If the CPU is idle, then tt>sync_sched_exp_handler()</tt> reports |
| the quiescent state. |
| |
| <p> |
| Otherwise, the handler invokes <tt>resched_cpu()</tt>, which forces |
| a future context switch. |
| At the time of the context switch, the CPU reports the quiescent state. |
| Should the CPU go offline first, it will report the quiescent state |
| at that time. |
| |
| <h2><a name="Expedited Grace Period and CPU Hotplug"> |
| Expedited Grace Period and CPU Hotplug</a></h2> |
| |
| <p> |
| The expedited nature of expedited grace periods require a much tighter |
| interaction with CPU hotplug operations than is required for normal |
| grace periods. |
| In addition, attempting to IPI offline CPUs will result in splats, but |
| failing to IPI online CPUs can result in too-short grace periods. |
| Neither option is acceptable in production kernels. |
| |
| <p> |
| The interaction between expedited grace periods and CPU hotplug operations |
| is carried out at several levels: |
| |
| <ol> |
| <li> The number of CPUs that have ever been online is tracked |
| by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt> |
| field. |
| The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt> |
| field tracks the number of CPUs that have ever been online |
| at the beginning of an RCU expedited grace period. |
| Note that this number never decreases, at least in the absence |
| of a time machine. |
| <li> The identities of the CPUs that have ever been online is |
| tracked by the <tt>rcu_node</tt> structure's |
| <tt>->expmaskinitnext</tt> field. |
| The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> |
| field tracks the identities of the CPUs that were online |
| at least once at the beginning of the most recent RCU |
| expedited grace period. |
| The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and |
| <tt>->ncpus_snap</tt> fields are used to detect when |
| new CPUs have come online for the first time, that is, |
| when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt> |
| field has changed since the beginning of the last RCU |
| expedited grace period, which triggers an update of each |
| <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> |
| field from its <tt>->expmaskinitnext</tt> field. |
| <li> Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> |
| field is used to initialize that structure's |
| <tt>->expmask</tt> at the beginning of each RCU |
| expedited grace period. |
| This means that only those CPUs that have been online at least |
| once will be considered for a given grace period. |
| <li> Any CPU that goes offline will clear its bit in its leaf |
| <tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt> |
| field, so any CPU with that bit clear can safely be ignored. |
| However, it is possible for a CPU coming online or going offline |
| to have this bit set for some time while <tt>cpu_online</tt> |
| returns <tt>false</tt>. |
| <li> For each non-idle CPU that RCU believes is currently online, the grace |
| period invokes <tt>smp_call_function_single()</tt>. |
| If this succeeds, the CPU was fully online. |
| Failure indicates that the CPU is in the process of coming online |
| or going offline, in which case it is necessary to wait for a |
| short time period and try again. |
| The purpose of this wait (or series of waits, as the case may be) |
| is to permit a concurrent CPU-hotplug operation to complete. |
| <li> In the case of RCU-sched, one of the last acts of an outgoing CPU |
| is to invoke <tt>rcu_report_dead()</tt>, which |
| reports a quiescent state for that CPU. |
| However, this is likely paranoia-induced redundancy. <!-- @@@ --> |
| </ol> |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| Why all the dancing around with multiple counters and masks |
| tracking CPUs that were once online? |
| Why not just have a single set of masks tracking the currently |
| online CPUs and be done with it? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| Maintaining single set of masks tracking the online CPUs <i>sounds</i> |
| easier, at least until you try working out all the race conditions |
| between grace-period initialization and CPU-hotplug operations. |
| For example, suppose initialization is progressing down the |
| tree while a CPU-offline operation is progressing up the tree. |
| This situation can result in bits set at the top of the tree |
| that have no counterparts at the bottom of the tree. |
| Those bits will never be cleared, which will result in |
| grace-period hangs. |
| In short, that way lies madness, to say nothing of a great many |
| bugs, hangs, and deadlocks. |
| |
| <p><font color="ffffff"> |
| In contrast, the current multi-mask multi-counter scheme ensures |
| that grace-period initialization will always see consistent masks |
| up and down the tree, which brings significant simplifications |
| over the single-mask method. |
| |
| <p><font color="ffffff"> |
| This is an instance of |
| <a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff"> |
| deferring work in order to avoid synchronization</a>. |
| Lazily recording CPU-hotplug events at the beginning of the next |
| grace period greatly simplifies maintenance of the CPU-tracking |
| bitmasks in the <tt>rcu_node</tt> tree. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <h2><a name="Expedited Grace Period Refinements"> |
| Expedited Grace Period Refinements</a></h2> |
| |
| <ol> |
| <li> <a href="#Idle-CPU Checks">Idle-CPU checks</a>. |
| <li> <a href="#Batching via Sequence Counter"> |
| Batching via sequence counter</a>. |
| <li> <a href="#Funnel Locking and Wait/Wakeup"> |
| Funnel locking and wait/wakeup</a>. |
| <li> <a href="#Use of Workqueues">Use of Workqueues</a>. |
| <li> <a href="#Stall Warnings">Stall warnings</a>. |
| <li> <a href="#Mid-Boot Operation">Mid-boot operation</a>. |
| </ol> |
| |
| <h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3> |
| |
| <p> |
| Each expedited grace period checks for idle CPUs when initially forming |
| the mask of CPUs to be IPIed and again just before IPIing a CPU |
| (both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>). |
| If the CPU is idle at any time between those two times, the CPU will |
| not be IPIed. |
| Instead, the task pushing the grace period forward will include the |
| idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>. |
| |
| <p> |
| For RCU-sched, there is an additional check for idle in the IPI |
| handler, <tt>sync_sched_exp_handler()</tt>. |
| If the IPI has interrupted the idle loop, then |
| <tt>sync_sched_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> |
| to report the corresponding quiescent state. |
| |
| <p> |
| For RCU-preempt, there is no specific check for idle in the |
| IPI handler (<tt>sync_rcu_exp_handler()</tt>), but because |
| RCU read-side critical sections are not permitted within the |
| idle loop, if <tt>sync_rcu_exp_handler()</tt> sees that the CPU is within |
| RCU read-side critical section, the CPU cannot possibly be idle. |
| Otherwise, <tt>sync_rcu_exp_handler()</tt> invokes |
| <tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent |
| state, regardless of whether or not that quiescent state was due to |
| the CPU being idle. |
| |
| <p> |
| In summary, RCU expedited grace periods check for idle when building |
| the bitmask of CPUs that must be IPIed, just before sending each IPI, |
| and (either explicitly or implicitly) within the IPI handler. |
| |
| <h3><a name="Batching via Sequence Counter"> |
| Batching via Sequence Counter</a></h3> |
| |
| <p> |
| If each grace-period request was carried out separately, expedited |
| grace periods would have abysmal scalability and |
| problematic high-load characteristics. |
| Because each grace-period operation can serve an unlimited number of |
| updates, it is important to <i>batch</i> requests, so that a single |
| expedited grace-period operation will cover all requests in the |
| corresponding batch. |
| |
| <p> |
| This batching is controlled by a sequence counter named |
| <tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure. |
| This counter has an odd value when there is an expedited grace period |
| in progress and an even value otherwise, so that dividing the counter |
| value by two gives the number of completed grace periods. |
| During any given update request, the counter must transition from |
| even to odd and then back to even, thus indicating that a grace |
| period has elapsed. |
| Therefore, if the initial value of the counter is <tt>s</tt>, |
| the updater must wait until the counter reaches at least the |
| value <tt>(s+3)&~0x1</tt>. |
| This counter is managed by the following access functions: |
| |
| <ol> |
| <li> <tt>rcu_exp_gp_seq_start()</tt>, which marks the start of |
| an expedited grace period. |
| <li> <tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an |
| expedited grace period. |
| <li> <tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of |
| the counter. |
| <li> <tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt> |
| if a full expedited grace period has elapsed since the |
| corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>. |
| </ol> |
| |
| <p> |
| Again, only one request in a given batch need actually carry out |
| a grace-period operation, which means there must be an efficient |
| way to identify which of many concurrent reqeusts will initiate |
| the grace period, and that there be an efficient way for the |
| remaining requests to wait for that grace period to complete. |
| However, that is the topic of the next section. |
| |
| <h3><a name="Funnel Locking and Wait/Wakeup"> |
| Funnel Locking and Wait/Wakeup</a></h3> |
| |
| <p> |
| The natural way to sort out which of a batch of updaters will initiate |
| the expedited grace period is to use the <tt>rcu_node</tt> combining |
| tree, as implemented by the <tt>exp_funnel_lock()</tt> function. |
| The first updater corresponding to a given grace period arriving |
| at a given <tt>rcu_node</tt> structure records its desired grace-period |
| sequence number in the <tt>->exp_seq_rq</tt> field and moves up |
| to the next level in the tree. |
| Otherwise, if the <tt>->exp_seq_rq</tt> field already contains |
| the sequence number for the desired grace period or some later one, |
| the updater blocks on one of four wait queues in the |
| <tt>->exp_wq[]</tt> array, using the second-from-bottom |
| and third-from bottom bits as an index. |
| An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure |
| synchronizes access to these fields. |
| |
| <p> |
| An empty <tt>rcu_node</tt> tree is shown in the following diagram, |
| with the white cells representing the <tt>->exp_seq_rq</tt> field |
| and the red cells representing the elements of the |
| <tt>->exp_wq[]</tt> array. |
| |
| <p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%"> |
| |
| <p> |
| The next diagram shows the situation after the arrival of Task A |
| and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt> |
| structures, respectively. |
| The current value of the <tt>rcu_state</tt> structure's |
| <tt>->expedited_sequence</tt> field is zero, so adding three and |
| clearing the bottom bit results in the value two, which both tasks |
| record in the <tt>->exp_seq_rq</tt> field of their respective |
| <tt>rcu_node</tt> structures: |
| |
| <p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%"> |
| |
| <p> |
| Each of Tasks A and B will move up to the root |
| <tt>rcu_node</tt> structure. |
| Suppose that Task A wins, recording its desired grace-period sequence |
| number and resulting in the state shown below: |
| |
| <p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%"> |
| |
| <p> |
| Task A now advances to initiate a new grace period, while Task B |
| moves up to the root <tt>rcu_node</tt> structure, and, seeing that |
| its desired sequence number is already recorded, blocks on |
| <tt>->exp_wq[1]</tt>. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| Why <tt>->exp_wq[1]</tt>? |
| Given that the value of these tasks' desired sequence number is |
| two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| No. |
| |
| <p><font color="ffffff"> |
| Recall that the bottom bit of the desired sequence number indicates |
| whether or not a grace period is currently in progress. |
| It is therefore necessary to shift the sequence number right one |
| bit position to obtain the number of the grace period. |
| This results in <tt>->exp_wq[1]</tt>. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <p> |
| If Tasks C and D also arrive at this point, they will compute the |
| same desired grace-period sequence number, and see that both leaf |
| <tt>rcu_node</tt> structures already have that value recorded. |
| They will therefore block on their respective <tt>rcu_node</tt> |
| structures' <tt>->exp_wq[1]</tt> fields, as shown below: |
| |
| <p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%"> |
| |
| <p> |
| Task A now acquires the <tt>rcu_state</tt> structure's |
| <tt>->exp_mutex</tt> and initiates the grace period, which |
| increments <tt>->expedited_sequence</tt>. |
| Therefore, if Tasks E and F arrive, they will compute |
| a desired sequence number of 4 and will record this value as |
| shown below: |
| |
| <p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%"> |
| |
| <p> |
| Tasks E and F will propagate up the <tt>rcu_node</tt> |
| combining tree, with Task F blocking on the root <tt>rcu_node</tt> |
| structure and Task E wait for Task A to finish so that |
| it can start the next grace period. |
| The resulting state is as shown below: |
| |
| <p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%"> |
| |
| <p> |
| Once the grace period completes, Task A |
| starts waking up the tasks waiting for this grace period to complete, |
| increments the <tt>->expedited_sequence</tt>, |
| acquires the <tt>->exp_wake_mutex</tt> and then releases the |
| <tt>->exp_mutex</tt>. |
| This results in the following state: |
| |
| <p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%"> |
| |
| <p> |
| Task E can then acquire <tt>->exp_mutex</tt> and increment |
| <tt>->expedited_sequence</tt> to the value three. |
| If new tasks G and H arrive and moves up the combining tree at the |
| same time, the state will be as follows: |
| |
| <p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%"> |
| |
| <p> |
| Note that three of the root <tt>rcu_node</tt> structure's |
| waitqueues are now occupied. |
| However, at some point, Task A will wake up the |
| tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting |
| in the following state: |
| |
| <p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%"> |
| |
| <p> |
| Execution will continue with Tasks E and H completing |
| their grace periods and carrying out their wakeups. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| What happens if Task A takes so long to do its wakeups |
| that Task E's grace period completes? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| Then Task E will block on the <tt>->exp_wake_mutex</tt>, |
| which will also prevent it from releasing <tt>->exp_mutex</tt>, |
| which in turn will prevent the next grace period from starting. |
| This last is important in preventing overflow of the |
| <tt>->exp_wq[]</tt> array. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <h3><a name="Use of Workqueues">Use of Workqueues</a></h3> |
| |
| <p> |
| In earlier implementations, the task requesting the expedited |
| grace period also drove it to completion. |
| This straightforward approach had the disadvantage of needing to |
| account for POSIX signals sent to user tasks, |
| so more recent implemementations use the Linux kernel's |
| <a href="https://www.kernel.org/doc/Documentation/core-api/workqueue.rst">workqueues</a>. |
| |
| <p> |
| The requesting task still does counter snapshotting and funnel-lock |
| processing, but the task reaching the top of the funnel lock |
| does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt> |
| so that a workqueue kthread does the actual grace-period processing. |
| Because workqueue kthreads do not accept POSIX signals, grace-period-wait |
| processing need not allow for POSIX signals. |
| |
| In addition, this approach allows wakeups for the previous expedited |
| grace period to be overlapped with processing for the next expedited |
| grace period. |
| Because there are only four sets of waitqueues, it is necessary to |
| ensure that the previous grace period's wakeups complete before the |
| next grace period's wakeups start. |
| This is handled by having the <tt>->exp_mutex</tt> |
| guard expedited grace-period processing and the |
| <tt>->exp_wake_mutex</tt> guard wakeups. |
| The key point is that the <tt>->exp_mutex</tt> is not released |
| until the first wakeup is complete, which means that the |
| <tt>->exp_wake_mutex</tt> has already been acquired at that point. |
| This approach ensures that the previous grace period's wakeups can |
| be carried out while the current grace period is in process, but |
| that these wakeups will complete before the next grace period starts. |
| This means that only three waitqueues are required, guaranteeing that |
| the four that are provided are sufficient. |
| |
| <h3><a name="Stall Warnings">Stall Warnings</a></h3> |
| |
| <p> |
| Expediting grace periods does nothing to speed things up when RCU |
| readers take too long, and therefore expedited grace periods check |
| for stalls just as normal grace periods do. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| But why not just let the normal grace-period machinery |
| detect the stalls, given that a given reader must block |
| both normal and expedited grace periods? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| Because it is quite possible that at a given time there |
| is no normal grace period in progress, in which case the |
| normal grace period cannot emit a stall warning. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting |
| for the expedited grace period to end, but with a timeout set to the |
| current RCU CPU stall-warning time. |
| If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures |
| blocking the current grace period are printed. |
| Each stall warning results in another pass through the loop, but the |
| second and subsequent passes use longer stall times. |
| |
| <h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3> |
| |
| <p> |
| The use of workqueues has the advantage that the expedited |
| grace-period code need not worry about POSIX signals. |
| Unfortunately, it has the |
| corresponding disadvantage that workqueues cannot be used until |
| they are initialized, which does not happen until some time after |
| the scheduler spawns the first task. |
| Given that there are parts of the kernel that really do want to |
| execute grace periods during this mid-boot “dead zone”, |
| expedited grace periods must do something else during thie time. |
| |
| <p> |
| What they do is to fall back to the old practice of requiring that the |
| requesting task drive the expedited grace period, as was the case |
| before the use of workqueues. |
| However, the requesting task is only required to drive the grace period |
| during the mid-boot dead zone. |
| Before mid-boot, a synchronous grace period is a no-op. |
| Some time after mid-boot, workqueues are used. |
| |
| <p> |
| Non-expedited non-SRCU synchronous grace periods must also operate |
| normally during mid-boot. |
| This is handled by causing non-expedited grace periods to take the |
| expedited code path during mid-boot. |
| |
| <p> |
| The current code assumes that there are no POSIX signals during |
| the mid-boot dead zone. |
| However, if an overwhelming need for POSIX signals somehow arises, |
| appropriate adjustments can be made to the expedited stall-warning code. |
| One such adjustment would reinstate the pre-workqueue stall-warning |
| checks, but only during the mid-boot dead zone. |
| |
| <p> |
| With this refinement, synchronous grace periods can now be used from |
| task context pretty much any time during the life of the kernel. |
| |
| <h3><a name="Summary"> |
| Summary</a></h3> |
| |
| <p> |
| Expedited grace periods use a sequence-number approach to promote |
| batching, so that a single grace-period operation can serve numerous |
| requests. |
| A funnel lock is used to efficiently identify the one task out of |
| a concurrent group that will request the grace period. |
| All members of the group will block on waitqueues provided in |
| the <tt>rcu_node</tt> structure. |
| The actual grace-period processing is carried out by a workqueue. |
| |
| <p> |
| CPU-hotplug operations are noted lazily in order to prevent the need |
| for tight synchronization between expedited grace periods and |
| CPU-hotplug operations. |
| The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, |
| at least in the common case. |
| RCU-preempt and RCU-sched use different IPI handlers and different |
| code to respond to the state changes carried out by those handlers, |
| but otherwise use common code. |
| |
| <p> |
| Quiescent states are tracked using the <tt>rcu_node</tt> tree, |
| and once all necessary quiescent states have been reported, |
| all tasks waiting on this expedited grace period are awakened. |
| A pair of mutexes are used to allow one grace period's wakeups |
| to proceed concurrently with the next grace period's processing. |
| |
| <p> |
| This combination of mechanisms allows expedited grace periods to |
| run reasonably efficiently. |
| However, for non-time-critical tasks, normal grace periods should be |
| used instead because their longer duration permits much higher |
| degrees of batching, and thus much lower per-request overheads. |
| |
| </body></html> |