Commit 58d30c36 authored by Ingo Molnar's avatar Ingo Molnar
Browse files

Merge branch 'for-mingo' of...

Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

 into core/rcu

Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).
Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
parents 94836ecf f2094107
......@@ -17,7 +17,7 @@ rcu_dereference.txt
rcubarrier.txt
- RCU and Unloadable Modules
rculist_nulls.txt
- RCU list primitives for use with SLAB_DESTROY_BY_RCU
- RCU list primitives for use with SLAB_TYPESAFE_BY_RCU
rcuref.txt
- Reference-count design for elements of lists/arrays protected by RCU
rcu.txt
......
......@@ -19,6 +19,8 @@ to each other.
The <tt>rcu_state</tt> Structure</a>
<li> <a href="#The rcu_node Structure">
The <tt>rcu_node</tt> Structure</a>
<li> <a href="#The rcu_segcblist Structure">
The <tt>rcu_segcblist</tt> Structure</a>
<li> <a href="#The rcu_data Structure">
The <tt>rcu_data</tt> Structure</a>
<li> <a href="#The rcu_dynticks Structure">
......@@ -841,6 +843,134 @@ for lockdep lock-class names.
Finally, lines&nbsp;64-66 produce an error if the maximum number of
CPUs is too large for the specified fanout.
<h3><a name="The rcu_segcblist Structure">
The <tt>rcu_segcblist</tt> Structure</a></h3>
The <tt>rcu_segcblist</tt> structure maintains a segmented list of
callbacks as follows:
<pre>
1 #define RCU_DONE_TAIL 0
2 #define RCU_WAIT_TAIL 1
3 #define RCU_NEXT_READY_TAIL 2
4 #define RCU_NEXT_TAIL 3
5 #define RCU_CBLIST_NSEGS 4
6
7 struct rcu_segcblist {
8 struct rcu_head *head;
9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
11 long len;
12 long len_lazy;
13 };
</pre>
<p>
The segments are as follows:
<ol>
<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
These callbacks are ready to be invoked.
<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
current grace period.
Note that different CPUs can have different ideas about which
grace period is current, hence the <tt>-&gt;gp_seq</tt> field.
<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next
grace period to start.
<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been
associated with a grace period.
</ol>
<p>
The <tt>-&gt;head</tt> pointer references the first callback or
is <tt>NULL</tt> if the list contains no callbacks (which is
<i>not</i> the same as being empty).
Each element of the <tt>-&gt;tails[]</tt> array references the
<tt>-&gt;next</tt> pointer of the last callback in the corresponding
segment of the list, or the list's <tt>-&gt;head</tt> pointer if
that segment and all previous segments are empty.
If the corresponding segment is empty but some previous segment is
not empty, then the array element is identical to its predecessor.
Older callbacks are closer to the head of the list, and new callbacks
are added at the tail.
This relationship between the <tt>-&gt;head</tt> pointer, the
<tt>-&gt;tails[]</tt> array, and the callbacks is shown in this
diagram:
</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
</p><p>In this figure, the <tt>-&gt;head</tt> pointer references the
first
RCU callback in the list.
The <tt>-&gt;tails[RCU_DONE_TAIL]</tt> array element references
the <tt>-&gt;head</tt> pointer itself, indicating that none
of the callbacks is ready to invoke.
The <tt>-&gt;tails[RCU_WAIT_TAIL]</tt> array element references callback
CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period,
give or take possible disagreements about exactly which grace period
is the current one.
The <tt>-&gt;tails[RCU_NEXT_READY_TAIL]</tt> array element
references the same RCU callback that <tt>-&gt;tails[RCU_WAIT_TAIL]</tt>
does, which indicates that there are no callbacks waiting on the next
RCU grace period.
The <tt>-&gt;tails[RCU_NEXT_TAIL]</tt> array element references
CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
remaining RCU callbacks have not yet been assigned to an RCU grace
period.
Note that the <tt>-&gt;tails[RCU_NEXT_TAIL]</tt> array element
always references the last RCU callback's <tt>-&gt;next</tt> pointer
unless the callback list is empty, in which case it references
the <tt>-&gt;head</tt> pointer.
<p>
There is one additional important special case for the
<tt>-&gt;tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
when this list is <i>disabled</i>.
Lists are disabled when the corresponding CPU is offline or when
the corresponding CPU's callbacks are offloaded to a kthread,
both of which are described elsewhere.
</p><p>CPUs advance their callbacks from the
<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
as grace periods advance.
</p><p>The <tt>-&gt;gp_seq[]</tt> array records grace-period
numbers corresponding to the list segments.
This is what allows different CPUs to have different ideas as to
which is the current grace period while still avoiding premature
invocation of their callbacks.
In particular, this allows CPUs that go idle for extended periods
to determine which of their callbacks are ready to be invoked after
reawakening.
</p><p>The <tt>-&gt;len</tt> counter contains the number of
callbacks in <tt>-&gt;head</tt>, and the
<tt>-&gt;len_lazy</tt> contains the number of those callbacks that
are known to only free memory, and whose invocation can therefore
be safely deferred.
<p><b>Important note</b>: It is the <tt>-&gt;len</tt> field that
determines whether or not there are callbacks associated with
this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>-&gt;head</tt>
pointer.
The reason for this is that all the ready-to-invoke callbacks
(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted
all at once at callback-invocation time.
If callback invocation must be postponed, for example, because a
high-priority process just woke up on this CPU, then the remaining
callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment.
Either way, the <tt>-&gt;len</tt> and <tt>-&gt;len_lazy</tt> counts
are adjusted after the corresponding callbacks have been invoked, and so
again it is the <tt>-&gt;len</tt> count that accurately reflects whether
or not there are callbacks associated with this <tt>rcu_segcblist</tt>
structure.
Of course, off-CPU sampling of the <tt>-&gt;len</tt> count requires
the use of appropriate synchronization, for example, memory barriers.
This synchronization can be a bit subtle, particularly in the case
of <tt>rcu_barrier()</tt>.
<h3><a name="The rcu_data Structure">
The <tt>rcu_data</tt> Structure</a></h3>
......@@ -983,62 +1113,18 @@ choice.
as follows:
<pre>
1 struct rcu_head *nxtlist;
2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
4 long qlen_lazy;
5 long qlen;
6 long qlen_last_fqs_check;
1 struct rcu_segcblist cblist;
2 long qlen_last_fqs_check;
3 unsigned long n_cbs_invoked;
4 unsigned long n_nocbs_invoked;
5 unsigned long n_cbs_orphaned;
6 unsigned long n_cbs_adopted;
7 unsigned long n_force_qs_snap;
8 unsigned long n_cbs_invoked;
9 unsigned long n_cbs_orphaned;
10 unsigned long n_cbs_adopted;
11 long blimit;
8 long blimit;
</pre>
<p>The <tt>-&gt;nxtlist</tt> pointer and the
<tt>-&gt;nxttail[]</tt> array form a four-segment list with
older callbacks near the head and newer ones near the tail.
Each segment contains callbacks with the corresponding relationship
to the current grace period.
The pointer out of the end of each of the four segments is referenced
by the element of the <tt>-&gt;nxttail[]</tt> array indexed by
<tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
<tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
<tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
grace period), and
<tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
with a specific grace period)
respectively, as shown in the following figure.
</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
</p><p>In this figure, the <tt>-&gt;nxtlist</tt> pointer references the
first
RCU callback in the list.
The <tt>-&gt;nxttail[RCU_DONE_TAIL]</tt> array element references
the <tt>-&gt;nxtlist</tt> pointer itself, indicating that none
of the callbacks is ready to invoke.
The <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt> array element references callback
CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period.
The <tt>-&gt;nxttail[RCU_NEXT_READY_TAIL]</tt> array element
references the same RCU callback that <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt>
does, which indicates that there are no callbacks waiting on the next
RCU grace period.
The <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element references
CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
remaining RCU callbacks have not yet been assigned to an RCU grace
period.
Note that the <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element
always references the last RCU callback's <tt>-&gt;next</tt> pointer
unless the callback list is empty, in which case it references
the <tt>-&gt;nxtlist</tt> pointer.
</p><p>CPUs advance their callbacks from the
<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
as grace periods advance.
<p>The <tt>-&gt;cblist</tt> structure is the segmented callback list
described earlier.
The CPU advances the callbacks in its <tt>rcu_data</tt> structure
whenever it notices that another RCU grace period has completed.
The CPU detects the completion of an RCU grace period by noticing
......@@ -1049,16 +1135,7 @@ Recall that each <tt>rcu_node</tt> structure's
<tt>-&gt;completed</tt> field is updated at the end of each
grace period.
</p><p>The <tt>-&gt;nxtcompleted[]</tt> array records grace-period
numbers corresponding to the list segments.
This allows CPUs that go idle for extended periods to determine
which of their callbacks are ready to be invoked after reawakening.
</p><p>The <tt>-&gt;qlen</tt> counter contains the number of
callbacks in <tt>-&gt;nxtlist</tt>, and the
<tt>-&gt;qlen_lazy</tt> contains the number of those callbacks that
are known to only free memory, and whose invocation can therefore
be safely deferred.
<p>
The <tt>-&gt;qlen_last_fqs_check</tt> and
<tt>-&gt;n_force_qs_snap</tt> coordinate the forcing of quiescent
states from <tt>call_rcu()</tt> and friends when callback
......@@ -1069,6 +1146,10 @@ lists grow excessively long.
fields count the number of callbacks invoked,
sent to other CPUs when this CPU goes offline,
and received from other CPUs when those other CPUs go offline.
The <tt>-&gt;n_nocbs_invoked</tt> is used when the CPU's callbacks
are offloaded to a kthread.
<p>
Finally, the <tt>-&gt;blimit</tt> counter is the maximum number of
RCU callbacks that may be invoked at a given time.
......@@ -1104,6 +1185,9 @@ Its fields are as follows:
1 int dynticks_nesting;
2 int dynticks_nmi_nesting;
3 atomic_t dynticks;
4 bool rcu_need_heavy_qs;
5 unsigned long rcu_qs_ctr;
6 bool rcu_urgent_qs;
</pre>
<p>The <tt>-&gt;dynticks_nesting</tt> field counts the
......@@ -1117,11 +1201,32 @@ NMIs are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
field, except that NMIs that interrupt non-dyntick-idle execution
are not counted.
</p><p>Finally, the <tt>-&gt;dynticks</tt> field counts the corresponding
</p><p>The <tt>-&gt;dynticks</tt> field counts the corresponding
CPU's transitions to and from dyntick-idle mode, so that this counter
has an even value when the CPU is in dyntick-idle mode and an odd
value otherwise.
</p><p>The <tt>-&gt;rcu_need_heavy_qs</tt> field is used
to record the fact that the RCU core code would really like to
see a quiescent state from the corresponding CPU, so much so that
it is willing to call for heavy-weight dyntick-counter operations.
This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
code, which provide a momentary idle sojourn in response.
</p><p>The <tt>-&gt;rcu_qs_ctr</tt> field is used to record
quiescent states from <tt>cond_resched()</tt>.
Because <tt>cond_resched()</tt> can execute quite frequently, this
must be quite lightweight, as in a non-atomic increment of this
per-CPU field.
</p><p>Finally, the <tt>-&gt;rcu_urgent_qs</tt> field is used to record
the fact that the RCU core code would really like to see a quiescent
state from the corresponding CPU, with the various other fields indicating
just how badly RCU wants this quiescent state.
This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
code, which, if nothing else, non-atomically increment <tt>-&gt;rcu_qs_ctr</tt>
in response.
<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
......
......@@ -19,7 +19,7 @@
id="svg2"
version="1.1"
inkscape:version="0.48.4 r9939"
sodipodi:docname="nxtlist.fig">
sodipodi:docname="segcblist.svg">
<metadata
id="metadata94">
<rdf:RDF>
......@@ -28,7 +28,7 @@
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
<dc:title />
</cc:Work>
</rdf:RDF>
</metadata>
......@@ -241,61 +241,51 @@
xml:space="preserve"
x="225"
y="675"
fill="#000000"
font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
text-anchor="start"
id="text64">nxtlist</text>
id="text64"
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;head</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="1800"
fill="#000000"
font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
text-anchor="start"
id="text66">nxttail[RCU_DONE_TAIL]</text>
id="text66"
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_DONE_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="2925"
fill="#000000"
font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
text-anchor="start"
id="text68">nxttail[RCU_WAIT_TAIL]</text>
id="text68"
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_WAIT_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="4050"
fill="#000000"
font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
text-anchor="start"
id="text70">nxttail[RCU_NEXT_READY_TAIL]</text>
id="text70"
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_NEXT_READY_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="5175"
fill="#000000"
font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
text-anchor="start"
id="text72">nxttail[RCU_NEXT_TAIL]</text>
id="text72"
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_NEXT_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
......
......@@ -284,6 +284,7 @@ Expedited Grace Period Refinements</a></h2>
Funnel locking and wait/wakeup</a>.
<li> <a href="#Use of Workqueues">Use of Workqueues</a>.
<li> <a href="#Stall Warnings">Stall warnings</a>.
<li> <a href="#Mid-Boot Operation">Mid-boot operation</a>.
</ol>
<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3>
......@@ -524,7 +525,7 @@ their grace periods and carrying out their wakeups.
In earlier implementations, the task requesting the expedited
grace period also drove it to completion.
This straightforward approach had the disadvantage of needing to
account for signals sent to user tasks,
account for POSIX signals sent to user tasks,
so more recent implemementations use the Linux kernel's
<a href="https://www.kernel.org/doc/Documentation/workqueue.txt">workqueues</a>.
......@@ -533,8 +534,8 @@ The requesting task still does counter snapshotting and funnel-lock
processing, but the task reaching the top of the funnel lock
does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt>
so that a workqueue kthread does the actual grace-period processing.
Because workqueue kthreads do not accept signals, grace-period-wait
processing need not allow for signals.
Because workqueue kthreads do not accept POSIX signals, grace-period-wait
processing need not allow for POSIX signals.
In addition, this approach allows wakeups for the previous expedited
grace period to be overlapped with processing for the next expedited
......@@ -586,6 +587,46 @@ blocking the current grace period are printed.
Each stall warning results in another pass through the loop, but the
second and subsequent passes use longer stall times.
<h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3>
<p>
The use of workqueues has the advantage that the expedited
grace-period code need not worry about POSIX signals.
Unfortunately, it has the
corresponding disadvantage that workqueues cannot be used until
they are initialized, which does not happen until some time after
the scheduler spawns the first task.
Given that there are parts of the kernel that really do want to
execute grace periods during this mid-boot &ldquo;dead zone&rdquo;,
expedited grace periods must do something else during thie time.
<p>
What they do is to fall back to the old practice of requiring that the
requesting task drive the expedited grace period, as was the case
before the use of workqueues.
However, the requesting task is only required to drive the grace period
during the mid-boot dead zone.
Before mid-boot, a synchronous grace period is a no-op.
Some time after mid-boot, workqueues are used.
<p>
Non-expedited non-SRCU synchronous grace periods must also operate
normally during mid-boot.
This is handled by causing non-expedited grace periods to take the
expedited code path during mid-boot.
<p>
The current code assumes that there are no POSIX signals during
the mid-boot dead zone.
However, if an overwhelming need for POSIX signals somehow arises,
appropriate adjustments can be made to the expedited stall-warning code.
One such adjustment would reinstate the pre-workqueue stall-warning
checks, but only during the mid-boot dead zone.
<p>
With this refinement, synchronous grace periods can now be used from
task context pretty much any time during the life of the kernel.
<h3><a name="Summary">
Summary</a></h3>
......
......@@ -659,8 +659,9 @@ systems with more than one CPU:
In other words, a given instance of <tt>synchronize_rcu()</tt>
can avoid waiting on a given RCU read-side critical section only
if it can prove that <tt>synchronize_rcu()</tt> started first.
</font>
<p>
<p><font color="ffffff">
A related question is &ldquo;When <tt>rcu_read_lock()</tt>
doesn't generate any code, why does it matter how it relates
to a grace period?&rdquo;
......@@ -675,8 +676,9 @@ systems with more than one CPU:
within the critical section, in which case none of the accesses
within the critical section may observe the effects of any
access following the grace period.
</font>
<p>
<p><font color="ffffff">
As of late 2016, mathematical models of RCU take this
viewpoint, for example, see slides&nbsp;62 and&nbsp;63
of the
......@@ -1616,8 +1618,8 @@ CPUs should at least make reasonable forward progress.
In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
is permitted to impose modest degradation of real-time latency
on non-idle online CPUs.
That said, it will likely be necessary to take further steps to reduce this
degradation, hopefully to roughly that of a scheduling-clock interrupt.
Here, &ldquo;modest&rdquo; means roughly the same latency
degradation as a scheduling-clock interrupt.
<p>
There are a number of situations where even
......@@ -1913,12 +1915,9 @@ This requirement is another factor driving batching of grace periods,
but it is also the driving force behind the checks for large numbers
of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
Finally, high update rates should not delay RCU read-side critical
sections, although some read-side delays can occur when using
sections, although some small read-side delays can occur when using
<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
of <tt>try_stop_cpus()</tt>.
(In the future, <tt>synchronize_rcu_expedited()</tt> will be
converted to use lighter-weight inter-processor interrupts (IPIs),
but this will still disturb readers, though to a much smaller degree.)
of <tt>smp_call_function_single()</tt>.
<p>
Although all three of these corner cases were understood in the early
......@@ -2154,7 +2153,8 @@ as will <tt>rcu_assign_pointer()</tt>.
<p>
Although <tt>call_rcu()</tt> may be invoked at any
time during boot, callbacks are not guaranteed to be invoked until after
the scheduler is fully up and running.
all of RCU's kthreads have been spawned, which occurs at
<tt>early_initcall()</tt> time.
This delay in callback invocation is due to the fact that RCU does not
invoke callbacks until it is fully initialized, and this full initialization
cannot occur until after the scheduler has initialized itself to the
......@@ -2167,8 +2167,10 @@ on what operations those callbacks could invoke.
Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
(<a href="#Bottom-Half Flavor">discussed below</a>),
and
<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>
<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>,
<tt>synchronize_rcu_expedited()</tt>,
<tt>synchronize_rcu_bh_expedited()</tt>, and
<tt>synchronize_sched_expedited()</tt>
will all operate normally
during very early boot, the reason being that there is only one CPU
and preemption is disabled.
......@@ -2178,45 +2180,59 @@ state and thus a grace period, so the early-boot implementation can
be a no-op.
<p>
Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt>
continue to operate normally through the remainder of boot, courtesy
of the fact that preemption is disabled across their RCU read-side
critical sections and also courtesy of the fact that there is still
only one CPU.
However, once the scheduler starts initializing, preemption is enabled.
There is still only a single CPU, but the fact that preemption is enabled
means that the no-op implementation of <tt>synchronize_rcu()</tt> no
longer works in <tt>CONFIG_PREEMPT=y</tt> kernels.
Therefore, as soon as the scheduler starts initializing, the early-boot
fastpath is disabled.
This means that <tt>synchronize_rcu()</tt> switches to its runtime
mode of operation where it posts callbacks, which in turn means that
any call to <tt>synchronize_rcu()</tt> will block until the corresponding
callback is invoked.
Unfortunately, the callback cannot be invoked until RCU's runtime
grace-period machinery is up and running, which cannot happen until
the scheduler has initialized itself sufficiently to allow RCU's
kthreads to be spawned.
Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler
initialization can result in deadlock.
However, once the scheduler has spawned its first kthread, this early
boot trick fails for <tt>synchronize_rcu()</tt> (as well as for
<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt>
kernels.
The reason is that an RCU read-side critical section might be preempted,
which means that a subsequent <tt>synchronize_rcu()</tt> really does have
to wait for something, as opposed to simply returning immediately.
Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of
its kthreads are spawned, which doesn't happen until some time during
<tt>early_initcalls()</tt> time.
But this is no excuse: RCU is nevertheless required to correctly handle
synchronous grace periods during this time period.
Once all of its kthreads are up and running, RCU starts running
normally.
<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
So what happens with <tt>synchronize_rcu()</tt> during
scheduler initialization for <tt>CONFIG_PREEMPT=n</tt>
kernels?
How can RCU possibly handle grace periods before all of its
kthreads have been spawned???
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt>
maps directly to <tt>synchronize_sched()</tt>.
Therefore, <tt>synchronize_rcu()</tt> works normally throughout
boot in <tt>CONFIG_PREEMPT=n</tt> kernels.
However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels,
so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt>
during scheduler initialization.
Very carefully!
</font>
<p><font color="ffffff">
During the &ldquo;dead zone&rdquo; between the time that the
scheduler spawns the first task and the time that all of RCU's
kthreads have been spawned, all synchronous grace periods are
handled by the expedited grace-period mechanism.
At runtime, this expedited mechanism relies on workqueues, but
during the dead zone the requesting task itself drives the
desired expedited grace period.
Because dead-zone execution takes place within task context,
everything works.
Once the dead zone ends, expedited grace periods go back to
using workqueues, as is required to avoid problems that would
otherwise occur when a user task received a POSIX signal while
driving an expedited grace period.
</font>
<p><font color="ffffff">
And yes, this does mean that it is unhelpful to send POSIX
signals to random tasks between the time that the scheduler
spawns its first kthread and the time that RCU's kthreads
have all been spawned.
If there ever turns out to be a good reason for sending POSIX
signals during that time, appropriate adjustments will be made.
(If it turns out that POSIX signals are sent during this time for
no good reason, other adjustments will be made, appropriate
or otherwise.)
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>
......@@ -2295,12 +2311,61 @@ situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
The need for <tt>rcu_barrier()</tt> for module unloading became
apparent later.
<p>
<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not,
repeat, <i>not</i>, obligated to wait for a grace period.
It is instead only required to wait for RCU callbacks that have
already been posted.
Therefore, if there are no RCU callbacks posted anywhere in the system,
<tt>rcu_barrier()</tt> is within its rights to return immediately.
Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not
necessarily need to wait for a grace period.
<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
Wait a minute!
Each RCU callbacks must wait for a grace period to complete,
and <tt>rcu_barrier()</tt> must wait for each pre-existing
callback to be invoked.
Doesn't <tt>rcu_barrier()</tt> therefore need to wait for
a full grace period if there is even one callback posted anywhere
in the system?
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
Absolutely not!!!
</font>
<p><font color="ffffff">
Yes, each RCU callbacks must wait for a grace period to complete,
but it might well be partly (or even completely) finished waiting
by the time <tt>rcu_barrier()</tt> is invoked.
In that case, <tt>rcu_barrier()</tt> need only wait for the
remaining portion of the grace period to elapse.
So even if there are quite a few callbacks posted,
<tt>rcu_barrier()</tt> might well return quite quickly.
</font>
<p><font color="ffffff">
So if you need to wait for a grace period as well as for all
pre-existing callbacks, you will need to invoke both
<tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>.
If latency is a concern, you can always use workqueues
to invoke them concurrently.
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>
<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
<p>
The Linux kernel supports CPU hotplug, which means that CPUs
can come and go.
It is of course illegal to use any RCU API member from an offline CPU.
It is of course illegal to use any RCU API member from an offline CPU,
with the exception of <a href="#Sleepable RCU">SRCU</a> read-side
critical sections.
This requirement was present from day one in DYNIX/ptx, but
on the other hand, the Linux kernel's CPU-hotplug implementation
is &ldquo;interesting.&rdquo;
......@@ -2310,19 +2375,18 @@ The Linux-kernel CPU-hotplug implementation has notifiers that
are used to allow the various kernel subsystems (including RCU)
to respond appropriately to a given CPU-hotplug operation.
Most RCU operations may be invoked from CPU-hotplug notifiers,
including even normal synchronous grace-period operations
such as <tt>synchronize_rcu()</tt>.
However, expedited grace-period operations such as
<tt>synchronize_rcu_expedited()</tt> are not supported,
due to the fact that current implementations block CPU-hotplug
operations, which could result in deadlock.
including even synchronous grace-period operations such as
<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>.
<p>
In addition, all-callback-wait operations such as
However, all-callback-wait operations such as
<tt>rcu_barrier()</tt> are also not supported, due to the
fact that there are phases of CPU-hotplug operations where
the outgoing CPU's callbacks will not be invoked until after
the CPU-hotplug operation ends, which could also result in deadlock.
Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations
during its execution, which results in another type of deadlock
when invoked from a CPU-hotplug notifier.
<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
......@@ -2863,6 +2927,27 @@ It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
API, which, in combination with <tt>srcu_read_unlock()</tt>,
guarantees a full memory barrier.
<p>
Also unlike other RCU flavors, SRCU's callbacks-wait function
<tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers,
though this is not necessarily a good idea.
The reason that this is possible is that SRCU is insensitive
to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt>
need not exclude CPU-hotplug operations.
<p>
As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating
a locking bottleneck present in prior kernel versions.
Although this will allow users to put much heavier stress on
<tt>call_srcu()</tt>, it is important to note that SRCU does not
yet take any special steps to deal with callback flooding.
So if you are posting (say) 10,000 SRCU callbacks per second per CPU,
you are probably totally OK, but if you intend to post (say) 1,000,000
SRCU callbacks per second per CPU, please run some tests first.
SRCU just might need a few adjustment to deal with that sort of load.
Of course, your mileage may vary based on the speed of your CPUs and
the size of your memory.
<p>
The
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
......@@ -3021,8 +3106,8 @@ to do some redesign to avoid this scalability problem.
<p>
RCU disables CPU hotplug in a few places, perhaps most notably in the
expedited grace-period and <tt>rcu_barrier()</tt> operations.
If there is a strong reason to use expedited grace periods in CPU-hotplug
<tt>rcu_barrier()</tt> operations.
If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug
notifiers, it will be necessary to avoid disabling CPU hotplug.
This would introduce some complexity, so there had better be a <i>very</i>
good reason.
......@@ -3096,9 +3181,5 @@ Andy Lutomirski for their help in rendering
this article human readable, and to Michelle Rankin for her support
of this effort.
Other contributions are acknowledged in the Linux kernel's git archive.
The cartoon is copyright (c) 2013 by Melissa Broussard,
and is provided
under the terms of the Creative Commons Attribution-Share Alike 3.0
United States license.
</body></html>
......@@ -138,6 +138,15 @@ o Be very careful about comparing pointers obtained from
This sort of comparison occurs frequently when scanning
RCU-protected circular linked lists.
Note that if checks for being within an RCU read-side
critical section are not required and the pointer is never
dereferenced, rcu_access_pointer() should be used in place
of rcu_dereference(). The rcu_access_pointer() primitive
does not require an enclosing read-side critical section,
and also omits the smp_read_barrier_depends() included in
rcu_dereference(), which in turn should provide a small
performance gain in some CPUs (e.g., the DEC Alpha).
o The comparison is against a pointer that references memory
that was initialized "a long time ago." The reason
this is safe is that even if misordering occurs, the
......
Using hlist_nulls to protect read-mostly linked lists and
objects using SLAB_DESTROY_BY_RCU allocations.
objects using SLAB_TYPESAFE_BY_RCU allocations.
Please read the basics in Documentation/RCU/listRCU.txt
......@@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way
to solve following problem :
A typical RCU linked list managing objects which are
allocated with SLAB_DESTROY_BY_RCU kmem_cache can
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :
1) Lookup algo
......@@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
Nothing special here, we can use a standard RCU hlist deletion.
But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)
if (put_last_reference_on(obj) {
......
Using RCU's CPU Stall Detector
The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
detector, which detects conditions that unduly delay RCU grace periods.
This module parameter enables CPU stall detection by default, but
may be overridden via boot-time parameter or at runtime via sysfs.
This document first discusses what sorts of issues RCU's CPU stall
detector can locate, and then discusses kernel parameters and Kconfig
options that can be used to fine-tune the detector's operation. Finally,
this document explains the stall detector's "splat" format.
What Causes RCU CPU Stall Warnings?
So your kernel printed an RCU CPU stall warning. The next question is
"What caused it?" The following problems can result in RCU CPU stall
warnings:
o A CPU looping in an RCU read-side critical section.
o A CPU looping with interrupts disabled.
o A CPU looping with preemption disabled. This condition can
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
stalls.
o A CPU looping with bottom halves disabled. This condition can
result in RCU-sched and RCU-bh stalls.
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
kernel without invoking schedule(). Note that cond_resched()
does not necessarily prevent RCU CPU stall warnings. Therefore,
if the looping in the kernel is really expected and desirable
behavior, you might need to replace some of the cond_resched()
calls with calls to cond_resched_rcu_qs().
o Booting Linux using a console connection that is too slow to
keep up with the boot-time console-message rate. For example,
a 115Kbaud serial console can be -way- too slow to keep up
with boot-time message rates, and will frequently result in
RCU CPU stall warning messages. Especially if you have added
debug printk()s.
o Anything that prevents RCU's grace-period kthreads from running.
This can result in the "All QSes seen" console-log message.
This message will include information on when the kthread last
ran and how often it should be expected to run.
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
happen to preempt a low-priority task in the middle of an RCU
read-side critical section. This is especially damaging if
that low-priority task is not permitted to run on any other CPU,
in which case the next RCU grace period can never complete, which
will eventually cause the system to run out of memory and hang.
While the system is in the process of running itself out of
memory, you might see stall-warning messages.
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
is running at a higher priority than the RCU softirq threads.
This will prevent RCU callbacks from ever being invoked,
and in a CONFIG_PREEMPT_RCU kernel will further prevent
RCU grace periods from ever completing. Either way, the
system will eventually run out of memory and hang. In the
CONFIG_PREEMPT_RCU case, you might see stall-warning
messages.
o A hardware or software issue shuts off the scheduler-clock
interrupt on a CPU that is not in dyntick-idle mode. This
problem really has happened, and seems to be most likely to
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
o A bug in the RCU implementation.
o A hardware failure. This is quite unlikely, but has occurred
at least once in real life. A CPU failed in a running system,
becoming unresponsive, but not causing an immediate crash.
This resulted in a series of RCU CPU stall warnings, eventually
leading the realization that the CPU had failed.
The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
warning. Note that SRCU does -not- have CPU stall warnings. Please note
that RCU only detects CPU stalls when there is a grace period in progress.
No grace period, no CPU stall warnings.
To diagnose the cause of the stall, inspect the stack traces.
The offending function will usually be near the top of the stack.
If you have a series of stall warnings from a single extended stall,
comparing the stack traces can often help determine where the stall
is occurring, which will usually be in the function nearest the top of
that portion of the stack which remains the same from trace to trace.
If you can reliably trigger the stall, ftrace can be quite helpful.
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
and with RCU's event tracing. For information on RCU's event tracing,
see include/trace/events/rcu.h.
Fine-Tuning the RCU CPU Stall Detector
The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's
CPU stall detector, which detects conditions that unduly delay RCU grace
periods. This module parameter enables CPU stall detection by default,
but may be overridden via boot-time parameter or at runtime via sysfs.
The stall detector's idea of what constitutes "unduly delayed" is
controlled by a set of kernel configuration variables and cpp macros:
......@@ -56,6 +149,9 @@ rcupdate.rcu_task_stall_timeout
And continues with the output of sched_show_task() for each
task stalling the current RCU-tasks grace period.
Interpreting RCU's CPU Stall-Detector "Splats"
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
it will print a message similar to the following:
......@@ -178,89 +274,3 @@ grace period is in flight.
It is entirely possible to see stall warnings from normal and from
expedited grace periods at about the same time from the same run.
What Causes RCU CPU Stall Warnings?
So your kernel printed an RCU CPU stall warning. The next question is
"What caused it?" The following problems can result in RCU CPU stall
warnings:
o A CPU looping in an RCU read-side critical section.
o A CPU looping with interrupts disabled. This condition can
result in RCU-sched and RCU-bh stalls.
o A CPU looping with preemption disabled. This condition can
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
stalls.
o A CPU looping with bottom halves disabled. This condition can
result in RCU-sched and RCU-bh stalls.
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
kernel without invoking schedule(). Note that cond_resched()
does not necessarily prevent RCU CPU stall warnings. Therefore,
if the looping in the kernel is really expected and desirable
behavior, you might need to replace some of the cond_resched()
calls with calls to cond_resched_rcu_qs().
o Booting Linux using a console connection that is too slow to
keep up with the boot-time console-message rate. For example,
a 115Kbaud serial console can be -way- too slow to keep up
with boot-time message rates, and will frequently result in
RCU CPU stall warning messages. Especially if you have added
debug printk()s.
o Anything that prevents RCU's grace-period kthreads from running.
This can result in the "All QSes seen" console-log message.
This message will include information on when the kthread last
ran and how often it should be expected to run.
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
happen to preempt a low-priority task in the middle of an RCU
read-side critical section. This is especially damaging if
that low-priority task is not permitted to run on any other CPU,
in which case the next RCU grace period can never complete, which
will eventually cause the system to run out of memory and hang.
While the system is in the process of running itself out of
memory, you might see stall-warning messages.
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
is running at a higher priority than the RCU softirq threads.
This will prevent RCU callbacks from ever being invoked,
and in a CONFIG_PREEMPT_RCU kernel will further prevent
RCU grace periods from ever completing. Either way, the
system will eventually run out of memory and hang. In the
CONFIG_PREEMPT_RCU case, you might see stall-warning
messages.
o A hardware or software issue shuts off the scheduler-clock
interrupt on a CPU that is not in dyntick-idle mode. This
problem really has happened, and seems to be most likely to
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
o A bug in the RCU implementation.
o A hardware failure. This is quite unlikely, but has occurred
at least once in real life. A CPU failed in a running system,
becoming unresponsive, but not causing an immediate crash.
This resulted in a series of RCU CPU stall warnings, eventually
leading the realization that the CPU had failed.
The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
warning. Note that SRCU does -not- have CPU stall warnings. Please note
that RCU only detects CPU stalls when there is a grace period in progress.
No grace period, no CPU stall warnings.
To diagnose the cause of the stall, inspect the stack traces.
The offending function will usually be near the top of the stack.
If you have a series of stall warnings from a single extended stall,
comparing the stack traces can often help determine where the stall
is occurring, which will usually be in the function nearest the top of
that portion of the stack which remains the same from trace to trace.
If you can reliably trigger the stall, ftrace can be quite helpful.
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
and with RCU's event tracing. For information on RCU's event tracing,
see include/trace/events/rcu.h.
......@@ -562,7 +562,9 @@ This section presents a "toy" RCU implementation that is based on
familiar locking primitives. Its overhead makes it a non-starter for
real-life use, as does its lack of scalability. It is also unsuitable
for realtime use, since it allows scheduling latency to "bleed" from
one read-side critical section to another.
one read-side critical section to another. It also assumes recursive
reader-writer locks: If you try this with non-recursive locks, and
you allow nested rcu_read_lock() calls, you can deadlock.
However, it is probably the easiest implementation to relate to, so is
a good starting point.
......@@ -587,20 +589,21 @@ It is extremely simple:
write_unlock(&rcu_gp_mutex);
}
[You can ignore rcu_assign_pointer() and rcu_dereference() without
missing much. But here they are anyway. And whatever you do, don't
forget about them when submitting patches making use of RCU!]
[You can ignore rcu_assign_pointer() and rcu_dereference() without missing
much. But here are simplified versions anyway. And whatever you do,
don't forget about them when submitting patches making use of RCU!]
#define rcu_assign_pointer(p, v) ({ \
smp_wmb(); \
(p) = (v); \
})
#define rcu_assign_pointer(p, v) \
({ \
smp_store_release(&(p), (v)); \
})
#define rcu_dereference(p) ({ \
typeof(p) _________p1 = p; \
smp_read_barrier_depends(); \
(_________p1); \
})
#define rcu_dereference(p) \
({ \
typeof(p) _________p1 = p; \
smp_read_barrier_depends(); \
(_________p1); \
})
The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
......@@ -925,7 +928,8 @@ d. Do you need RCU grace periods to complete even in the face
e. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
named SLAB_DESTROY_BY_RCU). But please be careful!
f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
......
......@@ -768,7 +768,7 @@ equal to zero, in which case the compiler is within its rights to
transform the above code into the following:
q = READ_ONCE(a);
WRITE_ONCE(b, 1);
WRITE_ONCE(b, 2);
do_something_else();
Given this transformation, the CPU is not required to respect the ordering
......
......@@ -320,6 +320,9 @@ config HAVE_CMPXCHG_LOCAL
config HAVE_CMPXCHG_DOUBLE
bool
config ARCH_WEAK_RELEASE_ACQUIRE
bool
config ARCH_WANT_IPC_PARSE_VERSION
bool
......
......@@ -99,6 +99,7 @@ config PPC
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_EXTABLE_SORT
select CLONE_BACKWARDS
......
......@@ -4665,7 +4665,7 @@ i915_gem_load_init(struct drm_i915_private *dev_priv)
dev_priv->requests = KMEM_CACHE(drm_i915_gem_request,
SLAB_HWCACHE_ALIGN |
SLAB_RECLAIM_ACCOUNT |
SLAB_DESTROY_BY_RCU);
SLAB_TYPESAFE_BY_RCU);
if (!dev_priv->requests)
goto err_vmas;
......
......@@ -493,7 +493,7 @@ static inline struct drm_i915_gem_request *
__i915_gem_active_get_rcu(const struct i915_gem_active *active)
{
/* Performing a lockless retrieval of the active request is super
* tricky. SLAB_DESTROY_BY_RCU merely guarantees that the backing
* tricky. SLAB_TYPESAFE_BY_RCU merely guarantees that the backing
* slab of request objects will not be freed whilst we hold the
* RCU read lock. It does not guarantee that the request itself
* will not be freed and then *reused*. Viz,
......
......@@ -1071,7 +1071,7 @@ int ldlm_init(void)
ldlm_lock_slab = kmem_cache_create("ldlm_locks",
sizeof(struct ldlm_lock), 0,
SLAB_HWCACHE_ALIGN |
SLAB_DESTROY_BY_RCU, NULL);
SLAB_TYPESAFE_BY_RCU, NULL);
if (!ldlm_lock_slab) {
kmem_cache_destroy(ldlm_resource_slab);
return -ENOMEM;
......
......@@ -2340,7 +2340,7 @@ static int jbd2_journal_init_journal_head_cache(void)
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0, /* offset */
SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
SLAB_TEMPORARY | SLAB_TYPESAFE_BY_RCU,
NULL); /* ctor */
retval = 0;
if (!jbd2_journal_head_cache) {
......
......@@ -38,7 +38,7 @@ void signalfd_cleanup(struct sighand_struct *sighand)
/*
* The lockless check can race with remove_wait_queue() in progress,
* but in this case its caller should run under rcu_read_lock() and
* sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
* sighand_cachep is SLAB_TYPESAFE_BY_RCU, we can safely return.
*/
if (likely(!waitqueue_active(wqh)))
return;
......
......@@ -229,7 +229,7 @@ static inline struct dma_fence *dma_fence_get_rcu(struct dma_fence *fence)
*
* Function returns NULL if no refcount could be obtained, or the fence.
* This function handles acquiring a reference to a fence that may be
* reallocated within the RCU grace period (such as with SLAB_DESTROY_BY_RCU),
* reallocated within the RCU grace period (such as with SLAB_TYPESAFE_BY_RCU),
* so long as the caller is using RCU on the pointer to the fence.
*
* An alternative mechanism is to employ a seqlock to protect a bunch of
......@@ -257,7 +257,7 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep)
* have successfully acquire a reference to it. If it no
* longer matches, we are holding a reference to some other
* reallocated pointer. This is possible if the allocator
* is using a freelist like SLAB_DESTROY_BY_RCU where the
* is using a freelist like SLAB_TYPESAFE_BY_RCU where the
* fence remains valid for the RCU grace period, but it
* may be reallocated. When using such allocators, we are
* responsible for ensuring the reference we get is to
......
......@@ -375,8 +375,6 @@ struct kvm {
struct mutex slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots *memslots[KVM_ADDRESS_SPACE_NUM];
struct srcu_struct srcu;
struct srcu_struct irq_srcu;
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
/*
......@@ -429,6 +427,8 @@ struct kvm {
struct list_head devices;
struct dentry *debugfs_dentry;
struct kvm_stat_data **debugfs_stat_data;
struct srcu_struct srcu;
struct srcu_struct irq_srcu;
};
#define kvm_err(fmt, ...) \
......
/*
* RCU node combining tree definitions. These are used to compute
* global attributes while avoiding common-case global contention. A key
* property that these computations rely on is a tournament-style approach
* where only one of the tasks contending a lower level in the tree need
* advance to the next higher level. If properly configured, this allows
* unlimited scalability while maintaining a constant level of contention
* on the root node.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2017
*
* Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*/
#ifndef __LINUX_RCU_NODE_TREE_H
#define __LINUX_RCU_NODE_TREE_H
/*
* Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
* CONFIG_RCU_FANOUT_LEAF.
* In theory, it should be possible to add more levels straightforwardly.
* In practice, this did work well going from three levels to four.
* Of course, your mileage may vary.
*/
#ifdef CONFIG_RCU_FANOUT
#define RCU_FANOUT CONFIG_RCU_FANOUT
#else /* #ifdef CONFIG_RCU_FANOUT */
# ifdef CONFIG_64BIT
# define RCU_FANOUT 64
# else
# define RCU_FANOUT 32
# endif
#endif /* #else #ifdef CONFIG_RCU_FANOUT */
#ifdef CONFIG_RCU_FANOUT_LEAF
#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
#define RCU_FANOUT_LEAF 16
#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
#if NR_CPUS <= RCU_FANOUT_1
# define RCU_NUM_LVLS 1
# define NUM_RCU_LVL_0 1
# define NUM_RCU_NODES NUM_RCU_LVL_0
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
# define RCU_NODE_NAME_INIT { "rcu_node_0" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
#elif NR_CPUS <= RCU_FANOUT_2
# define RCU_NUM_LVLS 2
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
#elif NR_CPUS <= RCU_FANOUT_3
# define RCU_NUM_LVLS 3
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
#elif NR_CPUS <= RCU_FANOUT_4
# define RCU_NUM_LVLS 4
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT_1 */
#endif /* __LINUX_RCU_NODE_TREE_H */
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment