rcu_init_levelspread()
(If you just want the answer without the story of how I found and fixed the bug, scroll down to the bottom of this page.)
To recap, I recently refactored my rcutorture tests, getting that
item off of my todo list after “only” three or four years.
To my surprise, one (but only one) of the new rcutorture test scenarios
failed, namely the one creatively named TREE08
:
CONFIG_SMP=y CONFIG_NR_CPUS=16 CONFIG_PREEMPT_NONE=n CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=y CONFIG_HZ_PERIODIC=n CONFIG_NO_HZ_IDLE=y CONFIG_NO_HZ_FULL=n CONFIG_RCU_FAST_NO_HZ=n CONFIG_RCU_TRACE=n CONFIG_HOTPLUG_CPU=n CONFIG_SUSPEND=n CONFIG_HIBERNATION=n CONFIG_RCU_FANOUT=3 CONFIG_RCU_FANOUT_EXACT=y CONFIG_RCU_FANOUT_LEAF=2 CONFIG_RCU_NOCB_CPU=y CONFIG_RCU_NOCB_CPU_ALL=y CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PROVE_RCU_DELAY=n CONFIG_RCU_CPU_STALL_INFO=n CONFIG_RCU_CPU_STALL_VERBOSE=n CONFIG_RCU_BOOST=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_PRINTK_TIME=y
One reason for my surprise was that I had been running similar
rcutorture scenarios for some years.
In sharp (and happy) contrast to
previous efforts
featuring but a handful of failures in a given ten-hour run,
the effect was not subtle.
In fact, TREE08
often failed more
than 20 times per minute.
My first thought was that I had blown a recent commit, but testing earlier kernel versions gave similar failure rates.
Additional testing and analysis showed that the bug was quite sensitive
to the exact values of the Kconfig parameters.
For example, setting CONFIG_RCU_FANOUT=2
or
CONFIG_RCU_FANOUT=4
resulted in rock-solid behavior,
despite CONFIG_RCU_FANOUT=3
failing miserably.
Similarly, setting CONFIG_RCU_FANOUT_EXACT=n
(which is thankfully the default) also resulted in rock-solid
behavior.
This fortunately provides not only a workaround for the
bug, but, even better, a workaround that is enabled by default.
This enabled-by-default workaround did much to explain the total lack
of bug reports.
This sensitivity to a few Kconfig parameters implicated the
boot-time code that sets up the rcu_node
structures.
I therefore ran this boot-time setup code in usermode scaffolding,
resulting in the following output:
NR_CPUS = 16, CONFIG_RCU_FANOUT = 3, CONFIG_RCU_FANOUT_LEAF = 2, MAX_RCU_LVLS = 4 rcu_fanout_leaf = 16, nr_cpu_ids = 3 NR_CPUS: 16, RCU_NUM_LVLS: 3, rcu_num_lvls: 3, rcu_num_nodes: 12 num_rcu_lvl[0]: 1 num_rcu_lvl[1]: 3 num_rcu_lvl[2]: 8 num_rcu_lvl[3]: 16 num_rcu_lvl[4]: 0 rsp->levelspread[0]: 2 rsp->levelspread[1]: 3 rsp->levelspread[2]: 3 CPU 0: rdp=0x804c420 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 1: rdp=0x804c4cc rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 2: rdp=0x804c578 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 3: rdp=0x804c624 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 4: rdp=0x804c6d0 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 5: rdp=0x804c77c rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 6: rdp=0x804c828 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 7: rdp=0x804c8d4 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 8: rdp=0x804c980 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15) CPU 9: rdp=0x804ca2c rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15) CPU 10: rdp=0x804cad8 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15) CPU 11: rdp=0x804cb84 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15) CPU 12: rdp=0x804cc30 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15) CPU 13: rdp=0x804ccdc rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15) CPU 14: rdp=0x804cd88 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15) CPU 15: rdp=0x804ce34 rnp=0x804b2e8 (15-15) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
The rdp=
numbers show each CPU's rcu_data
structure addresses, and the sequence of rnp=
numbers
show the rcu_node
structures when traversing upwards
from the corresponding CPU to the root of the rcu_node
tree.
The numbers in parentheses show which CPUs are handled by each
rcu_node
structure.
The initial lines simply show the values of a few key boot-time
macros, variables, and arrays.
Had I been paying attention, I would have noticed that this output shows two very odd things, with one of the called out twice. However, some of my assumptions blinded me to these warning signs.
I therefore enabled RCU event tracing, with the usual false start
where I failed to build the kernel with CONFIG_RCU_TRACE=y
.
This resulted in a 20MB trace, roughly 15MB of which was required to
show the entire grace-period dance starting from the callback being
registered and ending with rcutorture detecting RCU bug.
The following entries caught my eye:
rcu_pree-7 0d..2 5852674733: rcu_grace_period_init: rcu_preempt 18446744073709551553 0 0 15 3 rcu_pree-7 0d..2 5852674946: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 0 8 7 rcu_pree-7 0d..2 5852675138: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 9 15 7 rcu_pree-7 0d..2 5852675313: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 18 15 0 rcu_pree-7 0d..2 5852675631: rcu_grace_period: rcu_preempt 18446744073709551553 cpustart rcu_pree-7 0d..2 5852675802: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 0 2 7 rcu_pree-7 0d..2 5852675974: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 3 5 7 rcu_pree-7 0d..2 5852676148: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 6 8 7 rcu_pree-7 0d..2 5852676321: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 9 11 7 rcu_pree-7 0d..2 5852676519: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 12 14 7 rcu_pree-7 0d..2 5852676691: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 15 15 1 rcu_pree-7 0d..2 5852676862: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 18 15 0 rcu_pree-7 0d..2 5852677034: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 21 15 0
The smoking gun is the fourth line, the one containing the
“18 15
”.
This indicates that the rcu_node
structure corresponding
to this trace event covers all CPUs whose numbers (as listed by
smp_processor_id()
) are greater or equal to 18 and
also less than or equal to 15.
There are of course no such CPUs.
The last two lines show similar brokenness.
These trace events narrowed the bug down to
the following code fragment in kernel/rcutree.c
:
1 #ifdef CONFIG_RCU_FANOUT_EXACT 2 static void __init rcu_init_levelspread(struct rcu_state *rsp) 3 { 4 int i; 5 6 for (i = rcu_num_lvls - 1; i > 0; i--) 7 rsp->levelspread[i] = CONFIG_RCU_FANOUT; 8 rsp->levelspread[0] = rcu_fanout_leaf; 9 } 10 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ 11 static void __init rcu_init_levelspread(struct rcu_state *rsp) 12 { 13 int ccur; 14 int cprv; 15 int i; 16 17 cprv = nr_cpu_ids; 18 for (i = rcu_num_lvls - 1; i >= 0; i--) { 19 ccur = rsp->levelcnt[i]; 20 rsp->levelspread[i] = (cprv + ccur - 1) / ccur; 21 cprv = ccur; 22 } 23 } 24 #endif /* #else #ifdef CONFIG_RCU_FANOUT_EXACT */
The problem is in lines 6-8, which incorrectly set the zero-th
element of the array from rcu_fanout_leaf
and the rest of
the array from CONFIG_RCU_FANOUT
.
Instead, the last element of the array should be set from
rcu_fanout_leaf
and the rest of
the array from CONFIG_RCU_FANOUT
.
The required patch is as follows:
1 diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c 2 index e0a58eca0092..13d1a1a0d60a 100644 3 --- a/kernel/rcu/tree.c 4 +++ b/kernel/rcu/tree.c 5 @@ -3246,9 +3246,9 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp) 6 { 7 int i; 8 9 - for (i = rcu_num_lvls - 1; i > 0; i--) 10 + rsp->levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf; 11 + for (i = rcu_num_lvls - 2; i >= 0; i--) 12 rsp->levelspread[i] = CONFIG_RCU_FANOUT; 13 - rsp->levelspread[0] = rcu_fanout_leaf; 14 } 15 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ 16 static void __init rcu_init_levelspread(struct rcu_state *rsp)
One could argue that this bug is so obscure that it can be safely ignored. However, my experience is that amazing depths of obscurity can vanish quite suddenly, with no advance warning.
So one moral of this story is that it is really important to tend your test cases with just as much care as you tend your test code, which in turn needs as least as much attention as does the code making up the actual implementation.
Another moral of this story is that, Star Wars
and the benefits of a well-trained intuition notwithstanding,
you should not trust your own feelings all that much.
In fact, I had been considering dropping the more-complex
CONFIG_RCU_FANOUT_EXACT=n
case, making the simpler
CONFIG_RCU_FANOUT_EXACT=y
case not just the default,
but the only possible choice.
Needless to say, before this bug fix, that would have been a really
bad decision!
In short, you should verify your code early, often, and as intensively as you possibly can. This is not simply case of “trust but verify,” but rather a case of “verify first and trust later, if at all.”
Perhaps this should be termed the Indiana Jones philosophy of validation.