From: Dinakar G. <di...@in...> - 2005-05-11 19:05:37
Attachments:
cpusets-hot-preempt.patch
|
cpusets+hotplug+preempt hangs up the machine, if both cpuset and hotplug operations are going on simultaneously I also get this oops first scheduling while atomic: cpuoff.sh/0x00000001/25331 [<c040195d>] schedule+0xa4d/0xa60 [<c0401b25>] wait_for_completion+0xc5/0xf0 [<c011a200>] default_wake_function+0x0/0x20 [<c0400cd3>] __down+0x83/0x110 [<c011a200>] default_wake_function+0x0/0x20 [<c0400eab>] __down_failed+0x7/0xc [<c0141fe7>] .text.lock.cpuset+0x105/0x15e [<c011b860>] move_task_off_dead_cpu+0x130/0x1f0 [<c011ba7c>] migrate_live_tasks+0x8c/0x90 [<c011be25>] migration_call+0x75/0x2c0 [<c01423f2>] __stop_machine_run+0x92/0xb0 [<c012e0dd>] notifier_call_chain+0x2d/0x50 [<c013a6fb>] cpu_down+0x16b/0x2a0 [<c027ddfb>] store_online+0x5b/0x80 [<c027ae15>] sysdev_store+0x35/0x40 [<c019f43e>] flush_write_buffer+0x3e/0x50 [<c019f4a8>] sysfs_write_file+0x58/0x80 [<c0163d26>] vfs_write+0xc6/0x180 [<c0163eb1>] sys_write+0x51/0x80 [<c01034e5>] syscall_call+0x7/0xb It looks like hotplug operations need to take cpuset_sem as well. Here is a patch against 2.6.12-rc2-mm3. Should apply against rc3-mm3 as well This has been tested and seems to fix the problem -Dinakar Signed-off-by: Dinakar Guniguntala <di...@in...> |
From: Paul J. <pj...@sg...> - 2005-05-11 19:26:02
|
Dinakar wrote: > It looks like hotplug operations need to take cpuset_sem as well. Could you explain why this is -- what is the deadlock? I don't doubt you found a deadlock, but since I don't understand it yet, it is difficult for me to judge whether this is the best fix. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-05-11 19:55:21
|
pj wrote: > Could you explain why this is -- what is the deadlock? On further reading, I see it. You're right. Deep in the bowels of the hotplug code, when removing a dead cpu, while holding the runqueue lock (task_rq_lock), the hotplug code might need to walk up the cpuset hierarchy, to find the nearest enclosing cpuset that still has online cpus, as part of figuring where to run a task that is being kicked off the dead cpu. The runqueue lock is atomic, but getting the cpuset_sem (needed to walk up the cpuset hierarchy) can sleep. So you need to get the cpuset_sem before calling task_rq_lock, so as to avoid the "scheduling while atomic" oops that you reported. Therefore the hotplug code, and anyone else calling cpuset_cpus_allowed(), which means sched_setaffinity(), needs to first grab cpuset_sem, so that they can grab any atomic locks needed, after getting cpuset_sem, not before. Acked-by: Paul Jackson <pj...@sg...> -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Nathan L. <nt...@po...> - 2005-05-11 19:52:12
|
On Thu, May 12, 2005 at 12:46:54AM +0530, Dinakar Guniguntala wrote: > > cpusets+hotplug+preempt hangs up the machine, if both > cpuset and hotplug operations are going on simultaneously > > I also get this oops first > > scheduling while atomic: cpuoff.sh/0x00000001/25331 > [<c040195d>] schedule+0xa4d/0xa60 > [<c0401b25>] wait_for_completion+0xc5/0xf0 > [<c011a200>] default_wake_function+0x0/0x20 > [<c0400cd3>] __down+0x83/0x110 > [<c011a200>] default_wake_function+0x0/0x20 > [<c0400eab>] __down_failed+0x7/0xc > [<c0141fe7>] .text.lock.cpuset+0x105/0x15e > [<c011b860>] move_task_off_dead_cpu+0x130/0x1f0 > [<c011ba7c>] migrate_live_tasks+0x8c/0x90 > [<c011be25>] migration_call+0x75/0x2c0 > [<c01423f2>] __stop_machine_run+0x92/0xb0 > [<c012e0dd>] notifier_call_chain+0x2d/0x50 > [<c013a6fb>] cpu_down+0x16b/0x2a0 > [<c027ddfb>] store_online+0x5b/0x80 > [<c027ae15>] sysdev_store+0x35/0x40 > [<c019f43e>] flush_write_buffer+0x3e/0x50 > [<c019f4a8>] sysfs_write_file+0x58/0x80 > [<c0163d26>] vfs_write+0xc6/0x180 > [<c0163eb1>] sys_write+0x51/0x80 > [<c01034e5>] syscall_call+0x7/0xb This trace is what should be fixed -- we're trying to schedule while the machine is "stopped" (all cpus except for one spinning with interrupts off). I'm not too familiar with the cpusets code but I would like to stay away from nesting these semaphores if at all possible. Will you share your testcase please? Nathan |
From: Paul J. <pj...@sg...> - 2005-05-11 20:42:49
|
Nathan wrote: > I'm not too familiar with the cpusets code but I > would like to stay away from nesting these semaphores if at all > possible. I share you preference for not nesting these semaphores. The other choice I am aware of would be for the hotplug code to be less cpuset-friendly. In the move_task_off_dead_cpu() code, at the point it says "No more Mr. Nice Guy", instead of looking for the nearest enclosing cpuset that has something online, which is what the cpuset_cpus_allowed() does, instead we could just take any damn cpu that was online. Something along the lines of the following fix: --- pj/kernel.old/sched.c 2005-05-11 13:00:17.000000000 -0700 +++ pj/kernel.new/sched.c 2005-05-11 13:02:24.000000000 -0700 @@ -4229,7 +4229,7 @@ static void move_task_off_dead_cpu(int d /* No more Mr. Nice Guy. */ if (dest_cpu == NR_CPUS) { - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); + tsk->cpus_allowed = cpu_online_map; dest_cpu = any_online_cpu(tsk->cpus_allowed); /* We've already decided here that we had to violate the cpuset container - as apparently someone hot unplugged every cpu in the current tasks cpuset. Hmmm ... that's not quite right ... we've decided that we had to violate the current tasks cpus_allowed, as apparently someone hot unplugged every cpu allowed there. This might be a proper subset of what cpus the tasks cpuset allows, perhaps due to a sched_setaffinity() call to restrict a task to just one or a few cpus that are allowed in its cpuset. So what we'd really like to do would be to first fallback to all the cpus allowed in the specified tasks cpuset (no walking the cpuset hierarchy), and see if any of those cpus are still online to receive this orphan task. Unless someone has botched the system configuration, and taken offline all the cpus in a cpuset, this should yield up a cpu that is still both allowed and online. If that fails, then to heck with honoring cpuset placement - just take the first online cpu we can find. This is doable without holding cpuset_sem. We can look at a current tasks cpuset without cpuset_sem, just with the task lock. And this should almost always work (yield up an online cpu that the cpuset allows). And when it doesn't work, we can reasonably blame the system administrator for forcing us to blow out the cpuset confinement. The following untested, uncompiled patch claims to do this: --- 2.6.12-rc1-mm4/include/linux/cpuset.h 2005-04-02 15:43:28.000000000 -0800 +++ 2.6.12-rc1-mm4.new/include/linux/cpuset.h 2005-05-11 13:26:10.000000000 -0700 @@ -19,6 +19,7 @@ extern void cpuset_init_smp(void); extern void cpuset_fork(struct task_struct *p); extern void cpuset_exit(struct task_struct *p); extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p); +extern const cpumask_t cpuset_task_cpus_allowed(const struct task_struct *p); void cpuset_init_current_mems_allowed(void); void cpuset_update_current_mems_allowed(void); void cpuset_restrict_to_mems_allowed(unsigned long *nodes); @@ -38,6 +39,10 @@ static inline cpumask_t cpuset_cpus_allo { return cpu_possible_map; } +static inline cpumask_t cpuset_task_cpus_allowed(struct task_struct *p) +{ + return cpu_possible_map; +} static inline void cpuset_init_current_mems_allowed(void) {} static inline void cpuset_update_current_mems_allowed(void) {} --- 2.6.12-rc1-mm4/kernel/cpuset.c 2005-04-22 19:35:34.000000000 -0700 +++ 2.6.12-rc1-mm4.new/kernel/cpuset.c 2005-05-11 13:40:05.000000000 -0700 @@ -1570,6 +1570,27 @@ const cpumask_t cpuset_cpus_allowed(cons return mask; } +/** + * cpuset_task_cpus_allowed - return cpus_allowed mask from a tasks cpuset. + * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. + * + * Description: Returns the cpumask_t cpus_allowed of the cpuset + * attached to the specified @tsk. Unlike cpuset_cpus_allowed(), + * is not guaranteed to return a non-empty subset of cpu_online_map. + * Does not walk up the cpuset hierarchy, and does not attempt to + * acquire the cpuset_sem. If called on a task about to exit, + * where tsk->cpuset is already NULL, return cpu_online_map. + * + * Call with task locked. + **/ + +const cpumask_t cpuset_task_cpus_allowed(const struct task_struct *tsk) +{ + if (!tsk->cpuset) + return cpu_online_map; + return tsk->cpuset->cpus_allowed; +} + void cpuset_init_current_mems_allowed(void) { current->mems_allowed = NODE_MASK_ALL; --- 2.6.12-rc1-mm4/kernel/sched.c 2005-04-22 19:51:44.000000000 -0700 +++ 2.6.12-rc1-mm4.new/kernel/sched.c 2005-05-11 13:33:20.000000000 -0700 @@ -4303,7 +4303,7 @@ static void move_task_off_dead_cpu(int d /* No more Mr. Nice Guy. */ if (dest_cpu == NR_CPUS) { - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); + tsk->cpus_allowed = cpuset_task_cpus_allowed(tsk); dest_cpu = any_online_cpu(tsk->cpus_allowed); /* I retract my Ack of Dinakar's patch, in favor of further consideration of this last patch, above. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-05-11 20:58:58
|
Well, that last patch obviously doesn't do all that I said it did - it doesn't do the final fallback to cpu_online_map, if no cpu in tsk->cpuset->cpus_allowed is online. Another variation will be forthcoming soon. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-05-14 02:23:47
|
Two days ago, I wrote: > Another variation will be forthcoming soon. Don't apply the following yet, Andrew. It is untested, and we've not yet obtained agreement. I'll sqawk again, if this patch survives long enough to warrant inclusion. === Ah to heck with it. This subtle distinction over what level of cpuset we fall back to when a hot unplug leaves a task with no online cpuset in its current allowed set is not worth it. Every variation I consider is either sufficiently complicated that I can't be sure it's right, or sufficiently simple that it's obviously broken. Revert the move_task_off_dead_cpu() code to its previous code, before cpusets were added. If none of the remaining allowed cpus are online, then let the task run on any cpu, no limit. This is a legal fallback, and indeed one of the possible outcomes of the previous code. It's just not so Nice. If a system administrator doesn't like a task being allowed to run anywhere as a result of this, then they should clear out a cpuset of the tasks running in it, before they take the last cpu in that cpuset offline, and they should use taskset (sched_setaffinity) or other means to ensure that tasks aren't pinned to a cpu that is about to be taken offline. Unless and until someone can make a good case to the contrary, it is not worth nesting hotplug and cpuset semaphores to attempt to provide a more subtle fallback, that few people would understand anyway. At least do one thing right - attach the task to the top_cpuset if we have to force its cpus_allowed there. That keeps the tasks apparent cpuset in sync with its cpus_allowed (any online cpu or CPU_MASK_ALL, which are roughly equivalent in this context). Signed-off-by: Paul Jackson <pj...@sg...> diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c --- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700 +++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-13 19:02:49.000000000 -0700 @@ -4301,7 +4301,8 @@ static void move_task_off_dead_cpu(int d /* No more Mr. Nice Guy. */ if (dest_cpu == NR_CPUS) { - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); + cpus_setall(tsk->cpus_allowed); + tsk->cpuset = &top_cpuset; dest_cpu = any_online_cpu(tsk->cpus_allowed); /* -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Nathan L. <nt...@po...> - 2005-05-14 12:14:47
|
Paul Jackson wrote: > > Ah to heck with it. This subtle distinction over what level of cpuset > we fall back to when a hot unplug leaves a task with no online cpuset in > its current allowed set is not worth it. > > Every variation I consider is either sufficiently complicated that I > can't be sure it's right, or sufficiently simple that it's obviously > broken. Heh, I feel your pain. > diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c > --- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700 > +++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-13 19:02:49.000000000 -0700 > @@ -4301,7 +4301,8 @@ static void move_task_off_dead_cpu(int d > > /* No more Mr. Nice Guy. */ > if (dest_cpu == NR_CPUS) { > - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); > + cpus_setall(tsk->cpus_allowed); > + tsk->cpuset = &top_cpuset; Hmm, tsk->cpuset will break the build if !CONFIG_CPUSETS, no? Plus, the task's original cpuset->count will be unbalanced and it will never be released, methinks. As a short-term (i.e. 2.6.12) solution, would it be terrible to set tsk->cpus_allowed to the default value without messing with the cpuset? That is, is it Bad for a task's cpus_allowed to be a superset of its cpuset's cpus_allowed? I ran Dinakar's test on 2.6.12-rc4 + this proposed "fix" on ppc64 without any crashes or warnings, but I would need you to confirm that this doesn't violate some fundamental cpuset constraint. However, I think making a best effort to honor the task's cpuset is a reasonable goal in this context. But it will require some nontrivial changes to the code for migrating tasks off the dead cpu, as well as some changes to the cpuset code. Not 2.6.12 material. I'll patch soon. Nathan |
From: Paul J. <pj...@sg...> - 2005-05-14 17:04:38
|
Nathan wrote: > Hmm, tsk->cpuset will break the build if !CONFIG_CPUSETS, no? Plus, > the task's original cpuset->count will be unbalanced and it will never > be released, methinks. Ah - yup. Well, at least I am keeping my mistakes simple enough that others can spot them quickly ;). Fixing the tasks cpuset by moving it to top_cpuset to match its newly extended tsk->cpus_allowed setting probably requires holding cpuset_sem, which is what we were trying to avoid. Drat. > As a short-term (i.e. 2.6.12) solution, would it be terrible to set > tsk->cpus_allowed to the default value without messing with the > cpuset? No - not terrible at all. The code that was there before, using the cpuset_cpus_allowed() and guarantee_online_cpus(), could do just that, setting tsk->cpus_allowed to a superset of tsk->cpuset->cpus_allowed, possibly even setting tsk->cpus_allowed to all bits. No one has noticed any terrible disasters from this yet. Granted, no one has pushed this code path very hard yet. > However, I think making a best effort to honor the task's cpuset is a > reasonable goal in this context. But it will require some nontrivial > changes to the code for migrating tasks off the dead cpu, as well as > some changes to the cpuset code. If these nontrivial changes have other merits, such as making some code simpler or clearer or more elegant, then that's goodness in any case. But I am reluctant to complicate either the cpuset or the hotplug code on this account. In my view, the kernel reserves the right to blow out a cpuset, if it is convenient for the kernel to do so in order stay sane, which includes: * insuring every task has a cpu it is allowed to run on, * insuring every task has a node is is allowed to allocate on, and * no deadlocks, hangs, oops, panics or other disasters. The mm/page_alloc.c __alloc_pages() code makes similar tradeoffs, allowing GFP_ATOMIC requests to ignore cpusets under memory pressure, for the "convenience of the kernel." A key property of any solution to this is that it is so simple and robust that people who don't understand this stuff (including obviously myself after a month's vacation) won't break it in the future. The following untested patch is probably what you meant by your "short-term solution", above. It _just_ reverts the "No more Mr. Nice Guy" code back to setting all bits in tsk->cpus_allowed. I see on another post in my inbox that Srivatsa has just heartily endorsed this approach as well. Signed-off-by: Paul Jackson <pj...@sg...> diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c --- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700 +++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-14 09:06:29.000000000 -0700 @@ -4301,7 +4301,7 @@ static void move_task_off_dead_cpu(int d /* No more Mr. Nice Guy. */ if (dest_cpu == NR_CPUS) { - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); + cpus_setall(tsk->cpus_allowed); dest_cpu = any_online_cpu(tsk->cpus_allowed); /* -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Srivatsa V. <va...@in...> - 2005-05-14 16:28:46
|
On Fri, May 13, 2005 at 07:23:31PM -0700, Paul Jackson wrote: > Revert the move_task_off_dead_cpu() code to its previous code, before > cpusets were added. If none of the remaining allowed cpus are online, > then let the task run on any cpu, no limit. I would agree to that! -- Thanks and Regards, Srivatsa Vaddagiri, Linux Technology Center, IBM Software Labs, Bangalore, INDIA - 560017 |
From: Dinakar G. <di...@in...> - 2005-05-12 15:00:02
Attachments:
dyn-sd-v0.6-0.patch
test-scripts.tar.gz
|
On Wed, May 11, 2005 at 01:42:35PM -0700, Paul Jackson wrote: > So what we'd really like to do would be to first fallback to all the > cpus allowed in the specified tasks cpuset (no walking the cpuset > hierarchy), and see if any of those cpus are still online to receive > this orphan task. Unless someone has botched the system configuration, > and taken offline all the cpus in a cpuset, this should yield up a cpu > that is still both allowed and online. If that fails, then to heck with > honoring cpuset placement - just take the first online cpu we can find. > I tried your suggestion, but hit another oops. This has more to do with hotplug+preempt looks like Code: fc 89 ec 5d e9 8b 0e 00 00 c7 04 24 5c 49 41 c0 e8 7f 39 00 00 e8 da 87 fe ff eb c1 0f 0b 36 11 32 96 41 c0 eb 92 83 f8 <6>note: cpuoff.sh[25439] exited with preempt_count 1 scheduling while atomic: cpuoff.sh/0x10000001/25439 [<c04010d2>] schedule+0x4b2/0x4c0 [<c0152da0>] unmap_page_range+0xc0/0xe0 [<c0401988>] cond_resched+0x28/0x40 [<c0152f56>] unmap_vmas+0x196/0x290 [<c0157c93>] exit_mmap+0xa3/0x1a0 [<c011ce53>] mmput+0x43/0x100 [<c01222f0>] do_exit+0xf0/0x3b0 [<c01047da>] die+0x18a/0x190 [<c0104bb0>] do_invalid_op+0x0/0xd0 [<c0104c5e>] do_invalid_op+0xae/0xd0 [<c011bbe7>] migrate_dead+0xa7/0xc0 [<c0400f14>] schedule+0x2f4/0x4c0 [<c040112b>] preempt_schedule+0x4b/0x70 [<c040112b>] preempt_schedule+0x4b/0x70 [<c0103fd7>] error_code+0x4f/0x54 [<c040007b>] svc_proc_unregister+0x1b/0x20 [<c011007b>] __acpi_map_table+0x7b/0xc0 [<c011bbe7>] migrate_dead+0xa7/0xc0 [<c011fef9>] profile_handoff_task+0x39/0x50 [<c011bc62>] migrate_dead_tasks+0x62/0x90 [<c011bda6>] migration_call+0x116/0x2c0 [<c0142102>] __stop_machine_run+0x92/0xb0 [<c012ddad>] notifier_call_chain+0x2d/0x50 [<c013a3cb>] cpu_down+0x16b/0x2a0 [<c027db0b>] store_online+0x5b/0x80 [<c027ab25>] sysdev_store+0x35/0x40 [<c019f14e>] flush_write_buffer+0x3e/0x50 [<c019f1b8>] sysfs_write_file+0x58/0x80 [<c0163a36>] vfs_write+0xc6/0x180 [<c0163bc1>] sys_write+0x51/0x80 [<c01034e5>] syscall_call+0x7/0xb On 2.6.12-rc4-mm1 it is even worse, it panics Booting processor 2/1 eip 3000 Calibrating delay using timer specific routine.. 1800.08 BogoMIPS (lpj=900043) CPU2: Intel Pentium III (Cascades) stepping 04 Unable to handle kernel NULL pointer dereference<7>CPU0 attaching sched-domain: at virtual address 00000001 printing eip: 00000001 *pde = 000001e3 *pte = 00000001 Oops: 0000 [#1] PREEMPT SMP Modules linked in: CPU: 2 EIP: 0060:[<00000001>] Not tainted VLI EFLAGS: 00010006 (2.6.12-rc4-mm1) EIP is at 0x1 eax: c18c0000 ebx: c04758a0 ecx: 00000001 edx: e8c13eb4 esi: ffffffff edi: c185c030 ebp: c18c1f80 esp: c18c1f2c ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c18c0000 task=c185c030) Stack: c0114e0d e8c13eb4 c18c0000 c0103ecc c18c0000 00000008 00000002 ffffffff c185c030 c18c1f80 00000001 0000007b 0000007b fffffffb c04048a0 00000060 00000206 c040511c 00000003 00000000 00000000 c18c0000 c01033ee 00000003 Call Trace: [<c0114e0d>] smp_call_function_interrupt+0x3d/0x60 [<c0103ecc>] call_function_interrupt+0x1c/0x24 [<c04048a0>] schedule+0x0/0x7c0 [<c040511c>] preempt_schedule_irq+0x4c/0x80 [<c01033ee>] need_resched+0x1f/0x21 [<c011516f>] start_secondary+0x10f/0x1a0 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt Booting processor 3/2 eip 3000 Initializing CPU#3 I really havent had a chance to investigate whats going on, should be able to do that tomorrow. Here is the patch I tried, my .config and scripts -Dinakar |
From: Dinakar G. <di...@in...> - 2005-05-13 12:04:28
|
On Thu, May 12, 2005 at 08:40:48PM +0530, Dinakar Guniguntala wrote: > > I really havent had a chance to investigate whats going on, should be > able to > do that tomorrow. Here is the patch I tried, my .config and > scripts Ok I confirmed that this is a hotplug bug. This problem is hit by turning off cpusets with only hotplug and preempt on. It happens in the latest mm as well (2.6.12-rc4-mm1) I will not be looking into this right now, I can provide details to anyone interested -Dinakar |
From: Nathan L. <nt...@po...> - 2005-05-13 00:34:53
|
Paul Jackson wrote: > > I share you preference for not nesting these semaphores. > > The other choice I am aware of would be for the hotplug code to be less > cpuset-friendly. In the move_task_off_dead_cpu() code, at the point it > says "No more Mr. Nice Guy", instead of looking for the nearest > enclosing cpuset that has something online, which is what the > cpuset_cpus_allowed() does, instead we could just take any damn cpu that > was online. > > Something along the lines of the following fix: > > --- pj/kernel.old/sched.c 2005-05-11 13:00:17.000000000 -0700 > +++ pj/kernel.new/sched.c 2005-05-11 13:02:24.000000000 -0700 > @@ -4229,7 +4229,7 @@ static void move_task_off_dead_cpu(int d > > /* No more Mr. Nice Guy. */ > if (dest_cpu == NR_CPUS) { > - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); > + tsk->cpus_allowed = cpu_online_map; > dest_cpu = any_online_cpu(tsk->cpus_allowed); Well, CPU_MASK_ALL rather than cpu_online_map, I would think. That is what the behavior was before the cpuset merge, anyway. It might be the best short term solution, more below... > So what we'd really like to do would be to first fallback to all the > cpus allowed in the specified tasks cpuset (no walking the cpuset > hierarchy), and see if any of those cpus are still online to receive > this orphan task. Unless someone has botched the system configuration, > and taken offline all the cpus in a cpuset, this should yield up a cpu > that is still both allowed and online. If that fails, then to heck with > honoring cpuset placement - just take the first online cpu we can find. > > This is doable without holding cpuset_sem. We can look at a current > tasks cpuset without cpuset_sem, just with the task lock. Yes, but your patch doesn't lock the task itself (unless I'm misreading patches again). However, have a look at the comments above task_lock in sched.h: /* * Protects ->fs, ->files, ->mm, ->ptrace, ->group_info, ->comm, keyring * subscriptions and synchronises with wait4(). Also used in procfs. * * Nests both inside and outside of read_lock(&tasklist_lock). * It must not be nested with write_lock_irq(&tasklist_lock), * neither inside nor outside. */ static inline void task_lock(struct task_struct *p) { spin_lock(&p->alloc_lock); } Unfortunately, move_task_off_dead_cpu is called from migrate_live_tasks while the latter has a write_lock_irq on tasklist_lock. So we can't use task_lock in this context, assuming the comments are valid. Right? Nathan |
From: Dinakar G. <di...@in...> - 2005-05-13 12:21:23
|
On Wed, May 11, 2005 at 02:51:56PM -0500, Nathan Lynch wrote: > This trace is what should be fixed -- we're trying to schedule while > the machine is "stopped" (all cpus except for one spinning with > interrupts off). I'm not too familiar with the cpusets code but I > would like to stay away from nesting these semaphores if at all > possible. Vatsa pointed out another scenario where cpusets+hotplug is currently broken. attach_task in cpuset.c is called without holding the hotplug lock and it is possible to call set_cpus_allowed for a task with no online cpus. Given this I think the patch I sent first is the most appropriate patch. In addition we also need to take hotplug lock in the cpusets code whenever we are modifying cpus_allowed of a task. IOW make cpusets and hotplug operations completly exclusive to each other. The same applies to memory hotplug code once it gets in. However on the downside this would mean 1. A lot of nested locks (mostly in cpuset_common_file_write) 2. Taking of hotplug (cpu now and later memory) locks for operations that may just be updating a flag -Dinakar |
From: Srivatsa V. <va...@in...> - 2005-05-13 17:25:56
|
On Fri, May 13, 2005 at 06:02:17PM +0530, Dinakar Guniguntala wrote: > attach_task in cpuset.c is called without holding the hotplug > lock and it is possible to call set_cpus_allowed for a task with no > online cpus. This in fact was the reason that we added lock_cpu_hotplug in sched_setaffinity. Also guarantee_online_cpus seems to be accessing cpu_online_map with preemption enabled (& no hotplug lock taken). This is highly not recommended. > Given this I think the patch I sent first is the most appropriate > patch. I agree that taking the hotplug lock seems reasonable here. > In addition we also need to take hotplug lock in the cpusets > code whenever we are modifying cpus_allowed of a task. IOW make cpusets > and hotplug operations completly exclusive to each other. The same > applies to memory hotplug code once it gets in. > > However on the downside this would mean > 1. A lot of nested locks (mostly in cpuset_common_file_write) > 2. Taking of hotplug (cpu now and later memory) locks for operations > that may just be updating a flag Given the fact that CPU/Memory hotplug and cpuset operation may be infrequent events, this will probably be not a concern. -- Thanks and Regards, Srivatsa Vaddagiri, Linux Technology Center, IBM Software Labs, Bangalore, INDIA - 560017 |
From: Paul J. <pj...@sg...> - 2005-05-13 20:00:06
|
Srivatsa, replying to Dinakar: > This in fact was the reason that we added lock_cpu_hotplug > in sched_setaffinity. Why just in sched_setaffinity()? What about the other 60+ calls to set_cpus_allowed(). Shouldn't most of those calls be checking that the passed in cpus are online (holding lock_cpu_hotplug while doing all this)? Either that, or at least handling the error from set_cpus_allowed() if the requested cpus end up not being online? I see only 2 set_cpus_allowed() calls that make any pretense of examining the return value. > I agree that taking the hotplug lock seems reasonable here. I take it that you are recommending changing cpuset_cpus_allowed() in the way sched_setaffinity() was changed, to grab lock_cpu_hotplug around the small piece of code that examines cpu_online_map and calls set_cpus_allowed(). This sounds good to me. I wonder why it doesn't need to be considered in the other 60+ calls to set_cpus_allowed. > Given the fact that CPU/Memory hotplug and cpuset operation may > be infrequent events, this will probably be not a concern. I agree that performance is not the key issue here. Little if any of this is on any hot code path. We do need to be clear about how these locks work, their semantics, what they require and what they insure, and their various interactions. This is not easy stuff to get right. I notice that the documentation for lock_cpu_hotplug() is a tad on the skimpy side: /* Stop CPUs going up and down. */ That's it, so far as I can see. Interaction of hotplug locking with the rest of the kernel has been, is now, and will continue to be, a difficult area. More care than this needs to be put into explaining the use, semantics and interactions of any locking involved. In particular, in my view, locks should guard data. What data does lock_cpu_hotplug() guard? I propose that it guards cpu_online_map. I recommend considering a different name for this lock. Perhaps 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'? Then a survey of all uses of cpu_online_map, and by extension, of all calls to set_cpus_allowed(), should be made, to see in which cases this semaphore should be held. For extra credit, make sure that every caller to set_cpus_allowed() makes an effort to only pass in cpus that are currently online. The caller should do this, since only the caller can know what alternative cpus to use, if their first choice is offline. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dipankar S. <dip...@in...> - 2005-05-13 20:22:46
|
On Fri, May 13, 2005 at 12:59:53PM -0700, Paul Jackson wrote: > Srivatsa, replying to Dinakar: > > This in fact was the reason that we added lock_cpu_hotplug > > in sched_setaffinity. > > We do need to be clear about how these locks work, their semantics, what > they require and what they insure, and their various interactions. > > This is not easy stuff to get right. > > I notice that the documentation for lock_cpu_hotplug() is a tad on > the skimpy side: > > /* Stop CPUs going up and down. */ > > That's it, so far as I can see. Interaction of hotplug locking with > the rest of the kernel has been, is now, and will continue to be, a > difficult area. More care than this needs to be put into explaining > the use, semantics and interactions of any locking involved. > > In particular, in my view, locks should guard data. What data does > lock_cpu_hotplug() guard? I propose that it guards cpu_online_map. > > I recommend considering a different name for this lock. Perhaps > 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the > lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'? No. CPU hotplug uses two different locking - see both lock_cpu_hotplug() and __stop_machine_run(). Anyone reading cpu_online_map with preemption disabled is safe from cpu hotplug even without taking any lock. Thanks Dipankar |
From: Nathan L. <nt...@po...> - 2005-05-13 20:47:09
|
> > In particular, in my view, locks should guard data. What data does > > lock_cpu_hotplug() guard? I propose that it guards cpu_online_map. > > > > I recommend considering a different name for this lock. Perhaps > > 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the > > lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'? > > No. CPU hotplug uses two different locking - see both lock_cpu_hotplug() > and __stop_machine_run(). Anyone reading cpu_online_map with > preemption disabled is safe from cpu hotplug even without taking > any lock. More precisely (I think), reading cpu_online_map with preemption disabled guarantees that none of the cpus in the map will go offline -- it does not prevent an online operation in progress (but most code only cares about the former case). Also note that __stop_machine_run is used only to offline a cpu. The cpucontrol semaphore does not only protect cpu_online_map and cpu_present_map, but also serializes cpu hotplug operations, so that only one may be in progress at a time. I've been mulling over submitting a Documentation/cpuhotplug.txt, sounds like there's sufficient demand... Nathan |
From: Paul J. <pj...@sg...> - 2005-05-13 21:05:28
|
Nathan wrote: > sounds like there's sufficient demand... That's a safe bet ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dipankar S. <dip...@in...> - 2005-05-13 21:08:32
|
On Fri, May 13, 2005 at 03:46:52PM -0500, Nathan Lynch wrote: > > > In particular, in my view, locks should guard data. What data does > > > lock_cpu_hotplug() guard? I propose that it guards cpu_online_map. > > > > > > I recommend considering a different name for this lock. Perhaps > > > 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the > > > lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'? > > > > No. CPU hotplug uses two different locking - see both lock_cpu_hotplug() > > and __stop_machine_run(). Anyone reading cpu_online_map with > > preemption disabled is safe from cpu hotplug even without taking > > any lock. > > More precisely (I think), reading cpu_online_map with preemption > disabled guarantees that none of the cpus in the map will go offline > -- it does not prevent an online operation in progress (but most code > only cares about the former case). Also note that __stop_machine_run > is used only to offline a cpu. > > The cpucontrol semaphore does not only protect cpu_online_map and > cpu_present_map, but also serializes cpu hotplug operations, so that > only one may be in progress at a time. Exactly. So, naming it lock_cpu_online() would make no sense. > > I've been mulling over submitting a Documentation/cpuhotplug.txt, > sounds like there's sufficient demand... That would be very good considering the fact that CPU hotplug is getting broken often. Another nice thing to see would be i386 cpu hotplug support. That would allow lot more developers to test their code with cpu hotplug. Not sure where that one is now. Perhaps Zwane knows. Thanks Dipankar |
From: Paul J. <pj...@sg...> - 2005-05-13 20:53:08
|
> No. What part of what I wrote are you saying "No" to? And what does that have to do with the two different locking mechanisms for hotplug? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dipankar S. <dip...@in...> - 2005-05-13 21:04:29
|
On Fri, May 13, 2005 at 01:52:33PM -0700, Paul Jackson wrote: > > No. > > What part of what I wrote are you saying "No" to? The question right above "No" :) > > And what does that have to do with the two different > locking mechanisms for hotplug? Because lock_cpu_hotplug() isn't a lock just for cpu_online_map. I think we need better in-tree documentation of cpu hotplug locking. For now most of the descriptions are in conversations between Rusty and Vatsa in linux hotplug cpu support mailing list while they were designing it. Thanks Dipankar |
From: Paul J. <pj...@sg...> - 2005-05-14 02:59:03
|
Dipankar, replying to pj: > > What part of what I wrote are you saying "No" to? > > The question right above "No" :) Well ... that was less than obvious. You quoted too much, and responded with information about other semaphores, not about why other duties of _this_ semaphore made such a rename wrong. Fortunately, Nathan clarified matters. So how would you, or Srivatsa or Nathan, respond to my more substantive point, to repeat: Srivatsa, replying to Dinakar: > This in fact was the reason that we added lock_cpu_hotplug > in sched_setaffinity. Why just in sched_setaffinity()? What about the other 60+ calls to set_cpus_allowed(). Shouldn't most of those calls be checking that the passed in cpus are online (holding lock_cpu_hotplug while doing all this)? Either that, or at least handling the error from set_cpus_allowed() if the requested cpus end up not being online? I see only 2 set_cpus_allowed() calls that make any pretense of examining the return value. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Nathan L. <nt...@po...> - 2005-05-14 16:30:27
|
On Fri, May 13, 2005 at 07:58:51PM -0700, Paul Jackson wrote: > > So how would you, or Srivatsa or Nathan, respond to my more substantive > point, to repeat: > > Srivatsa, replying to Dinakar: > > This in fact was the reason that we added lock_cpu_hotplug > > in sched_setaffinity. > > Why just in sched_setaffinity()? I suspect that the lock_cpu_hotplug is no longer necessary in sched_setaffinity. I found the original changeset which introduced it, and it's short enough that I'll just duplicate it here: diff -Naru a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c 2005-05-14 07:21:39 -07:00 +++ b/kernel/sched.c 2005-05-14 07:21:39 -07:00 @@ -1012,6 +1012,7 @@ unsigned long flags; cpumask_t old_mask, new_mask = cpumask_of_cpu(dest_cpu); + lock_cpu_hotplug(); rq = task_rq_lock(p, &flags); old_mask = p->cpus_allowed; if (!cpu_isset(dest_cpu, old_mask) || !cpu_online(dest_cpu)) @@ -1035,6 +1036,7 @@ } out: task_rq_unlock(rq, &flags); + unlock_cpu_hotplug(); } /* @@ -2309,11 +2311,13 @@ if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask))) return -EFAULT; + lock_cpu_hotplug(); read_lock(&tasklist_lock); p = find_process_by_pid(pid); if (!p) { read_unlock(&tasklist_lock); + unlock_cpu_hotplug(); return -ESRCH; } @@ -2334,6 +2338,7 @@ out_unlock: put_task_struct(p); + unlock_cpu_hotplug(); return retval; } # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2004/03/19 08:02:56-08:00 ru...@ru... # [PATCH] Hotplug CPUs: Take cpu Lock Around Migration # # Grab cpu lock around sched_migrate_task() and #sys_sched_setaffinity(). # This is a noop without CONFIG_HOTPLUG_CPU. # # The sched_migrate_task may have a performance penalty on NUMA if #lots # of exec rebalancing is happening, however this only applies to # CONFIG_NUMA and CONFIG_HOTPLUG_CPU, which noone does at the moment # anyway. # # Also, the scheduler in -mm solves the race another way, so this #will # vanish then. # # kernel/sched.c # 2004/03/16 18:10:10-08:00 ru...@ru... +5 -0 # Hotplug CPUs: Take cpu Lock Around Migration # The lock/unlock_cpu_hotplug is no longer there in sched_migrate_task. The changelog leads me to believe that it was intended that the same change should have been made to sched_setaffinity by now. I think it's safe to remove it; I can't see why it would be necessary any more. Nathan |
From: Srivatsa V. <va...@in...> - 2005-05-14 17:23:21
|
On Sat, May 14, 2005 at 11:30:12AM -0500, Nathan Lynch wrote: > I suspect that the lock_cpu_hotplug is no longer necessary in > sched_setaffinity. Digging harder into my memory, I think the primary reason why lock_cpu_hotplug was added in sched_setaffinity was this: some code wants to temporarily override a (user-space) task's cpu mask. Before stop_machine came along, module code was doing that (again if my memory serves me right). With stop_machine, temporary changing of user-space (rmmod's) cpus_allowed is not reqd. However I still see other code (like acpi_processor_set_performance) which does something like this: saved_mask = current->cpus_allowed; set_cpus_allowed(current, new_mask); /* Do something ..Could block */ set_cpus_allowed(current, saved_mask); How do you ensure that saved_mask has not changed since the time we took a snapshot of it (bcoz of sched_setaffinity having changed it)? lock_cpu_hotplug could serialize such pieces of code > I found the original changeset which introduced > it, and it's short enough that I'll just duplicate it here: [snip] > The lock/unlock_cpu_hotplug is no longer there in sched_migrate_task. Essentialy, if I remember correct, sched_migrate_task was also doing something like acpi_processor_set_performance to migrate a task to a new cpu. Thats why we discussed that sched_migrate_task should take lock_cpu_hotplug. But I don't think it ever went in. There were concerns of cache-line bounces on NUMA m/c bcoz of lock_cpu_hotplug in a frequently executed code like sched_migrate_task. Moreover sched_migrate_task no longer behaves like acpi_processor_set_performance when it wants to migrate a user-space task to a different CPU. It now seems to take the help of migration thread (which is much neater compared to what was there earlier i guess). > The changelog leads me to believe that it was intended that the same > change should have been made to sched_setaffinity by now. I think > it's safe to remove it; I can't see why it would be necessary any > more. I recommend that we keep the lock_cpu_hotplug in sched_affinity unless we figure out a different way of solving the race I highlighted above or we ban code like that in acpi_processor_set_performance :) -- Thanks and Regards, Srivatsa Vaddagiri, Linux Technology Center, IBM Software Labs, Bangalore, INDIA - 560017 |