From: richard -r. w. <ric...@gm...> - 2010-10-01 22:03:11
Attachments:
config-2.6.36-rc6-00105-gd7d8ecb.gz
|
Hi, I'm seeing often messages like this one on my UML: ... INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) ... The system often seems to be frozen. When I press a button it wakes up immediately and continues to run. Any ideas? Config is attached. -- Cheers, //richard |
From: Chris F. <cd...@fo...> - 2010-10-04 19:26:40
|
On Sat, Oct 02, 2010 at 12:03:04AM +0200, richard -rw- weinberger wrote: > Hi, > > I'm seeing often messages like this one on my UML: > ... > INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) > ... > > The system often seems to be frozen. > When I press a button it wakes up immediately and continues to run. > > Any ideas? > Config is attached. I saw this too, but I thought it was because I was overloading my machine. I only seem to see it when the guest is starved for CPU due to IO on the host. - Chris |
From: richard -r. w. <ric...@gm...> - 2010-10-14 18:28:02
|
Hi Arjan! This commit causes some problems on UML. The kernel freezes after a few seconds until it gets some input. e.g: When I run top it stops refreshing the process list until i press a button. Messages like this appear: INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) After reverting UML works fine again. commit 78b435368fcd615e695a06012cd963a556284e00 Author: Arjan van de Ven <ar...@li...> Date: Mon Jul 19 10:59:42 2010 -0700 slab: use deferable timers for its periodic housekeeping slab has a "once every 2 second" timer for its housekeeping. As the number of logical processors is growing, its more and more common that this 2 second timer becomes the primary wakeup source. This patch turns this housekeeping timer into a deferable timer, which means that the timer does not interrupt idle, but just runs at the next event that wakes the cpu up. The impact is that the timer likely runs a bit later, but during the delay no code is running so there's not all that much reason for a difference in housekeeping to occur because of this delay. Signed-off-by: Arjan van de Ven <ar...@li...> Signed-off-by: Pekka Enberg <pe...@cs...> diff --git a/mm/slab.c b/mm/slab.c index e49f8f4..29aad44 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -861,7 +861,7 @@ static void __cpuinit start_cpu_timer(int cpu) */ if (keventd_up() && reap_work->work.func == NULL) { init_reap_node(cpu); - INIT_DELAYED_WORK(reap_work, cache_reap); + INIT_DELAYED_WORK_DEFERRABLE(reap_work, cache_reap); schedule_delayed_work_on(cpu, reap_work, __round_jiffies_relative(HZ, cpu)); } -- Thanks, //richard |
From: richard -r. w. <ric...@gm...> - 2010-10-14 23:44:14
|
On Thu, Oct 14, 2010 at 9:50 PM, Arjan van de Ven <ar...@li...> wrote: > On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: >> >> Hi Arjan! >> >> This commit causes some problems on UML. >> > that is extremely weird. >> >> The kernel freezes after a few seconds until it gets some input. >> e.g: When I run top it stops refreshing the process list until i press a >> button. > > a slab timer change (to not be as critical) causing global timer issues.... > that's very obviously not a problem with this patch. > has this been seem anywhere except UML ? A small update: It seems that CONFIG_NO_HZ is broken on UML. :-( CONFIG_NO_HZ + CONFIG_SLAB: works CONFIG_NO_HZ + CONFIG_SLAB + your patch: broken CONFIG_NO_HZ + CONFIG_SLUB: broken CONFIG_SLAB + your patch: works CONFIG_SLAB: works CONFIG_SLUB: works >> Messages like this appear: >> INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) >> >> After reverting UML works fine again. >> >> commit 78b435368fcd615e695a06012cd963a556284e00 >> Author: Arjan van de Ven<ar...@li...> >> Date: Mon Jul 19 10:59:42 2010 -0700 >> >> slab: use deferable timers for its periodic housekeeping >> >> slab has a "once every 2 second" timer for its housekeeping. >> As the number of logical processors is growing, its more and more >> common that this 2 second timer becomes the primary wakeup source. >> >> This patch turns this housekeeping timer into a deferable timer, >> which means that the timer does not interrupt idle, but just runs >> at the next event that wakes the cpu up. >> >> The impact is that the timer likely runs a bit later, but during the >> delay no code is running so there's not all that much reason for >> a difference in housekeeping to occur because of this delay. >> >> Signed-off-by: Arjan van de Ven<ar...@li...> >> Signed-off-by: Pekka Enberg<pe...@cs...> >> >> diff --git a/mm/slab.c b/mm/slab.c >> index e49f8f4..29aad44 100644 >> --- a/mm/slab.c >> +++ b/mm/slab.c >> @@ -861,7 +861,7 @@ static void __cpuinit start_cpu_timer(int cpu) >> */ >> if (keventd_up()&& reap_work->work.func == NULL) { >> init_reap_node(cpu); >> - INIT_DELAYED_WORK(reap_work, cache_reap); >> + INIT_DELAYED_WORK_DEFERRABLE(reap_work, cache_reap); >> schedule_delayed_work_on(cpu, reap_work, >> __round_jiffies_relative(HZ, >> cpu)); >> } >> >> > > -- Thanks, //richard |
From: Pekka E. <pe...@ke...> - 2010-10-15 07:03:03
|
On Fri, Oct 15, 2010 at 2:44 AM, richard -rw- weinberger <ric...@gm...> wrote: > On Thu, Oct 14, 2010 at 9:50 PM, Arjan van de Ven <ar...@li...> wrote: >> On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: >>> >>> Hi Arjan! >>> >>> This commit causes some problems on UML. >>> >> that is extremely weird. >>> >>> The kernel freezes after a few seconds until it gets some input. >>> e.g: When I run top it stops refreshing the process list until i press a >>> button. >> >> a slab timer change (to not be as critical) causing global timer issues.... >> that's very obviously not a problem with this patch. >> has this been seem anywhere except UML ? > > A small update: > It seems that CONFIG_NO_HZ is broken on UML. :-( > > CONFIG_NO_HZ + CONFIG_SLAB: works > CONFIG_NO_HZ + CONFIG_SLAB + your patch: broken > CONFIG_NO_HZ + CONFIG_SLUB: broken > > CONFIG_SLAB + your patch: works > CONFIG_SLAB: works > CONFIG_SLUB: works Thanks for testing! Thomas, Ingo, Peter, I'm not sure who maintains CONFIG_NO_HZ so I CC'd you. The problem here is that Arjan's deferrable timers patch in SLAB triggered something that looks like a latent bug with UML and NOHZ. Pekka >>> Messages like this appear: >>> INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) >>> >>> After reverting UML works fine again. >>> >>> commit 78b435368fcd615e695a06012cd963a556284e00 >>> Author: Arjan van de Ven<ar...@li...> >>> Date: Mon Jul 19 10:59:42 2010 -0700 >>> >>> slab: use deferable timers for its periodic housekeeping >>> >>> slab has a "once every 2 second" timer for its housekeeping. >>> As the number of logical processors is growing, its more and more >>> common that this 2 second timer becomes the primary wakeup source. >>> >>> This patch turns this housekeeping timer into a deferable timer, >>> which means that the timer does not interrupt idle, but just runs >>> at the next event that wakes the cpu up. >>> >>> The impact is that the timer likely runs a bit later, but during the >>> delay no code is running so there's not all that much reason for >>> a difference in housekeeping to occur because of this delay. >>> >>> Signed-off-by: Arjan van de Ven<ar...@li...> >>> Signed-off-by: Pekka Enberg<pe...@cs...> >>> >>> diff --git a/mm/slab.c b/mm/slab.c >>> index e49f8f4..29aad44 100644 >>> --- a/mm/slab.c >>> +++ b/mm/slab.c >>> @@ -861,7 +861,7 @@ static void __cpuinit start_cpu_timer(int cpu) >>> */ >>> if (keventd_up()&& reap_work->work.func == NULL) { >>> init_reap_node(cpu); >>> - INIT_DELAYED_WORK(reap_work, cache_reap); >>> + INIT_DELAYED_WORK_DEFERRABLE(reap_work, cache_reap); >>> schedule_delayed_work_on(cpu, reap_work, >>> __round_jiffies_relative(HZ, >>> cpu)); >>> } >>> >>> >> >> > > > > -- > Thanks, > //richard > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > |
From: Peter Z. <pe...@in...> - 2010-10-15 08:23:17
|
On Fri, 2010-10-15 at 10:02 +0300, Pekka Enberg wrote: > On Fri, Oct 15, 2010 at 2:44 AM, richard -rw- weinberger > <ric...@gm...> wrote: > > On Thu, Oct 14, 2010 at 9:50 PM, Arjan van de Ven > <ar...@li...> wrote: > >> On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: > >>> > >>> Hi Arjan! > >>> > >>> This commit causes some problems on UML. > >>> > >> that is extremely weird. > >>> > >>> The kernel freezes after a few seconds until it gets some input. > >>> e.g: When I run top it stops refreshing the process list until i > press a > >>> button. > >> > >> a slab timer change (to not be as critical) causing global timer > issues.... > >> that's very obviously not a problem with this patch. > >> has this been seem anywhere except UML ? > > > > A small update: > > It seems that CONFIG_NO_HZ is broken on UML. :-( > > > > CONFIG_NO_HZ + CONFIG_SLAB: works > > CONFIG_NO_HZ + CONFIG_SLAB + your patch: broken > > CONFIG_NO_HZ + CONFIG_SLUB: broken > > > > CONFIG_SLAB + your patch: works > > CONFIG_SLAB: works > > CONFIG_SLUB: works > > Thanks for testing! Thomas, Ingo, Peter, I'm not sure who maintains > CONFIG_NO_HZ so I CC'd you. The problem here is that Arjan's > deferrable timers patch in SLAB triggered something that looks like a > latent bug with UML and NOHZ. Thomas does mostly, but if its UML specific, I guess its Jeff Dike you'll be wanting to talk to, since he's the arch maintainer. |
From: richard -r. w. <ric...@gm...> - 2010-10-15 09:24:15
|
On Fri, Oct 15, 2010 at 9:48 AM, Peter Zijlstra <pe...@in...> wrote: > On Fri, 2010-10-15 at 10:02 +0300, Pekka Enberg wrote: >> On Fri, Oct 15, 2010 at 2:44 AM, richard -rw- weinberger >> <ric...@gm...> wrote: >> > On Thu, Oct 14, 2010 at 9:50 PM, Arjan van de Ven >> <ar...@li...> wrote: >> >> On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: >> >>> >> >>> Hi Arjan! >> >>> >> >>> This commit causes some problems on UML. >> >>> >> >> that is extremely weird. >> >>> >> >>> The kernel freezes after a few seconds until it gets some input. >> >>> e.g: When I run top it stops refreshing the process list until i >> press a >> >>> button. >> >> >> >> a slab timer change (to not be as critical) causing global timer >> issues.... >> >> that's very obviously not a problem with this patch. >> >> has this been seem anywhere except UML ? >> > >> > A small update: >> > It seems that CONFIG_NO_HZ is broken on UML. :-( >> > >> > CONFIG_NO_HZ + CONFIG_SLAB: works >> > CONFIG_NO_HZ + CONFIG_SLAB + your patch: broken >> > CONFIG_NO_HZ + CONFIG_SLUB: broken >> > >> > CONFIG_SLAB + your patch: works >> > CONFIG_SLAB: works >> > CONFIG_SLUB: works >> >> Thanks for testing! Thomas, Ingo, Peter, I'm not sure who maintains >> CONFIG_NO_HZ so I CC'd you. The problem here is that Arjan's >> deferrable timers patch in SLAB triggered something that looks like a >> latent bug with UML and NOHZ. > > Thomas does mostly, but if its UML specific, I guess its Jeff Dike > you'll be wanting to talk to, since he's the arch maintainer. > Jeff is the UML maintainer only in theory. He seems to be very busy and hasn't touched UML since 2008. Very sad... -- Thanks, //richard |
From: richard -r. w. <ric...@gm...> - 2010-10-16 15:27:48
|
On Fri, Oct 15, 2010 at 9:48 AM, Peter Zijlstra <pe...@in...> wrote: > On Fri, 2010-10-15 at 10:02 +0300, Pekka Enberg wrote: >> On Fri, Oct 15, 2010 at 2:44 AM, richard -rw- weinberger >> <ric...@gm...> wrote: >> > On Thu, Oct 14, 2010 at 9:50 PM, Arjan van de Ven >> <ar...@li...> wrote: >> >> On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: >> >>> >> >>> Hi Arjan! >> >>> >> >>> This commit causes some problems on UML. >> >>> >> >> that is extremely weird. >> >>> >> >>> The kernel freezes after a few seconds until it gets some input. >> >>> e.g: When I run top it stops refreshing the process list until i >> press a >> >>> button. >> >> >> >> a slab timer change (to not be as critical) causing global timer >> issues.... >> >> that's very obviously not a problem with this patch. >> >> has this been seem anywhere except UML ? >> > >> > A small update: >> > It seems that CONFIG_NO_HZ is broken on UML. :-( >> > >> > CONFIG_NO_HZ + CONFIG_SLAB: works >> > CONFIG_NO_HZ + CONFIG_SLAB + your patch: broken >> > CONFIG_NO_HZ + CONFIG_SLUB: broken >> > >> > CONFIG_SLAB + your patch: works >> > CONFIG_SLAB: works >> > CONFIG_SLUB: works >> >> Thanks for testing! Thomas, Ingo, Peter, I'm not sure who maintains >> CONFIG_NO_HZ so I CC'd you. The problem here is that Arjan's >> deferrable timers patch in SLAB triggered something that looks like a >> latent bug with UML and NOHZ. > > Thomas does mostly, but if its UML specific, I guess its Jeff Dike > you'll be wanting to talk to, since he's the arch maintainer. After reviewing the code for hours I've found the bug. It's a int/long long issue within arch/um/os-Linux/time.c. A patch is on the way! -- Thanks, //richard |
From: Arjan v. de V. <ar...@li...> - 2010-10-14 20:24:55
|
On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: > Hi Arjan! > > This commit causes some problems on UML. > that is extremely weird. > The kernel freezes after a few seconds until it gets some input. > e.g: When I run top it stops refreshing the process list until i press a button. a slab timer change (to not be as critical) causing global timer issues.... that's very obviously not a problem with this patch. has this been seem anywhere except UML ? > Messages like this appear: > INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) > > After reverting UML works fine again. > > commit 78b435368fcd615e695a06012cd963a556284e00 > Author: Arjan van de Ven<ar...@li...> > Date: Mon Jul 19 10:59:42 2010 -0700 > > slab: use deferable timers for its periodic housekeeping > > slab has a "once every 2 second" timer for its housekeeping. > As the number of logical processors is growing, its more and more > common that this 2 second timer becomes the primary wakeup source. > > This patch turns this housekeeping timer into a deferable timer, > which means that the timer does not interrupt idle, but just runs > at the next event that wakes the cpu up. > > The impact is that the timer likely runs a bit later, but during the > delay no code is running so there's not all that much reason for > a difference in housekeeping to occur because of this delay. > > Signed-off-by: Arjan van de Ven<ar...@li...> > Signed-off-by: Pekka Enberg<pe...@cs...> > > diff --git a/mm/slab.c b/mm/slab.c > index e49f8f4..29aad44 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -861,7 +861,7 @@ static void __cpuinit start_cpu_timer(int cpu) > */ > if (keventd_up()&& reap_work->work.func == NULL) { > init_reap_node(cpu); > - INIT_DELAYED_WORK(reap_work, cache_reap); > + INIT_DELAYED_WORK_DEFERRABLE(reap_work, cache_reap); > schedule_delayed_work_on(cpu, reap_work, > __round_jiffies_relative(HZ, cpu)); > } > > |
From: richard -r. w. <ric...@gm...> - 2010-10-14 20:06:48
|
On Thu, Oct 14, 2010 at 9:50 PM, Arjan van de Ven <ar...@li...> wrote: > On 10/14/2010 11:27 AM, richard -rw- weinberger wrote: >> >> Hi Arjan! >> >> This commit causes some problems on UML. >> > that is extremely weird. >> >> The kernel freezes after a few seconds until it gets some input. >> e.g: When I run top it stops refreshing the process list until i press a >> button. > > a slab timer change (to not be as critical) causing global timer issues.... > that's very obviously not a problem with this patch. > has this been seem anywhere except UML ? > So far I've seen this problem only on UML. Chris saw it too: http://marc.info/?l=linux-kernel&m=128622041625323&w=2 Maybe your patch triggers a general timer problem within UML? >> Messages like this appear: >> INFO: rcu_sched_state detected stall on CPU 0 (t=7348 jiffies) >> >> After reverting UML works fine again. >> >> commit 78b435368fcd615e695a06012cd963a556284e00 >> Author: Arjan van de Ven<ar...@li...> >> Date: Mon Jul 19 10:59:42 2010 -0700 >> >> slab: use deferable timers for its periodic housekeeping >> >> slab has a "once every 2 second" timer for its housekeeping. >> As the number of logical processors is growing, its more and more >> common that this 2 second timer becomes the primary wakeup source. >> >> This patch turns this housekeeping timer into a deferable timer, >> which means that the timer does not interrupt idle, but just runs >> at the next event that wakes the cpu up. >> >> The impact is that the timer likely runs a bit later, but during the >> delay no code is running so there's not all that much reason for >> a difference in housekeeping to occur because of this delay. >> >> Signed-off-by: Arjan van de Ven<ar...@li...> >> Signed-off-by: Pekka Enberg<pe...@cs...> >> >> diff --git a/mm/slab.c b/mm/slab.c >> index e49f8f4..29aad44 100644 >> --- a/mm/slab.c >> +++ b/mm/slab.c >> @@ -861,7 +861,7 @@ static void __cpuinit start_cpu_timer(int cpu) >> */ >> if (keventd_up()&& reap_work->work.func == NULL) { >> init_reap_node(cpu); >> - INIT_DELAYED_WORK(reap_work, cache_reap); >> + INIT_DELAYED_WORK_DEFERRABLE(reap_work, cache_reap); >> schedule_delayed_work_on(cpu, reap_work, >> __round_jiffies_relative(HZ, >> cpu)); >> } >> >> > > -- Thanks, //richard |