Thread: [uml-devel] umls unresponsive & consuming 100% cpu time

Brought to you by: blaisorblade, derrichard, jdike, rusty

user-mode-linux-devel

[uml-devel] umls unresponsive & consuming 100% cpu time

From: Bram M. (Syzop) <sy...@vu...> - 2008-05-07 14:01:12

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I'm experiencing the following problem:
I upgraded from 2.6.20.1 to 2.6.25 on both the host and the uml's.
Now, after some time (unsure how soon), the uml's appear to hang.
It seems though, that they are not completely freezed, but just very very
very slow (or rather.. 99% unresponsive).

top:
~  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
~ 5434 virt      20   0  128m  89m  89m R   99  4.5 269:40.81 linux
..so consuming nearly 100% cpu.

When typing a letter at the console (I run the umls in a screen), it goes
slow, sometimes it takes up to a minute or so... so I can hardly login
(actually it does process/buffer my line, but by the time the username is
entered and it prompts for the password the login time of 60s is exceeded).
Also, there are no errors (like kernel warnings) displayed on the console.

When pinging I get:
PING slave (192.168.22.11) 56(84) bytes of data.
~From slave (192.168.22.1) icmp_seq=2 Destination Host Unreachable
~From slave (192.168.22.1) icmp_seq=3 Destination Host Unreachable
- -more more-
64 bytes from slave (192.168.22.11): icmp_seq=26 ttl=64 time=111 ms
- -yes, just one.. then delay... and then..-
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
- -more-

When doing a version request using uml_mconsole I get a respond after delay
of like 25 seconds, then quick subsequent requests work too, then they no
longer do for like 34 seconds, then a reply, etc etc etc.

I'm not sure if this is actually correct (I know the 'linux' image
corresponds to the running slave kernel but I'm unsure about the backtrace
it shows), but here's some gdb stuff:
- -gdb-
srv1:/home/virt# gdb linux 5434
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...Using host libthread_db
library "/lib/tls/i686/cmov/libthread_db.so.1".

Attaching to program: /home/virt/linux, process 5434
0x0809647a in update_xtime_cache ()
(gdb) bt
#0  0x0809647a in update_xtime_cache ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x08096477 in update_xtime_cache ()
(gdb) bt
#0  0x08096477 in update_xtime_cache ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0809645a in update_xtime_cache ()
(gdb) bt
#0  0x0809645a in update_xtime_cache ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0809645d in update_xtime_cache ()
(gdb) bt
#0  0x0809645d in update_xtime_cache ()
(gdb)
..so each time I ctrl+c after a few secs to see where it's at, it's in there..

Any ideas what this could be?

Or any help on how to get additional / useful info?

TIA,

	Bram.

- --
Bram Matthys
Software developer/IT consultant        sy...@vu...
PGP key:                       www.vulnscan.org/pubkey.asc
PGP fp: 8DD4 437E 9BA8 09AA 0A8D  1811 E1C3 D65F E6ED 2AA2
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFIIbX846ioc5305a8RAtiMAJ9/MJaGN/6k+711lFVxoX9sUgt5vACgxXgr
qyheRL6nPMJmat49fDS828k=
=0Ra0
-----END PGP SIGNATURE-----

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Nix <ni...@es...> - 2008-05-07 20:24:13

On 7 May 2008, Bram Matthys said:
> Or any help on how to get additional / useful info?

Set

CONFIG_DEBUG_INFO=y
CONFIG_FRAME_POINTER=y

in your kernel, and recompile. `bt' will then show heaps more info.

-- 
`If you are having a "ua luea luea le ua le" kind of day, I can only
 assume that you are doing no work due [to] incapacitating nausea caused 
 by numerous lazy demons.' --- Frossie

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Bram M. (Syzop) <sy...@vu...> - 2008-05-08 10:03:08

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Nix wrote:
| On 7 May 2008, Bram Matthys said:
|> Or any help on how to get additional / useful info?
|
| Set
|
| CONFIG_DEBUG_INFO=y
| CONFIG_FRAME_POINTER=y
|
| in your kernel, and recompile. `bt' will then show heaps more info.

Thanks, I'll get back with the results once it hangs again.

I just noticed something odd at one of the uml's that didn't hang. That uml
is still running the .25 kernel (without the debugging info): it says login
timeout all the time, this might be why...
When I type 'date' every second I get this:
root@vsrv:~# date
Fri Sep  5 15:49:31 UTC 2008
root@vsrv:~# date
Thu Sep  4 03:51:46 UTC 2008
root@vsrv:~# date
Tue Sep  9 01:27:15 UTC 2008
root@vsrv:~# date
Fri Sep  5 02:27:58 UTC 2008
root@vsrv:~# date
Tue Sep  2 06:22:48 UTC 2008
root@vsrv:~# date
Tue Sep  9 03:21:44 UTC 2008
root@vsrv:~# date
Tue Sep  2 22:51:05 UTC 2008
root@vsrv:~# date
Sun Sep  7 09:18:25 UTC 2008
root@vsrv:~# date
Thu Sep 11 03:24:52 UTC 2008
root@vsrv:~# date
Thu Sep 11 20:24:11 UTC 2008

So it seems to hop both forward and backward.. heavily..

There's no ntp stuff running on the UML btw.

Date/Time on the main server is correct.

I did do this on the main server a few days ago:
/etc/init.d/ntp stop
hwclock --systohc
/etc/init.d/ntp start
due to these kernel messages (on the main):
'set_rtc_mmss: can't update from 90 to 21'
..which went away after that.
but could that really be related ?
(the hw clock was off by 1 hour or so, but linux time was ok)

	Bram.

- --
Bram Matthys
Software developer/IT consultant        sy...@vu...
PGP key:                       www.vulnscan.org/pubkey.asc
PGP fp: 8DD4 437E 9BA8 09AA 0A8D  1811 E1C3 D65F E6ED 2AA2
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFIIs+/46ioc5305a8RAttsAKCwIWDpGjbH7PEaX37e7BE/sIfX8gCgkaC8
frzs6nPC35fMoJ+p78AA4h0=
=3e1T
-----END PGP SIGNATURE-----

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Jeff D. <jd...@ad...> - 2008-05-09 15:46:40

On Wed, May 07, 2008 at 04:00:28PM +0200, Bram Matthys (Syzop) wrote:
> I'm experiencing the following problem:
> I upgraded from 2.6.20.1 to 2.6.25 on both the host and the uml's.
> Now, after some time (unsure how soon), the uml's appear to hang.
> It seems though, that they are not completely freezed, but just very very
> very slow (or rather.. 99% unresponsive).
> 
> top:
> ~  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> ~ 5434 virt      20   0  128m  89m  89m R   99  4.5 269:40.81 linux
> ..so consuming nearly 100% cpu.

Are you using CONFIG_NOHZ?

There have been some recent time-related fixes.  Can you try the two
patches below and see if they help?

	      	      Jeff

-- 
Work email - jdike at linux dot intel dot com


Index: linux-2.6.22/arch/um/os-Linux/time.c
===================================================================
--- linux-2.6.22.orig/arch/um/os-Linux/time.c	2008-03-18 12:32:19.000000000 -0400
+++ linux-2.6.22/arch/um/os-Linux/time.c	2008-03-24 12:46:26.000000000 -0400
@@ -11,6 +11,7 @@
 #include "kern_constants.h"
 #include "os.h"
 #include "user.h"
+#include "kern_util.h"
 
 int set_interval(void)
 {
@@ -58,12 +59,17 @@ static inline long long timeval_to_ns(co
 long long disable_timer(void)
 {
 	struct itimerval time = ((struct itimerval) { { 0, 0 }, { 0, 0 } });
+	int remain, max = UM_NSEC_PER_SEC / UM_HZ;
 
 	if (setitimer(ITIMER_VIRTUAL, &time, &time) < 0)
 		printk(UM_KERN_ERR "disable_timer - setitimer failed, "
 		       "errno = %d\n", errno);
 
-	return timeval_to_ns(&time.it_value);
+	remain = timeval_to_ns(&time.it_value);
+	if (remain > max)
+		remain = max;
+
+	return remain;
 }
 
 long long os_nsecs(void)
@@ -74,12 +80,51 @@ long long os_nsecs(void)
 	return timeval_to_ns(&tv);
 }
 
+extern void alarm_handler(int sig, struct sigcontext *sc);
+
 #ifdef UML_CONFIG_NO_HZ
 static int after_sleep_interval(struct timespec *ts)
 {
 	return 0;
 }
+
+static void deliver_alarm(void)
+{
+	alarm_handler(SIGVTALRM, NULL);
+}
+
+static unsigned long long sleep_time(unsigned long long nsecs)
+{
+	return nsecs;
+}
+
 #else
+unsigned long long last_tick;
+unsigned long long skew;
+
+static void deliver_alarm(void)
+{
+	unsigned long long this_tick = os_nsecs();
+	int one_tick = UM_NSEC_PER_SEC / UM_HZ;
+
+	if (last_tick == 0)
+		last_tick = this_tick - one_tick;
+
+	skew += this_tick - last_tick;
+
+	while (skew >= one_tick) {
+		alarm_handler(SIGVTALRM, NULL);
+		skew -= one_tick;
+	}
+
+	last_tick = this_tick;
+}
+
+static unsigned long long sleep_time(unsigned long long nsecs)
+{
+	return nsecs > skew ? nsecs - skew : 0;
+}
+
 static inline long long timespec_to_us(const struct timespec *ts)
 {
 	return ((long long) ts->tv_sec * UM_USEC_PER_SEC) +
@@ -102,6 +147,8 @@ static int after_sleep_interval(struct t
 	 */
 	if (start_usecs > usec)
 		start_usecs = usec;
+
+	start_usecs -= skew / UM_NSEC_PER_USEC;
 	tv = ((struct timeval) { .tv_sec  = start_usecs / UM_USEC_PER_SEC,
 				 .tv_usec = start_usecs % UM_USEC_PER_SEC });
 	interval = ((struct itimerval) { { 0, usec }, tv });
@@ -113,8 +160,6 @@ static int after_sleep_interval(struct t
 }
 #endif
 
-extern void alarm_handler(int sig, struct sigcontext *sc);
-
 void idle_sleep(unsigned long long nsecs)
 {
 	struct timespec ts;
@@ -126,10 +171,12 @@ void idle_sleep(unsigned long long nsecs
 	 */
 	if (nsecs == 0)
 		nsecs = UM_NSEC_PER_SEC / UM_HZ;
+
+	nsecs = sleep_time(nsecs);
 	ts = ((struct timespec) { .tv_sec	= nsecs / UM_NSEC_PER_SEC,
 				  .tv_nsec	= nsecs % UM_NSEC_PER_SEC });
 
 	if (nanosleep(&ts, &ts) == 0)
-		alarm_handler(SIGVTALRM, NULL);
+		deliver_alarm();
 	after_sleep_interval(&ts);
 }



Index: linux-2.6.22/arch/um/kernel/time.c
===================================================================
--- linux-2.6.22.orig/arch/um/kernel/time.c	2008-04-10 12:53:32.000000000 -0400
+++ linux-2.6.22/arch/um/kernel/time.c	2008-04-14 10:30:00.000000000 -0400
@@ -75,7 +75,7 @@ static irqreturn_t um_timer(int irq, voi
 
 static cycle_t itimer_read(void)
 {
-	return os_nsecs();
+	return os_nsecs() / 1000;
 }
 
 static struct clocksource itimer_clocksource = {
@@ -83,7 +83,7 @@ static struct clocksource itimer_clockso
 	.rating		= 300,
 	.read		= itimer_read,
 	.mask		= CLOCKSOURCE_MASK(64),
-	.mult		= 1,
+	.mult		= 1000,
 	.shift		= 0,
 	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
 };

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Bram M. (Syzop) <sy...@vu...> - 2008-05-10 08:42:23

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sorry, this was supposed to go to the list...

UPDATE: after 12+ hours still hung.

Jeff Dike wrote:
| On Wed, May 07, 2008 at 04:00:28PM +0200, Bram Matthys (Syzop) wrote:
|> I'm experiencing the following problem:
|> I upgraded from 2.6.20.1 to 2.6.25 on both the host and the uml's.
|> Now, after some time (unsure how soon), the uml's appear to hang.
|> It seems though, that they are not completely freezed, but just very very
|> very slow (or rather.. 99% unresponsive).
|>
|> top:
|> ~  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
|> ~ 5434 virt      20   0  128m  89m  89m R   99  4.5 269:40.81 linux
|> ..so consuming nearly 100% cpu.
|
| Are you using CONFIG_NOHZ?
|
| There have been some recent time-related fixes.  Can you try the two
| patches below and see if they help?

Thanks for your reply.

$ grep HZ .config
CONFIG_HZ=100
# CONFIG_NO_HZ is not set

I've applied your patch against my 2.6.25 (vanilla)...
patching file arch/um/os-Linux/time.c
patching file arch/um/kernel/time.c
Hunk #1 succeeded at 74 (offset -1 lines).
Hunk #2 succeeded at 82 (offset -1 lines).
and recompiled etc..

I saw vincent's issue, and when I set the time like 5 seconds back.. the UML
freezes and uses 100% cpu and doesn't respond at all. This is however not
entirely the same as what I had, because i still had it somewhat responsive...

Anyway, applied your patches and recompiled, booted etc.. hangs again when I
set the time 5s back.
I also tested with 1s backwards... same...
This was a quick test, I don't know if it becomes responsive after like
several hours...

Attaching to program: /home/virt/linux, process 25109
0x080978bf in update_wall_time () at kernel/time/timekeeping.c:475
475                     clock->error -= clock->xtime_interval <<
(TICK_LENGTH_SHIFT - clock->shift);
(gdb) bt
#0  0x080978bf in update_wall_time () at kernel/time/timekeeping.c:475
#1  0x08086bb5 in do_timer (ticks=1) at kernel/timer.c:929
#2  0x08099793 in tick_periodic (cpu=0) at kernel/time/tick-common.c:66
#3  0x080997b8 in tick_handle_periodic (dev=0x8355420) at
kernel/time/tick-common.c:82
#4  0x0805c143 in um_timer (irq=0, dev=0x0) at arch/um/kernel/time.c:70
#5  0x0809fac0 in handle_IRQ_event (irq=0, action=0x11449460) at
kernel/irq/handle.c:140
#6  0x0809fb6a in __do_IRQ (irq=0) at kernel/irq/handle.c:236
#7  0x08059d65 in do_IRQ (irq=0, regs=0x834fe98) at arch/um/kernel/irq.c:335
#8  0x0805c0c6 in timer_handler (sig=26, regs=0x834fe98) at
arch/um/kernel/time.c:28
#9  0x0806c3d9 in real_alarm_handler (sc=0x0) at arch/um/os-Linux/signal.c:93
#10 0x0806c410 in alarm_handler (sig=26, sc=0x0) at
arch/um/os-Linux/signal.c:108
#11 0x0806cee4 in deliver_alarm () at arch/um/os-Linux/time.c:116
#12 0x0806d0f1 in idle_sleep (nsecs=<value optimized out>) at
arch/um/os-Linux/time.c:180
#13 0x0805ab13 in default_idle () at arch/um/kernel/process.c:248
#14 0x0805ab56 in cpu_idle () at arch/um/kernel/process.c:256
#15 0x082b379a in rest_init () at init/main.c:453
#16 0x0804879a in start_kernel () at init/main.c:650
#17 0x0804a12c in start_kernel_proc (unused=0x0) at
arch/um/kernel/skas/process.c:46
#18 0x0806b671 in run_kernel_thread (fn=0x804a100 <start_kernel_proc>,
arg=0x0, jmp_ptr=0x83551e0)
~    at arch/um/os-Linux/process.c:267
#19 0x0805a892 in new_thread_handler () at arch/um/kernel/process.c:151
#20 0x00000000 in ?? ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0809785c in update_wall_time () at kernel/time/timekeeping.c:464
464                     clock->cycle_last += clock->cycle_interval;
(gdb) bt
#0  0x0809785c in update_wall_time () at kernel/time/timekeeping.c:464
#1  0x08086bb5 in do_timer (ticks=1) at kernel/timer.c:929
#2  0x08099793 in tick_periodic (cpu=0) at kernel/time/tick-common.c:66
#3  0x080997b8 in tick_handle_periodic (dev=0x8355420) at
kernel/time/tick-common.c:82
#4  0x0805c143 in um_timer (irq=0, dev=0x0) at arch/um/kernel/time.c:70
#5  0x0809fac0 in handle_IRQ_event (irq=0, action=0x11449460) at
kernel/irq/handle.c:140
#6  0x0809fb6a in __do_IRQ (irq=0) at kernel/irq/handle.c:236
#7  0x08059d65 in do_IRQ (irq=0, regs=0x834fe98) at arch/um/kernel/irq.c:335
#8  0x0805c0c6 in timer_handler (sig=26, regs=0x834fe98) at
arch/um/kernel/time.c:28
#9  0x0806c3d9 in real_alarm_handler (sc=0x0) at arch/um/os-Linux/signal.c:93
#10 0x0806c410 in alarm_handler (sig=26, sc=0x0) at
arch/um/os-Linux/signal.c:108
#11 0x0806cee4 in deliver_alarm () at arch/um/os-Linux/time.c:116
#12 0x0806d0f1 in idle_sleep (nsecs=<value optimized out>) at
arch/um/os-Linux/time.c:180
#13 0x0805ab13 in default_idle () at arch/um/kernel/process.c:248
#14 0x0805ab56 in cpu_idle () at arch/um/kernel/process.c:256
#15 0x082b379a in rest_init () at init/main.c:453
#16 0x0804879a in start_kernel () at init/main.c:650
#17 0x0804a12c in start_kernel_proc (unused=0x0) at
arch/um/kernel/skas/process.c:46
#18 0x0806b671 in run_kernel_thread (fn=0x804a100 <start_kernel_proc>,
arg=0x0, jmp_ptr=0x83551e0)
~    at arch/um/os-Linux/process.c:267
#19 0x0805a892 in new_thread_handler () at arch/um/kernel/process.c:151
#20 0x00000000 in ?? ()
(gdb)

I also saw this on my console (which does not react either btw), not sure
when it appeared.. at or very short after/before the time setting:
Stub registers -
~        0 - 621a
~        1 - 13
~        2 - 621a
~        3 - 6215
~        4 - 8
~        5 - bfae182c
~        6 - 0
~        7 - 7b
~        8 - 7b
~        9 - 0
~        10 - 0
~        11 - ffffffff
~        12 - 1000be
~        13 - 73
~        14 - 200246
~        15 - bfae1810
~        16 - 7b
wait_stub_done : failed to wait for SIGTRAP, pid = 26141, n = 26141, errno =
0, status = 0x1c7f

The old 2.6.20.1 uml's react fine when setting time backwards, btw (well..
within reasonable limits)

	Bram.

- --
Bram Matthys
Software developer/IT consultant        sy...@vu...
PGP key:                       www.vulnscan.org/pubkey.asc
PGP fp: 8DD4 437E 9BA8 09AA 0A8D  1811 E1C3 D65F E6ED 2AA2
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFIJV/E46ioc5305a8RAsiwAJ4wjzYWngQWdfQ+EdGuJgXFyu5PYQCeIPbe
tqYB/w+brTtcjK0dLpoe/yY=
=P50g
-----END PGP SIGNATURE-----

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Jeff D. <jd...@ad...> - 2008-05-13 15:40:55

On Sat, May 10, 2008 at 10:41:40AM +0200, Bram Matthys (Syzop) wrote:
> I also saw this on my console (which does not react either btw), not sure
> when it appeared.. at or very short after/before the time setting:
> Stub registers -
> ~        0 - 621a
> ~        1 - 13
> ~        2 - 621a
> ~        3 - 6215
> ~        4 - 8
> ~        5 - bfae182c
> ~        6 - 0
> ~        7 - 7b
> ~        8 - 7b
> ~        9 - 0
> ~        10 - 0
> ~        11 - ffffffff
> ~        12 - 1000be
> ~        13 - 73
> ~        14 - 200246
> ~        15 - bfae1810
> ~        16 - 7b
> wait_stub_done : failed to wait for SIGTRAP, pid = 26141, n = 26141, errno =
> 0, status = 0x1c7f

For this one, try this patch:

Index: linux-2.6.22/arch/um/os-Linux/skas/process.c
===================================================================
--- linux-2.6.22.orig/arch/um/os-Linux/skas/process.c	2008-04-14 10:44:33.000000000 -0400
+++ linux-2.6.22/arch/um/os-Linux/skas/process.c	2008-05-13 11:37:35.000000000 -0400
@@ -55,7 +55,7 @@ static int ptrace_dump_regs(int pid)
  * Signals that are OK to receive in the stub - we'll just continue it.
  * SIGWINCH will happen when UML is inside a detached screen.
  */
-#define STUB_SIG_MASK (1 << SIGVTALRM)
+#define STUB_SIG_MASK ((1 << SIGVTALRM) | (1 << SIGWINCH))
 
 /* Signals that the stub will finish with - anything else is an error */
 #define STUB_DONE_MASK (1 << SIGTRAP)

I doubt it will fix the time problem.  I'm going to chase vincent's
problem on the assusmption that you're seeing the same thing.  When I
figure that out, we'll see how true that is.


> The old 2.6.20.1 uml's react fine when setting time backwards, btw (well..
> within reasonable limits)

UML got its timekeeping redone as part of the tickless work and I'm
still shaking out bugs...

       	    	       Jeff


-- 
Work email - jdike at linux dot intel dot com

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Sakari A. <sak...@sa...> - 2008-05-14 08:41:29

Bram Matthys (Syzop) wrote:
> Thanks for your reply.
> 
> $ grep HZ .config
> CONFIG_HZ=100
> # CONFIG_NO_HZ is not set
> 
> I've applied your patch against my 2.6.25 (vanilla)...
> patching file arch/um/os-Linux/time.c
> patching file arch/um/kernel/time.c
> Hunk #1 succeeded at 74 (offset -1 lines).
> Hunk #2 succeeded at 82 (offset -1 lines).
> and recompiled etc..
> 
> I saw vincent's issue, and when I set the time like 5 seconds back.. the UML
> freezes and uses 100% cpu and doesn't respond at all. This is however not
> entirely the same as what I had, because i still had it somewhat responsive...

Hi,

I think I have experienced the same problem. I don't have time now to 
investigate it further, but I have some info which may or may not be 
useful in debugging. So this is mainly just FYI.

I have three UML instances running on a host. First, they all were 
unresponsive simultaneously using all CPU time they could get. After a 
while they became responsive again. I could log in through SSH. The 
funny thing is that the date command showed correct date and time (as 
far as I remember, can't test it now as they are hung again) while the 
time in bash prompt was constant showing the time around the initial 
hang, which is the same on all three instances.

I think there was some NTP related activity on the host while this 
happened. The time the UMLs were showing was 23:xx:xx, don't know 
exactly. :(

---
May 13 22:58:56 retiisi ntpd[20630]: synchronized to 192.26.119.7, stratum 2
May 13 22:58:56 retiisi ntpd[20630]: time reset -5.151310 s
May 13 22:58:56 retiisi ntpd[20630]: kernel time sync enabled 0001
May 13 22:58:52 retiisi kernel: set_rtc_mmss: can't update from 1 to 58
May 13 22:58:56 retiisi last message repeated 4 times
May 13 22:59:27 retiisi kernel: set_rtc_mmss: can't update from 1 to 59
May 13 22:59:40 retiisi last message repeated 13 times
May 13 22:59:41 retiisi kernel: set_rtc_mmss: can't update from 2 to 59
May 13 22:59:59 retiisi last message repeated 18 times
May 13 23:02:29 retiisi ntpd[20630]: synchronized to 192.26.119.7, stratum 2
---

The host is 2.6.24 (skas4) and the clients are vanilla 2.6.24. Oh dear, 
I seem to have CONFIG_NO_HZ enabled...

-- 
Sakari Ailus
sak...@sa...

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Jeff D. <jd...@ad...> - 2008-05-19 16:16:52

On Wed, May 14, 2008 at 11:41:18AM +0300, Sakari Ailus wrote:
> I have three UML instances running on a host. First, they all were 
> unresponsive simultaneously using all CPU time they could get. After a 
> while they became responsive again. I could log in through SSH. The 
> funny thing is that the date command showed correct date and time (as 
> far as I remember, can't test it now as they are hung again) while the 
> time in bash prompt was constant showing the time around the initial 
> hang, which is the same on all three instances.

I reproduced and debugged a similar problem, resulting in the patch
below.  See if it makes any difference for you...

	       	  	Jeff

-- 
Work email - jdike at linux dot intel dot com

Index: 2.6/stable/arch/um/os-Linux/time.c
===================================================================
--- 2.6.orig/stable/arch/um/os-Linux/time.c	2008-05-14 14:55:56.000000000 -0400
+++ 2.6/stable/arch/um/os-Linux/time.c	2008-05-14 15:30:48.000000000 -0400
@@ -66,12 +66,21 @@ long long disable_timer(void)
 	return timeval_to_ns(&time.it_value);
 }
 
+static long long last_time;
+
 long long os_nsecs(void)
 {
 	struct timeval tv;
+	long long ret;
 
 	gettimeofday(&tv, NULL);
-	return timeval_to_ns(&tv);
+	ret = timeval_to_ns(&tv);
+
+	if((last_time != 0) && (last_time > ret))
+		ret = last_time;
+
+	last_time = ret;
+	return ret;
 }
 
 #ifdef UML_CONFIG_NO_HZ

Re: [uml-devel] umls unresponsive & consuming 100% cpu time

From: Bram M. (Syzop) <sy...@vu...> - 2008-05-31 09:17:11

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jeff,

Sorry for my late reply.

Just tried, and I think your patch fixed it (at least a test with 'date'
setting the time 2 seconds back on the host no longer causes it to hang).
If it causes any trouble in the next few days I'll let you know.

Regards,

	Bram.

Jeff Dike wrote:
| On Wed, May 14, 2008 at 11:41:18AM +0300, Sakari Ailus wrote:
|> I have three UML instances running on a host. First, they all were
|> unresponsive simultaneously using all CPU time they could get. After a
|> while they became responsive again. I could log in through SSH. The
|> funny thing is that the date command showed correct date and time (as
|> far as I remember, can't test it now as they are hung again) while the
|> time in bash prompt was constant showing the time around the initial
|> hang, which is the same on all three instances.
|
| I reproduced and debugged a similar problem, resulting in the patch
| below.  See if it makes any difference for you...
|
| 	       	  	Jeff
|

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFIQReI46ioc5305a8RAqQpAKC0j9HcBGFxRgaT2yvXPC9E3W1FXACgg6q6
LAYiR6Z0MMuwIZfTax6ojhc=
=tSsa
-----END PGP SIGNATURE-----