From: Nix <ni...@es...> - 2008-04-25 19:59:56
|
I suspect this can go wrong anywhere, but it happens to have been a CBQ-triggered gettimeofday() while sending an arp that did it. (My ADSL router pretty much bombs the poor damn machine with ARP packets all the time.) #0 getnstimeofday (ts=0x8217d10) at include/linux/time.h:182 #1 0x080824b4 in ktime_get_ts (ts=0x8217d10) at kernel/hrtimer.c:122 #2 0x0808244b in ktime_get () at kernel/hrtimer.c:58 #3 0x08148d93 in cbq_dequeue (sch=0xdde3800) at include/net/pkt_sched.h:53 #4 0x08143355 in __qdisc_run (dev=0xded1800) at net/sched/sch_generic.c:85 #5 0x08137991 in dev_queue_xmit (skb=0xdc89380) at include/net/pkt_sched.h:89 #6 0x08185537 in arp_xmit (skb=0xdc89380) at net/ipv4/arp.c:668 #7 0x08185574 in arp_send (type=1, ptype=2054, dest_ip=3457507285, dev=0xded1800, src_ip=2668538048, dest_hw=0xdd65244 "", src_hw=0xded192c "\002`\227yâÁ", target_hw=0x0) at net/ipv4/arp.c:694 #8 0x08184fb2 in arp_solicit (neigh=0xdd65220, skb=0x0) at net/ipv4/arp.c:379 #9 0x0813c898 in neigh_timer_handler (arg=232149536) at net/core/neighbour.c:866 #10 0x08077afd in run_timer_softirq (h=0x823a3ec) at kernel/timer.c:664 #11 0x080740a4 in __do_softirq () at kernel/softirq.c:234 #12 0x0807412e in do_softirq () at kernel/softirq.c:271 #13 0x08074186 in irq_exit () at kernel/softirq.c:310 #14 0x080576fd in do_IRQ (irq=0, regs=0x8217ec4) at arch/um/kernel/irq.c:336 #15 0x080590a0 in timer_handler (sig=26, regs=0x8217ec4) at arch/um/kernel/time.c:28 #16 0x08064c84 in real_alarm_handler (sc=0x0) at arch/um/os-Linux/signal.c:93 #17 0x08064cac in alarm_handler (sig=26, sc=0x0) at arch/um/os-Linux/signal.c:108 #18 0x080654a0 in deliver_alarm () at arch/um/os-Linux/time.c:117 #19 0x08065640 in idle_sleep (nsecs=<value optimized out>) at arch/um/os-Linux/time.c:181 #20 0x080580d4 in default_idle () at arch/um/kernel/process.c:248 #21 0x080580fe in cpu_idle () at arch/um/kernel/process.c:256 #22 0x081afdee in rest_init () at init/main.c:453 #23 0x0804953b in start_kernel () at init/main.c:650 #24 0x0804a7e1 in start_kernel_proc (unused=0x0) at arch/um/kernel/skas/process.c:46 #25 0x08064261 in run_kernel_thread (fn=0x804a7b5 <start_kernel_proc>, arg=0x0, jmp_ptr=0x821d184) at arch/um/os-Linux/process.c:267 #26 0x08057ee0 in new_thread_handler () at arch/um/kernel/process.c:151 #27 0x00000000 in ?? () (gdb) frame 1 #1 0x080824b4 in ktime_get_ts (ts=0x8217d10) at kernel/hrtimer.c:122 122 getnstimeofday(ts); (gdb) print *ts $3 = {tv_sec = -1963377075, tv_nsec = 320917000} It's stuck inside timespec_add_ns(), never leaving: this has been inlined, so spying on its locals or parameters with GDB is pretty much a no-hoper :/ Still, it's obvious enough to me that __get_nsec_offset() is returning a negative value in getnstimeofday(), so timespec_add_ns() is looping, well, not *forever*, but near enough, trying to push that u64 back down to zero in units of NSEC_PER_SEC... I hope this gives you a bit more of a clue. |
From: Nix <ni...@es...> - 2008-04-25 22:06:32
|
On 25 Apr 2008, ni...@es... told this: > I suspect this can go wrong anywhere, but it happens to have been a > CBQ-triggered gettimeofday() while sending an arp that did it. (My ADSL > router pretty much bombs the poor damn machine with ARP packets all the > time.) Woo, it's happening a lot tonight. The CBQ/ARP problem recurred, and also this: (gdb) bt #0 0xb7f87410 in __kernel_vsyscall () #1 0x45f27b36 in gettimeofday () from /lib/libc.so.6 #2 0x08065417 in os_nsecs () at arch/um/os-Linux/time.c:79 #3 0x080590fc in itimer_read () at arch/um/kernel/time.c:77 #4 0x08084a5f in update_wall_time () at include/linux/clocksource.h:167 #5 0x08077ba1 in do_timer (ticks=1) at kernel/timer.c:929 #6 0x08086350 in tick_periodic (cpu=0) at kernel/time/tick-common.c:66 #7 0x08086382 in tick_handle_periodic (dev=0x821d384) at kernel/time/tick-common.c:82 #8 0x080590ed in um_timer (irq=0, dev=0x0) at arch/um/kernel/time.c:70 #9 0x0808d42f in handle_IRQ_event (irq=0, action=0xdc49480) at kernel/irq/handle.c:140 #10 0x0808d4ad in __do_IRQ (irq=0) at kernel/irq/handle.c:236 #11 0x080576f8 in do_IRQ (irq=0, regs=0x8217ec4) at arch/um/kernel/irq.c:335 #12 0x080590a0 in timer_handler (sig=26, regs=0x8217ec4) at arch/um/kernel/time.c:28 #13 0x08064c84 in real_alarm_handler (sc=0x0) at arch/um/os-Linux/signal.c:93 #14 0x08064cac in alarm_handler (sig=26, sc=0x0) at arch/um/os-Linux/signal.c:108 #15 0x080654a0 in deliver_alarm () at arch/um/os-Linux/time.c:117 #16 0x08065640 in idle_sleep (nsecs=<value optimized out>) at arch/um/os-Linux/time.c:181 #17 0x080580d4 in default_idle () at arch/um/kernel/process.c:248 #18 0x080580fe in cpu_idle () at arch/um/kernel/process.c:256 #19 0x081afdee in rest_init () at init/main.c:453 #20 0x0804953b in start_kernel () at init/main.c:650 #21 0x0804a7e1 in start_kernel_proc (unused=0x0) at arch/um/kernel/skas/process.c:46 #22 0x08064261 in run_kernel_thread (fn=0x804a7b5 <start_kernel_proc>, arg=0x0, jmp_ptr=0x821d184) at arch/um/os-Linux/process.c:267 #23 0x08057ee0 in new_thread_handler () at arch/um/kernel/process.c:151 #24 0x00000000 in ?? () It looks to me like os_nsecs()/gettimeofday()/et al are the only really important parts of this backtrace. -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Nix <ni...@es...> - 2008-04-26 18:34:00
|
On 25 Apr 2008, ni...@es... uttered the following: > On 25 Apr 2008, ni...@es... told this: > >> I suspect this can go wrong anywhere, but it happens to have been a >> CBQ-triggered gettimeofday() while sending an arp that did it. (My ADSL >> router pretty much bombs the poor damn machine with ARP packets all the >> time.) > > Woo, it's happening a lot tonight. The cause of this is almost certainly time-skewing on the *host* via adjtimex(). I stopped ntpd and there were no problems for half a day: I restarted it, and as soon as ntpd had synched and begun slewing the time (within a second of slewing beginning, probably less than that), *wham*: (gdb) bt #0 getnstimeofday (ts=0x8217d10) at include/linux/time.h:176 #1 0x080824b4 in ktime_get_ts (ts=0x8217d10) at kernel/hrtimer.c:122 #2 0x0808244b in ktime_get () at kernel/hrtimer.c:58 #3 0x08148d93 in cbq_dequeue (sch=0xddfe000) at include/net/pkt_sched.h:53 #4 0x08143355 in __qdisc_run (dev=0xded1800) at net/sched/sch_generic.c:85 #5 0x08137991 in dev_queue_xmit (skb=0xdf1b240) at include/net/pkt_sched.h:89 #6 0x08185537 in arp_xmit (skb=0xdf1b240) at net/ipv4/arp.c:668 #7 0x08185574 in arp_send (type=1, ptype=2054, dest_ip=3536983368, dev=0xded1800, src_ip=2668538048, dest_hw=0xddf2e44 "", src_hw=0xded192c "\002`\227yâÁ", target_hw=0x0) at net/ipv4/arp.c:694 #8 0x08184fb2 in arp_solicit (neigh=0xddf2e20, skb=0x0) at net/ipv4/arp.c:379 #9 0x0813c898 in neigh_timer_handler (arg=232730144) at net/core/neighbour.c:866 #10 0x08077afd in run_timer_softirq (h=0x823a3ec) at kernel/timer.c:664 #11 0x080740a4 in __do_softirq () at kernel/softirq.c:234 #12 0x0807412e in do_softirq () at kernel/softirq.c:271 #13 0x08074186 in irq_exit () at kernel/softirq.c:310 #14 0x080576fd in do_IRQ (irq=0, regs=0x8217ec4) at arch/um/kernel/irq.c:336 #15 0x080590a0 in timer_handler (sig=26, regs=0x8217ec4) at arch/um/kernel/time.c:28 #16 0x08064c84 in real_alarm_handler (sc=0x0) at arch/um/os-Linux/signal.c:93 #17 0x08064cac in alarm_handler (sig=26, sc=0x0) at arch/um/os-Linux/signal.c:108 #18 0x080654a0 in deliver_alarm () at arch/um/os-Linux/time.c:117 #19 0x08065640 in idle_sleep (nsecs=<value optimized out>) at arch/um/os-Linux/time.c:181 #20 0x080580d4 in default_idle () at arch/um/kernel/process.c:248 #21 0x080580fe in cpu_idle () at arch/um/kernel/process.c:256 #22 0x081afdee in rest_init () at init/main.c:453 #23 0x0804953b in start_kernel () at init/main.c:650 #24 0x0804a7e1 in start_kernel_proc (unused=0x0) at arch/um/kernel/skas/process.c:46 #25 0x08064261 in run_kernel_thread (fn=0x804a7b5 <start_kernel_proc>, arg=0x0, jmp_ptr=0x821d184) at arch/um/os-Linux/process.c:267 #26 0x08057ee0 in new_thread_handler () at arch/um/kernel/process.c:151 #27 0x00000000 in ?? () Obviously, given that it's now busy decrementing that 64-bit variable, stopping ntpd didn't help: UML was looping and would continue looping until we were fossilized or I killed it. UML seems to be vulnerable to microsecond-level time slews on the host. This is not ideal :/ -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Nix <ni...@es...> - 2008-04-27 14:25:08
|
On 26 Apr 2008, ni...@es... uttered the following: > On 25 Apr 2008, ni...@es... uttered the following: > >> On 25 Apr 2008, ni...@es... told this: >> >>> I suspect this can go wrong anywhere, but it happens to have been a >>> CBQ-triggered gettimeofday() while sending an arp that did it. (My ADSL >>> router pretty much bombs the poor damn machine with ARP packets all the >>> time.) >> >> Woo, it's happening a lot tonight. > > The cause of this is almost certainly time-skewing on the *host* via > adjtimex(). I stopped ntpd and there were no problems for half a day: I > restarted it, and as soon as ntpd had synched and begun slewing the time > (within a second of slewing beginning, probably less than that), *wham*: Time-stepping on the host is enough to cause trouble. I switched to using ntpdate -b to keep time in synch, and since then my UML instances have gone into mad loops only on the hour, when the ntpdate time-step should take place. -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Nix <ni...@es...> - 2008-05-02 23:21:29
|
On 2 May 2008, Jeff Dike stated: > On Fri, May 02, 2008 at 07:55:11PM +0100, Nix wrote: >> I'm trying something else now, arranging for os_nsecs() itself to do the >> never-backwards stuff on the assumption that something depends on >> monotonic timers not skipping backwards which presently they might >> (there are callers of os_nsecs() outside of os-Linux/time.c, notably in >> kernel/time.c, and currently they see an unadjusted time, jumping >> backwards and forwards and whatever). > > This sounds like a reasonable idea. With this patch (migrating most of the work into os_nsecs(), with a non-NO_HZ version doing skew computations too, atop your first patch, and making a couple of variables static for good measure), I still had no luck: Index: linux/arch/um/os-Linux/time.c =================================================================== --- linux.orig/arch/um/os-Linux/time.c 2008-05-02 16:06:14.000000000 +0100 +++ linux/arch/um/os-Linux/time.c 2008-05-02 16:28:29.000000000 +0100 @@ -73,15 +73,29 @@ return remain; } +static unsigned long long last_tick; + +#ifdef UML_CONFIG_NO_HZ long long os_nsecs(void) { struct timeval tv; + long long this_tick; gettimeofday(&tv, NULL); - return timeval_to_ns(&tv); + this_tick = timeval_to_ns(&tv); + + /* Ensure that host time going backwards is treated as if it had stood + * still. + */ + + if ((last_tick != 0) && (this_tick < last_tick)) + this_tick = last_tick; + + last_tick = this_tick; + + return this_tick; } -#ifdef UML_CONFIG_NO_HZ static int after_sleep_interval(struct timespec *ts) { return 0; @@ -98,15 +112,22 @@ } #else -unsigned long long last_tick; -unsigned long long skew; +static unsigned long long skew; +static int one_tick = UM_NSEC_PER_SEC / UM_HZ; -static void deliver_alarm(void) +long long os_nsecs(void) { - unsigned long long this_tick = os_nsecs(); - int one_tick = UM_NSEC_PER_SEC / UM_HZ; + struct timeval tv; + long long this_tick; + + gettimeofday(&tv, NULL); + this_tick = timeval_to_ns(&tv); + + /* Keep a running computation of the number of ticks lost due to host + * load, and ensure that host time going backwards is treated as if it + * had stood still. + */ - /* Protection against the host's time going backwards */ if ((last_tick != 0) && (this_tick < last_tick)) this_tick = last_tick; @@ -114,13 +135,20 @@ last_tick = this_tick - one_tick; skew += this_tick - last_tick; + last_tick = this_tick; + + return this_tick; +} +static void deliver_alarm(void) +{ + /* Recompute the skew values and wait until skew is near-zero again. */ + os_nsecs(); while (skew >= one_tick) { alarm_handler(SIGVTALRM, NULL); skew -= one_tick; } - last_tick = this_tick; } static unsigned long long sleep_time(unsigned long long nsecs) > Another thing you might try is the following: > get two extra windows > in one, get a "strace -p uml-pid -o strace.out -tt" ready > in two, get a "date ; sudo date -s "now-5 sec" ; date" ready > hit return in one, then in two, then in your UML window > when the UML gives you another prompt (or you've decided it's > hung), ^C the strace *slap* why didn't I think of that? ... unfortunately it's not terribly informative. Here, with an earlier gettimeofday() included for context: 00:15:10.606808 gettimeofday({1209770110, 606883}, NULL) = 0 [...] 00:15:10.804151 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == 133}], WSTOPPED|__WALL) = 9963 00:15:10.804271 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.804391 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.804499 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:10.804671 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:10.804754 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == 133}], WSTOPPED|__WALL) = 9963 00:15:10.804876 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.804994 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.805103 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:10.805249 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:10.805332 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == 133}], WSTOPPED|__WALL) = 9963 00:15:10.805453 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.805572 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.805680 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:10.805925 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:10.806020 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:10.806141 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.806257 ptrace(0x34 /* PTRACE_??? */, 9963, 0, 0xde42a6c) = 0 00:15:10.806393 write(36, "6\0\0\0\0000\255*\0\20\0\0\5\0\0\0\21\0\0\0\3\0\0\0\0000\244\0", 28) = 28 00:15:10.806613 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:10.806732 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:10.806867 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:10.806949 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:05.811501 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.811674 ptrace(0x34 /* PTRACE_??? */, 9963, 0, 0xde42a6c) = 0 00:15:05.812544 write(36, "6\0\0\0\0\0\255*\0\20\0\0\5\0\0\0\21\0\0\0\3\0\0\0\0\0\244\0", 28) = 28 00:15:05.812853 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.812983 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:05.813467 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:05.813640 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:05.813797 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.813918 ptrace(0x34 /* PTRACE_??? */, 9963, 0, 0xde42a6c) = 0 00:15:05.814055 write(36, "6\0\0\0\0\220\257*\0\20\0\0\5\0\0\0\21\0\0\0\3\0\0\0\0\220\243\0", 28) = 28 00:15:05.814276 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.814400 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:05.814559 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:05.814645 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:05.814764 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.814879 ptrace(0x34 /* PTRACE_??? */, 9963, 0, 0xde42a6c) = 0 00:15:05.815003 write(36, "6\0\0\0\0\20\255*\0\20\0\0\5\0\0\0\21\0\0\0\3\0\0\0\0 \244\0", 28) = 28 00:15:05.815198 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.815316 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:05.815453 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:05.815535 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:05.815654 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.815768 ptrace(0x34 /* PTRACE_??? */, 9963, 0, 0xde42a6c) = 0 00:15:05.815890 write(36, "6\0\0\0\0 \255*\0\20\0\0\5\0\0\0\21\0\0\0\3\0\0\0\0@\244\0", 28) = 28 00:15:05.816081 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.816234 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:05.816381 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:05.816463 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:05.816582 ptrace(PTRACE_GETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.816695 ptrace(0x34 /* PTRACE_??? */, 9963, 0, 0xde42a6c) = 0 00:15:05.816819 write(36, "6\0\0\0\0\300\260*\0\20\0\0\5\0\0\0\21\0\0\0\3\0\0\0\0p\241\0", 28) = 28 00:15:05.817013 ptrace(PTRACE_SETREGS, 9963, 0, 0xde42a28) = 0 00:15:05.817132 ptrace(0x1f /* PTRACE_??? */, 9963, 0, 0) = 0 00:15:05.817274 --- SIGCHLD (Child exited) @ 0 (0) --- 00:15:05.817357 waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], WSTOPPED|__WALL) = 9963 00:15:05.817574 --- SIGVTALRM (Virtual timer expired) @ 0 (0) --- 00:15:05.817842 gettimeofday({1209770105, 817925}, NULL) = 0 So there's basically nothing unusual here. The first gettimeofday() call after the jump, and *whoompf* we're off into the magic land of looping with no extra syscalls to speak of at all. Next attempt, tomorrow: run it under gdb with an LD_PRELOADed wrapper around gettimeofday() that raises some unusual signal (SIGILL maybe) when it spots time going backwards, then run it under gdb and trap that, and trace forwards from there... -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Jeff D. <jd...@ad...> - 2008-05-03 00:56:46
|
On Sat, May 03, 2008 at 12:21:15AM +0100, Nix wrote: > With this patch (migrating most of the work into os_nsecs(), with a > non-NO_HZ version doing skew computations too, atop your first patch, > and making a couple of variables static for good measure), I still had > no luck: >From a quick look, this seems right. And this would be the gold standard of preventing UML from seeing time going backwards. > *slap* why didn't I think of that? .... > ... unfortunately it's not terribly informative. Here, with an earlier > gettimeofday() included for context: > > So there's basically nothing unusual here. The first gettimeofday() call > after the jump, and *whoompf* we're off into the magic land of looping > with no extra syscalls to speak of at all. This was with a solid hang? The thing is still handling page faults from something at the end of that trace. I would try again, making sure it's a solid hang, and including enough of the trace so that you stop seeing page faults: waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}]) and system calls (WSTOPSIG(s) == 133) and see what's happening at that point. And if you never stop seeing page faults or system calls, then that's interesting too. Jeff -- Work email - jdike at linux dot intel dot com |
From: Jeff D. <jd...@ad...> - 2008-04-28 16:45:29
|
On Sat, Apr 26, 2008 at 07:31:44PM +0100, Nix wrote: > The cause of this is almost certainly time-skewing on the *host* via > adjtimex(). I stopped ntpd and there were no problems for half a day: I > restarted it, and as soon as ntpd had synched and begun slewing the time > (within a second of slewing beginning, probably less than that), *wham*: Can you try the patch below? Time should definitely not go backwards - it's a bug if it happens. If you want more a more definite indication about whether this is having an effect, stick a printk in that test and see if it shows up. Jeff -- Work email - jdike at linux dot intel dot com Index: linux-2.6.22/arch/um/os-Linux/time.c =================================================================== --- linux-2.6.22.orig/arch/um/os-Linux/time.c 2008-04-24 13:21:28.000000000 -0400 +++ linux-2.6.22/arch/um/os-Linux/time.c 2008-04-28 12:41:37.000000000 -0400 @@ -107,6 +107,10 @@ static void deliver_alarm(void) unsigned long long this_tick = os_nsecs(); int one_tick = UM_NSEC_PER_SEC / UM_HZ; + /* Protection against the host's time going backwards */ + if ((last_tick != 0) && (this_tick < last_tick)) + this_tick = last_tick; + if (last_tick == 0) last_tick = this_tick - one_tick; |
From: Nix <ni...@es...> - 2008-05-03 10:08:49
|
On 3 May 2008, Jeff Dike told this: > On Sat, May 03, 2008 at 12:21:15AM +0100, Nix wrote: >> With this patch (migrating most of the work into os_nsecs(), with a >> non-NO_HZ version doing skew computations too, atop your first patch, >> and making a couple of variables static for good measure), I still had >> no luck: > >>From a quick look, this seems right. And this would be the gold > standard of preventing UML from seeing time going backwards. Oh. And it still doesn't work. Damn. >> So there's basically nothing unusual here. The first gettimeofday() call >> after the jump, and *whoompf* we're off into the magic land of looping >> with no extra syscalls to speak of at all. > > This was with a solid hang? The thing is still handling page faults > from something at the end of that trace. I would try again, making > sure it's a solid hang, and including enough of the trace so that you > stop seeing page faults: > waitpid(9963, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}]) > and system calls (WSTOPSIG(s) == 133) and see what's happening at that > point. I gave it thirty seconds after the time change. After that gettimeofday() there is *nothing* (which is unsurprising; the kernel thread is stuck in a tight loop trying to decrement 2^64 down to 0 by repeated subtraction... there's not much time for syscalls there.) > And if you never stop seeing page faults or system calls, then that's > interesting too. Howzabout if we never see any after that gettimeofday()? :/ -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Jeff D. <jd...@ad...> - 2008-05-09 15:16:30
|
On Sat, May 03, 2008 at 11:08:24AM +0100, Nix wrote: > >>From a quick look, this seems right. And this would be the gold > > standard of preventing UML from seeing time going backwards. > > Oh. And it still doesn't work. Damn. Is it possible for me to get my hands on a UML which is doing this? Jeff -- Work email - jdike at linux dot intel dot com |
From: Nix <ni...@es...> - 2008-04-30 22:07:21
|
On 28 Apr 2008, Jeff Dike told this: > On Sat, Apr 26, 2008 at 07:31:44PM +0100, Nix wrote: >> The cause of this is almost certainly time-skewing on the *host* via >> adjtimex(). I stopped ntpd and there were no problems for half a day: I >> restarted it, and as soon as ntpd had synched and begun slewing the time >> (within a second of slewing beginning, probably less than that), *wham*: > > Can you try the patch below? Time should definitely not go backwards > - it's a bug if it happens. I suspect that just a slowing of the rate of passing of time had nasty negative consequences, because when ntpd is running resident all the time, it should slew, not skew: yet we were getting loops on a regular basis... > If you want more a more definite > indication about whether this is having an effect, stick a printk in > that test and see if it shows up. Done, and tested by stepping the time (five seconds --- five seconds per day! I have pendulum clocks that keep better time than that!), and, oops, instant loop as before, with this debugging patch applied, which did not fire... Index: i686-esperi/arch/um/os-Linux/time.c =================================================================== --- i686-esperi.orig/arch/um/os-Linux/time.c 2008-04-29 19:34:07.000000000 +0100 +++ i686-esperi/arch/um/os-Linux/time.c 2008-04-29 20:08:50.000000000 +0100 @@ -108,7 +108,11 @@ /* Protection against the host's time going backwards */ if ((last_tick != 0) && (this_tick < last_tick)) + { + printk(UM_KERN_ERR "Time went backwards: %llu -> %llu", + last_tick, this_tick); this_tick = last_tick; + } if (last_tick == 0) last_tick = this_tick - one_tick; getnstimeofday (ts=0x821ba2c) at include/linux/time.h:176 176 while(unlikely(ns >= NSEC_PER_SEC)) { (gdb) bt #0 getnstimeofday (ts=0x821ba2c) at include/linux/time.h:176 #1 0x080824e4 in ktime_get_ts (ts=0x821ba2c) at kernel/hrtimer.c:122 #2 0x0808247b in ktime_get () at kernel/hrtimer.c:58 #3 0x0814757c in tcf_act_police (skb=0xdc89800, a=0xded0e40, res=0x821bb68) at include/net/pkt_sched.h:53 #4 0x08146123 in tcf_action_exec (skb=0xdc89800, act=0xded0e40, res=0x821bb68) at net/sched/act_api.c:370 #5 0x08153779 in u32_classify (skb=0xdc89800, tp=0xde83f60, res=0x821bb68) at include/net/pkt_cls.h:127 #6 0x08144fed in tc_classify_compat (skb=0xdc89800, tp=0xde83f60, res=0x821bb68) at net/sched/sch_api.c:1203 #7 0x08145032 in tc_classify (skb=0xdc89800, tp=0xde83f60, res=0x821bb68) at net/sched/sch_api.c:1228 #8 0x08150a1a in ingress_enqueue (skb=0xdc89800, sch=0xdedc760) at net/sched/sch_ingress.c:77 #9 0x08137c31 in ing_filter (skb=0xfffffffe) at net/core/dev.c:1973 #10 0x08137db0 in netif_receive_skb (skb=0xdc89800) at net/core/dev.c:1994 #11 0x08137ecc in process_backlog (napi=0x8240650, quota=64) at net/core/dev.c:2132 #12 0x08137fc3 in net_rx_action (h=0x823a3fc) at net/core/dev.c:2202 #13 0x080740d4 in __do_softirq () at kernel/softirq.c:234 #14 0x0807415e in do_softirq () at kernel/softirq.c:271 #15 0x080741b6 in irq_exit () at kernel/softirq.c:310 #16 0x080576fd in do_IRQ (irq=5, regs=0x821bc6c) at arch/um/kernel/irq.c:336 #17 0x08057367 in sigio_handler (sig=29, regs=0x821bc6c) at arch/um/kernel/irq.c:95 #18 0x08064c07 in sig_handler_common (sig=29, sc=0x821bd24) at arch/um/os-Linux/signal.c:48 #19 0x08064c44 in sig_handler (sig=29, sc=0x821bd24) at arch/um/os-Linux/signal.c:80 #20 0x08064d67 in handle_signal (sig=-2, sc=0x821bd24) at arch/um/os-Linux/signal.c:157 #21 0x08066263 in hard_handler (sig=29) at arch/um/os-Linux/sys-i386/signal.c:12 #22 <signal handler called> #23 getnstimeofday (ts=0xde3feec) at include/linux/time.h:182 #24 0x08084306 in do_gettimeofday (tv=0xde3ff04) at kernel/time/timekeeping.c:118 #25 0x0807396c in sys_gettimeofday (tv=0xbfdd7008, tz=0x0) at kernel/time.c:103 #26 0x0805a55c in handle_syscall (r=0xde78628) at arch/um/kernel/skas/syscall.c:35 #27 0x08066cdf in handle_trap (pid=26156, regs=0xde78628, local_using_sysemu=2) at arch/um/os-Linux/skas/process.c:202 #28 0x0806716b in userspace (regs=0xde78628) at arch/um/os-Linux/skas/process.c:418 #29 0x08057f5a in fork_handler () at arch/um/kernel/process.c:179 #30 0x00000000 in ?? () -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Jeff D. <jd...@ad...> - 2008-05-01 15:14:09
|
On Wed, Apr 30, 2008 at 10:49:27PM +0100, Nix wrote: > Done, and tested by stepping the time (five seconds --- five seconds per > day! I have pendulum clocks that keep better time than that!), and, > oops, instant loop as before, with this debugging patch applied, which > did not fire... Hmmm. I stepped the time (date -set "now+5sec") and nothing bad happened. Is that more or less what you did? Can you send me your config? Jeff -- Work email - jdike at linux dot intel dot com |
From: vincent-perrier <vin...@cl...> - 2008-05-09 17:15:28
|
For the "date --set" back 5 sec on the host that freezes an uml indefinitely (in kernel/time/timekeeping.c, update_wall_time,) then download plug_and_play_clownix_network_03 at http://clownix.net On my host, it does get stuck, but it may depend also on host and host kernel, I have: uml_clownix_net$ uname -a Linux localhost 2.6.22.15.tex1 #1 SMP Sat Dec 15 13:15:05 CST 2007 i686 Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz GNU/Linux The download is long, the untar too, but the freeze is garanteed! to start after download: "./start_clownix_net virtual_platform_configs/single_machine" Hope you try it! On Fri, 2008-05-09 at 11:16 -0400, Jeff Dike wrote: > On Sat, May 03, 2008 at 11:08:24AM +0100, Nix wrote: > > >>From a quick look, this seems right. And this would be the gold > > > standard of preventing UML from seeing time going backwards. > > > > Oh. And it still doesn't work. Damn. > > Is it possible for me to get my hands on a UML which is doing this? > > Jeff > |
From: Nix <ni...@es...> - 2008-05-09 20:33:07
|
On 9 May 2008, vin...@cl... verbalised: > The download is long, the untar too, but the freeze is garanteed! If this doesn't work I can give you an account on, hm, the box which freezes uses a UML for its network link so if you flip the time on it you'd get cut off... and on the other machine it doesn't freeze at all, even if I slam the time backwards and forwards in 5s increments constantly. I wonder if I have to be pushing network packets around or something before it goes wrong? (I can certainly send you the binary that's going wrong for me, and induce a core dump when it's gone wrong and send you that: it'll be 90Mb+ though. I'm not sure how useful that is.) -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Jeff D. <jd...@ad...> - 2008-05-14 17:24:09
|
On Fri, May 09, 2008 at 07:13:22PM +0200, vincent-perrier wrote: > The download is long, the untar too, but the freeze is garanteed! > > to start after download: > > "./start_clownix_net virtual_platform_configs/single_machine" > > Hope you try it! Why is it mucking with root's ssh keys: RSA_PUB=/root/.ssh/id_rsa.pub RSA_PRIV=/root/.ssh/id_rsa Jeff -- Work email - jdike at linux dot intel dot com |
From: vincent-perrier <vin...@cl...> - 2008-05-14 18:04:04
|
Hello, I know that I should not work as root, but I am too used to it, so my example is done with the root user. I am not sure that the network simulator works with another user for the version you have, but I have checked that for next one, a normal user will do as long as the daemon is owned by root and is suid root. The daemon has to create taps, open a socket to ping, and configure an ip address for tap, so it will always have to be suid root even for a normal user. The RSA files are copied into the machines at startup so as to be able to scp or ssh the machine without any password. I can garantee that the software does nothing nasty with those files. On Wed, 2008-05-14 at 13:23 -0400, Jeff Dike wrote: > On Fri, May 09, 2008 at 07:13:22PM +0200, vincent-perrier wrote: > > The download is long, the untar too, but the freeze is garanteed! > > > > to start after download: > > > > "./start_clownix_net virtual_platform_configs/single_machine" > > > > Hope you try it! > > Why is it mucking with root's ssh keys: > > RSA_PUB=/root/.ssh/id_rsa.pub > RSA_PRIV=/root/.ssh/id_rsa > > Jeff > |
From: vincent-perrier <vin...@cl...> - 2008-05-14 18:18:17
|
I forgot to tell you, if you launch the daemon as a normal user, the RSA files of this normal user will be used, and not the root ones. And also, the daemon has to mount a file when it creates the ubdb config file, one more reason to have the root sticky bit. This tool is made for network study and tests, a standalone PC is good enough to use it and observe routing daemons at work. On Wed, 2008-05-14 at 13:23 -0400, Jeff Dike wrote: > On Fri, May 09, 2008 at 07:13:22PM +0200, vincent-perrier wrote: > > The download is long, the untar too, but the freeze is garanteed! > > > > to start after download: > > > > "./start_clownix_net virtual_platform_configs/single_machine" > > > > Hope you try it! > > Why is it mucking with root's ssh keys: > > RSA_PUB=/root/.ssh/id_rsa.pub > RSA_PRIV=/root/.ssh/id_rsa > > Jeff > |
From: vincent-perrier <vin...@cl...> - 2008-05-14 18:29:28
|
Sorry if you receive this message twice, but I sent 2 messages following each other by a few minutes, and received only the second one, this was the first one: Hello, I know that I should not work as root, but I am too used to it, so my example is done with the root user. I am not sure that the network simulator works with another user for the version you have, but I have checked that for next one, a normal user will do as long as the daemon is owned by root and is suid root. The daemon has to create taps, open a socket to ping, and configure an ip address for tap, so it will always have to be suid root even for a normal user. The RSA files are copied into the machines at startup so as to be able to scp or ssh the machine without any password. Those files are given to the machines through an ubdb config file at uml start. I can garantee that the software does nothing nasty with those files. On Wed, 2008-05-14 at 13:23 -0400, Jeff Dike wrote: > On Fri, May 09, 2008 at 07:13:22PM +0200, vincent-perrier wrote: > > The download is long, the untar too, but the freeze is garanteed! > > > > to start after download: > > > > "./start_clownix_net virtual_platform_configs/single_machine" > > > > Hope you try it! > > Why is it mucking with root's ssh keys: > > RSA_PUB=/root/.ssh/id_rsa.pub > RSA_PRIV=/root/.ssh/id_rsa > > Jeff > |
From: vincent-perrier <vin...@cl...> - 2008-05-19 18:54:56
|
You are right, the private key is useless for my purposes, the clownix daemon just has to copy the public key to the ssh authorized keys file inside the UML guest machines, then the access to the UML guests can be done without password. In my next version of the clownix network, I will not "muck" with the private rsa key. > > > > Why is it mucking with root's ssh keys: > > > > RSA_PUB=/root/.ssh/id_rsa.pub > > RSA_PRIV=/root/.ssh/id_rsa > > > > Jeff > > |
From: Nix <ni...@es...> - 2008-05-01 23:35:18
|
On 1 May 2008, Jeff Dike outgrape: > On Wed, Apr 30, 2008 at 10:49:27PM +0100, Nix wrote: >> Done, and tested by stepping the time (five seconds --- five seconds per >> day! I have pendulum clocks that keep better time than that!), and, >> oops, instant loop as before, with this debugging patch applied, which >> did not fire... > > Hmmm. I stepped the time (date -set "now+5sec") and nothing bad > happened. Is that more or less what you did? loki:/tmp# /usr/bin/ntpdate -bv hades 30 Apr 22:45:52 ntpdate[8833]: ntpdate 4.2.4p4@1.1520-o Fri Feb 22 18:37:11 UTC 2008 (1) 30 Apr 22:45:52 ntpdate[8833]: step time server 192.168.14.18 offset -5.728539 sec so in effect yes :) Hm. A five-second skip forwards via date --set works fine. A five-second skip *backwards*, and instant boom. (I suspect from earlier failures that sometimes time-slewing can cause this too, because normally I run ntpd all the time, and that should only slew the time under normal circumstances, never skip it: yet UML 2.6.24+ visited hang city every few days.) I've been busy birthdaying today, but tomorrow I'll instrument things and figure out just where a backward skip hangs and why (assuming you haven't already done so: it's my own damn fault for being so laggardly, I should have done this weeks ago). > Can you send me your config? Here: CONFIG_DEFCONFIG_LIST="arch/$ARCH/defconfig" CONFIG_GENERIC_HARDIRQS=y CONFIG_UML=y CONFIG_MMU=y CONFIG_NO_IOMEM=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_IRQ_RELEASE_METHOD=y CONFIG_HZ=100 CONFIG_MPENTIUMIII=y CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=5 CONFIG_X86_XADD=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_P6_NOP=y CONFIG_X86_TSC=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_FAMILY=6 CONFIG_X86_DEBUGCTLMSR=y CONFIG_UML_X86=y CONFIG_X86_32=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_ARCH_HAS_SC_SIGNALS=y CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_ZONE_DMA_FLAG=0 CONFIG_VIRT_TO_BUS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_LD_SCRIPT_DYN=y CONFIG_BINFMT_ELF=y CONFIG_MCONSOLE=y CONFIG_KERNEL_STACK_ORDER=2 CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=128 CONFIG_LOCALVERSION="" CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_LOG_BUF_SHIFT=14 CONFIG_CGROUPS=y CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_RT_GROUP_SCHED=y CONFIG_USER_SCHED=y CONFIG_RELAY=y CONFIG_NAMESPACES=y CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y CONFIG_PROC_PAGE_MONITOR=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 CONFIG_BLOCK=y CONFIG_BLK_DEV_IO_TRACE=y CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_DEADLINE=y CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" CONFIG_CLASSIC_RCU=y CONFIG_BLK_DEV=y CONFIG_BLK_DEV_UBD=y CONFIG_BLK_DEV_COW_COMMON=y CONFIG_STDERR_CONSOLE=y CONFIG_STDIO_CONSOLE=y CONFIG_SSL=y CONFIG_NULL_CHAN=y CONFIG_PORT_CHAN=y CONFIG_PTY_CHAN=y CONFIG_TTY_CHAN=y CONFIG_XTERM_CHAN=y CONFIG_CON_ZERO_CHAN="fd:0,fd:1" CONFIG_CON_CHAN="tty" CONFIG_SSL_CHAN="pty" CONFIG_UNIX98_PTYS=y CONFIG_UML_RANDOM=y CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug" CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_ASK_IP_FIB_HASH=y CONFIG_IP_FIB_HASH=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_SYN_COOKIES=y CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y CONFIG_TCP_CONG_ADVANCED=y CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" CONFIG_NETFILTER=y CONFIG_NETFILTER_ADVANCED=y CONFIG_NF_CONNTRACK=y CONFIG_NF_CT_ACCT=y CONFIG_NF_CONNTRACK_MARK=y CONFIG_NF_CT_PROTO_UDPLITE=y CONFIG_NF_CONNTRACK_FTP=y CONFIG_NF_CONNTRACK_IRC=y CONFIG_NF_CONNTRACK_SIP=y CONFIG_NETFILTER_XTABLES=y CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y CONFIG_NETFILTER_XT_TARGET_MARK=y CONFIG_NETFILTER_XT_TARGET_NFQUEUE=y CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=y CONFIG_NETFILTER_XT_MATCH_CONNMARK=y CONFIG_NETFILTER_XT_MATCH_HELPER=y CONFIG_NETFILTER_XT_MATCH_IPRANGE=y CONFIG_NETFILTER_XT_MATCH_LIMIT=y CONFIG_NETFILTER_XT_MATCH_MARK=y CONFIG_NETFILTER_XT_MATCH_OWNER=y CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y CONFIG_NETFILTER_XT_MATCH_QUOTA=y CONFIG_NETFILTER_XT_MATCH_STATE=y CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=y CONFIG_NF_CONNTRACK_IPV4=y CONFIG_NF_CONNTRACK_PROC_COMPAT=y CONFIG_IP_NF_IPTABLES=y CONFIG_IP_NF_MATCH_RECENT=y CONFIG_IP_NF_MATCH_ADDRTYPE=y CONFIG_IP_NF_FILTER=y CONFIG_IP_NF_TARGET_REJECT=y CONFIG_IP_NF_TARGET_LOG=y CONFIG_NF_NAT=y CONFIG_NF_NAT_NEEDED=y CONFIG_IP_NF_TARGET_REDIRECT=y CONFIG_NF_NAT_FTP=y CONFIG_NF_NAT_IRC=y CONFIG_NF_NAT_SIP=y CONFIG_IP_NF_MANGLE=y CONFIG_IP_NF_TARGET_ECN=y CONFIG_NET_SCHED=y CONFIG_NET_SCH_CBQ=y CONFIG_NET_SCH_HTB=y CONFIG_NET_SCH_HFSC=y CONFIG_NET_SCH_PRIO=y CONFIG_NET_SCH_RED=y CONFIG_NET_SCH_SFQ=y CONFIG_NET_SCH_TEQL=y CONFIG_NET_SCH_TBF=y CONFIG_NET_SCH_GRED=y CONFIG_NET_SCH_DSMARK=y CONFIG_NET_SCH_INGRESS=y CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=y CONFIG_NET_CLS_TCINDEX=y CONFIG_NET_CLS_ROUTE4=y CONFIG_NET_CLS_ROUTE=y CONFIG_NET_CLS_FW=y CONFIG_NET_CLS_U32=y CONFIG_CLS_U32_PERF=y CONFIG_CLS_U32_MARK=y CONFIG_NET_CLS_RSVP=y CONFIG_NET_EMATCH=y CONFIG_NET_EMATCH_STACK=32 CONFIG_NET_EMATCH_CMP=y CONFIG_NET_EMATCH_NBYTE=y CONFIG_NET_EMATCH_U32=y CONFIG_NET_EMATCH_META=y CONFIG_NET_CLS_ACT=y CONFIG_NET_ACT_POLICE=y CONFIG_NET_ACT_GACT=y CONFIG_NET_ACT_PEDIT=y CONFIG_NET_SCH_FIFO=y CONFIG_FIB_RULES=y CONFIG_UML_NET=y CONFIG_UML_NET_TUNTAP=y CONFIG_NETDEVICES=y CONFIG_DUMMY=y CONFIG_TUN=y CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y CONFIG_EXT2_FS_SECURITY=y CONFIG_FS_MBCACHE=y CONFIG_FS_POSIX_ACL=y CONFIG_DNOTIFY=y CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y CONFIG_QUOTA=y CONFIG_PRINT_QUOTA_WARNING=y CONFIG_QUOTACTL=y CONFIG_GENERIC_ACL=y CONFIG_PROC_FS=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_CONFIGFS_FS=y CONFIG_NETWORK_FILESYSTEMS=y CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y CONFIG_MSDOS_PARTITION=y CONFIG_KEYS=y CONFIG_SECURITY=y CONFIG_SECURITY_CAPABILITIES=y CONFIG_SECURITY_FILE_CAPABILITIES=y CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=16384 CONFIG_CRYPTO=y CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_SHA1=y CONFIG_BITREVERSE=y CONFIG_CRC32=y CONFIG_PLIST=y CONFIG_HAS_DMA=y CONFIG_ENABLE_WARN_DEPRECATED=y CONFIG_ENABLE_MUST_CHECK=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y CONFIG_FRAME_POINTER=y |
From: Jeff D. <jd...@ad...> - 2008-05-02 16:30:38
|
On Fri, May 02, 2008 at 12:34:54AM +0100, Nix wrote: > loki:/tmp# /usr/bin/ntpdate -bv hades > 30 Apr 22:45:52 ntpdate[8833]: ntpdate 4.2.4p4@1.1520-o Fri Feb 22 18:37:11 UTC 2008 (1) > 30 Apr 22:45:52 ntpdate[8833]: step time server 192.168.14.18 offset -5.728539 sec > > so in effect yes :) > > Hm. A five-second skip forwards via date --set works fine. A five-second > skip *backwards*, and instant boom. I tried both directions for the hell of it. With your config, I'm seeing a hang until the system time catches up to what UML thought it should have been in the first place. But it's only a few seconds, not forever. However, stracing it did reveal a bogus interval trying to be set, which the patch below fixes. It doesn't cause any behavior change here, so YMMV. This includes the previous patch, which I think is a good idea anyway, so back that out and drop this in its place. Jeff Index: linux-2.6.22/arch/um/os-Linux/time.c =================================================================== --- linux-2.6.22.orig/arch/um/os-Linux/time.c 2008-04-24 13:21:28.000000000 -0400 +++ linux-2.6.22/arch/um/os-Linux/time.c 2008-05-02 12:21:01.000000000 -0400 @@ -107,6 +107,10 @@ static void deliver_alarm(void) unsigned long long this_tick = os_nsecs(); int one_tick = UM_NSEC_PER_SEC / UM_HZ; + /* Protection against the host's time going backwards */ + if ((last_tick != 0) && (this_tick < last_tick)) + this_tick = last_tick; + if (last_tick == 0) last_tick = this_tick - one_tick; @@ -149,6 +153,9 @@ static int after_sleep_interval(struct t start_usecs = usec; start_usecs -= skew / UM_NSEC_PER_USEC; + if (start_usecs < 0) + start_usecs = 0; + tv = ((struct timeval) { .tv_sec = start_usecs / UM_USEC_PER_SEC, .tv_usec = start_usecs % UM_USEC_PER_SEC }); interval = ((struct itimerval) { { 0, usec }, tv }); -- Work email - jdike at linux dot intel dot com |
From: Jeff D. <jd...@ad...> - 2008-05-14 19:39:34
|
I finally reproduced this using 2.6.25.1 (2.6.25-mm1 was no good) with your config. The patch below fixes it for me. You'll notice a certain similarity between this and a previous patch that you posted. It's not clear to me why yours didn't work. Jeff -- Work email - jdike at linux dot intel dot com Index: 2.6/stable/arch/um/os-Linux/time.c =================================================================== --- 2.6.orig/stable/arch/um/os-Linux/time.c 2008-05-14 14:55:56.000000000 -0400 +++ 2.6/stable/arch/um/os-Linux/time.c 2008-05-14 15:30:48.000000000 -0400 @@ -66,12 +66,21 @@ long long disable_timer(void) return timeval_to_ns(&time.it_value); } +static long long last_time; + long long os_nsecs(void) { struct timeval tv; + long long ret; gettimeofday(&tv, NULL); - return timeval_to_ns(&tv); + ret = timeval_to_ns(&tv); + + if((last_time != 0) && (last_time > ret)) + ret = last_time; + + last_time = ret; + return ret; } #ifdef UML_CONFIG_NO_HZ |
From: Nix <ni...@es...> - 2008-05-02 18:55:24
|
On 2 May 2008, Jeff Dike verbalised: > With your config, I'm seeing a hang until the system time catches up > to what UML thought it should have been in the first place. But it's > only a few seconds, not forever. This is true sometimes, but not always: I just tried twice and got a rapid recovery the first time, but not the second. Trace the second time: 0x08084b89 in update_wall_time () at kernel/time/timekeeping.c:475 475 clock->error -= clock->xtime_interval << (TICK_LENGTH_SHIFT - clock->shift); (gdb) bt #0 0x08084b89 in update_wall_time () at kernel/time/timekeeping.c:475 #1 0x08077bd1 in do_timer (ticks=1) at kernel/timer.c:929 #2 0x08086314 in tick_periodic (cpu=0) at kernel/time/tick-common.c:66 #3 0x08086346 in tick_handle_periodic (dev=0x821d384) at kernel/time/tick-common.c:82 #4 0x080590ed in um_timer (irq=0, dev=0x0) at arch/um/kernel/time.c:70 #5 0x0808d3f3 in handle_IRQ_event (irq=0, action=0xdc49480) at kernel/irq/handle.c:140 #6 0x0808d471 in __do_IRQ (irq=0) at kernel/irq/handle.c:236 #7 0x080576f8 in do_IRQ (irq=0, regs=0x821bc80) at arch/um/kernel/irq.c:335 #8 0x080590a0 in timer_handler (sig=26, regs=0x821bc80) at arch/um/kernel/time.c:28 #9 0x08064c84 in real_alarm_handler (sc=0x821bd24) at arch/um/os-Linux/signal.c:93 #10 0x08064cac in alarm_handler (sig=26, sc=0x821bd24) at arch/um/os-Linux/signal.c:108 #11 0x08064d67 in handle_signal (sig=32, sc=0x821bd24) at arch/um/os-Linux/signal.c:157 #12 0x08066263 in hard_handler (sig=26) at arch/um/os-Linux/sys-i386/signal.c:12 #13 <signal handler called> #14 getnstimeofday (ts=0xde3beec) at include/linux/time.h:182 #15 0x08084306 in do_gettimeofday (tv=0xde3bf04) at kernel/time/timekeeping.c:118 #16 0x0807396c in sys_gettimeofday (tv=0xbf8a4ad8, tz=0x0) at kernel/time.c:103 #17 0x0805a55c in handle_syscall (r=0xde66e28) at arch/um/kernel/skas/syscall.c:35 #18 0x08066cdf in handle_trap (pid=13310, regs=0xde66e28, local_using_sysemu=2) at arch/um/os-Linux/skas/process.c:202 #19 0x0806716b in userspace (regs=0xde66e28) at arch/um/os-Linux/skas/process.c:418 #20 0x08057f5a in fork_handler () at arch/um/kernel/process.c:179 #21 0x00000000 in ?? () (gdb) quit > However, stracing it did reveal a bogus interval trying to be set, > which the patch below fixes. It doesn't cause any behavior change > here, so YMMV. > > This includes the previous patch, which I think is a good idea anyway, > so back that out and drop this in its place. No behavioural change :( I'm trying something else now, arranging for os_nsecs() itself to do the never-backwards stuff on the assumption that something depends on monotonic timers not skipping backwards which presently they might (there are callers of os_nsecs() outside of os-Linux/time.c, notably in kernel/time.c, and currently they see an unadjusted time, jumping backwards and forwards and whatever). -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |
From: Jeff D. <jd...@ad...> - 2008-05-02 19:57:29
|
On Fri, May 02, 2008 at 07:55:11PM +0100, Nix wrote: > No behavioural change :( Oh well. > I'm trying something else now, arranging for os_nsecs() itself to do the > never-backwards stuff on the assumption that something depends on > monotonic timers not skipping backwards which presently they might > (there are callers of os_nsecs() outside of os-Linux/time.c, notably in > kernel/time.c, and currently they see an unadjusted time, jumping > backwards and forwards and whatever). This sounds like a reasonable idea. Another thing you might try is the following: get two extra windows in one, get a "strace -p uml-pid -o strace.out -tt" ready in two, get a "date ; sudo date -s "now-5 sec" ; date" ready hit return in one, then in two, then in your UML window when the UML gives you another prompt (or you've decided it's hung), ^C the strace Look at the times reported by the dates, find the time change in the strace log, and send the section following that (and a bit before) to me. Jeff -- Work email - jdike at linux dot intel dot com |
From: Nix <ni...@es...> - 2008-05-14 20:12:33
|
On 14 May 2008, Jeff Dike verbalised: > I finally reproduced this using 2.6.25.1 (2.6.25-mm1 was no good) with > your config. YAY! (I wonder why this was so .config-dependent? You'd think it would trigger on anything, but I couldn't even make it happen on all my hosts...) > The patch below fixes it for me. You'll notice a certain similarity > between this and a previous patch that you posted. It's not clear to > me why yours didn't work. As far as I can tell it's identical in behaviour in the !NO_HZ case. Annoyingly, now I've upgraded the host to 2.6.25 (hence sans skas3), timings or something have changed such that I now can't make the freeze happen at all. I'll reboot the host into 2.6.24.x shortly to get skas back and try again. Oh, and, it's pedantic of me, I know, but what does this do if time goes backwards in the NO_HZ case? (Or is handling that a 2.6.26 thing?) |
From: Nix <ni...@es...> - 2008-05-14 20:31:43
|
[vincent-perrier <vin...@cl...> removed from Cc;, his MTA says `Client host rejected: AP0002 Please use your ISP mailserver' only I don't *have* an ISP mailserver.] On 14 May 2008, ni...@es... verbalised: > Annoyingly, now I've upgraded the host to 2.6.25 (hence sans skas3), > timings or something have changed such that I now can't make the freeze > happen at all. I'll reboot the host into 2.6.24.x shortly to get skas > back and try again. It seems to fix it with a 2.6.24 host too, but I still can't see why it works when my previous fix didn't. (Actually, while I'm trying it with lots of differently mangled variations on a host .config theme, is there a skas3 or skas4 against 2.6.25? I may as well add that into the mix and see if things still work.) I'll keep ntpd running here for a day or so as well, and see if it stalls out. (The host loses nearly half a minute a day without ntpd, it's awful.) -- `If you are having a "ua luea luea le ua le" kind of day, I can only assume that you are doing no work due [to] incapacitating nausea caused by numerous lazy demons.' --- Frossie |