From: Marcus B. <mbl...@gm...> - 2006-02-26 18:58:11
Attachments:
console.o.uml-t0
|
Hi all, After having done a maintainance update of my application server kernel i experience guests to panic with signal 7. The problem is reproducible and seems related to IO traffic as it happens when doing "apt-get update" in guest for example. New host kernel: 2.6.15.4 + squashfs Different guest kernels all suffer the problem, versions in detail: 2.6.7 + various patches (used to work fine before) 2.6.15.4 + squashfs 2.6.15.4 + squashfs + 2.6.15-bs2 Old host kernel: 2.6.8.1 + various patches + skas3 Short excerpt from crash: [42949509.890000] Kernel panic - not syncing: Kernel mode signal 7 [42949509.890000] [42949509.890000] EIP: 0073:[<402237ee>] CPU: 0 Not tainted ESP: 007b:bf8038e4 EFLAGS: 00000283 [42949509.890000] Not tainted [42949509.890000] EAX: ffffffda EBX: 00000005 ECX: bf8039bc EDX: bf80393c [42949509.890000] ESI: 00000000 EDI: bf803934 EBP: bf803a3c DS: 007b ES: 007b [42949509.890000] 08eb7770: [<0807ce4c>] notifier_call_chain+0x1c/0x40 [42949509.890000] 08eb778c: [<0806ef6b>] panic+0x4b/0xf0 [42949509.890000] 08eb77a0: [<08059e3b>] relay_signal+0x7b/0x80 [42949509.890000] 08eb77bc: [<08059e8f>] bus_handler+0x4f/0x60 [42949509.890000] 08eb77d0: [<0805ce28>] sig_handler_common_skas+0x78/0xd0 [42949509.890000] 08eb77f0: [<0806770f>] sig_handler+0xf/0x20 [42949509.890000] 08eb77fc: [<081a8c08>] __restore+0x0/0x8 [42949509.890000] 08eb783c: [<080b313e>] __pollwait+0x3e/0xb0 [42949509.890000] 08eb789c: [<08057eb4>] change_signals+0x34/0x60 [42949509.890000] 08eb7948: [<08092a57>] __pagevec_lru_add+0x97/0xd0 [42949509.890000] 08eb79a0: [<08057f32>] enable_mask+0x32/0x40 [42949509.890000] 08eb79ac: [<08057fdd>] set_signals+0x5d/0xe0 [42949509.890000] 08eb7a3c: [<0808d61b>] prep_new_page+0x6b/0x80 [42949509.890000] 08eb7a4c: [<0808d9a7>] buffered_rmqueue+0xf7/0x200 [42949509.890000] 08eb7a70: [<0808dbf2>] get_page_from_freelist+0x82/0xc0 [42949509.890000] 08eb7a74: [<0808dc08>] get_page_from_freelist+0x98/0xc0 [42949509.890000] 08eb7a90: [<0808dc7e>] __alloc_pages+0x4e/0x2d0 [42949509.890000] 08eb7ad8: [<0808df20>] __get_free_pages+0x20/0x60 [42949509.890000] 08eb7adc: [<080b3135>] __pollwait+0x35/0xb0 [42949509.890000] 08eb7af8: [<080ad134>] pipe_poll+0x24/0x90 [42949509.890000] 08eb7b14: [<080b3540>] do_select+0x290/0x310 [42949509.890000] 08eb7b6c: [<0805d1f3>] copy_from_user_skas+0x73/0x90 [42949509.890000] 08eb7b74: [<080b3100>] __pollwait+0x0/0xb0 [42949509.890000] 08eb7b94: [<080b38d8>] sys_select+0x2e8/0x530 [42949509.890000] 08eb7bc4: [<080a1a6b>] vfs_write+0xbb/0x130 [42949509.890000] 08eb7bc8: [<080617ed>] mconsole_config+0xad/0xc0 [42949509.890000] 08eb7bcc: [<080a1a7f>] vfs_write+0xcf/0x130 [42949509.890000] 08eb7c1c: [<0805cb99>] handle_syscall+0xb9/0xc0 [42949509.890000] 08eb7c78: [<08068650>] move_registers+0x30/0x50 [42949509.890000] 08eb7c8c: [<0805b934>] handle_trap+0x24/0xe0 [42949509.890000] 08eb7ca8: [<0805bdf0>] userspace+0x170/0x1b0 [42949509.890000] 08eb7ce0: [<0805cd9a>] force_flush_all_skas+0x2a/0x40 [42949509.890000] 08eb7cfc: [<0805c7cf>] fork_handler+0xaf/0xc0 [42949509.890000] 08eb7d1c: [<081a8c08>] __restore+0x0/0x8 [42949509.890000] 08eb7d5c: [<081a8cc1>] __kill+0x11/0x20 [42949509.890000] [42949509.890000] deactivate_all_fds failed, errno = 9 Appended is a full console dump of guest crashing. I'd like to help tracing this down, so if you have ideas then please communicate them. Best regards, Marcus |
From: Jeff D. <jd...@ad...> - 2006-02-26 21:11:00
|
On Sun, Feb 26, 2006 at 07:57:55PM +0100, Marcus Blomenkamp wrote: > After having done a maintainance update of my application server kernel i > experience guests to panic with signal 7. The problem is reproducible and > seems related to IO traffic as it happens when doing "apt-get update" in > guest for example. This is normally a tmpfs mount on the host filling up, i.e. it's too small to hold the physical memory files of the UMLs that are using it. Jeff |
From: Marcus B. <mbl...@gm...> - 2006-02-27 08:51:50
|
Am Sonntag, 26. Februar 2006 22:12 schrieb Jeff Dike: > > This is normally a tmpfs mount on the host filling up, i.e. it's too small > to hold the physical memory files of the UMLs that are using it. Well this is what i concluded after googling too so i already worked around it. Each machine has its own tmpfs and respective TMPDIR variable. As a first attempt i made it guest memory size plus 1M, but it did not help. increasing it to 2*guest + 1M did neither. Not to mention i can see the virtual machines using their TMPDIR, the latter configuration results in about 2/3 free memory each. Could it be a short peek in memory usage or some other shared ressource that i have missed? For the statistics: The host machine has 320M. The guest virtual machines amount to 7*16M plus 1*24M. Sum of virtual machine tmp dirs for 2nd conf is 273M. BTW: The host machine has no swap and is configured with overcommit mode 2. Best regards, Marcus |
From: Marcus B. <mbl...@gm...> - 2006-02-27 14:33:59
Attachments:
console.o.uml-t0.b-PREEMPT
console.o.uml-t0.c-NOPREEMPT
|
Am Montag, 27. Februar 2006 09:51 schrieb Marcus Blomenkamp: > ... As a pure guess i disabled host kernel preemption which seems to cure the "signal 7" problem. Nevertheless i do also get various other kernel panics, which i cannot relate or not relate to preemption. Appended to this mail are some sample panics. While the above panics all happened in skas0 mode i also did a test in tt mode. In this case the machine does not even do a proper boot. Below is a console excerpt, as id did not result in a full panic. [42949374.090000] device eth1 entered promiscuous mode + ip link set dev eth1 up [42949374.090000] br0: port 2(eth1) entering learning state + ip addr add dev br0 192.168.1.70/24 brd + + ip route add default via 192.168.1.10 INIT: PANIC: segmentation violation at 0x44a7! sleeping for 30 seconds. Best regards, Marcus |
From: Jeff D. <jd...@ad...> - 2006-02-27 17:15:32
|
If the tmpfs theory were correct, you'd see the tmpfs mount being 100% full at the point of the panic. On Mon, Feb 27, 2006 at 03:33:41PM +0100, Marcus Blomenkamp wrote: > [42949441.660000] userspace - child stopped with signal 18 > [42949446.640000] userspace - child stopped with signal 18 > [42949446.680000] userspace - child stopped with signal 18 > [42949449.880000] userspace - child stopped with signal 18 > [42950121.590000] Kernel panic - not syncing: do_syscall_stub : failed to wait for SIGUSR1/SIGTRAP, pid = 28347, n = 28347, errno = 25, status = 0x127f Something is sending the UML SIGCONT (that's the signal 18 and the status 0x12). What's the host? Can you try a different version of the host kernel, especially if this all happened after a host upgrade? > [42949401.520000] <0>Kernel panic - not syncing: Kernel mode fault at addr 0x1e0, ip 0x0 > [42949409.990000] > [42949409.990000] EIP: 0000:[<00000000>] CPU: 0 Not tainted EFLAGS: 00000000 > [42949409.990000] Not tainted > [42949409.990000] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000 > [42949409.990000] ESI: 00000000 EDI: 00000000 EBP: 00000000 DS: 0000 ES: 0000 > [42949409.990000] 0822b260: [<0807cf5c>] notifier_call_chain+0x1c/0x40 > [42949409.990000] 0822b27c: [<0806f07b>] panic+0x4b/0xf0 > [42949409.990000] 0822b290: [<08059d24>] segv+0x264/0x290 > [42949409.990000] 0822b354: [<08059fc0>] segv_handler+0x50/0x60 > [42949409.990000] 0822b370: [<0805ce78>] sig_handler_common_skas+0x78/0xd0 > [42949409.990000] 0822b390: [<080677cf>] sig_handler+0xf/0x20 > [42949409.990000] 0822b39c: [<081a8d68>] __restore+0x0/0x8 > [42949409.990000] 0822b3dc: [<08196d63>] fn_hash_lookup+0x23/0xb0 > [42949409.990000] 0822b438: [<08057f32>] enable_mask+0x32/0x40 > [42949409.990000] 0822b44c: [<08057f5d>] get_signals+0x1d/0x40 > [42949409.990000] 0822b484: [<08057f32>] enable_mask+0x32/0x40 > [42949409.990000] 0822b498: [<08057f5d>] get_signals+0x1d/0x40 > [42949409.990000] 0822b4b4: [<0805fe78>] uml_net_start_xmit+0x68/0x110 This one looks more real, but given the first panic, I'd like to see if the host kernel is causing this one too. Jeff |
From: Marcus B. <mbl...@gm...> - 2006-02-27 20:52:24
|
Am Montag, 27. Februar 2006 18:16 schrieb Jeff Dike: > > Something is sending the UML SIGCONT (that's the signal 18 and the > status 0x12). What's the host? Can you try a different version of > the host kernel, especially if this all happened after a host upgrade? I suppose this has been caused by me experimenting with the terminal program "screen", especially its attaching and detaching funcion. Could that be the case? > This one looks more real, but given the first panic, I'd like to see > if the host kernel is causing this one too. Well in the meantime i tried different overcommit mode settings with a yet to be confirmed result that 2.6.15.4 is more strict than 2.6.8.1 in that respect. I will have to run a series of tests under _really_ relaxed memory settings to see what is what. After that i will put preemption back into place as the clients seemed to panic much more often with it - whether this feeling holds true will be seen then. More results to come in a few hours as i need a proper service window first. Best regards, Marcus |