From: Arturo R. <ja...@gm...> - 2011-01-25 17:46:26
Attachments:
strace.log
upstart.log
|
I hope it's OK that I'm cross-posting this. I sent the message to debian-users, but I think I might have better luck here. Per the title, I'd love to get your input on how to debug/fix this particular issue. A description of my setup: Asus UL30A-X5 Laptop 1.3GHz Intel SU7300 Core 2 Duo 4GB of DDR3 RAM 500GB SATA Intel GMA 4500MHD Running Debian sid on a coLinux 0.7.8 (uname -a: "Linux colinux 2.6.33.5-co-0.7.8 #1 PREEMPT Wed Sep 1 22:49:51 UTC 2010 i686 GNU/Linux") inside of Windows XP Pro SP3. The error is reproducible 100% of the time. When the machine goes into standby, either automatically or manually, init (or something else? see below), crashes and takes the system down with it. I've read that gdb can't attach to init by design, so I tried strace. Output is attached as strace.log Now, since I assumed the problem was with init, I switched to upstart, but that's not working either. See upstart.log, attached. I've also ruled out coLinux (and with it, its kernel) by trying one of the filesystem images they provide. When using that, there is no problem bringing the machine in and out of standby repeatedly. Does anyone have any idea of how I could further narrow down where the problem lies, or point me in the direction of the proper mailing list to direct my question? My apologies if I've left out any important detail. Please let me know if you have any questions. -- Arturo R. |
From: Henry N. <hen...@ar...> - 2011-01-27 00:15:05
|
Hello Arturo, On 25.01.2011 18:46, Arturo R. wrote: > Running Debian sid on a coLinux 0.7.8 (uname -a: "Linux colinux > 2.6.33.5-co-0.7.8 #1 PREEMPT Wed Sep 1 22:49:51 UTC 2010 i686 > GNU/Linux") inside of Windows XP Pro SP3. The error is reproducible > 100% of the time. When the machine goes into standby, either > automatically or manually, init (or something else? see below), > crashes and takes the system down with it. > > I've read that gdb can't attach to init by design, so I tried strace. > Output is attached as strace.log > > Now, since I assumed the problem was with init, I switched to upstart, > but that's not working either. See upstart.log, attached. > > I've also ruled out coLinux (and with it, its kernel) by trying one of > the filesystem images they provide. When using that, there is no > problem bringing the machine in and out of standby repeatedly. > > Does anyone have any idea of how I could further narrow down where the > problem lies You should have debug symbols for init, or an init with debug symbols. Than you can load init into gdb and locate the code snip for address 0xb766d417, or you can run "objdump -Dr /sbin/init >dump.txt" and find out the address manually. The addr2line should also work for this. -- Henry N. |
From: Arturo R. <ja...@gm...> - 2011-01-27 04:06:20
Attachments:
gdb.log
|
Hi Henry. On Wed, Jan 26, 2011 at 4:14 PM, Henry Nestler <hen...@ar...> wrote: > > You should have debug symbols for init, or an init with debug symbols. Than > you can load init into gdb and locate the code snip for address 0xb766d417, > or you can run "objdump -Dr /sbin/init >dump.txt" and find out the address > manually. The addr2line should also work for this. > Thank you very much for the reply. I guess I misspoke. When I said that the crash is reproducible all the time, I mean that putting the laptop to standby always causes init to crash, but the address given in the error message changes. I've rebuilt an sysvinit package with debug symbols and I can attach to it with gdb, but when the process crashes gdb becomes unresponsive too. I was able to get a coredump and I think I have a little more information about the crash this time, but I'm not sure how to interpret it. I've attached the output of gdb in case it's helpful to you or anybody else. Thanks again. -- Arturo R. |
From: Henry N. <hen...@ar...> - 2011-01-27 20:08:24
|
On 27.01.2011 05:06, Arturo R. wrote: > #4<signal handler called> > #5 0xb75d5417 in ?? () from /lib/i686/cmov/libc.so.6 > #6 0x08049f87 in print (s=0x804eb45 "\rINIT: ") at init.c:820 > #7 0x0804a0ef in initlog (loglevel=1, > s=0x804e854 "Id \"%s\" respawning too fast: disabled for %d minutes") > at init.c:858 You should see the text 'INIT: Id "foo" respawning too fast: disabled for 5 minutes'. But the print self creates a segfault inside libc. Please locate the lines 820 to 858. Maybe the variable for the first %s has no value or a wrong pointer. To see the called function in libc, can you build "init" with debug and all libraries as static? -- Henry N. |
From: Arturo R. <ja...@gm...> - 2011-01-29 16:20:44
Attachments:
gdb.log
|
Henry: On Thu, Jan 27, 2011 at 12:08 PM, Henry Nestler <hen...@ar...> wrote: > > On 27.01.2011 05:06, Arturo R. wrote: >> >> #4<signal handler called> >> #5 0xb75d5417 in ?? () from /lib/i686/cmov/libc.so.6 >> #6 0x08049f87 in print (s=0x804eb45 "\rINIT: ") at init.c:820 >> #7 0x0804a0ef in initlog (loglevel=1, >> s=0x804e854 "Id \"%s\" respawning too fast: disabled for %d minutes") >> at init.c:858 > > You should see the text 'INIT: Id "foo" respawning too fast: disabled for 5 > minutes'. > But the print self creates a segfault inside libc. > > Please locate the lines 820 to 858. Maybe the variable for the first %s has > no value or a wrong pointer. > > To see the called function in libc, can you build "init" with debug and all > libraries as static? Attached is a backtrace with all the libraries init uses built with debugging symbols, which gives the output I think you're looking for. I'm stumped by another thing though. I commented out the offending code from init.c and the system still hung, it just didn't print that message. Does that make sense? Thanks again for your help and patience. I'm learning a lot here that will help me debug these kinds of things in the future. -- Arturo R. |
From: Arturo R. <ja...@gm...> - 2011-01-30 07:17:07
Attachments:
gdb-bash.log
|
New development. After I changed the way the kernel names the coredumps, I realized that bash was also crashing on the exact same function (__strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:75). Armed with this information, I decided to remove the libc6-i686 package and now the system no longer crashes when resuming from standby. I would still love to help out in figuring out a proper fix, if you Henry, or anyone else, would like to work with me. Thanks again. -- Arturo R. |
From: Henry N. <hen...@ar...> - 2011-01-30 21:14:35
|
Hello Arturo, nice, you have found a bug inside libc or with SSE2. Google for this text "segfault in multiarch string function (__strlen_sse2)" and you will find many of these bugs. Mostly not solved or not reproduce later. http://en.wikipedia.org/wiki/SSE2 Please check, that your CPU supports SSE2. You can do it under native Linux, or with knoppix by checking the flags from "/proc/cpuinfo". I think your Intel has this. Maybe we have a problem with FPU save/restore code for SSE2 instructions inside coLinux? Here you need to find a testcase, that produce code like "pxor %xmm0, %xmm0". Run this under coLinux to check it. Boot coLinux with kernel option "nofxsr". This should disable all MMX and SSE/SSE2 instructions. Henry On 30.01.2011 08:16, Arturo R. wrote: > New development. After I changed the way the kernel names the > coredumps, I realized that bash was also crashing on the exact same > function (__strlen_sse2 () at > ../sysdeps/i386/i686/multiarch/strlen.S:75). > > Armed with this information, I decided to remove the libc6-i686 > package and now the system no longer crashes when resuming from > standby. > > I would still love to help out in figuring out a proper fix, if you > Henry, or anyone else, would like to work with me. > > Thanks again. > > gdb-bash.log > > > root@colinux:~/src# gdb --quiet /bin/bash coredump.bash.1296369722 > Reading symbols from /bin/bash...done. > [New Thread 1592] > > warning: Can't read pathname for load map: Input/output error. > Reading symbols from /lib/libncurses.so.5...(no debugging symbols found)...done. > Loaded symbols for /lib/libncurses.so.5 > Reading symbols from /lib/i686/cmov/libdl.so.2...done. > Loaded symbols for /lib/i686/cmov/libdl.so.2 > Reading symbols from /lib/i686/cmov/libc.so.6...done. > Loaded symbols for /lib/i686/cmov/libc.so.6 > Reading symbols from /lib/ld-linux.so.2...done. > Loaded symbols for /lib/ld-linux.so.2 > Reading symbols from /lib/i686/cmov/libnss_compat.so.2...done. > Loaded symbols for /lib/i686/cmov/libnss_compat.so.2 > Reading symbols from /lib/i686/cmov/libnsl.so.1...done. > Loaded symbols for /lib/i686/cmov/libnsl.so.1 > Reading symbols from /lib/i686/cmov/libnss_nis.so.2...done. > Loaded symbols for /lib/i686/cmov/libnss_nis.so.2 > Reading symbols from /lib/i686/cmov/libnss_files.so.2...done. > Loaded symbols for /lib/i686/cmov/libnss_files.so.2 > Core was generated by `-bash'. > Program terminated with signal 11, Segmentation fault. > #0 __strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:75 > 75 pxor %xmm0, %xmm0 /* 16 null chars */ |
From: Paolo M. <pao...@gm...> - 2011-01-31 12:33:39
|
> Maybe we have a problem with FPU save/restore code for SSE2 instructions > inside coLinux? > Here you need to find a testcase, that produce code like "pxor %xmm0, > %xmm0". Run this under coLinux to check it. > > Boot coLinux with kernel option "nofxsr". This should disable all MMX and > SSE/SSE2 instructions. > > Henry Hi Henry and Arturo, If I remember correcly FXSAVE and FXRSTOR save and restore all FPU/MMX/SSE2 state. This problem seems very interesting ... Paolo |
From: Arturo R. <ja...@gm...> - 2011-01-31 04:59:56
|
Henry: On Sun, Jan 30, 2011 at 1:14 PM, Henry Nestler <hen...@ar...> wrote: > nice, you have found a bug inside libc or with SSE2. Google for this text > "segfault in multiarch string function (__strlen_sse2)" and you will find > many of these bugs. Mostly not solved or not reproduce later. This one on Ubuntu's Launchpad looked specially attractive, since it includes a test case. Alas, I don't know how to compile/use the test case. There is a foo.cc file, but when I try to compile it, it complains about missing foo.h, which isn't in the tar archive. https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/544109 > Please check, that your CPU supports SSE2. You can do it under native Linux, > or with knoppix by checking the flags from "/proc/cpuinfo". I think your > Intel has this. Yeah, looks like my CPU supports it (cpuinfo.log attached). > Maybe we have a problem with FPU save/restore code for SSE2 instructions > inside coLinux? > Here you need to find a testcase, that produce code like "pxor %xmm0, > %xmm0". Run this under coLinux to check it. Can you point me in the right direction of how to do this? A simple .c program that runs strlen on a string doesn't seem to be calling the assembly optimized code, and if it is, it's not causing a crash. > Boot coLinux with kernel option "nofxsr". This should disable all MMX and > SSE/SSE2 instructions. I tried it, but coLinux just crashes (coLinux .log and .conf attached). Should I try with a development snapshot? Do you think it makes sense to file a bug report for the Debian package at this point? Thank you. -- Arturo R. |
From: Henry N. <hen...@ar...> - 2011-01-31 23:17:58
|
On 31.01.2011 05:59, Arturo R. wrote: > On Sun, Jan 30, 2011 at 1:14 PM, Henry Nestler wrote: >> nice, you have found a bug inside libc or with SSE2. Google for this text >> "segfault in multiarch string function (__strlen_sse2)" and you will find >> many of these bugs. Mostly not solved or not reproduce later. > This one on Ubuntu's Launchpad looked specially attractive, since it > includes a test case. Alas, I don't know how to compile/use the test > case. There is a foo.cc file, but when I try to compile it, it > complains about missing foo.h, which isn't in the tar archive. > > https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/544109 This test is for testing "cpp", that produced the bug while cpmpiling this code snip. This is not a source to create a test. >> Maybe we have a problem with FPU save/restore code for SSE2 instructions >> inside coLinux? >> Here you need to find a testcase, that produce code like "pxor %xmm0, >> %xmm0". Run this under coLinux to check it. > Can you point me in the right direction of how to do this? A simple .c > program that runs strlen on a string doesn't seem to be calling the > assembly optimized code, and if it is, it's not causing a crash. No, sorry I don't have such, and I also not found any usable code. >> Boot coLinux with kernel option "nofxsr". This should disable all MMX and >> SSE/SSE2 instructions. > I tried it, but coLinux just crashes (coLinux .log and .conf > attached). I have tested "nofxsr" on my machine and it has no effect. It's normal working. No crashing. Maybe an other use with same Intel U7300 can check the usage of "nofxsr" udner coLinux. > Should I try with a development snapshot? This would do no matter here. > Do you think it makes sense to file a bug report for the Debian > package at this point? Only, if you can reproduce this under native Linux, for example with debian boot cdrom and the kernel parameter "nofxsr". -- Henry N. |
From: Arturo R. <ja...@gm...> - 2011-01-31 05:21:50
Attachments:
colinux.log
|
> I tried it, but coLinux just crashes (coLinux .log and .conf > attached). Should I try with a development snapshot? Sorry, meant to attach this colinux.log -- Arturo R. |
From: Henry N. <hen...@ar...> - 2011-01-31 21:47:26
|
On 31.01.2011 06:21, Arturo R. wrote: > > C:\coLinux>colinux-daemon.exe @debian-dev.conf > Cooperative Linux Daemon, 0.7.8 > Daemon compiled on Wed Sep 1 22:59:30 2010 > > PID: 372 > colinux: booting > Linux version 2.6.33.5-co-0.7.8 (hn@hn-dt) (gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) ) #1 PREEMPT Wed Sep 1 22:49:51 UTC 2010 > > [...snip...] > > Kernel command line: root=/dev/cobd0 ro debug nofxsr > PID hash table entries: 2048 (order: 1, 8192 bytes) > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) > Initializing CPU#0 > xsave/xrstor: enabled xstate_bv 0x3, cntxt size 0x240 > > [...snip...] > > CPU: Genuine Intel(R) CPU U7300 @ 1.30GHz stepping 0a > > [...snip...] > > VFS: Mounted root (ext3 filesystem) readonly on device 117:0. > Freeing unused kernel memory: 140k freed > kjournald starting. Commit interval 5 seconds > Kernel panic - not syncing: Attempted to kill init! > Pid: 1, comm: init Not tainted 2.6.33.5-co-0.7.8 #1 > Call Trace: > [<c122f6af>] ? printk+0x18/0x21 > [<c122f681>] panic+0x4e/0x64 > colinux: Linux VM terminated > colinux: Kernel panic: Attempted to kill init! Ok. You bootet kernel without support for sse2 and than init does not start or kills him self at very top. I can not exclude that coLinux has an error here. But I feel it is a bug with libc and sse2 detection. -- Henry N. |
From: Arturo R. <ja...@gm...> - 2011-01-31 06:24:49
|
> I tried it, but coLinux just crashes (coLinux .log and .conf > attached). Should I try with a development snapshot? Tried downloading devel-coLinux-20110125.exe, but I'm getting a truncated executable. <http://sourceforge.net/projects/colinux/files/Snapshots/devel-20110125-Snapshot/devel-coLinux-20110125.exe/download> -- Arturo R. |
From: Henry N. <hen...@ar...> - 2011-01-31 21:29:17
|
On 31.01.2011 07:24, Arturo R. wrote: >> I tried it, but coLinux just crashes (coLinux .log and .conf >> attached). Should I try with a development snapshot? No, not need. There are no changes on floating point. > > Tried downloading devel-coLinux-20110125.exe, but I'm getting a > truncated executable. > > <http://sourceforge.net/projects/colinux/files/Snapshots/devel-20110125-Snapshot/devel-coLinux-20110125.exe/download> Oh, yes I see too. The file on server is ok. But the server terminates after some bytes are loaded. It is a problem on SF. They have many problems currently. It begun with an attack Jan 27 last week. Read more ... https://sourceforge.net/apps/wordpress/sourceforge/ By the while the snapshot is available site: http://www.henrynestler.com/colinux/testing/devel-0.7.9/20110125-Snapshot/ -- Henry N. |