From: Allen C. <al...@us...> - 2004-07-28 07:14:39
|
Attached below are some stack traces for the different "linux" threads after the guest kernel hangs. I tried to follow the instructions and examples from the Kernel Debugging section of the User Mode Linux HOWTO, but am unsure of how to get the stack trace from the "last call to schedule()". I hope the stack traces I have attached helps. If not, please point me in the right direction to getting the stack trace that you are looking for. Thanks. > This resembles the GHz bug, except that one is fixed. If someone > could get a stack trace from the last call to schedule(), that would > be a help. > There is some sort of a deadlock on a semaphore, and I'd like to > know which one it is. ---------- start of debug session ------------ (906)% gdb linux GNU gdb 6.1-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) break schedule Breakpoint 1 at 0xa0011c43: file sched.c, line 557. (gdb) handle SIGSEGV pass nostop noprint Signal Stop Print Pass to program Description SIGSEGV No No Yes Segmentation fault (gdb) handle SIGUSR1 pass nostop noprint Signal Stop Print Pass to program Description SIGUSR1 No No Yes User defined signal 1 (gdb) r umid=tinyuml mem=16M ubd0=umlfs eth0=daemon Starting program: /usr/bin/linux umid=tinyuml mem=16M ubd0=umlfs eth0=daemon Detaching after fork from child process 23295. Checking for the skas3 patch in the host...found Checking for /proc/mm...found Detaching after fork from child process 23296. Detaching after fork from child process 23297. Detaching after fork from child process 23298. Detaching after fork from child process 23299. Checking for /dev/anon on the host...Not available (open failed with errno 2) Linux version 2.4.26-2um (root@swale) (gcc version 3.3.4 (Debian 1:3.3.4-4)) #2 Tue Jul 27 13:34:29 EDT 2004 On node 0 totalpages: 4096 zone(0): 4096 pages. zone(1): 0 pages. zone(2): 0 pages. Kernel command line: mem=16M ubd0=umlfs eth0=daemon root=/dev/ubd0 Calibrating delay loop... 1648.23 BogoMIPS Memory: 12112k available Dentry cache hash table entries: 2048 (order: 2, 16384 bytes) Inode cache hash table entries: 1024 (order: 1, 8192 bytes) Mount cache hash table entries: 512 (order: 0, 4096 bytes) Buffer cache hash table entries: 1024 (order: 0, 4096 bytes) Page-cache hash table entries: 4096 (order: 2, 16384 bytes) Checking for host processor cmov support...Yes Checking for host processor xmm support...No Checking that ptrace can change system call numbers...Detaching after fork from child process 23300. OK Checking that host ptys support output SIGIO...Yes Checking that host ptys support SIGIO on close...No, enabling workaround POSIX conformance testing by UNIFIX Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket Starting kswapd VFS: Disk quotas vdquot_6.5.1 Journalled Block Device driver loaded devfs: v1.12c (20020818) Richard Gooch (rg...@at...) devfs: boot_options: 0x0 SGI XFS with ACLs, no debug enabled SGI XFS Quota Management subsystem Detaching after fork from child process 23301. pty: 256 Unix98 ptys configured RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize loop: loaded (max 8 devices) Initializing Cryptographic API Initializing software serial port version 1 Netdevice 0 : daemon backend (uml_switch version 3) - unix:/var/run/uml-utilities/uml_switch.ctl mconsole (version 2) initialized on /home/alydar/.uml/tinyuml/mconsole Partition check: ubda: unknown partition table Detaching after fork from child process 23302. Initializing stdio console driver Detaching after fork from child process 23303. NET4: Linux TCP/IP 1.0 for NET4.0 Program received signal SIGINT, Interrupt. 0xa02177da in nanosleep () at proc_fs.h:155 155 res->data=data; (gdb) bt #0 0xa02177da in nanosleep () at proc_fs.h:155 #1 0xa0153354 in idle_sleep (secs=-516) at time.c:132 #2 0xa014f091 in cpu_idle () at process_kern.c:212 #3 0xa000f635 in rest_init () at init/main.c:364 #4 0xa00025e6 in start_kernel () at init/main.c:458 #5 0xa015b617 in start_kernel_proc (unused=0x0) at process_kern.c:156 #6 0xa014ead9 in run_kernel_thread (fn=0xa015b5f0 <start_kernel_proc>, arg=0x0, jmp_ptr=0xfffffdfc) at process.c:227 #7 0xa015b360 in new_thread_handler (sig=10) at process_kern.c:70 #8 <signal handler called> #9 0xa0219c9d in syscall () at proc_fs.h:155 #10 0xbffff4c8 in ?? () #11 0xa020128c in siglongjmp () at proc_fs.h:155 #12 0xa015ae39 in start_idle_thread (stack=0xa02a2000, switch_buf_ptr=0xa02a057c, fork_buf_ptr=0x0) at process.c:303 #13 0xa015b678 in start_uml_skas () at process_kern.c:170 #14 0xa0154a3d in linux_main (argc=-1610612736, argv=0xbffff924) at um_arch.c:397 #15 0xa000f3ee in main (argc=6, argv=0xbffff924, envp=0xbffff940) at arch/um/main.c:148 (gdb) det Detaching from program: /usr/bin/linux, process 23291 (gdb) att 23296 Attaching to program: /usr/bin/linux, process 23296 ptrace: Operation not permitted. (gdb) att 23301 Attaching to program: /usr/bin/linux, process 23301 0xa0219c9d in syscall () at proc_fs.h:155 155 res->data=data; (gdb) bt #0 0xa0219c9d in syscall () at proc_fs.h:155 #1 0xa015e945 in io_getevents (ctx_id=4294967292, min_nr=-4, nr=-4, events=0xfffffffc, timeout=0xfffffffc) at aio.c:58 #2 0xa015eb09 in aio_thread (arg=0x0) at aio.c:128 #3 0xa0219dda in clone () at proc_fs.h:155 (gdb) det Detaching from program: /usr/bin/linux, process 23301 (gdb) att 23302 Attaching to program: /usr/bin/linux, process 23302 0xa0218aad in read () at proc_fs.h:155 155 res->data=data; (gdb) bt #0 0xa0218aad in read () at proc_fs.h:155 #1 0xa08d8f84 in ?? () #2 0xa015fca4 in file_io (fd=12, buf=0xa08d8f84, len=76, io_proc=0xa0218a90 <read>, copy_user_proc=0xa014f440 <copy_from_user_proc>) at file.c:489 #3 0xa015fd41 in os_read_file (fd=-512, buf=0xfffffe00, len=-512) at file.c:505 #4 0xa018497a in io_thread (arg=0x0) at ubd_user.c:317 #5 0xa0219dda in clone () at proc_fs.h:155 (gdb) det Detaching from program: /usr/bin/linux, process 23302 (gdb) att 23303 Attaching to program: /usr/bin/linux, process 23303 0xa0219196 in poll () at proc_fs.h:155 155 res->data=data; (gdb) bt #0 0xa0219196 in poll () at proc_fs.h:155 #1 0xa01500a9 in write_sigio_thread (unused=0x0) at sigio_user.c:190 #2 0xa0219dda in clone () at proc_fs.h:155 (gdb) ---------- end of debug session ------------ ---------- ps output for "linux" threads ----------- 23290 pts/68 S+ 0:01 gdb linux 23291 pts/68 T 0:00 /usr/bin/linux 23296 pts/68 T 0:00 /usr/bin/linux 23301 pts/68 T 0:00 /usr/bin/linux 23302 pts/68 T 0:00 /usr/bin/linux 23303 pts/68 T 0:00 /usr/bin/linux -------------- end ps output ---------------- |
From: Jeff D. <jd...@ad...> - 2004-07-28 16:24:02
|
al...@us... said: > #2 0xa014f091 in cpu_idle () at process_kern.c:212 > #3 0xa000f635 in rest_init () at init/main.c:364 > #4 0xa00025e6 in start_kernel () at init/main.c:458 This is the problem stack. One of the init functions is tripping over a semaphore, and no one has been able to tell me which init function it is or which semaphore. Although the down isn't showing in this stack, which is a bit confusing. Jeff |
From: Allen C. <al...@us...> - 2004-07-29 18:30:27
|
I had no luck trying to get a better backtrace. However, since this problem only showed up between 2.4.26-1um and 2.4.26-2um, I systematically backed out the CVS changes until the problem went away. It turns out that the following changes caused the problem to show up: arch/um/drivers/chan_kern.c 1.39 Fri 07-16-2004 02:27 PM EDT jdike Exp +65 -0 Moved some non-userspace code from chan_user.c to chan_kern.c. arch/um/drivers/chan_user.c 1.33 Fri 07-16-2004 02:27 PM EDT jdike Exp +0 -64 Moved some non-userspace code from chan_user.c to chan_kern.c. I cannot explain why simply moving a few functions from one file to another will have any real effect on the resulting kernel, but my 2.4.26-2um guest kernel is now running fine without these changes. On Wednesday 28 July 2004 01:22 pm, Jeff Dike wrote: > al...@us... said: > > #2 0xa014f091 in cpu_idle () at process_kern.c:212 > > #3 0xa000f635 in rest_init () at init/main.c:364 > > #4 0xa00025e6 in start_kernel () at init/main.c:458 > > This is the problem stack. One of the init functions is tripping > over a semaphore, and no one has been able to tell me which init > function it is or which semaphore. Although the down isn't showing > in this stack, which is a bit confusing. > > Jeff > |
From: Allen C. <al...@us...> - 2004-07-30 02:34:07
|
More details on the problem after some more investigation and testing... the specific function that was moved from chan_user.c to chan_kern.c that caused the problem is generic_console_write(). Notice the isatty() calls. They probably explain the earlier observation that piping or redirecting the stdout gets around the problem. int generic_console_write(int fd, const char *buf, int n, void *unused) { struct termios save, new; int err; if(isatty(fd)){ tcgetattr(fd, &save); new = save; new.c_oflag |= OPOST; tcsetattr(fd, TCSAFLUSH, &new); } err = generic_write(fd, buf, n, NULL); if(isatty(fd)) tcsetattr(fd, TCSAFLUSH, &save); return(err); } Note also the output for chan_kern.c while the kernel is compiling: gcc -D__KERNEL__ -I/home/alydar/play/kernel/linux-2.4.26/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -U__i386__ -Ui386 -DUM_FASTCALL -g -D__arch_um__ -DSUBARCH="i386" -D_LARGEFILE64_SOURCE -I/home/alydar/play/kernel/linux-2.4.26/arch/um/include -Derrno=kernel_errno -I/home/alydar/play/kernel/linux-2.4.26/arch/um/kernel/tt/include -I/home/alydar/play/kernel/linux-2.4.26/arch/um/kernel/skas/include -nostdinc -iwithprefix include -DKBUILD_BASENAME=chan_kern -c -o chan_kern.o chan_kern.c chan_kern.c: In function `generic_console_write': chan_kern.c:121: warning: implicit declaration of function `isatty' chan_kern.c:122: warning: implicit declaration of function `tcgetattr' chan_kern.c:125: warning: implicit declaration of function `tcsetattr' On Thursday 29 July 2004 02:30 pm, Allen Chan wrote: > > It turns out that the following changes caused the problem to show > up: > > arch/um/drivers/chan_kern.c 1.39 Fri 07-16-2004 02:27 PM EDT jdike > Exp +65 -0 > Moved some non-userspace code from chan_user.c to chan_kern.c. > arch/um/drivers/chan_user.c 1.33 Fri 07-16-2004 02:27 PM EDT jdike > Exp +0 -64 > Moved some non-userspace code from chan_user.c to chan_kern.c. > |
From: Jeff D. <jd...@ad...> - 2004-07-30 16:25:10
|
al...@us... said: > More details on the problem after some more investigation and > testing... the specific function that was moved from chan_user.c to > chan_kern.c that caused the problem is generic_console_write(). > Notice the isatty() calls. They probably explain the earlier > observation that piping or redirecting the stdout gets around the > problem. The move_console_write patch at http://user-mode-linux.sourceforge.net/patches.html backs out this part of that change. Can anyone who's seeing this check that it fixes the hang? Jeff |
From: Matt C. <mat...@he...> - 2004-07-30 17:20:55
|
Jeff Dike wrote: > > The move_console_write patch at http://user-mode-linux.sourceforge.net/patches.html > backs out this part of that change. > > Can anyone who's seeing this check that it fixes the hang? Works for me. What still bothers me is that turning on debugging symbols was "fixing" it for me as well. - Matt |
From: Allen C. <al...@us...> - 2004-07-30 17:43:16
|
Works for me. Thanks. On Friday 30 July 2004 01:19 pm, Matt Clay wrote: > Jeff Dike wrote: > > The move_console_write patch at > > http://user-mode-linux.sourceforge.net/patches.html backs out > > this part of that change. > > > > Can anyone who's seeing this check that it fixes the hang? > > Works for me. What still bothers me is that turning on debugging > symbols was "fixing" it for me as well. > > - Matt |
From: Jeff D. <jd...@ad...> - 2004-07-30 17:51:19
|
mat...@he... said: > Works for me. What still bothers me is that turning on debugging > symbols was "fixing" it for me as well. Yeah, this definitely needs more investigation. Jeff |
From: Allen C. <al...@us...> - 2004-07-30 21:44:06
|
FYI... The same problem occurs in the 2.6.7-1um kernel after applying the remove-user-code patch from the "Current patches" site. Except this time, instead of hanging, the guest kernel simply exits without warning. The last output lines before exiting were: ...... [snipped] ...... Initializing stdio console driver kjournald starting. Commit interval 5 seconds While compiling chan_kern.c, the following warnings were issued: CC arch/um/drivers/chan_kern.o arch/um/drivers/chan_kern.c: In function `generic_console_write': arch/um/drivers/chan_kern.c:121: warning: implicit declaration of function `isatty' arch/um/drivers/chan_kern.c:122: warning: implicit declaration of function `tcgetattr' arch/um/drivers/chan_kern.c:125: warning: implicit declaration of function `tcsetattr' arch/um/drivers/chan_kern.c: At top level: arch/um/drivers/chan_kern.c:79: warning: `not_configged_ops' defined but not used While isatty() is in <unistd.h> and tcgetattr() and tcsetattr() are in <termios.h> in /usr/include, there does not appear to exist equivalent kernel header files that will make those warnings go away for chan_kern.c. Do these functions have to be implemented differently if placed at the kernel level? On Friday 30 July 2004 02:49 pm, Jeff Dike wrote: > Yeah, this definitely needs more investigation. |
From: Jeff D. <jd...@ad...> - 2004-07-31 04:09:22
|
On Fri, Jul 30, 2004 at 05:44:00PM -0400, Allen Chan wrote: > While isatty() is in <unistd.h> and tcgetattr() and tcsetattr() are in > <termios.h> in /usr/include, there does not appear to exist > equivalent kernel header files that will make those warnings go away > for chan_kern.c. Do these functions have to be implemented > differently if placed at the kernel level? No, because there really is no difference. There are "user" files and "kernel" files only because user files include libc headers and kernel files include kernel headers, and you can't mix kernel and libc headers in the same file. Possibly, in the libc headers, isatty et al, are macros which expand to calling something else. When you use kernel headers without those definitions, you might get some functions with those names that do the wrong thing somehow. It would be interesting to see a cpp expansion of that function in both chan_user.c and chan_kern.c Jeff |
From: Matt C. <mat...@he...> - 2004-07-31 08:00:20
|
Jeff Dike wrote: > >>Works for me. What still bothers me is that turning on debugging >>symbols was "fixing" it for me as well. > > Yeah, this definitely needs more investigation. Looking at the code for generic_console_write I noticed the return codes for tcgetattr and tcsetattr are not being checked. After I added checks for the return codes, I found out that both calls to tcsetattr always return a non-zero status, which, according to the man pages, indicates a failure. I have no idea if that is important or not. This happens with or without the patch you supplied. BTW: I also noticed that commenting out the second call to tcsetattr makes the problem go away as well. - Matt |
From: Matt C. <mat...@he...> - 2004-08-01 03:30:34
|
Jeff Dike wrote: > >>Works for me. What still bothers me is that turning on debugging >>symbols was "fixing" it for me as well. > > Yeah, this definitely needs more investigation. This patch seems to fix the problem by adding error handling to the tcsetattr call in generic_console_write. - Matt --- um.orig/arch/um/drivers/chan_kern.c 2004-07-26 11:43:52.000000000 -0700 +++ um/arch/um/drivers/chan_kern.c 2004-07-31 19:38:51.000000000 -0700 @@ -117,15 +117,15 @@ { struct termios save, new; int err; + int changed = 0; - if(isatty(fd)){ - tcgetattr(fd, &save); + if(isatty(fd) && tcgetattr(fd, &save) == 0) { new = save; new.c_oflag |= OPOST; - tcsetattr(fd, TCSAFLUSH, &new); + if(tcsetattr(fd, TCSAFLUSH, &new) == 0) changed = 1; } err = generic_write(fd, buf, n, NULL); - if(isatty(fd)) tcsetattr(fd, TCSAFLUSH, &save); + if(changed) tcsetattr(fd, TCSAFLUSH, &save); return(err); } |
From: Matt C. <mat...@he...> - 2004-08-01 03:48:47
Attachments:
tcsetattr_check
|
Matt Clay wrote: > > This patch seems to fix the problem by adding error handling to > the tcsetattr call in generic_console_write. Looks like pasting the patch inline wasn't a good idea. I've attached the patch this time. - Matt |
From: Jeff D. <jd...@ad...> - 2004-08-01 14:53:43
|
On Sat, Jul 31, 2004 at 08:28:39PM -0700, Matt Clay wrote: > This patch seems to fix the problem by adding error handling to > the tcsetattr call in generic_console_write. What are the errnos from the failures? Jeff |
From: Matt C. <mat...@he...> - 2004-08-01 20:02:13
|
Jeff Dike wrote: > >>This patch seems to fix the problem by adding error handling to >>the tcsetattr call in generic_console_write. > > What are the errnos from the failures? When it does fail (it is not consistent), errno retains whatever value it had before tcsetattr was called. After tcsetattr fails, calling tcgetattr shows c_oflag is unchanged. - Matt |
From: Allen C. <al...@ho...> - 2004-08-02 05:15:38
|
After a little poking around, it appears that chan_kern.c and chan_user.c have very different interpretations of what struct termios should be . On my debian unstable system, In chan_kern.c: sizeof(struct termios) == 36 In chan_user.c: sizeof(struct termios) == 60 It appears that the tcgetattr() and tcsetattr() calls are expecting pointers to structures of size 60, but are getting pointers to structures of size 36 instead when placed inside chan_kern.c. On a tcgetattr() call, the next 24 bytes on the stack after the local structure become open game for memory corruption. On a tcsetattr() call, actions may be taken based on 24 bytes worth of data which is likely to be garbage. |
From: Matt C. <mat...@he...> - 2004-08-02 07:33:52
|
Allen Chan wrote: > After a little poking around, it appears that chan_kern.c and > chan_user.c have very different interpretations of what struct > termios should be . On my debian unstable system, Now that you point it out, I see the same thing on my Slackware 10 system. Here is the definition in my /usr/include/bits/termios.h typedef unsigned char cc_t; typedef unsigned int speed_t; typedef unsigned int tcflag_t; #define NCCS 32 struct termios { tcflag_t c_iflag; /* input mode flags */ tcflag_t c_oflag; /* output mode flags */ tcflag_t c_cflag; /* control mode flags */ tcflag_t c_lflag; /* local mode flags */ cc_t c_line; /* line discipline */ cc_t c_cc[NCCS]; /* control characters */ speed_t c_ispeed; /* input speed */ speed_t c_ospeed; /* output speed */ }; And the one from um/include/asm/arch/termbits.h typedef unsigned char cc_t; typedef unsigned int speed_t; typedef unsigned int tcflag_t; #define NCCS 19 struct termios { tcflag_t c_iflag; /* input mode flags */ tcflag_t c_oflag; /* output mode flags */ tcflag_t c_cflag; /* control mode flags */ tcflag_t c_lflag; /* local mode flags */ cc_t c_line; /* line discipline */ cc_t c_cc[NCCS]; /* control characters */ }; This would certainly explain the random behavior I've been seeing with calls to tcsetattr failing sometimes. - Matt |
From: Matt C. <mat...@he...> - 2004-08-02 08:15:17
|
Allen Chan wrote: > After a little poking around, it appears that chan_kern.c and > chan_user.c have very different interpretations of what struct > termios should be . On my debian unstable system, I updated my um/include/asm/arch/termbits.h to match the standard 60 byte termios struct, and this fixed the kernel hang during boot. However, it did not make the random failures on calls to tcsetattr go away. The result is that I sometimes get the following during boot: <snip> NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP IP: routing cache hash table of 512 buckets, 4Kbytes <snip> Instead of the correct output: <snip> NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP IP: routing cache hash table of 512 buckets, 4Kbytes <snip> Notice the indented "IP Protocols"? That's where tcsetattr failed. - Matt |
From: Jeff D. <jd...@ad...> - 2004-08-02 18:21:17
|
al...@ho... said: > It appears that the tcgetattr() and tcsetattr() calls are expecting > pointers to structures of size 60, but are getting pointers to > structures of size 36 instead when placed inside chan_kern.c. On a > tcgetattr() call, the next 24 bytes on the stack after the local > structure become open game for memory corruption. On a tcsetattr() > call, actions may be taken based on 24 bytes worth of data which is > likely to be garbage. Right, this looks like a real explanation. Thanks for chasing it down. Jeff |
From: Matt C. <mat...@he...> - 2004-08-03 05:58:12
|
Jeff Dike wrote: > >>This patch seems to fix the problem by adding error handling to >>the tcsetattr call in generic_console_write. > > What are the errnos from the failures? Ignore what I said before about errno not being set, I was wrong. The answer is that tcsetattr is failing with EINTR rather frequently. It never seems to happen when running UML on a real console, but quite frequently from inside xterm, konsole or gnome-terminal. - Matt |
From: Daniel H. <dan...@gm...> - 2004-07-30 23:51:39
|
On Friday 30 July 2004 19:23, Jeff Dike wrote: > The move_console_write patch at > http://user-mode-linux.sourceforge.net/patches.html backs out this part > of that change. > > Can anyone who's seeing this check that it fixes the hang? > > Jeff Fixes the hang here, too. Regards Daniel -- "Those who would give up essential liberty, to purchase a little temporary safety, deserve neither liberty nor safety." - Benjamin Franklin |
From: Nicholas L. <nj...@pl...> - 2004-07-31 00:30:15
|
On Fri, Jul 30, 2004 at 01:23:40PM -0400, Jeff Dike wrote: > Can anyone who's seeing this check that it fixes the hang? Yep. Nicholas |