From: Chris F. <cfr...@no...> - 2004-03-01 22:20:18
|
I'm having issues with UML. I've got a rootfs that used to work with an older UML. With 2.4.24, I try and boot with the following arguments: ./linux-2.4.24 mem=256M con0=fd:0,fd:1 con=pts ubd0=test,rootfs debug msp_view=build_neptune_2.0_int11 The msp_view thing causes additional mounts to be created at init time. However, I get a bunch of errors in my console window when I try to log in: ed Hat Linux release 8.0 (Psyche) Plankton/CELP 2.5.0 - Kernel 2.4.24-1um on an i686 (none) login: Registering fd 47 twice Irqs : 3, 3 Ids : 0xa022c3f8, 0xa022c3f8 Registering fd 47 twice Irqs : 3, 3 Ids : 0xa022c3f8, 0xa022c3f8 If I then keep going, I end up with a whole bunch of lines like: Failed to forward -1359446016 to pid 16452, err = 9 As part of my efforts to try and debug this, I wanted to dump out the names of the processes that were calling os_set_owner(). Accordingly in um/os/file.c I included <linux/sched.c> and wrote some code with appropriate locking around the tasklist_lock so that I could match the pid against the process name. When I compiled, it grabbed the sched.h from /usr/include rather than from the kernel build, and everything blew up. So I guess I'm asking for help on two fronts. 1) How do I fix the build process to pick up the kernel includes (and why doesn't it do that by default?). 2) Any ideas why I'm getting the descriptor thing? Thanks, Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: Jeff D. <jd...@ad...> - 2004-03-02 00:42:33
|
cfr...@no... said: > 1) How do I fix the build process to pick up the kernel includes (and > why doesn't it do that by default?). You don't. The host's libc headers and the UML pool's kernel headers are fundamentally incompatible. You can't mix them, as you're trying to do by including sched.h into a userspace file. > 2) Any ideas why I'm getting > the descriptor thing? No, but get a stack trace or two from that printk, and that should help figure it out. Jeff |
From: Chris F. <cfr...@no...> - 2004-03-02 15:21:30
|
Jeff Dike wrote: > cfr...@no... said: > >>1) How do I fix the build process to pick up the kernel includes (and >>why doesn't it do that by default?). >> > > You don't. The host's libc headers and the UML pool's kernel headers are > fundamentally incompatible. You can't mix them, as you're trying to do > by including sched.h into a userspace file. Maybe I'm just confused about how the various things get compiled. In a normal kernel build, including <linux/sched.h> gets you the kernel headers, not the libc headers. I had assumed that I would be able to get at the UML kernel's sched.h from within the arch/um code. Is this a false assumption? Or am I just doing it wrong? >>2) Any ideas why I'm getting >>the descriptor thing? >> > > No, but get a stack trace or two from that printk, and that should help figure > it out. Here's a backtrace from just where the code would print the registering twice message. Am I right in assuming that since we're hitting this it seems to indicate some messed-up reference counting somewhere? #0 activate_fd (irq=3, fd=47, type=1, dev_id=0xa02309f8) at irq_user.c:157 #1 0xa00cf9bd in um_request_irq (irq=3, fd=47, type=1, handler=0xa010e230 <line_write_interrupt>, irqflags=872415232, devname=0xa01e7e85 "console-write", dev_id=0xa02309f8) at irq.c:430 #2 0xa010e326 in line_setup_irq (fd=47, input=1, output=1, data=0xa02309f8) at line.c:211 #3 0xa010d0b9 in enable_chan (chans=0xa0230a00, data=0xa02309f8) at chan_kern.c:164 #4 0xa010e458 in line_open (lines=0xa02309a0, tty=0xaf1f7000, opts=0xa0230920) at line.c:258 #5 0xa010cb99 in open_console (tty=0xaf1f7000) at stdio_console.c:135 #6 0xa00e391f in tty_open (inode=0xafc13c00, filp=0xafe7ef20) at tty_io.c:1433 #7 0xa0084d94 in devfs_open (inode=0xafc13c00, file=0xafe7ef20) at base.c:2799 #8 0xa0035e18 in dentry_open (dentry=0xafcd15a0, mnt=0xa032f320, flags=2) at open.c:698 #9 0xa0035d0b in filp_open (filename=0xaff49000 "/dev/ttys/1", flags=2, mode=0) at open.c:656 #10 0xa003607a in sys_open (filename=0x9ffffd20 "/dev/ttys/1", flags=2, mode=0) at open.c:798 #11 0xa00d97ec in execute_syscall_tt (r=0xaefd8274) at syscall_kern.c:120 #12 0xa00d98a9 in syscall_handler_tt (sig=12, regs=0xaefd8274) at syscall_user.c:37 #13 0xa00da9bb in sig_handler_common_tt (sig=12, sc_ptr=0xaefdbd28) ---Type <return> to continue, or q <return> to quit--- at trap_user.c:37 #14 0xa00d70dd in sig_handler (sig=12, sc= {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, __dsh = 0, edi = 2684353860, esi = 2684353824, ebp = 2684354024, esp = 2684353800, ebx = 2684353824, edx = 0, ecx = 2, eax = 4294967258, trapno = 14, err = 4, eip = 1074733544, cs = 35, __csh = 0, eflags = 582, esp_at_signal = 2684353800, ss = 43, __ssh = 0, fpstate = 0xaefdbd80, oldmask = 134283264, cr2 = 1074737568}) at trap_user.c:101 #15 <signal handler called> #16 0x400f21e8 in ?? () #17 0x08049799 in ?? () #18 0x4002f9a7 in ?? () -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: BlaisorBlade <bla...@ya...> - 2004-03-02 19:08:11
|
Alle 16:08, marted=EC 2 marzo 2004, Chris Friesen ha scritto: > Jeff Dike wrote: > > Maybe I'm just confused about how the various things get compiled. In a > normal kernel build, including <linux/sched.h> gets you the kernel > headers, not the libc headers. > I had assumed that I would be able to get > at the UML kernel's sched.h from within the arch/um code. Is this a > false assumption? Or am I just doing it wrong? Inside arch/um, there are some files which can include *only* the kernel=20 includes, and some files which can include *only* the libc includes;=20 os/file.c is of the second type. In the directory Makefile, the files liste= d=20 in USER_OBJS are of the second type: USER_OBJS :=3D $(foreach file,file.o process.o tty.o,$(obj)/$(file)) i.e. file.o is listed. Note this idiom: USER_OBJS :=3D $(filter %_user.o,$(obj-y) $(obj-m)) fd.o ... the $(filter ) part means a "grep" for *_user.o: i.e. every file ending in= =20 _user.c is compiled against libc headers. =2D-=20 Paolo Giarrusso, aka Blaisorblade Linux registered user n. 292729 |
From: Jeff D. <jd...@ad...> - 2004-03-02 20:37:06
|
cfr...@no... said: > Am I right in assuming that since we're hitting this it seems to > indicate some messed-up reference counting somewhere? It would appear so. This is inside an if line->count == 0 > #4 0xa010e458 in line_open (lines=0xa02309a0, tty=0xaf1f7000, > opts=0xa0230920) at line.c:258 and the place that decrements line->count does this: if(line->count == 0) line_disable(line, -1); which should ensure that the interrupt is disabled. Could you get stack traces from activate_fd and deactivate_fd for this irq? We need to see how the count is becoming 0 without the interrupt being turned off. Jeff |
From: Chris F. <cfr...@no...> - 2004-03-02 22:39:36
|
I've found something interesting. I'm hitting a code path in activate_fd(), where the call to um_kmalloc() returns a chunk of memory that is still pointed to in the active_fds linked list! Hence, it updates the fd and type to the ones passed in, then scans the list and of course it finds a conflict and starts complaining about duplicate activations. Somehow memory is being freed without it being removed from the active_fds list. I put a check in kfree() to scan the arg against the active_fds list. Here's the backtrace of the first call that tried to free up memory still on the active_fds list. #0 kfree (objp=0xa032c820) at slab.c:1622 #1 0xa00d03d6 in activate_fd (irq=3, fd=47, type=1, dev_id=0xa02309f8) at irq_user.c:230 #2 0xa00cf9ed in um_request_irq (irq=3, fd=47, type=1, handler=0xa010e340 <line_write_interrupt>, irqflags=872415232, devname=0xa01e7fa5 "console-write", dev_id=0xa02309f8) at irq.c:430 #3 0xa010e436 in line_setup_irq (fd=47, input=1, output=1, data=0xa02309f8) at line.c:211 #4 0xa010d1c9 in enable_chan (chans=0xa0230a00, data=0xa02309f8) at chan_kern.c:164 #5 0xa010e568 in line_open (lines=0xa02309a0, tty=0xaf0ec000, opts=0xa0230920) at line.c:258 #6 0xa010cca9 in open_console (tty=0xaf0ec000) at stdio_console.c:135 #7 0xa00e3a2f in tty_open (inode=0xafc27c00, filp=0xaf2d9a60) at tty_io.c:1433 #8 0xa0084dc4 in devfs_open (inode=0xafc27c00, file=0xaf2d9a60) at base.c:2799 #9 0xa0035e48 in dentry_open (dentry=0xafc5d5a0, mnt=0xa032f320, flags=2) at open.c:698 #10 0xa0035d3b in filp_open (filename=0xaff49000 "/dev/ttys/1", flags=2, mode=0) at open.c:656 #11 0xa00360aa in sys_open (filename=0x9ffffd20 "/dev/ttys/1", flags=2, mode=0) at open.c:798 #12 0xa00d98fc in execute_syscall_tt (r=0xaefd8274) at syscall_kern.c:120 #13 0xa00d99b9 in syscall_handler_tt (sig=12, regs=0xaefd8274) ---Type <return> to continue, or q <return> to quit--- at syscall_user.c:37 #14 0xa00daacb in sig_handler_common_tt (sig=12, sc_ptr=0xaefdbd28) at trap_user.c:37 #15 0xa00d71ed in sig_handler (sig=12, sc= {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, __dsh = 0, edi = 2684353860, esi = 2684353824 , ebp = 2684354024, esp = 2684353800, ebx = 2684353824, edx = 0, ecx = 2, eax = 4294967258, trapno = 14, err = 4, eip = 1074733544, cs = 35, __csh = 0, eflags = 582, esp_at_signal = 2684353800, ss = 43, __ssh = 0, fpstate = 0xaefdbd80, ol dmask = 134283264, cr2 = 1074737568}) at trap_user.c:101 #16 <signal handler called> #17 0x400f21e8 in ?? () #18 0x08049799 in ?? () #19 0x4002f9a7 in ?? () Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: Jeff D. <jd...@ad...> - 2004-03-02 23:04:22
|
cfr...@no... said: > #1 0xa00d03d6 in activate_fd (irq=3, fd=47, type=1, > dev_id=0xa02309f8) at irq_user.c:230 Why don't your line numbers match mine? The kfree you seem to be talking about is this: 211 out_unlock: 212 irq_unlock(flags); 213 kfree(new_fd); 214 out: And I totally don't see a problem there. new_fd isn't on any lists when it's freed there. Jeff |
From: Chris F. <cfr...@no...> - 2004-03-02 23:31:29
|
Jeff Dike wrote: > cfr...@no... said: > >>#1 0xa00d03d6 in activate_fd (irq=3, fd=47, type=1, >>dev_id=0xa02309f8) at irq_user.c:230 >> > > Why don't your line numbers match mine? I've added debug statements. The kfree you seem to be talking > about is this: > > 211 out_unlock: > 212 irq_unlock(flags); > 213 kfree(new_fd); > 214 out: > > And I totally don't see a problem there. new_fd isn't on any lists when it's > freed there. As mentioned in my other message, I'm seeing um_kmalloc() return a new_fd address that is *already* on the active_fds list. Hence it ends up looking like a dupe, hence we end up calling kfree() on an address that is still on the active_fds list. I made the following changes within activate_fd() to try and see what's going on. If I set a breakpoint at the "redo=1" line and then step through the code, I end up in a continuous loop getting the same value for the return code of um_kmalloc(). There's something fishy going on here.... get_mem: new_fd = um_kmalloc(sizeof(*new_fd)); err = -ENOMEM; if(new_fd == NULL) goto out; redo = 0; flags = irq_lock(); for(irq_fd = active_fds; irq_fd != NULL; irq_fd = irq_fd->next){ if (irq_fd == new_fd) { redo = 1; } } irq_unlock(flags); if (redo) goto get_mem; Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: Chris F. <cfr...@no...> - 2004-03-02 23:50:33
|
I turned on memory debugging, and got the following. Line 1439 in slab.c is /* Either write before start, or a double free. */ Here's the backtrace: Breakpoint 1, panic (fmt=0xa01e2d8a "kernel BUG at %s:%d!\n") at panic.c:60 60 bust_spinlocks(1); (gdb) bt #0 panic (fmt=0xa01e2d8a "kernel BUG at %s:%d!\n") at panic.c:60 #1 0xa002dc8a in kfree (objp=0xa032c89c) at slab.c:1439 #2 0xa0103ac8 in hostfs_read_super (sb=0xaf332400, data=0xaf036000, silent=0) at hostfs_kern.c:932 #3 0xa003cae9 in get_sb_nodev (fs_type=0xa022d62c, flags=0, dev_name=0xaf037000 "none", data=0xaf036000) at super.c:756 #4 0xa003cc82 in do_kern_mount (fstype=0xaf0d8000 "hostfs", flags=0, name=0xaf037000 "none", data=0xaf036000) at super.c:808 #5 0xa004fd62 in do_add_mount (nd=0xaefcbb94, type=0xaf0d8000 "hostfs", flags=0, mnt_flags=0, name=0xaf037000 "none", data=0xaf036000) at namespace.c:626 #6 0xa0050049 in do_mount (dev_name=0xaf037000 "none", dir_name=0xaff46000 "/vob/neptune", type_page=0xaf0d8000 "hostfs", flags=3236757504, data_page=0xaf036000) at namespace.c:740 #7 0xa0050416 in sys_mount (dev_name=0x8059e20 "none", dir_name=0x8059e30 "/vob/neptune", type=0x8059e48 "hostfs", flags=3236757504, data=0x8059eb8) at namespace.c:853 #8 0xa00da41c in execute_syscall_tt (r=0xaefc8274) at syscall_kern.c:120 #9 0xa00da4d9 in syscall_handler_tt (sig=12, regs=0xaefc8274) at syscall_user.c:37 #10 0xa00db5eb in sig_handler_common_tt (sig=12, sc_ptr=0xaefcbd28) at trap_user.c:37 #11 0xa00d7d0d in sig_handler (sig=12, sc= ---Type <return> to continue, or q <return> to quit--- {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, __dsh = 0, edi = 134586040, esi = 3236757504, ebp = 2684353448, esp = 2684353408, ebx = 134585888, edx = 134585928, ecx = 134585904, eax = 4294967258, trapno = 14, err = 4, eip = 1074791518, cs = 35, __csh = 0, eflags = 582, esp_at_signal = 2684353408, ss = 43, __ssh = 0, fpstate = 0xaefcbd80, oldmask = 134283264, cr2 = 1074791488}) at trap_user.c:101 #12 <signal handler called> #13 0x4010045e in ?? () #14 0x0804a8b8 in ?? () #15 0x0804b4d1 in ?? () #16 0x0804c125 in ?? () #17 0x0804cf1d in ?? () #18 0x4002f9a7 in ?? () -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: Jeff D. <jd...@ad...> - 2004-03-03 01:15:40
|
cfr...@no... said: > I turned on memory debugging, and got the following. Line 1439 in > slab.c is Excellent. Can you add the following to hostfs_read_super (not a real patch - add that line by hand)? out_dput: dput(sb->s_root); + return(NULL); out_free: kfree(root_dir); The dput is doing some cleanup for me that I wasn't aware of, including freeing stuff. Jeff |
From: Chris F. <cfr...@no...> - 2004-03-03 17:03:24
|
Jeff Dike wrote: > cfr...@no... said: > >>I turned on memory debugging, and got the following. Line 1439 in >>slab.c is > > > Excellent. Can you add the following to hostfs_read_super (not a real patch - > add that line by hand)? > > out_dput: > dput(sb->s_root); > + return(NULL); > out_free: > kfree(root_dir); > > The dput is doing some cleanup for me that I wasn't aware of, including freeing > stuff. Looks like that cleaned it up for the minimal setup that was triggering the fault. I'll test it with the full configuration and see how that goes... Thanks, Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: Jeff D. <jd...@ad...> - 2004-03-08 21:18:49
|
cfr...@no... said: > I'll test it with the full configuration and see how that goes... I guess things are working OK since you haven't been complaining, but I thought I make sure. That fix does make the problem totally go away? Jeff |
From: Chris F. <cfr...@no...> - 2004-03-08 21:25:10
|
Jeff Dike wrote: > cfr...@no... said: > >>I'll test it with the full configuration and see how that goes... >> > > I guess things are working OK since you haven't been complaining, but I > thought I make sure. That fix does make the problem totally go away? Heh...I meant to get back to you and let you know, but its been kind of busy. Yep, the problems seem to be gone. I'm not one of the main users of the UML environment, but I haven't heard of any more problems from the guys that are. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |
From: Chris F. <cfr...@no...> - 2004-03-02 23:16:29
|
Further update on the very strange behaviour I'm seeing. I instrumented kfree() as mentioned before, and also instrumented activate_fd() to test after um_kmalloc() whether the result was already in the active_fds list. I had assumed that I would see it being freed while still on the list, then being returned from malloc the next time around, causing a dupe. However, what I actually saw was new fd is same address as one already in active_fds deleting address currently in active_fds In other words, the um_kmalloc() was returning an address already in the active_fds list, *before* an address still in the list had been kfree()'d. The kfree() that I see with an address in the list is the one from the out_unlock jump target in activeate_fd(). So somehow um_kmalloc() is returning duplicate addresses without an intervening call to kfree(). Any ideas? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfr...@no... |