From: Marcin P. <wa...@op...> - 2004-04-13 21:23:32
|
Hello, My UML fails to reboot saying: #v+ F_SETLK failed, file already locked by pid 32589 Failed to lock 'fs', err = 11 #v- or when .cow file is used: #v+ F_SETLK failed, file already locked by pid 7530 Failed to lock 'root_fs.cow', err = 11 unable to open root_fs.cow for validation Initializing stdio console driver #v- It happens when I execute "shutdown -r now" command on guest system or "cad" in uml_mconsole. I'm using tt mode and the problem is present in every version I checked - user-mode-linux 2.4.24-1um-2 from Debian, manually compiled 2.4.24 and 2.6.4. This behavior doesn't also depend on host kernel version - I tried 2.4.18, 2.4.24 and 2.6.5. The problem was reported in Debian as a bug #220679 (and forwarded here by Matt Zimmerman) but I think it wasn't solved. On 2.6.5 host kernel it also triggers host kernel BUG in locks_remove_flock from fs/locks.c. fl->flags is FL_POSIX and the kernel expects FL_FLOCK or FL_LEASE. On 2.4.* kernels the bug doesn't show since BUG() line was added in 2.5. I think UML starts its threads with CLONE_FILES and the main process is restarted witch execvp which also preserves the lock. In skas mode the problem is not present because the process that survives the reboot is also the one holding the lock. In tt mode the lock is not placed by the tracing thread so UML cannot place it again. I tried to close all the files (from ubd_dev array) in kill_io_thread function and it helps but I have no idea what happens to the host kernel when UML is not modified. Usually it removes the lock when the main process dies but sometimes the lock is left. Killing all the other processes doesn't also help here. I don't know how to reproduce this without UML. Probably a special combination of clone flags and maybe ptrace settings used by UML is needed. Regards, -- Marcin Pawlik |
From: Henrik N. <um...@he...> - 2004-04-14 01:13:21
Attachments:
uml-flock.patch
|
On Tue, 13 Apr 2004, Marcin Pawlik wrote: > #v+ > F_SETLK failed, file already locked by pid 32589 > Failed to lock 'fs', err = 11 > #v- Have been plauged by this quite a lot.. tried to narrow it down the other day but the conclusion was that the host fcntl locking implementation is buggy and stale locks easily gets left behind even after application has closed the file or even terminated. Probably related to the use of clone() which somewhat messes up the hosts view of which process owning the lock.. After this I gave up and rewrote this part to use flock instead of fcntl for locking. Seems to work much better except that locks are only local and does not protect from multiple stations accessing the same NFS mounted image.. Patch attached. Regards Henrik |
From: Marcin P. <wa...@op...> - 2004-04-14 13:32:58
Attachments:
uml-kill-io-thread.patch
|
On Wed, Apr 14 at 03:13, Henrik Nordstrom wrote: > On Tue, 13 Apr 2004, Marcin Pawlik wrote: > >> #v+ >> F_SETLK failed, file already locked by pid 32589 >> Failed to lock 'fs', err = 11 >> #v- > > Have been plauged by this quite a lot.. tried to narrow it down the > other day but the conclusion was that the host fcntl locking > implementation is buggy and stale locks easily gets left behind even > after application has closed the file Do you know where and which thread closes the files? I tried to add file closing to kill_io_thread() (patch attached) and it helps but I think it should also be performed without my code. > or even terminated. Probably related to the use of clone() which > somewhat messes up the hosts view of which process owning the lock.. If it works as I suspected clone is used with CLONE_FILES. The lock is released if any of file-sharing threads closes the file or all of them are finished. The tracing thread is never finished so if the file is not explicitly closed the host kernel shouldn't release the lock. This is correct (the files should simply be closed by UML before reboot). The problem with host kernel is that it sometimes doesn't release the lock even after all threads are finished and on 2.6.5 always hits a BUG() line in locks_remove_flock. I don't see how this could be exploited but it should be corrected anyway. On 2.6.5 it leaves filesystem in inconsistent state with kernel unable to umount it. I thought it would be nice to reproduce this with something simpler than UML before reporting. Unfortunately I don't have sufficient UML internals knowledge to mimic its threads creation, ptracing, file locking and reboot which should lead to the same behavior. > After this I gave up and rewrote this part to use flock instead of > fcntl for locking. Seems to work much better except that locks are > only local and does not protect from multiple stations accessing the > same NFS mounted image.. > > Patch attached. > > > Index: arch/um/os-Linux/file.c > =================================================================== > RCS file: /cvsroot/user-mode-linux/linux/arch/um/os-Linux/file.c,v > retrieving revision 1.29 > diff -u -r1.29 file.c > --- arch/um/os-Linux/file.c 7 Apr 2004 20:44:49 -0000 1.29 > +++ arch/um/os-Linux/file.c 14 Apr 2004 00:41:22 -0000 > @@ -688,6 +688,7 @@ > > int os_lock_file(int fd, int excl) > { > +#if USE_FCNTL_LOCK > int type = excl ? F_WRLCK : F_RDLCK; > struct flock lock = ((struct flock) { .l_type = type, > .l_whence = SEEK_SET, > @@ -710,6 +711,21 @@ > err = save; > out: > return(err); > +#else > + int type = excl ? LOCK_EX : LOCK_SH; I don't understand this. IMO excl should be F_RDLCK or F_WRLCK. F_RDLCK is 0, F_WRLCK is 1 and LOCK_EX is 2 so you will always use LOCK_SH. Anyway I tried the patch on 2.4.24 with uml-patch-2.4.24-2 and it breaks UML. It is unable to halt or restart with some of its processes left. I don't know why, maybe because of mixed flock/fcntl calls. Regards, -- Marcin Pawlik |
From: Marcin P. <wa...@op...> - 2004-04-14 14:08:05
|
On Wed, Apr 14 at 15:32, Marcin Pawlik wrote: > Do you know where and which thread closes the files? I tried to add > file closing to kill_io_thread() (patch attached) and it helps [...] > diff -urN kernel-source-2.4.24/arch/um/drivers/ubd_kern.c > kernel-source-2.4.24.mp/arch/um/drivers/ubd_kern.c --- > kernel-source-2.4.24/arch/um/drivers/ubd_kern.c 2004-04-14 > 14:38:21.000000000 +0200 > +++ kernel-source-2.4.24.mp/arch/um/drivers/ubd_kern.c 2004-04-14 > 14:42:55.000000000 +0200 @@ -495,6 +495,16 @@ > > void kill_io_thread(void) > { > + int i; > + struct ubd * ubd_devp = ubd_dev; > + > + for(i = 0; i < MAX_DEV; i++, ubd_devp++) { > + if(ubd_devp) { > + os_close_file(ubd_devp->fd); > + close(ubd_devp->cow.fd); To be consistent I should of course change the line above to "os_close_file(ubd_devp->cow.fd);", sorry. Regards, -- Marcin Pawlik |
From: Henrik N. <um...@he...> - 2004-04-14 14:44:52
|
On Wed, 14 Apr 2004, Marcin Pawlik wrote: > Do you know where and which thread closes the files? I tried to add file > closing to kill_io_thread() (patch attached) and it helps but I think it > should also be performed without my code. No, I do not remember, but the thread which originally opened and locked the file is apparently not around after the UML has booted. > The problem with host kernel is that it sometimes doesn't release the > lock even after all threads are finished and on 2.6.5 always hits a > BUG() line in locks_remove_flock. I don't see how this could be > exploited but it should be corrected anyway. On 2.6.5 it leaves > filesystem in inconsistent state with kernel unable to umount it. > I thought it would be nice to reproduce this with something simpler than > UML before reporting. Unfortunately I don't have sufficient UML > internals knowledge to mimic its threads creation, ptracing, file > locking and reboot which should lead to the same behavior. Indeed. I am not of much help here however.. > > int os_lock_file(int fd, int excl) > > { > > +#if USE_FCNTL_LOCK > > int type = excl ? F_WRLCK : F_RDLCK; > > struct flock lock = ((struct flock) { .l_type = type, > > .l_whence = SEEK_SET, > > @@ -710,6 +711,21 @@ > > err = save; > > out: > > return(err); > > +#else > > + int type = excl ? LOCK_EX : LOCK_SH; > > I don't understand this. IMO excl should be F_RDLCK or F_WRLCK. F_RDLCK > is 0, F_WRLCK is 1 and LOCK_EX is 2 so you will always use LOCK_SH. ??? excl is a boolean, true if the lock should be exclusive (write access), false if it is a shared lock (read-only). This is how the UML function os_lock_file is defined. This function does not expect fcntl lock names as argument. In addition the flock api does not use the F_XXX flags. None of the code mentioning F_XXX flags is relevant to the flock implemention which is below after the #else. What the patch does is that it completely replaces os_lock_file with another implementation using flock instead of fcntl, with the old implementation #ifdef USE_FCNTL_LOCK (which is not defined). > Anyway I tried the patch on 2.4.24 with uml-patch-2.4.24-2 and it breaks > UML. It is unable to halt or restart with some of its processes left. > I don't know why, maybe because of mixed flock/fcntl calls. Seems to works here.. there os no other uses of F_SETLK in my uml tree. Using this successfully on RedHat-8 (2.4.20 somthing host kernel, no SKAS) and Fedora Core 2 test 1 + SKAS (2.6.something + SKAS host kernel).. Regards Henrik |
From: Marcin P. <wa...@op...> - 2004-04-14 18:49:13
|
On Wed, Apr 14 at 16:44, Henrik Nordstrom wrote: > On Wed, 14 Apr 2004, Marcin Pawlik wrote: > >> Do you know where and which thread closes the files? I tried to add >> file closing to kill_io_thread() (patch attached) and it helps but I >> think it should also be performed without my code. > > No, I do not remember, but the thread which originally opened and > locked the file is apparently not around after the UML has booted. Yes, but this is not necessarily a problem. Any thread sharing file description table can close (and therefore unlock) the file. [...] >>> int os_lock_file(int fd, int excl) >>> { >>> +#if USE_FCNTL_LOCK >>> int type = excl ? F_WRLCK : F_RDLCK; >>> struct flock lock = ((struct flock) { .l_type = type, >>> .l_whence = SEEK_SET, >>> @@ -710,6 +711,21 @@ >>> err = save; >>> out: >>> return(err); >>> +#else >>> + int type = excl ? LOCK_EX : LOCK_SH; >> >> I don't understand this. IMO excl should be F_RDLCK or F_WRLCK. >> F_RDLCK is 0, F_WRLCK is 1 and LOCK_EX is 2 so you will always use >> LOCK_SH. > > ??? > > excl is a boolean, true if the lock should be exclusive (write Ups. Yes, you are absolutely right. I thought... Well I don't know. I'm sorry. Probably I should take some sleep :/ [...] > Seems to works here.. there os no other uses of F_SETLK in my uml > tree. Using this successfully on RedHat-8 (2.4.20 somthing host > kernel, no SKAS) and Fedora Core 2 test 1 + SKAS (2.6.something + > SKAS host kernel).. I tried it on Debian testing/unstable with different host kernels (2.4.25, 2.4.25 with skas, 2.6.5, 2.4.18-1-k7 from Debian) and the same binary on RHEL 3.0 with some 2.4.21. Doesn't work for me. After "cad" or "halt" in uml_mconsole I have sleeping and traced proceses left. I placed my testing UML binary and the filesystem (infinite loop in /sbin/init) on http://www.pwr.wroc.pl/~marcinp/uml/uml.tar.gz (~1.2 MB). Maybe it depends on UML configuration or compiler used. Could you send me your --showconfig? Regards, -- Marcin Pawlik |