Thread: Re: [SSI-devel] Orphaned write file lock
Brought to you by:
brucewalker,
rogertsang
From: Roger T. <rog...@gm...> - 2006-08-11 01:51:37
|
hmm... interesting # ps -fp 200093 UID PID NODE PPID C STIME TTY TIME CMD # cat /proc/locks | grep 200093 34: FLOCK ADVISORY WRITE 200093 93:00:1276714 0 EOF Roger On 8/10/06, John Steinman <joh...@gm...> wrote: > > I have been experiencing an orphaned write file lock problem with Perl > script exiting. It appears that a write file lock is held by a process that > has exited and no longer exist. > > I can reproduce the problem using two programs prog1 and bad_prog1. The > prog1 opens "data.file" does some file locking releases the lock and close > the file and loops back to repeat this sequence. The bad_prog1 opens > "data.file" does some file locking but does not release the lock or close > the file on exit. I start a script "testit" on the one node on the cluster > that starts at least 9 or 10 sessions of prog1 which continues to run in a > for loop and than starts the bad_prog1 with a while command: > > while true; do ./bad_prog1; done > > (See attached tar file for test source code) > > After about an hour or less I get the orphaned lock from the bad_prog1. > > > PID 155816 requesting lock > PID 155816 has lock on byte starting at 0 for 1 bytes of data.file > PID 155816 requesting lock > PID 155816 has lock on byte starting at 0 for 1000 bytes of data.file > PID 155817 requesting lock > PID 155817 has lock on byte starting at 0 for 1 bytes of data.file > PID 155817 requesting lock > PID 155817 has lock on byte starting at 0 for 1000 bytes of data.file > PID 155818 requesting lock > PID 155818 has lock on byte starting at 0 for 1 bytes of data.file > PID 155818 requesting lock > PID 155818 has lock on byte starting at 0 for 1000 bytes of data.file > PID 155819 requesting lock > PID 155819 has lock on byte starting at 0 for 1 bytes of data.file > PID 155819 requesting lock > PID 155819 has lock on byte starting at 0 for 1000 bytes of data.file > PID 155820 requesting lock > PID 155820 has lock on byte starting at 0 for 1 bytes of data.file > PID 155820 requesting lock > PID 155820 has lock on byte starting at 0 for 1000 bytes of data.file > PID 155821 requesting lock > > # cat /proc/locks > 1: POSIX ADVISORY WRITE 155820 fe:0a:241525 0 999 > 1: -> POSIX ADVISORY WRITE 136763 fe:0a:241525 0 0 > 1: -> POSIX ADVISORY WRITE 136010 fe:0a:241525 0 0 > 1: -> POSIX ADVISORY WRITE 136706 fe:0a:241525 0 0 > 1: -> POSIX ADVISORY WRITE 155821 fe:0a:241525 0 0 > 1: -> POSIX ADVISORY WRITE 136423 fe:0a:241525 0 0 > 2: POSIX ADVISORY WRITE 141420 fe:0a:692309 0 EOF > 3: FLOCK ADVISORY WRITE 78025 fe:0a:647506 0 EOF > 4: POSIX ADVISORY WRITE 76092 fe:0a:647482 0 EOF > 5: POSIX ADVISORY WRITE 76092 fe:0a:647482 0 EOF > > # ps -ef | grep 155820 > root 86989 86664 0 11:26 pts/10 00:00:00 grep 155820 > > # ls -il data.file > 241525 -rw-r--r-- 1 root root 0 Aug 4 10:24 data.file > Under "kdb" I was able to check the inode for this file and the "i_flock" > pointer was NULL no file locks. There appears to be a race condition that > makes "/proc/locks" and other processes to believe the file has blocking > locks. > > Has anyone else experiened this problem on their cluster? > > -- > John F. Steinman > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > > > |
From: Roger T. <rog...@gm...> - 2006-08-11 22:47:02
|
Actually I just realized I didn't run into a stale lock. # localview lsof | grep cron <snip> crond 200094 root 3u REG 147,0 7 1276714 /cluster/node3/var/run/crond.pid crond 200094 root 4u unix 0xf4c61dc0 50335322 socket # service crond stop (node 3) Stopping crond: [ OK ] # localview lsof | grep 1276714 # Roger On 8/10/06, Roger Tsang <rog...@gm...> wrote: > hmm... interesting > > # ps -fp 200093 > UID PID NODE PPID C STIME TTY TIME CMD > # cat /proc/locks | grep 200093 > 34: FLOCK ADVISORY WRITE 200093 93:00:1276714 0 EOF > > Roger > > > On 8/10/06, John Steinman <joh...@gm...> wrote: > > > > I have been experiencing an orphaned write file lock problem with Perl > > script exiting. It appears that a write file lock is held by a process that > > has exited and no longer exist. > > > > I can reproduce the problem using two programs prog1 and bad_prog1. The > > prog1 opens "data.file" does some file locking releases the lock and close > > the file and loops back to repeat this sequence. The bad_prog1 opens > > "data.file" does some file locking but does not release the lock or close > > the file on exit. I start a script "testit" on the one node on the cluster > > that starts at least 9 or 10 sessions of prog1 which continues to run in a > > for loop and than starts the bad_prog1 with a while command: > > > > while true; do ./bad_prog1; done > > > > (See attached tar file for test source code) > > > > After about an hour or less I get the orphaned lock from the bad_prog1. > > > > > > PID 155816 requesting lock > > PID 155816 has lock on byte starting at 0 for 1 bytes of data.file > > PID 155816 requesting lock > > PID 155816 has lock on byte starting at 0 for 1000 bytes of data.file > > PID 155817 requesting lock > > PID 155817 has lock on byte starting at 0 for 1 bytes of data.file > > PID 155817 requesting lock > > PID 155817 has lock on byte starting at 0 for 1000 bytes of data.file > > PID 155818 requesting lock > > PID 155818 has lock on byte starting at 0 for 1 bytes of data.file > > PID 155818 requesting lock > > PID 155818 has lock on byte starting at 0 for 1000 bytes of data.file > > PID 155819 requesting lock > > PID 155819 has lock on byte starting at 0 for 1 bytes of data.file > > PID 155819 requesting lock > > PID 155819 has lock on byte starting at 0 for 1000 bytes of data.file > > PID 155820 requesting lock > > PID 155820 has lock on byte starting at 0 for 1 bytes of data.file > > PID 155820 requesting lock > > PID 155820 has lock on byte starting at 0 for 1000 bytes of data.file > > PID 155821 requesting lock > > > > # cat /proc/locks > > 1: POSIX ADVISORY WRITE 155820 fe:0a:241525 0 999 > > 1: -> POSIX ADVISORY WRITE 136763 fe:0a:241525 0 0 > > 1: -> POSIX ADVISORY WRITE 136010 fe:0a:241525 0 0 > > 1: -> POSIX ADVISORY WRITE 136706 fe:0a:241525 0 0 > > 1: -> POSIX ADVISORY WRITE 155821 fe:0a:241525 0 0 > > 1: -> POSIX ADVISORY WRITE 136423 fe:0a:241525 0 0 > > 2: POSIX ADVISORY WRITE 141420 fe:0a:692309 0 EOF > > 3: FLOCK ADVISORY WRITE 78025 fe:0a:647506 0 EOF > > 4: POSIX ADVISORY WRITE 76092 fe:0a:647482 0 EOF > > 5: POSIX ADVISORY WRITE 76092 fe:0a:647482 0 EOF > > > > # ps -ef | grep 155820 > > root 86989 86664 0 11:26 pts/10 00:00:00 grep 155820 > > > > # ls -il data.file > > 241525 -rw-r--r-- 1 root root 0 Aug 4 10:24 data.file > > Under "kdb" I was able to check the inode for this file and the "i_flock" > > pointer was NULL no file locks. There appears to be a race condition that > > makes "/proc/locks" and other processes to believe the file has blocking > > locks. > > > > Has anyone else experiened this problem on their cluster? > > > > -- > > John F. Steinman > > > > ------------------------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, security? > > Get stuff done quickly with pre-integrated technology to make your job > > easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > > > _______________________________________________ > > ssic-linux-devel mailing list > > ssi...@li... > > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > > > > > > > > |