Hi,

I've received a few responses to this problem but most of the test systems have been single processors (UP) systems and unable to reproduce the problem. I'm running on a dual core 4 cpu system with OpenSSI on a 2.6.12.2 kernel. Is anyone running OpenSSI on at least a 4 CPU SMP platform that could try to reproduce this problem?

Thanks in advance,

John Steinman

On 8/10/06, John Steinman <john.steinman46@gmail.com> wrote:

I have been experiencing an orphaned write file lock problem with Perl script exiting. It appears that a write file lock is held by a process that has exited and no longer exist.

I can reproduce the problem using two programs prog1 and bad_prog1. The prog1 opens "data.file" does some file locking releases the lock and close the file and loops back to repeat this sequence. The bad_prog1 opens "data.file" does some file locking but does not release the lock or close the file on exit. I start a script "testit" on the one node on the cluster that starts at least 9 or 10 sessions of prog1 which continues to run in a for loop and than starts the bad_prog1 with a while command:

while true; do ./bad_prog1; done

(See attached tar file for test source code)

After about an hour or less I get the orphaned lock from the bad_prog1.


PID 155816 requesting lock
PID 155816 has lock on byte starting at 0 for 1 bytes of data.file
PID 155816 requesting lock
PID 155816 has lock on byte starting at 0 for 1000 bytes of data.file
PID 155817 requesting lock
PID 155817 has lock on byte starting at 0 for 1 bytes of data.file
PID 155817 requesting lock
PID 155817 has lock on byte starting at 0 for 1000 bytes of data.file
PID 155818 requesting lock
PID 155818 has lock on byte starting at 0 for 1 bytes of data.file
PID 155818 requesting lock
PID 155818 has lock on byte starting at 0 for 1000 bytes of data.file
PID 155819 requesting lock
PID 155819 has lock on byte starting at 0 for 1 bytes of data.file
PID 155819 requesting lock
PID 155819 has lock on byte starting at 0 for 1000 bytes of data.file
PID 155820 requesting lock
PID 155820 has lock on byte starting at 0 for 1 bytes of data.file
PID 155820 requesting lock
PID 155820 has lock on byte starting at 0 for 1000 bytes of data.file
PID 155821 requesting lock

# cat /proc/locks
1: POSIX  ADVISORY  WRITE 155820 fe:0a:241525 0 999
1: -> POSIX  ADVISORY  WRITE 136763 fe:0a:241525 0 0
1: -> POSIX  ADVISORY  WRITE 136010 fe:0a:241525 0 0
1: -> POSIX  ADVISORY  WRITE 136706 fe:0a:241525 0 0
1: -> POSIX  ADVISORY  WRITE 155821 fe:0a:241525 0 0
1: -> POSIX  ADVISORY  WRITE 136423 fe:0a:241525 0 0
2: POSIX  ADVISORY  WRITE 141420 fe:0a:692309 0 EOF
3: FLOCK  ADVISORY  WRITE 78025 fe:0a:647506 0 EOF
4: POSIX  ADVISORY  WRITE 76092 fe:0a:647482 0 EOF
5: POSIX  ADVISORY  WRITE 76092 fe:0a:647482 0 EOF

# ps -ef | grep 155820
root          86989      86664  0 11:26 pts/10   00:00:00 grep 155820

# ls -il data.file
241525 -rw-r--r--  1 root root 0 Aug  4 10:24 data.file
Under "kdb" I was able to check the inode for this file and the "i_flock" pointer was NULL no file locks. There appears to be a race condition that makes "/proc/locks" and other processes to believe the file has blocking locks.

Has anyone else experiened this problem on their cluster?

--
John F. Steinman




--
John F. Steinman