Update: Vladimir has been doing some work on this file lock problem and has been able to determine that the inode is being stepped on in the SMP/CFS environment. He wrapped the CFS file locking code in the i_sem lock in cfs_lock and cfsd_proc_setlock_0 and released the i_sem after the lock operation has been completed or after handling an error. This seem to resolve the problem. However, we have some concerns about using the i_sem. There may be a chance that a deadlock could occur (i_sem) is use elsewhere in the code and also this fix does not appear to be suitable on the client.
What placing the i_sem wrapper around the CFS lock code on master does is prove that we have a race on the "inode" that can be resolved by providing a lock on the inode.
Note: posix_lock_file is called twice, once from cfsd_proc_setlock_0 and again in cfs_lock after calling cfsd_proc_setlock_0 or rcfscall on the client. Wrapping these with the i_sem seems to prevent the race only on the master.
I'd like some input from the forum before we make this kind of change. John Byrne, could you provide some guidance here? Anyone else feel free to comment.
What protects the inode while acquiring the file locks besides lock_kernel?
Can lock_kernel be released by schedule() if the process's time is up?
Should we be using a lock (mutex lock) on the inode to protect it if the lock_kernel is released?
On 9/8/06, John Steinman <firstname.lastname@example.org > wrote:Hi,
The trouble shooting contines on this one. I modified the test program to open a file, take a write lock on the first 1000 bytes and exit. This removes the sys_close from the picture and should always leave file_lock attached to the inode on exit. I ran about 10 while loops on this program and I got some interesting results with some debug code in the kernel located in fcntl_setlk, close_file and filp_close:
This message came from fcntl_setlk after requesting the write lock. We should have a file_lock attached to the inode and we didn't and there wasn't an error returned:
Sep 8 09:29:54 hawk5_node1 kernel: fcntl_setlk: error 0 PID 81002 no i_flock
Followed by these two messages indicating that when doing do_exit we didn't have the lock for this proess on exit so we now have the orphanned write lock:
Sep 8 09:29:54 hawk5_node1 kernel: close_files: PID 81002 removing locks on inode 18 with no locks.
Sep 8 09:29:54 hawk5_node1 kernel: filp_close: removing locks on inode 18 with no locks.
The next run I got my write lock from fcntl_setlk but by the time I got to do_exit my write file lock had been stripped away from the process and again orphanned write lock:
Sep 8 12:21:21 hawk5_node1 kernel: close_files: PID 80552 removing locks on inode 18 with no locks.
Sep 8 12:21:21 hawk5_node1 kernel: filp_close: removing locks on inode 18 with no locks.
1: POSIX ADVISORY WRITE 80552 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80553 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80554 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80556 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80559 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80551 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80557 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80558 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80560 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80555 fe:0a:18 0 999
1: -> POSIX ADVISORY WRITE 80561 fe:0a:18 0 999
I also tried this same test on an ext2 FS on the same cluster mounted NON-CFS and it works just fine which suggest that the base code is working and the problem is in CFS/OpenSSI code or at least caused by the inclusion of CFS code path.
This is starting to look more like a kernel synchronization issue with CFS.
Still working to isolate this,
On 9/5/06, John Steinman < email@example.com> wrote:Roger,
Why does one program set work but adding the other cause the problem?
I have followed the code path getting to "locks_remove_posix" that the test programs exercise.
sys_close() ---> filp_close()
Both do_exit() and sys_close() have a common call path through filp_close().
With this information I modified the "prog1" to not release the locks on the file and close the file. This appears to run without a problem. I also ran a number of copies of the other version prog1 that releases the lock before closing with the ones that do not release the locks. These together ran without a problem. So the "sys_close" path appears to be OK. These are programs that run in a loop and do not exit. As soon as I add any programs that exit with locks held and it doesn't matter if I close the file before I exit or not I get the orphaned lock for the process that exited. The "inode" as I pointed out before from KDB has a NULL i_flock pointer which suggest that no file locks are on the file when the problem occurs but "/proc/locks" shows clearly that processes are sleeping on the PID which held the lock last.
This puts a new spin on this problem. The testing would suggest that all the code below filp_close works for the "sys_close" but doesn't for the "do_exit". This doesn't make sense when "locks_remove_posix" for both paths removes file locks for the full range of the file.
Why are sleepers still waiting and why aren't the sleepers being woke up?
Is there something in the cluster break down of a process exit that could prevent the wake up?
The only difference between the "do_exit" path and the "sys_close" is the "do_exit" is tearing down the process and thread structures and free resources.
Still looking to isolate this one further,
As a side performance issue, I notice that CPU utilization was taking a hit during these test with spinning processes. Looks like we are using a spin lock (lock_kernel) on entry into "__posix_lock_file" and releasing it on exit (unlock_kernel). This appear to be the cause of the hit by spinning processes hogging the CPU's while in the file lock code. We could be in the file locking code a long time and having processes in a spin wait seems like a waste of CPU cycles to me.
John F. Steinman
John F. Steinman