Why does one program set work but adding the other cause the problem?
I have followed the code path getting to "locks_remove_posix" that the test programs exercise.
sys_close() ---> filp_close()
Both do_exit() and sys_close() have a common call path through filp_close().
With this information I modified the "prog1" to not release the locks on the file and close the file. This appears to run without a problem. I also ran a number of copies of the other version prog1 that releases the lock before closing with the ones that do not release the locks. These together ran without a problem. So the "sys_close" path appears to be OK. These are programs that run in a loop and do not exit. As soon as I add any programs that exit with locks held and it doesn't matter if I close the file before I exit or not I get the orphaned lock for the process that exited. The "inode" as I pointed out before from KDB has a NULL i_flock pointer which suggest that no file locks are on the file when the problem occurs but "/proc/locks" shows clearly that processes are sleeping on the PID which held the lock last.
This puts a new spin on this problem. The testing would suggest that all the code below filp_close works for the "sys_close" but doesn't for the "do_exit". This doesn't make sense when "locks_remove_posix" for both paths removes file locks for the full range of the file.
Why are sleepers still waiting and why aren't the sleepers being woke up?
Is there something in the cluster break down of a process exit that could prevent the wake up?
The only difference between the "do_exit" path and the "sys_close" is the "do_exit" is tearing down the process and thread structures and free resources.
Still looking to isolate this one further,
As a side performance issue, I notice that CPU utilization was taking a hit during these test with spinning processes. Looks like we are using a spin lock (lock_kernel) on entry into "__posix_lock_file" and releasing it on exit (unlock_kernel). This appear to be the cause of the hit by spinning processes hogging the CPU's while in the file lock code. We could be in the file locking code a long time and having processes in a spin wait seems like a waste of CPU cycles to me.
I think I have isolated the problem to CFS code. The test program I attached has been ran on a 2.6.10, 2.6.12 and a 2.6.17 kernel on the following configurations with results a follows:
SMP 4 CPU:
This would suggest that the base code is OK.
The configuration that we are doing develpment on to port OpenSSI to is OpenSSI 1.9.2 to SUSE 9.2 with a 188.8.131.52 kernel. Test results under this configuration:
Cluster SMP 4 CPU Failed with orphanned write file lock
Cluster SMP 1 CPU Currently running for 5 hours 25 minutes.
This would suggest that "cfs_lock" has a posible Cluster (CFS) SMP race condition when called from locks_remove_posix when file locks are still held by an exiting process..
Has anyone tried this test on a "RedHat" OpenSSI SMP Cluster?
- John SteinmanOn 8/23/06, John Steinman < firstname.lastname@example.org> wrote:Roger,
I plugged in the changes that I could. This didn't seem to help. Also, I found an SMP system inhouse running SUSE 9.2 on the same hardware as my OpenSSI system and ran the test. No failures. So it doesn't look like a Base OS issue. I am back to OpenSSI related. I guess my next step is to look at the possibility that I may have a race condition on in-flight ops in CFS.
- John SteinmanOn 8/21/06, John Steinman < email@example.com> wrote:Roger,
I took your suggestion to check to see if my problem could have been fix in the base. I did some more research on this problem over the weekend. It looks like between 184.108.40.206 and 220.127.116.11 there were some interesting changes made to fs/locks.c to correct some file locking problems. I not sure that they correct the problem I am seeing but I am in the process of trying to port these changes back to my 18.104.22.168 OpenSSI kernel to run my test against them.
One fix I think needs to be done is in locks_remove_posix would be to use F_SETLKW. Currently we use F_SETLK and we don't check the return status.
More to follow,
John F. Steinman
John F. Steinman