Something froze our server last night and now it's been restarted; but I'm having trouble getting my project to make things. I'm looking into whether it's waiting on a stale lock from before the restart.
The reboot was at about 19:20 UTC, far as I can reconstruct.
In the log messages I've been dumping, I have
[16-Feb-2013 19:34:13 UTC] 4511: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by 4408
which is from after the reboot, and the last few entries from PID 4408 are
which is from before the reboot. There is no process 4408, currently. So why are we waiting on it, and will it time out at some point? I could have process 4511 check whether 4408 exists, but then what? Delete and recreate the lockfile? The lock should be released automatically when the process is gone.
Is it because we're using dio_fcntl instead of regular fcntl, to do file locking? But this is necessary because the working directories are NFS mounted on lalashan, and regular fcntl doesn't work on lalashan.
further investigation: the test-flock script I wrote a long time ago still seems good, and here's what it reports:
It's supposed to have the owning PID stored in the lock file...
I'm going to retry the preview operation that's waiting... Here's what I get:
[16-Feb-2013 20:12:31 UTC] 3715: /usr/local/workingwiki/working-directories/preview/.1951447361:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by 4511
test-flock and cat confirm this.
I'm sure I could free up the whole system by deleting the git project's lockfile, but I want to fix the cause.
Last edit: Lee Worden 2013-02-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm uncertain, and I'm going to experiment a little, even though it may make the problem unsolvable by making it go away.
First I'm going to kill 4511 and see if its lock (Cascades_Paper.lock) gets released. Then I'll see whether the next process that's waiting (maybe 3715) gets stuck on the git project.
Answer:
[16-Feb-2013 20:22:22 UTC] 3715: Locked /usr/local/workingwiki/working-directories/preview/.1951447361:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock after 590787.20593452 microseconds [16-Feb-2013 20:22:22 UTC] 3715: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock [16-Feb-2013 20:22:22 UTC] 3715: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by
Yes, 3715 gets Cascade_Paper.lock, and now it's next in line for the git project's lock. Which is apparently not about to get released anytime soon.
I don't know why the PID is no longer in the lock file.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've been given to understand that locks don't always play nice with NFS when crashes happen, and that's just the way it is. So at least consider the possibility that there's no cause to pinpoint?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there a way to ask the OS what process has a file locked? I don't know of one.
Anyway, it looks to me like the NFS server and web server are not working together correctly to release locks on NFS files when the web server crashes and reboots.
I'm going to remove the .pe-git:git@github.com:worden-lee_bikh-on-network.git.lock file by hand, and see if my project starts working again.
Results:
1. the waiting projects don't spring into action. This is presumably because the lock file still exists, and has just been renamed to .nfsSOMETHING.
2. But when I try again, it will try to lock the file by name, and succeed? No, because it'll be waiting on Cascades_Paper. I'll have to kill the process that's holding it. So I kill 3715, and immediately 3379 gets the lock on Cascades_Paper, so 4508 is still waiting in line. When I kill 3379, 4509 (!) gets Cascades_Paper.lock, goes on to lock bikh_on_network.git, and gets stuck on the next one in line:
[16-Feb-2013 20:39:08 UTC] 4509: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock is locked by 4408
3. So I'll remove that one as well and kill the waiting process. Upon which 4508 gets those, but gets stuck on the next one:
[16-Feb-2013 20:41:32 UTC] 4508: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock is locked by 4408
4. Remove that one and kill 4508. Nothing happens, because nothing is waiting. Except that my browser tab finally stops spinning and reports that the connection was reset.
5. So now I reload the tab, re-submitting the request to the wiki. Maybe I did that wrong, but it reset the edit field to its original value and destroyed hours of work that I'd been trying to preview when the system froze last night. Fortunately I'm using the Lazarus Form Recovery firefox addon, which is great for recovering things like this, so that's okay. Now when I hit preview, it seems to get the locks it wants, and "top" reports that the server is running python, which is what my job should be doing. Hopefully it won't crash the server again!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
JD: I'm pretty ignorant myself. It does look iffy. For instance:
fcntl() locks are supposedly supported by NFSv3 and higher, but different kernels do different random things with it. Some kernels just lock the file locally, and don't notify the server. Some notify the server, but do it wrong. So you probably can't trust it.
That's even worse than the current problem. But it does work in general on linux, I tested it when we started using it. I'm willing to not worry too hard about other kernels.
I feel like it would be appropriate to ask RHPCS what's up and whether they can fix the problem on lalashan.
But if this crash-recovery issue won't work on other people's systems, maybe I should make WW more robust regardless of whether it gets fixed on ours.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This comment will probably be a repeat of what's above, but I'm recording my steps anyway in case.
I'm trying to create a background job (to run a demanding script on the cluster in hope of not crashing the head server yet again). It stops at
[17-Feb-2013 23:08:09 UTC] 5806: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
(I added the hostname to the pid, since the point of using flock and NFS is that there could be multiple machines locking things.)
Searching backwards for 3385:
[17-Feb-2013 23:04:27 UTC] 3430: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
[17-Feb-2013 23:03:19 UTC] 5810: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
[17-Feb-2013 06:55:16 UTC] 3385: Locking /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock [17-Feb-2013 06:55:16 UTC] 3385: Locked /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock without waiting [17-Feb-2013 06:55:16 UTC] 3385: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock [17-Feb-2013 06:55:16 UTC] 3385: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by yushan:4513
Searching on 4513:
[17-Feb-2013 21:48:34 UTC] 3434: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 21:42:25 UTC] 10068: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 21:35:41 UTC] 3433: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 19:52:30 UTC] 6846: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 19:45:57 UTC] 10044: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 19:39:50 UTC] 13669: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 06:55:16 UTC] 3385: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by yushan:4513
Juding by the long gap in the log between 17-Feb-2013 06:56:13 UTC and 17-Feb-2013 14:58:28 UTC (and noticing that the pid was reused later) I judge that process 4513 went down in the system crash (and it may have caused it, but that's another question).
So I think I'll just kill all these processes (the ones created since the restart).
wonder@yushan:~$ sudo -u apache kill 5806 3385 3430 5810 3434 10068 3433 6846 10044 13669 [sudo] password for wonder:
kill 3385: No such process
wonder@yushan:~$
Right, I guess 3385 was from before the crash.
Anyway, after that, retrying the "background make" link DOES NOT work any better, because the lock file is still locked by 3385:
[17-Feb-2013 23:25:18 UTC] 3431: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
I need to blow away all lock files that were stranded by the crash.
Now background make goes through without hanging up on a lock.
I'll abandon the preview session and retry that after the background job is successfully merged, because I don't want to risk restarting the dangerous python script.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We are still getting some of these cases of processes waiting on a lock owned by a long-gone process. I don't know why these locks are persisting, or what to do about it. I created a script to summarize the logfile info on who's waiting on what processes, which makes it a lot easier to manually destroy them, but doesn't shed light on the cause.
We also have a new problem with sleeping mysql connections accumulating and making the database stop working. Some of them are owned by these lock-waiting processes, but I don't know how many.
I have worked out that it's possible to get the client process's hostname and port number from MySQL's "show processlist", and go to the client's host and use sudo netstat -p to find the client's pid, which makes it possible to find out whether it's waiting on a lock or what.
JD suggested having the locking process time out after 5 minutes, which I might agree with.
One issue in reconstructing the sequence of events from the logfiles is that PIDs keep getting reused. I'm wondering whether there's something in /proc or somewhere that I can use to disambiguate different processes with the same number - date/time of creation, maybe.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So all this is nice, but basically fcntl + NFS is buggy.
There are other ways to do locking on NFS - mv and symlink operations are atomic, and you could also do it by putting lock data into memcached or even in a mysql table - but they require your process to unlock nicely, and we can't count on that. We need a system that removes the lock when a process dies holding it, and we just don't have one.
So maybe I need subsequent lock-seeking processes to catch when their predecessors have died. It would be a challenge to do that across hosts, but I could check when I happen to be on the same one.
There's an entry in /proc/*/stat that records each processes start time ("the time in jiffies the process started after system boot"). So I could record my pid and starttime in the lockfile, and successors could check whether that pid exists and still has the same starttime, if they're on the same host.
This is independent of Jonathan's suggestion about timing out when waiting.
But it also occurred to me that usually the locking process has a natural expiration time: make processes are expected to timeout after 3 minutes, so I could also put the expiration time in the lockfile and successors would be able to tell that I should be done by now. But maybe the pid/starttime thing is enough.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just cleared all the locks, and now new incidents are accumulating. My script says:
/usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock:
yushan-n01:27254 is waiting for yushan-n05:2049 (defunct) since 10-May-2013 04:12:38 UTC
yushan-n06:14086 is waiting for yushan-n05:2049 (defunct) since 10-May-2013 04:12:38 UTC
While in yushan-n05's log is
[09-May-2013 20:59:06 UTC] 2049 begin: /theobio/ClawedFrogs/index.php/Special:GetProjectFile?project=Eight_genes_model_dup-del_parameter_estimation_models&make=false&display=raw&filename=bd_dirDinvB-anc1.Rout.png
[09-May-2013 20:59:06 UTC] 2049: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock
[09-May-2013 20:59:06 UTC] 2049: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock is locked by
[09-May-2013 20:59:06 UTC] 2049: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock after 1.317024230957 microseconds
[09-May-2013 20:59:06 UTC] 2049: Unlock /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock
[09-May-2013 20:59:06 UTC] 2049 end
and after that are several more processes that successfully lock that lockfile, none of them with pid 2049.
So why would later processes on other nodes say they are waiting for that one? Maybe the locks don't propagate as well as they should? Or maybe the locks propagate, but the file contents have a lag - which would mean they get the wrong pid from inside the lockfile... maybe they're actually waiting for something else...
... and how about that? They're not waiting any more.
OK, that's an issue.
I see two issues here that weren't previously logged: this one, where the locks propagate maybe correctly but the file contents, which record the pid of the owning process, have a lag from one NFS client to another. and the one that shows up in the logfile entries above, where the file is empty and so the lock-seeker reports that it is locked by '' rather than an actual pid. That could be a symptom of the same lag issue, though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The reboot was at about 19:20 UTC, far as I can reconstruct.
In the log messages I've been dumping, I have
[16-Feb-2013 19:34:13 UTC] 4511: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by 4408
which is from after the reboot, and the last few entries from PID 4408 are
[16-Feb-2013 06:29:05 UTC] 4408: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock without waiting
[16-Feb-2013 06:29:05 UTC] 4408: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock
[16-Feb-2013 06:29:05 UTC] 4408: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock without waiting
[16-Feb-2013 06:29:05 UTC] 4408: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock
[16-Feb-2013 06:29:05 UTC] 4408: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock without waiting
which is from before the reboot. There is no process 4408, currently. So why are we waiting on it, and will it time out at some point? I could have process 4511 check whether 4408 exists, but then what? Delete and recreate the lockfile? The lock should be released automatically when the process is gone.
Is it because we're using dio_fcntl instead of regular fcntl, to do file locking? But this is necessary because the working directories are NFS mounted on lalashan, and regular fcntl doesn't work on lalashan.
further investigation: the test-flock script I wrote a long time ago still seems good, and here's what it reports:
wonder@yushan:/usr/local/workingwiki/WorkingWiki$ sudo -u apache php test-flock.php --test /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
/usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is write locked.
wonder@yushan:/usr/local/workingwiki/WorkingWiki$ sudo -u apache php test-flock.php --fcntl --test /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
/usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is write locked.
But:
wonder@yushan:/usr/local/workingwiki/WorkingWiki$ sudo -u apache cat /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
wonder@yushan:/usr/local/workingwiki/WorkingWiki$
It's supposed to have the owning PID stored in the lock file...
I'm going to retry the preview operation that's waiting... Here's what I get:
[16-Feb-2013 20:12:31 UTC] 3715: /usr/local/workingwiki/working-directories/preview/.1951447361:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by 4511
test-flock and cat confirm this.
I'm sure I could free up the whole system by deleting the git project's lockfile, but I want to fix the cause.
Last edit: Lee Worden 2013-02-16
I'm uncertain, and I'm going to experiment a little, even though it may make the problem unsolvable by making it go away.
First I'm going to kill 4511 and see if its lock (Cascades_Paper.lock) gets released. Then I'll see whether the next process that's waiting (maybe 3715) gets stuck on the git project.
Answer:
[16-Feb-2013 20:22:22 UTC] 3715: Locked /usr/local/workingwiki/working-directories/preview/.1951447361:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock after 590787.20593452 microseconds
[16-Feb-2013 20:22:22 UTC] 3715: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
[16-Feb-2013 20:22:22 UTC] 3715: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by
Yes, 3715 gets Cascade_Paper.lock, and now it's next in line for the git project's lock. Which is apparently not about to get released anytime soon.
I don't know why the PID is no longer in the lock file.
Commenting from a position of deep ignorance.
I've been given to understand that locks don't always play nice with NFS when crashes happen, and that's just the way it is. So at least consider the possibility that there's no cause to pinpoint?
Is there a way to ask the OS what process has a file locked? I don't know of one.
Anyway, it looks to me like the NFS server and web server are not working together correctly to release locks on NFS files when the web server crashes and reboots.
I'm going to remove the .pe-git:git@github.com:worden-lee_bikh-on-network.git.lock file by hand, and see if my project starts working again.
Results:
1
. the waiting projects don't spring into action. This is presumably because the lock file still exists, and has just been renamed to .nfsSOMETHING.2
. But when I try again, it will try to lock the file by name, and succeed? No, because it'll be waiting on Cascades_Paper. I'll have to kill the process that's holding it. So I kill 3715, and immediately 3379 gets the lock on Cascades_Paper, so 4508 is still waiting in line. When I kill 3379, 4509 (!) gets Cascades_Paper.lock, goes on to lock bikh_on_network.git, and gets stuck on the next one in line:[16-Feb-2013 20:39:08 UTC] 4509: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock is locked by 4408
3
. So I'll remove that one as well and kill the waiting process. Upon which 4508 gets those, but gets stuck on the next one:[16-Feb-2013 20:41:32 UTC] 4508: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock is locked by 4408
4
. Remove that one and kill 4508. Nothing happens, because nothing is waiting. Except that my browser tab finally stops spinning and reports that the connection was reset.5
. So now I reload the tab, re-submitting the request to the wiki. Maybe I did that wrong, but it reset the edit field to its original value and destroyed hours of work that I'd been trying to preview when the system froze last night. Fortunately I'm using the Lazarus Form Recovery firefox addon, which is great for recovering things like this, so that's okay. Now when I hit preview, it seems to get the locks it wants, and "top" reports that the server is running python, which is what my job should be doing. Hopefully it won't crash the server again!Preview succeeded.
JD: I'm pretty ignorant myself. It does look iffy. For instance:
That's even worse than the current problem. But it does work in general on linux, I tested it when we started using it. I'm willing to not worry too hard about other kernels.
I feel like it would be appropriate to ask RHPCS what's up and whether they can fix the problem on lalashan.
But if this crash-recovery issue won't work on other people's systems, maybe I should make WW more robust regardless of whether it gets fixed on ours.
Crashed again, and now is locked up again.
This comment will probably be a repeat of what's above, but I'm recording my steps anyway in case.
I'm trying to create a background job (to run a demanding script on the cluster in hope of not crashing the head server yet again). It stops at
[17-Feb-2013 23:08:09 UTC] 5806: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
(I added the hostname to the pid, since the point of using flock and NFS is that there could be multiple machines locking things.)
Searching backwards for 3385:
[17-Feb-2013 23:04:27 UTC] 3430: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
[17-Feb-2013 23:03:19 UTC] 5810: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
[17-Feb-2013 06:55:16 UTC] 3385: Locking /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock
[17-Feb-2013 06:55:16 UTC] 3385: Locked /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock without waiting
[17-Feb-2013 06:55:16 UTC] 3385: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
[17-Feb-2013 06:55:16 UTC] 3385: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by yushan:4513
Searching on 4513:
[17-Feb-2013 21:48:34 UTC] 3434: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 21:42:25 UTC] 10068: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 21:35:41 UTC] 3433: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 19:52:30 UTC] 6846: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 19:45:57 UTC] 10044: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[17-Feb-2013 19:39:50 UTC] 13669: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513
[16-Feb-2013 19:28:30 UTC] 4513 begin: /theobio/3SS/index.php?title=Main_Page&oldid=369
[16-Feb-2013 19:28:31 UTC] 4513 end
[17-Feb-2013 06:55:16 UTC] 3385: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by yushan:4513
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock without waiting
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock without waiting
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock without waiting
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock without waiting
Juding by the long gap in the log between 17-Feb-2013 06:56:13 UTC and 17-Feb-2013 14:58:28 UTC (and noticing that the pid was reused later) I judge that process 4513 went down in the system crash (and it may have caused it, but that's another question).
So I think I'll just kill all these processes (the ones created since the restart).
wonder@yushan:~$ sudo -u apache kill 5806 3385 3430 5810 3434 10068 3433 6846 10044 13669
[sudo] password for wonder:
kill 3385: No such process
wonder@yushan:~$
Right, I guess 3385 was from before the crash.
Anyway, after that, retrying the "background make" link DOES NOT work any better, because the lock file is still locked by 3385:
[17-Feb-2013 23:25:18 UTC] 3431: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385
I need to blow away all lock files that were stranded by the crash.
wonder@yushan:~$ sudo -u apache rm /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock
Now background make goes through without hanging up on a lock.
I'll abandon the preview session and retry that after the background job is successfully merged, because I don't want to risk restarting the dangerous python script.
We are still getting some of these cases of processes waiting on a lock owned by a long-gone process. I don't know why these locks are persisting, or what to do about it. I created a script to summarize the logfile info on who's waiting on what processes, which makes it a lot easier to manually destroy them, but doesn't shed light on the cause.
We also have a new problem with sleeping mysql connections accumulating and making the database stop working. Some of them are owned by these lock-waiting processes, but I don't know how many.
I have worked out that it's possible to get the client process's hostname and port number from MySQL's "show processlist", and go to the client's host and use sudo netstat -p to find the client's pid, which makes it possible to find out whether it's waiting on a lock or what.
JD suggested having the locking process time out after 5 minutes, which I might agree with.
One issue in reconstructing the sequence of events from the logfiles is that PIDs keep getting reused. I'm wondering whether there's something in /proc or somewhere that I can use to disambiguate different processes with the same number - date/time of creation, maybe.
So all this is nice, but basically fcntl + NFS is buggy.
There are other ways to do locking on NFS - mv and symlink operations are atomic, and you could also do it by putting lock data into memcached or even in a mysql table - but they require your process to unlock nicely, and we can't count on that. We need a system that removes the lock when a process dies holding it, and we just don't have one.
So maybe I need subsequent lock-seeking processes to catch when their predecessors have died. It would be a challenge to do that across hosts, but I could check when I happen to be on the same one.
There's an entry in /proc/*/stat that records each processes start time ("the time in jiffies the process started after system boot"). So I could record my pid and starttime in the lockfile, and successors could check whether that pid exists and still has the same starttime, if they're on the same host.
This is independent of Jonathan's suggestion about timing out when waiting.
But it also occurred to me that usually the locking process has a natural expiration time: make processes are expected to timeout after 3 minutes, so I could also put the expiration time in the lockfile and successors would be able to tell that I should be done by now. But maybe the pid/starttime thing is enough.
I just cleared all the locks, and now new incidents are accumulating. My script says:
While in yushan-n05's log is
and after that are several more processes that successfully lock that lockfile, none of them with pid 2049.
So why would later processes on other nodes say they are waiting for that one? Maybe the locks don't propagate as well as they should? Or maybe the locks propagate, but the file contents have a lag - which would mean they get the wrong pid from inside the lockfile... maybe they're actually waiting for something else...
... and how about that? They're not waiting any more.
OK, that's an issue.
I see two issues here that weren't previously logged: this one, where the locks propagate maybe correctly but the file contents, which record the pid of the owning process, have a lag from one NFS client to another. and the one that shows up in the logfile entries above, where the file is empty and so the lock-seeker reports that it is locked by '' rather than an actual pid. That could be a symptom of the same lag issue, though.
At this Linux NFS FAQ I notice this line:
I would have to look into what NLM lock recovery is, but it sounds worth checking.
The pid-lag issue is [#298] now.
Related
Bugs: #298