WorkingWiki / Bugs / #263 overly persistent lock?

Lee Worden - 2013-02-16

The reboot was at about 19:20 UTC, far as I can reconstruct.

In the log messages I've been dumping, I have

[16-Feb-2013 19:34:13 UTC] 4511: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by 4408

which is from after the reboot, and the last few entries from PID 4408 are

[16-Feb-2013 06:29:05 UTC] 4408: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock without waiting
[16-Feb-2013 06:29:05 UTC] 4408: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock
[16-Feb-2013 06:29:05 UTC] 4408: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock without waiting
[16-Feb-2013 06:29:05 UTC] 4408: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock
[16-Feb-2013 06:29:05 UTC] 4408: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock without waiting

which is from before the reboot. There is no process 4408, currently. So why are we waiting on it, and will it time out at some point? I could have process 4511 check whether 4408 exists, but then what? Delete and recreate the lockfile? The lock should be released automatically when the process is gone.

Is it because we're using dio_fcntl instead of regular fcntl, to do file locking? But this is necessary because the working directories are NFS mounted on lalashan, and regular fcntl doesn't work on lalashan.

further investigation: the test-flock script I wrote a long time ago still seems good, and here's what it reports:

wonder@yushan:/usr/local/workingwiki/WorkingWiki$ sudo -u apache php test-flock.php --test /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
/usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is write locked.

wonder@yushan:/usr/local/workingwiki/WorkingWiki$ sudo -u apache php test-flock.php --fcntl --test /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
/usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is write locked.

But:

wonder@yushan:/usr/local/workingwiki/WorkingWiki$ sudo -u apache cat /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
wonder@yushan:/usr/local/workingwiki/WorkingWiki$

It's supposed to have the owning PID stored in the lock file...

I'm going to retry the preview operation that's waiting... Here's what I get:

[16-Feb-2013 20:12:31 UTC] 3715: /usr/local/workingwiki/working-directories/preview/.1951447361:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by 4511

test-flock and cat confirm this.

I'm sure I could free up the whole system by deleting the git project's lockfile, but I want to fix the cause.

Last edit: Lee Worden 2013-02-16

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-02-16

I'm uncertain, and I'm going to experiment a little, even though it may make the problem unsolvable by making it go away.

First I'm going to kill 4511 and see if its lock (Cascades_Paper.lock) gets released. Then I'll see whether the next process that's waiting (maybe 3715) gets stuck on the git project.

Answer:

[16-Feb-2013 20:22:22 UTC] 3715: Locked /usr/local/workingwiki/working-directories/preview/.1951447361:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock after 590787.20593452 microseconds
[16-Feb-2013 20:22:22 UTC] 3715: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
[16-Feb-2013 20:22:22 UTC] 3715: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by

Yes, 3715 gets Cascade_Paper.lock, and now it's next in line for the git project's lock. Which is apparently not about to get released anytime soon.

I don't know why the PID is no longer in the lock file.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Jonathan Dushoff - 2013-02-16

Commenting from a position of deep ignorance.

I've been given to understand that locks don't always play nice with NFS when crashes happen, and that's just the way it is. So at least consider the possibility that there's no cause to pinpoint?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-02-16

Is there a way to ask the OS what process has a file locked? I don't know of one.

Anyway, it looks to me like the NFS server and web server are not working together correctly to release locks on NFS files when the web server crashes and reboots.

I'm going to remove the .pe-git:git@github.com:worden-lee_bikh-on-network.git.lock file by hand, and see if my project starts working again.

Results:

1. the waiting projects don't spring into action. This is presumably because the lock file still exists, and has just been renamed to .nfsSOMETHING.

2. But when I try again, it will try to lock the file by name, and succeed? No, because it'll be waiting on Cascades_Paper. I'll have to kill the process that's holding it. So I kill 3715, and immediately 3379 gets the lock on Cascades_Paper, so 4508 is still waiting in line. When I kill 3379, 4509 (!) gets Cascades_Paper.lock, goes on to lock bikh_on_network.git, and gets stuck on the next one in line:

[16-Feb-2013 20:39:08 UTC] 4509: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock is locked by 4408

3. So I'll remove that one as well and kill the waiting process. Upon which 4508 gets those, but gets stuck on the next one:

[16-Feb-2013 20:41:32 UTC] 4508: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock is locked by 4408

4. Remove that one and kill 4508. Nothing happens, because nothing is waiting. Except that my browser tab finally stops spinning and reports that the connection was reset.

5. So now I reload the tab, re-submitting the request to the wiki. Maybe I did that wrong, but it reset the edit field to its original value and destroyed hours of work that I'd been trying to preview when the system froze last night. Fortunately I'm using the Lazarus Form Recovery firefox addon, which is great for recovering things like this, so that's okay. Now when I hit preview, it seems to get the locks it wants, and "top" reports that the server is running python, which is what my job should be doing. Hopefully it won't crash the server again!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-02-16

Preview succeeded.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-02-16

JD: I'm pretty ignorant myself. It does look iffy. For instance:

fcntl() locks are supposedly supported by NFSv3 and higher, but different kernels do different random things with it. Some kernels just lock the file locally, and don't notify the server. Some notify the server, but do it wrong. So you probably can't trust it.

That's even worse than the current problem. But it does work in general on linux, I tested it when we started using it. I'm willing to not worry too hard about other kernels.

I feel like it would be appropriate to ask RHPCS what's up and whether they can fix the problem on lalashan.

But if this crash-recovery issue won't work on other people's systems, maybe I should make WW more robust regardless of whether it gets fixed on ours.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-02-17

Crashed again, and now is locked up again.

This comment will probably be a repeat of what's above, but I'm recording my steps anyway in case.

I'm trying to create a background job (to run a demanding script on the cluster in hope of not crashing the head server yet again). It stops at

[17-Feb-2013 23:08:09 UTC] 5806: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385

(I added the hostname to the pid, since the point of using flock and NFS is that there could be multiple machines locking things.)

Searching backwards for 3385:

[17-Feb-2013 23:04:27 UTC] 3430: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385

[17-Feb-2013 23:03:19 UTC] 5810: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385

[17-Feb-2013 06:55:16 UTC] 3385: Locking /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock
[17-Feb-2013 06:55:16 UTC] 3385: Locked /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock without waiting
[17-Feb-2013 06:55:16 UTC] 3385: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
[17-Feb-2013 06:55:16 UTC] 3385: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by yushan:4513

Searching on 4513:

[17-Feb-2013 21:48:34 UTC] 3434: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513

[17-Feb-2013 21:42:25 UTC] 10068: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513

[17-Feb-2013 21:35:41 UTC] 3433: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513

[17-Feb-2013 19:52:30 UTC] 6846: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513

[17-Feb-2013 19:45:57 UTC] 10044: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513

[17-Feb-2013 19:39:50 UTC] 13669: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:4513

[16-Feb-2013 19:28:30 UTC] 4513 begin: /theobio/3SS/index.php?title=Main_Page&oldid=369
[16-Feb-2013 19:28:31 UTC] 4513 end

[17-Feb-2013 06:55:16 UTC] 3385: /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock is locked by yushan:4513

[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock without waiting
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock without waiting
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock without waiting
[17-Feb-2013 06:54:00 UTC] 4513: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock
[17-Feb-2013 06:54:00 UTC] 4513: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock without waiting

Juding by the long gap in the log between 17-Feb-2013 06:56:13 UTC and 17-Feb-2013 14:58:28 UTC (and noticing that the pid was reused later) I judge that process 4513 went down in the system crash (and it may have caused it, but that's another question).

So I think I'll just kill all these processes (the ones created since the restart).

wonder@yushan:~$ sudo -u apache kill 5806 3385 3430 5810 3434 10068 3433 6846 10044 13669
[sudo] password for wonder:
kill 3385: No such process
wonder@yushan:~$

Right, I guess 3385 was from before the crash.

Anyway, after that, retrying the "background make" link DOES NOT work any better, because the lock file is still locked by 3385:

[17-Feb-2013 23:25:18 UTC] 3431: /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock is locked by yushan:3385

I need to blow away all lock files that were stranded by the crash.

wonder@yushan:~$ sudo -u apache rm /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_bikh-on-network.git.lock /usr/local/workingwiki/working-directories/persistent/.pe-git:git@github.com:worden-lee_net-dyn.git.lock /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:worden:Bikh_on_network.lock /usr/local/workingwiki/working-directories/preview/.695171881:pe-ww:lalashan.mcmaster.ca:theobio:worden:Cascades_Paper.lock

Now background make goes through without hanging up on a lock.

I'll abandon the preview session and retry that after the background job is successfully merged, because I don't want to risk restarting the dangerous python script.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-02-17

milestone: lalashan-yushan_site-specific --> refactored_workingwiki_(with_projectengine)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-05-09

We are still getting some of these cases of processes waiting on a lock owned by a long-gone process. I don't know why these locks are persisting, or what to do about it. I created a script to summarize the logfile info on who's waiting on what processes, which makes it a lot easier to manually destroy them, but doesn't shed light on the cause.

We also have a new problem with sleeping mysql connections accumulating and making the database stop working. Some of them are owned by these lock-waiting processes, but I don't know how many.

I have worked out that it's possible to get the client process's hostname and port number from MySQL's "show processlist", and go to the client's host and use sudo netstat -p to find the client's pid, which makes it possible to find out whether it's waiting on a lock or what.

JD suggested having the locking process time out after 5 minutes, which I might agree with.

One issue in reconstructing the sequence of events from the logfiles is that PIDs keep getting reused. I'm wondering whether there's something in /proc or somewhere that I can use to disambiguate different processes with the same number - date/time of creation, maybe.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-05-10

So all this is nice, but basically fcntl + NFS is buggy.

There are other ways to do locking on NFS - mv and symlink operations are atomic, and you could also do it by putting lock data into memcached or even in a mysql table - but they require your process to unlock nicely, and we can't count on that. We need a system that removes the lock when a process dies holding it, and we just don't have one.

So maybe I need subsequent lock-seeking processes to catch when their predecessors have died. It would be a challenge to do that across hosts, but I could check when I happen to be on the same one.

There's an entry in /proc/*/stat that records each processes start time ("the time in jiffies the process started after system boot"). So I could record my pid and starttime in the lockfile, and successors could check whether that pid exists and still has the same starttime, if they're on the same host.

This is independent of Jonathan's suggestion about timing out when waiting.

But it also occurred to me that usually the locking process has a natural expiration time: make processes are expected to timeout after 3 minutes, so I could also put the expiration time in the lockfile and successors would be able to tell that I should be done by now. But maybe the pid/starttime thing is enough.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-05-10

I just cleared all the locks, and now new incidents are accumulating. My script says:

/usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock: yushan-n01:27254 is waiting for yushan-n05:2049 (defunct) since 10-May-2013 04:12:38 UTC yushan-n06:14086 is waiting for yushan-n05:2049 (defunct) since 10-May-2013 04:12:38 UTC

While in yushan-n05's log is

[09-May-2013 20:59:06 UTC] 2049 begin: /theobio/ClawedFrogs/index.php/Special:GetProjectFile?project=Eight_genes_model_dup-del_parameter_estimation_models&make=false&display=raw&filename=bd_dirDinvB-anc1.Rout.png [09-May-2013 20:59:06 UTC] 2049: Locking /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock [09-May-2013 20:59:06 UTC] 2049: /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock is locked by [09-May-2013 20:59:06 UTC] 2049: Locked /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock after 1.317024230957 microseconds [09-May-2013 20:59:06 UTC] 2049: Unlock /usr/local/workingwiki/working-directories/persistent/.pe-ww:lalashan.mcmaster.ca:theobio:ClawedFrogs:Eight_genes_model_dup-del_parameter_estimation_models.lock [09-May-2013 20:59:06 UTC] 2049 end

and after that are several more processes that successfully lock that lockfile, none of them with pid 2049.

So why would later processes on other nodes say they are waiting for that one? Maybe the locks don't propagate as well as they should? Or maybe the locks propagate, but the file contents have a lag - which would mean they get the wrong pid from inside the lockfile... maybe they're actually waiting for something else...

... and how about that? They're not waiting any more.

OK, that's an issue.

I see two issues here that weren't previously logged: this one, where the locks propagate maybe correctly but the file contents, which record the pid of the owning process, have a lag from one NFS client to another. and the one that shows up in the logfile entries above, where the file is empty and so the lock-seeker reports that it is locked by '' rather than an actual pid. That could be a symptom of the same lag issue, though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-05-10

At this Linux NFS FAQ I notice this line:

To enable NLM lock recovery, ensure your client's host name, as returned by uname -n, matches the host name returned by DNS.

I would have to look into what NLM lock recovery is, but it sounds worth checking.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2013-05-10

The pid-lag issue is [#298] now.

Related

Bugs: #298

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

overly persistent lock?

Group

Searches

Help

#263 overly persistent lock?

Related

Discussion

Related