Sometimes I wait a long time for a page on lalashan. Sometimes when I don't want to wait, I hit escape to stop the page loading, then reload, and I get it right away. Because I'm on the far end of a network connection california <-> ontario, I haven't been sure whether the request might be getting lost before it reaches the wiki software or what. I've asked others to give me any relevant info, but haven't gotten anything as far as I can recall. So I have been thinking of it as a server performance issue that may or may not be related to my code, and should be addressed on the system administration level. I asked the RHP sysadmin team to try to diagnose lalashan's performance issues. They tweaked something and decided they were done. Nobody followed up.
Anyway, I had a long-ish wait for two pages on yushan recently, so I'm thinking it may be the same issue, and if so that seems to increase the likelihood that it's related to the WW code, because the two servers are configured quite differently. Though both systems are using the same MediaWiki code base and mostly the same LocalSettings stuff, so it could be in there as well.
But I'm wondering how I would diagnose whether there's a point of resource contention or something in my code.
Anonymous
I'm thinking putting PEMessage::debug_message("time: ". time(0)); at various points in the code would be sufficient to document if there's a long delay at some particular point in the execution. Then I would have to wait until I get a slow response and then go look in the log file.
No, I would use wfDebug(), because I want to start from the very beginning of the MW processing, not just the part where WW and PE are involved.
I am still using lalashan and this is still bothering me (and everyone else). I feel like it might be worth putting a bit of my time into trying to find out whether the holdup is within the PHP code or somewhere else in the browser-server-browser round trip.
Am looking into where wfDebug() is defined and how early I can insert a call to it in the MW code.
wfDebug() is defined in GlobalFunctions.php and so is wfErrorLog(), which does its work (given the debug-logging configuration I'm using), and it looks like I can use it once that file's been read in. It's read in by Setup.php, which is read in at the end of WebStart.php. So probably best to start logging in index.php just after it reads in WebStart.php.
I have added commands to log the time (to the nearest second) near the beginning and at the end of index.php on lalashan. When I get a very slow response, hopefully, I'll be able to examine the log and see whether the php took a long time to execute or the problem happened somewhere else. I guess if the start and end time are close together I'll have to turn to error_log() to check the possibility that the part of index.php before the first log command is slow, however. If the php is slow, however, I can go on to add more time checks to find out what parts of it are slow.
Interesting, I'm waiting for a ManageProject page right now - 3 minutes and counting - and the log says MW finished serving the page in the first 5 seconds, and then handled the subsequent requests for two JavaScript resources within the same second. It also said it was getting the MathJax resource from its server, and then went on to say "Waiting for lalashan.mcmaster.ca" for 3 minutes for some unclear reason.
And after waiting all that time, it still can't make my project file successfully! :(
Had a slow response on a directory listing response, had trouble sorting out the log statements because there are a series of requests from the MS Bing web spider mixed in with them. On a hunch I added timing statements just before and after each call to ProjectEngine.
Made a change in PEAPI.php: there's special behavior when Special:GetProjectFile is called with display=raw or display=download, in which case PE outputs the file contents directly and calls exit() rather than return to the caller, because there's no need to finish constructing and outputting a wiki page. There should also be no need to unlock the various lock files that PE has locked, because exit() takes care of that. But it's better to do it explicitly, and who knows, maybe it makes a difference. For that matter, maybe exiting from MW makes a difference too. But one thing at a time: now it unlocks the lockfiles before exiting.
OMG I have a screenshot from Firebug - the GetProjectPage took 3 seconds to complete, and then it's the User:WuLi/monobook.css resource that hangs for 3 minutes and for some reason nothing can display in the browser until it's done. What is taking it so long?
Screenshot of Firebug with slow resource
Just caught another one - this time it's commonPrint.css that took 4m 40s instead of the usual 400ms or so.
Looking through the debug statements. Firstly, a sort of oddity. The wiki page itself took 5 seconds to run index.php, according to the timing checks (which could be as little as 4.1 seconds, I guess, due to the granularity) - but Firebug reports it took 1.06s connecting, 3.89s waiting, and 6.97s receiving, and I had the strong impression that MW does all its computing before outputting anything, because the parser can alter the content of the HTTP headers, i.e. the waiting phase should have been longer.
No - I think that 5s includes both computing and outputting. The time from first timecheck to the penultimate one - the final PE call - is only 3s. So Firebug's waiting phase ends sometime after that 3s when it first receives something, and from that time it's 6.97s until it receives the last packet. Which implies that either the last packet went out significantly before it was received, or that apache handled that for a while after MediaWiki exited.
As to the main question, timing information for http://lalashan.mcmaster.ca/theobio/worden/skins/common/commonPrint.css?164 was not logged because it's not a call to the wiki code, it's simply a plain file served by apache. Which means I'm uncertain how to diagnose why it took more than 4 minutes to serve it.
I'm wondering if it's just due to plain old swapping. When lalashan's RAM is in demand, apache processes may get swapped out now and then, and take a while to get back into execution.
Whatever it is, the fact that it was a flat file that was slow this time makes it seem very unlikely that the delays are caused by the wiki software! (Or at least, unlikely that all of them are...)
I'm going to assume it's a swapping issue. We're moving all our wikis from the lalashan server to a bigger one, yushan, and if so that should help. There are still ways to exhaust any amount of virtual memory, so it would probably be wise to add a setrlimit() call to pe-make.c to put an upper limit on WW jobs' virtual memory usage. To be honest, I'm putting off doing that because I'm curious to see how much changing to a server with more RAM helps, before I add another confounding factor.
Just observed a 5 minute response time for a page on yushan. Will upload screenshot. Will aim to put in the same timing checks I did on lalashan, to see whether there's a sticky point in the software or anything.
Screenshot of Firebug showing slow response time on a yushan wiki.
Another slow response from yushan, this time on a static resource, not a script execution.
And almost immediately after that, I get stuck waiting 5 minutes for another page, and the holdup is on user.gif - a tiny static image file that should be completely trivial to serve. I've uploaded a screenshot for this one too. Oddly, the one a few minutes ago took 5m8s, and this one took 5m14s. Coincidence? Or is there something with a 5 minute timeout or something?
the top command on yushan indicates no WW-related processes are running, by the way. The only process doing any meaningful amount of processing is an rsync job owned by root.
attaching another screenshot of a slow page, this one in a Chromium browser. In this case it looks like the browser waited 5 minutes for some of the essential CSS resources, and then gave up and displayed the page without them, which is why it doesn't have the proper formatting. As the user, I was stuck looking at a white browser page for the full 5 minutes before I got to see the wiki page, which was itself a trivial request and was in fact at the browser within the first 2 seconds.
forgot to say, while I was waiting that 5 minutes, I logged into yushan and ran top, which reported that the only jobs taking up nonzero CPU were five nfsd processes, at 1% CPU each.
5 minute wait for CSS resources on yushan in Chromium, 4-15-2012
5m1s wait for image on yushan (which is now also known as lalashan), in Firefox, 6-4-2012
OK, I've just uploaded another screenshot documenting a 5m1s wait for a static resource on yushan (which is now also known as lalashan). Here is a direct link to the screenshot: https://sourceforge.net/tracker/download.php?group_id=366300&atid=1527385&file_id=445221&aid=3473259. This makes at least three times I've documented the web server taking just over 5 minutes to return a simple file - which DOES NOT USE php or any other script, including the wiki software, it's just a file - that should take at most a few seconds.
Closing and reopening as https://sourceforge.net/p/workingwiki/bugs/246/ to make a clearing through the old comments.