From: Neil G. <ne...@ni...> - 2006-08-28 17:32:27
|
Hi all, I am wondering if anybody can help me track down what is causing some apparent spikes in system CPU usage on my server. Here are the current graphs: http://www.crazyguyonabike.com/munin/localdomain/localhost.localdomain.html I just noticed these spikes a couple of weeks ago. I thought at first it was because Google had returned to my site after along absence, but looking at the server logs, there doesn't seem to be much of a correlation between volume of visits by Googlebot and the spikes. The number of Googlebot visits per hour spans either side of when the spike begins and ends. Very weird. The server is most LAMP (Linux 2.6, AMD64, dual Operon 265, custom kernel) with 4 GB RAM. It's running Apache 1.3.x, MySQL 4.0.x, and Perl 5.8. Sendmail and bind are also on the box, though the email side of things is generally very light. I don't know how much load bind puts on, though I don't think it's very much. Incidentally, I have never seen any graphing of sendmail activity. Also the ntp_goose graph disappeared at some point, not sure why. I've looked at whether I'm doing anything heavy at the times when system CPU usage spikes, but I just can't seem to find anything out of the ordinary. Is there anyone who knows more about Linux internals than I, who can interpret these graphs to give some clue as to where to start looking? Thanks! /Neil |
From: Klaus A. S. <kse...@gm...> - 2006-08-28 18:38:52
|
Neil Gunton <ne...@ni...> wrote: > I am wondering if anybody can help me track down what is causing some > apparent spikes in system CPU usage on my server. Here are the current > graphs: > > http://www.crazyguyonabike.com/munin/localdomain/localhost.localdomain.ht= ml Are you running any cron jobs at 6 and 18? Cheers, --=20 Klaus Alexander Seistrup Copenhagen =B7 Denmark http://magnetic-ink.dk/ |
From: Neil G. <ne...@ni...> - 2006-08-28 19:58:32
|
Klaus Alexander Seistrup wrote: > Neil Gunton <ne...@ni...> wrote: > > >>I am wondering if anybody can help me track down what is causing some >>apparent spikes in system CPU usage on my server. Here are the current >>graphs: >> >>http://www.crazyguyonabike.com/munin/localdomain/localhost.localdomain.html > > > Are you running any cron jobs at 6 and 18? No, crontab has just the usual hourly, daily, weekly and monthly times, plus three small jobs that run every minute (small stuff that wouldn't cause these kinds of spikes). One of those is indexing changes to the website, the other two are simply checking for any emails that need to be sent, or catching email posts to the journals. I don't think these would be the source of the problem. I guess, to clarify, I was curious as to whether there are any clues from the other munin graphs as to something out of the ordinary in a system like this... I am not familiar with exactly how to interpret stuff like "entropy", for example. Thanks again, -Neil |
From: Klaus A. S. <kse...@gm...> - 2006-08-29 04:05:39
|
Neil Gunton wrote: > I guess, to clarify, I was curious as to whether there are any clues > from the other munin graphs as to something out of the ordinary in a > system like this... I am not familiar with exactly how to interpret > stuff like "entropy", for example. The "Fork rate" graph does show some spikes corresponding to the "CPU usage" and "Load average" graphs, but I have no idea what could have caused them, although it all seems to haver started in week 34. Some of the "Fork rate" spikes reach more than 100 forks/second, which is pretty high. It doesn't show up on the "Number of Processes" graph, though, so my guess is that it's a burst of short-lived jobs (e.g. sending individual emails, or having 'find' spawning individual jobs on each match). By repeatedly using 'ps' at the right time I'd say you should be able to find the cause. Cheers, --=20 Klaus A Seistrup K=F8benhavn =B7 DK http://seistrup.dk/ |
From: Neil G. <ne...@ni...> - 2006-08-29 07:06:59
|
Klaus Alexander Seistrup wrote: > Some of the "Fork rate" spikes reach more than 100 forks/second, which > is pretty high. It doesn't show up on the "Number of Processes" > graph, though, so my guess is that it's a burst of short-lived jobs > (e.g. sending individual emails, or having 'find' spawning individual > jobs on each match). By repeatedly using 'ps' at the right time I'd > say you should be able to find the cause. Yes, you're right about the high fork rates... I can't imagine what would be forking at that kind of rate. A fork is a new process - what on earth would be creating new processes hundreds of times per second??? There are three possibilities that I can see so far: 1. It's some kind of kernel-level bug 2. It's a bug in apache or one of its modules (or my code) 3. My server's been hacked and it's a spammer or whatever. I think option 3 is made less likely by the fact that the eth0 traffic does not spike along with the load spikes. So if it was a spammer, you'd expect the network usage to increase too. Option 1 is scary, but not so easy to test with new kernels - my RAID card needs a special driver from Adaptec (it's AMD64 and 2.6, and the dpt_i2o driver is not included by default). So I need to go ask Adaptec for a version of their driver that will work with whatever kernel version I am testing. Can be done, but I'd rather not be messing with new kernels on a stable production server if I can help it. However if nothing else shows up then this may be the only option I guess. Option 2 seems unlikely, since Apache 1.3 is very stable. Of course, it could be a config issue that only shows up under load. I did notice this evening a small increase in the system CPU load, so I tried a couple of things - tweaking MaxRequestsPerChild, in case this was being caused by some kind of issue with individual Apache processes being around longer than they should. I noticed that the 'system' load went down right after I restarted apache, which is what made me think about this. However, frustratingly, I noticed that Googlebot stopped visiting right then too, so it wasn't a real test, since system load might have gone down anyway. I forgot to mention in my first email that I use two custom apache servers on the same box - one is a lightweight front end caching reverse proxy (using Igor Sysoev's mod_accel and mod_deflate), and the "heavy" back end with mod_perl. This has been working just fine for years now, and it allows for high performance with dynamic websites. The only thing that really changed just recently was a huge increase in Googlebot activity - apparently due to some misunderstanding a while back, the good folks at Google had told their bot to not crawl my site. After this was cleared up, it was back - making thousands of requests per hour. However, the server should be able to handle all that without breaking a sweat. Would 'ps' show me by default the useful info I need in order to discover which processes are taking up the high system load at the spike times? I have of course used ps, but only in a very basic way. Looking at the man page, there are (as usual) a whole load of options, so if anyone is an expert with this and knows the right command line opts to use for what I am trying to discover, I'd be glad of advice. Thanks again, -Neil |
From: Douglas M. <dou...@gm...> - 2006-08-29 13:05:00
|
On 8/29/06, Neil Gunton <ne...@ni...> wrote: > > A fork is a new process - what on > earth would be creating new processes hundreds of times per second??? > A mailserver handling hundreds/thousands of emails in a very short time. I've seen it happen before. Spammers know no shame. -- Doug |
From: Klaus A. S. <kse...@gm...> - 2006-08-30 16:13:44
|
Nicolai Langfeldt wrote: > Please fix your mailserver graphs, the best suggestion this far is > that the cause is a spam attempt or some such. The graphs are password protected now, so I cannot conclude anything. However, unless the network throughput is usually very high, bursts of spam would certainly be noticeable on the network graphs, and I can't seem to recall such a correlation on the graphs I saw. Cheers, --=20 Klaus Alexander Seistrup Copenhagen =B7 Denmark http://magnetic-ink.dk/ |
From: Neil G. <ne...@ni...> - 2006-08-30 17:05:29
|
Klaus Alexander Seistrup wrote: > Nicolai Langfeldt wrote: > > >>Please fix your mailserver graphs, the best suggestion this far is >>that the cause is a spam attempt or some such. > > > The graphs are password protected now, so I cannot conclude anything. > However, unless the network throughput is usually very high, bursts of > spam would certainly be noticeable on the network graphs, and I can't > seem to recall such a correlation on the graphs I saw. > > Cheers, > Sorry about the password protection. I was tweaking my apache configs and accidentally re-included munin in the restricted access list. I've taken it off again (for now - after a week or two I'll probably restrict it again, once this thread is done). I don't think spam is the problem here, since (as I previously mentioned), the mail log doesn't show anything really out of the ordinary, and nor does the eth0 traffic. Although I guess you don't need a lot of traffic in order for there to be a lot of attempted connections, each of which could cause forking. There are always attempted connections coming in from all over, but I use various blacklists to block some, and spamassassin to catch others. I don't see particular spikes in spam activity in the mail log during these times. I was able to get the sendmail statistics working last night. Apparently the sendmail.conf directives for producing stats weren't enabled. I can now see some mail stats in munin (though not the mqueue, but that could be some other issue - there are four files in the queue directory, but they start with Dfk rather than Q, as the munin plugin seems to look for). Also, last night I compiled my own perl, without threading, which I thought might influence things possibly. The reason I did this was because of 1) the extremely high fork rate during the CPU spikes, 2) the stock perl was built with threading, and 3) in the apache mod_perl error logs there was an entry "A thread exited while 2 threads were running" every time a child process would terminate. I have no idea why there would be more than one thread, I certainly don't start any, but perhaps this was some kind of clue - more threads than expected, high fork rate... it was worth a try. So then I discovered the dark side of Debian package management - perl is used by a hell of a lot of packages, and when I installed my own version built from source, it broke apt-get. So I ended up putting the old perl back in /usr/bin/perl for regular use, but building the mod_perl against the new perl. At least, I think that's what's happening now. At least apt-get works again, and I don't see those thread messages in my error log. I can see the fork rate creeping up around 30 forks per second now, and the CPU usage is small but noticeable too. Googlebot is visiting. Although I could see very similar numbers of requests for pages from Googlebot both during and after the previous CPU spikes, I now believe that they do have something to do with apache load. Why the CPU load suddenly drops off sometimes, while the traffic is still high, is still a mystery. I have tweaked the MaxRequestsPerChild, but it doesn't seem to affect much except for increasing the memory usage the longer the children are kept around. This is to be expected with mod_perl, because due to the way perl works, over time the children are able to share less and less memory. So it's good to have the children die fairly quickly for mod_perl (I have MaxRequestsPerChild set to 100 currently, it was 30 before). I'll let you know if I find anything, but at present I am still stumped as to exactly why this is happening... /Neil |
From: Mike B. <mb...@te...> - 2006-08-29 12:34:28
|
Hi, On my system I noticed that my fork rates rumble around 20 and my spikes in fork rate, sometimes over 100, correspond to spikes in rejections in my exim stats graph. Sincerely, Mike -- Mike Brandonisio * Web Hosting Tech One Illustration * Internet Marketing tel (630) 759-9283 * e-Commerce mb...@ji... * http://www.jikometrix.net JIKOmetrix - Reliable web hosting On Aug 28, 2006, at 11:05 PM, Klaus Alexander Seistrup wrote: > > The "Fork rate" graph does show some spikes corresponding to the "CPU > usage" and "Load average" graphs, but I have no idea what could have > caused them, although it all seems to haver started in week 34. > > Some of the "Fork rate" spikes reach more than 100 forks/second, which > is pretty high. It doesn't show up on the "Number of Processes" > graph, though, so my guess is that it's a burst of short-lived jobs > (e.g. sending individual emails, or having 'find' spawning individual > jobs on each match). By repeatedly using 'ps' at the right time I'd > say you should be able to find the cause. |
From: Neil G. <ne...@ni...> - 2006-08-29 21:07:57
|
Nicolai Langfeldt wrote: > Please fix your mailserver graphs, the best suggestion this far is that > the cause is a spam attempt or some such. I have no idea why the sendmail graphs are not working. I mentioned this in the original email, but nobody has offered any suggestions. There is no sign of unusual activity in the mail log file. Is it likely that spam attempts could be the cause of the spikes given this? I also have both tripwire and chkrootkit on the server, and nothing appears to be amiss. >>2. It's a bug in apache or one of its modules (or my code) > > You're not running an apache plugin it seems. Not sure what you mean by this, or how you can even tell what modules I have installed. Or are you referring to something other than modules? /Neil |
From: Nicolai L. <ja...@li...> - 2006-08-29 22:07:07
|
Neil Gunton wrote: > Nicolai Langfeldt wrote: >> Please fix your mailserver graphs, the best suggestion this far is that >> the cause is a spam attempt or some such. > > I have no idea why the sendmail graphs are not working. I mentioned this > in the original email, but nobody has offered any suggestions. Please check our FAQ at http://munin.projects.linpro.no/wiki/faq, the answer for "Why are the graphs for plugin xyz blank?" may help. >>> 2. It's a bug in apache or one of its modules (or my code) >> >> You're not running an apache plugin it seems. > > Not sure what you mean by this, or how you can even tell what modules I > have installed. Or are you referring to something other than modules? Referring to a munin apache plugin. Which you're not running according to your munin page. Nicolai |
From: Douglas M. <dou...@gm...> - 2006-08-29 04:07:34
|
On 8/28/06, Neil Gunton <ne...@ni...> wrote: > > I just noticed these spikes a couple of weeks ago. I thought at first it > was because Google had returned to my site after along absence, but > looking at the server logs, there doesn't seem to be much of a > correlation between volume of visits by Googlebot and the spikes. The > number of Googlebot visits per hour spans either side of when the spike > begins and ends. Very weird. > You might want to check system and mail logs, as well as the crontab log for spikes of activity around those times. -- Doug |
From: Neil G. <ne...@ni...> - 2006-08-29 07:07:32
|
Douglas Muth wrote: > On 8/28/06, Neil Gunton <ne...@ni...> wrote: > >>I just noticed these spikes a couple of weeks ago. I thought at first it >>was because Google had returned to my site after along absence, but >>looking at the server logs, there doesn't seem to be much of a >>correlation between volume of visits by Googlebot and the spikes. The >>number of Googlebot visits per hour spans either side of when the spike >>begins and ends. Very weird. >> > > > You might want to check system and mail logs, as well as the crontab > log for spikes of activity around those times. Yup, nothing at all in those logs to indicate anything out of the ordinary. |
From: Mike B. <mb...@te...> - 2006-08-29 12:44:14
|
Hi, You might also try using 'top' around the time the spikes occur and watch. When top is running try SHFT+P to sort by priority. It should put the most active process on top. If you want to try writing 'top' output to a file, stretch the window really long and top > top.txt. You'll have to CTRL+C out of it but it will write the window to a file for each update of the screen. You will not see the updates but you can later review them a text editor. Sincerely, Mike -- Mike Brandonisio * Web Hosting Tech One Illustration * Internet Marketing tel (630) 759-9283 * e-Commerce mb...@ji... * http://www.jikometrix.net JIKOmetrix - Reliable web hosting On Aug 28, 2006, at 12:32 PM, Neil Gunton wrote: > Hi all, > > I am wondering if anybody can help me track down what is causing some > apparent spikes in system CPU usage on my server. Here are the current > graphs: > |
From: Nicolai L. <ja...@li...> - 2006-08-29 21:02:28
|
Neil Gunton wrote: > Yes, you're right about the high fork rates... I can't imagine what Please fix your mailserver graphs, the best suggestion this far is that the cause is a spam attempt or some such. > would be forking at that kind of rate. A fork is a new process - what on > earth would be creating new processes hundreds of times per second??? > > There are three possibilities that I can see so far: > > 1. It's some kind of kernel-level bug Not likely IMHO. > 2. It's a bug in apache or one of its modules (or my code) You're not running an apache plugin it seems. Nicolai |