From: Tom <tom...@gm...> - 2008-05-28 17:27:00
|
I read through the thread posted a while back about zombie processes but I don't see any evidence of any zombies on my system. However, after a few hours after restarting MH (sometimes longer, sometimes shorter), mh takes up around 90% of the cpu. it goes back down after a few seconds then goes back up. Is there a way to see a breakdown of what is running and causing the high utilization? I'm running on CentOS using the latest version from SVN. -------------------- As I was typing this email, I noticed something. I opened up 2 sessions over SSH and while watching TOP, I was also watching the MH output When MH is eating up the CPU, I see the following behavior: The spikes happens every minute when the object states are getting saved. The Saving object states is always followed by a Paused for x seconds 05/28/08 01:10:00 PM: Saving object states ... done 05/28/08 01:10:00 PM Warning, memory leak detected: 109.0 hours: 1.0 MB in 1.3 hours. 78.2 -> 79.1 at 0.72 MB/hour. Total: 0.37 MB/hour 05/28/08 01:10:00 PM Paused for 27 seconds 05/28/08 01:14:00 PM: Saving object states ... done 05/28/08 01:14:00 PM Paused for 27 seconds When I restarted MH, the CPU doesn't spike every minutes and object state saves happen without the "Paused for x seconds" ----------------------- Here's a top when mh is taking up 94.8% of the CPU. Tasks: 64 total, 2 running, 62 sleeping, 0 stopped, 0 zombie Cpu(s): 94.6% us, 5.4% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 515544k total, 428484k used, 87060k free, 111816k buffers Swap: 1048568k total, 236k used, 1048332k free, 148052k cached PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND 26917 root 25 0 94.8 2064:47 13.9 78256 70m 2896 R mh 9218 root 15 0 2.6 0:00.33 0.5 8832 2612 2100 S sshd 9254 root 16 0 2.6 0:00.83 0.2 3836 992 780 R top 1 root 16 0 0.0 0:17.17 0.1 1688 544 468 S init 2 root 34 19 0.0 0:00.02 0.0 0 0 0 S ksoftirqd/0 3 root 5 -10 0.0 0:00.02 0.0 0 0 0 S events/0 4 root 6 -10 0.0 0:00.03 0.0 0 0 0 S khelper 5 root 5 -10 0.0 0:00.00 0.0 0 0 0 S kblockd/0 6 root 15 0 0.0 0:00.00 0.0 0 0 0 S khubd 35 root 15 0 0.0 0:29.98 0.0 0 0 0 S kapmd 38 root 20 0 0.0 0:00.00 0.0 0 0 0 S pdflush 39 root 15 0 0.0 0:16.99 0.0 0 0 0 S pdflush |
From: Tom <tom...@gm...> - 2008-05-30 16:26:14
|
Any ideas? On Wed, May 28, 2008 at 1:26 PM, Tom <tom...@gm...> wrote: > I read through the thread posted a while back about zombie processes but I > don't see any evidence of any zombies on my system. However, after a few > hours after restarting MH (sometimes longer, sometimes shorter), mh takes up > around 90% of the cpu. it goes back down after a few seconds then goes back > up. > > Is there a way to see a breakdown of what is running and causing the high > utilization? > > I'm running on CentOS using the latest version from SVN. > > -------------------- > As I was typing this email, I noticed something. > I opened up 2 sessions over SSH and while watching TOP, I was also watching > the MH output > > When MH is eating up the CPU, I see the following behavior: > The spikes happens every minute when the object states are getting saved. > The Saving object states is always followed by a Paused for x seconds > > 05/28/08 01:10:00 PM: Saving object states ... done > 05/28/08 01:10:00 PM Warning, memory leak detected: 109.0 hours: 1.0 MB in > 1.3 hours. 78.2 -> 79.1 at 0.72 MB/hour. Total: 0.37 MB/hour > 05/28/08 01:10:00 PM Paused for 27 seconds > > 05/28/08 01:14:00 PM: Saving object states ... done > 05/28/08 01:14:00 PM Paused for 27 seconds > > When I restarted MH, the CPU doesn't spike every minutes and object state > saves happen without the "Paused for x seconds" > > ----------------------- > > Here's a top when mh is taking up 94.8% of the CPU. > > Tasks: 64 total, 2 running, 62 sleeping, 0 stopped, 0 zombie > Cpu(s): 94.6% us, 5.4% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si > Mem: 515544k total, 428484k used, 87060k free, 111816k buffers > Swap: 1048568k total, 236k used, 1048332k free, 148052k cached > > PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND > 26917 root 25 0 > 94.8 2064:47 13.9 78256 70m 2896 R mh > 9218 root 15 0 2.6 0:00.33 0.5 > 8832 2612 2100 S sshd > 9254 root 16 0 2.6 0:00.83 0.2 3836 992 780 R top > 1 root > 16 0 0.0 0:17.17 0.1 1688 544 468 S init > 2 root 34 19 0.0 0:00.02 > 0.0 0 0 0 S ksoftirqd/0 > 3 root 5 -10 0.0 0:00.02 0.0 0 0 0 S > events/0 4 > root 6 -10 0.0 0:00.03 0.0 0 0 0 S khelper > 5 root 5 -10 0.0 > 0:00.00 0.0 0 0 0 S kblockd/0 > 6 root 15 0 0.0 0:00.00 0.0 0 0 > 0 S khubd > 35 root 15 0 0.0 0:29.98 0.0 0 0 0 S kapmd > 38 root 20 0 > 0.0 0:00.00 0.0 0 0 0 S pdflush > 39 root 15 0 0.0 0:16.99 0.0 > 0 0 0 S pdflush > > |
From: Gregg L. <gr...@li...> - 2008-05-30 17:43:25
|
Hi Tom, Tom wrote: > Any ideas? ... stuff to consider ... 1) mh may spike CPU if it needs to "catch up" from being starved by anything else running at a higher priority. To rule this out, consider running the command renice to bump up the nice level: renice +5 -u mh Here, the "-u mh" applies a increase to the nice level for all processes owned by mh. For me, this is easier than collecting pids from files. But, if running as root, then you'd have to supply a pid unless you want all processes running as root to be similarly applied. I would definitely do this if I had other stuff running on the same system that should be considered lower priority. Be careful that you don't "overnice" mh and cause key OS processes to be starved. 2) I have an all to often bad habit of "tailing" my console output that is redirected to a log file via the -log argument. Usually, this is because I'm trying to catch or watch certain events for code that I'm working on. If you leave a session up like this for an extended period of time, then I have noticed pauses. 3) I can't quite tell from your discussion below whether the "spiking" is occasional or frequent. Also, is it short-lived or significant duration? Perhaps a better way to get some hint of history is to post your pause log from a restart up to a condition of frequent/recurring CPU spikes so that we can tell if the condition is linearly worsening or seemingly spontaneous. 4) Do you have open browsers to mh's ia5 interface when the spiking is occurring? While debugging the problem, make sure that they are all closed/killed. 5) Consider backing out (disabling) various modules that may be less essential. If you're running/using any of the rrd functionality--particularly graphing--considering disabling those first. 6) You mention a correlation between pausing and saving object states. Try watching the size of the saved_states files to see if they are fluctuating between the point right after restart to whenever you start seeing pausing: ls -l <mh_data_dir>/mh_temp.saved_states* Perhaps something from the above may give you some new insight/info that could lead to a diagnosis. Gregg > On Wed, May 28, 2008 at 1:26 PM, Tom <tom...@gm...> wrote: >> I read through the thread posted a while back about zombie processes but I >> don't see any evidence of any zombies on my system. However, after a few >> hours after restarting MH (sometimes longer, sometimes shorter), mh takes up >> around 90% of the cpu. it goes back down after a few seconds then goes back >> up. >> >> Is there a way to see a breakdown of what is running and causing the high >> utilization? >> >> I'm running on CentOS using the latest version from SVN. >> >> -------------------- >> As I was typing this email, I noticed something. >> I opened up 2 sessions over SSH and while watching TOP, I was also watching >> the MH output >> >> When MH is eating up the CPU, I see the following behavior: >> The spikes happens every minute when the object states are getting saved. >> The Saving object states is always followed by a Paused for x seconds >> >> 05/28/08 01:10:00 PM: Saving object states ... done >> 05/28/08 01:10:00 PM Warning, memory leak detected: 109.0 hours: 1.0 MB in >> 1.3 hours. 78.2 -> 79.1 at 0.72 MB/hour. Total: 0.37 MB/hour >> 05/28/08 01:10:00 PM Paused for 27 seconds >> >> 05/28/08 01:14:00 PM: Saving object states ... done >> 05/28/08 01:14:00 PM Paused for 27 seconds >> >> When I restarted MH, the CPU doesn't spike every minutes and object state >> saves happen without the "Paused for x seconds" >> >> ----------------------- >> >> Here's a top when mh is taking up 94.8% of the CPU. >> >> Tasks: 64 total, 2 running, 62 sleeping, 0 stopped, 0 zombie >> Cpu(s): 94.6% us, 5.4% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si >> Mem: 515544k total, 428484k used, 87060k free, 111816k buffers >> Swap: 1048568k total, 236k used, 1048332k free, 148052k cached >> >> PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND >> 26917 root 25 0 >> 94.8 2064:47 13.9 78256 70m 2896 R mh >> 9218 root 15 0 2.6 0:00.33 0.5 >> 8832 2612 2100 S sshd >> 9254 root 16 0 2.6 0:00.83 0.2 3836 992 780 R top >> 1 root >> 16 0 0.0 0:17.17 0.1 1688 544 468 S init >> 2 root 34 19 0.0 0:00.02 >> 0.0 0 0 0 S ksoftirqd/0 >> 3 root 5 -10 0.0 0:00.02 0.0 0 0 0 S >> events/0 4 >> root 6 -10 0.0 0:00.03 0.0 0 0 0 S khelper >> 5 root 5 -10 0.0 >> 0:00.00 0.0 0 0 0 S kblockd/0 >> 6 root 15 0 0.0 0:00.00 0.0 0 0 >> 0 S khubd >> 35 root 15 0 0.0 0:29.98 0.0 0 0 0 S kapmd >> 38 root 20 0 >> 0.0 0:00.00 0.0 0 0 0 S pdflush >> 39 root 15 0 0.0 0:16.99 0.0 >> 0 0 0 S pdflush >> >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ________________________________________________________ > To unsubscribe from this list, go to: http://sourceforge.net/mail/?group_id=1365 > |
From: Tom <tom...@gm...> - 2008-06-03 15:58:00
|
Gregg Liming wrote: > Hi Tom, > > Tom wrote: > >> Any ideas? >> > > ... stuff to consider ... > > 1) mh may spike CPU if it needs to "catch up" from being starved by > anything else running at a higher priority. To rule this out, consider > running the command renice to bump up the nice level: > > renice +5 -u mh > > Here, the "-u mh" applies a increase to the nice level for all processes > owned by mh. For me, this is easier than collecting pids from files. > But, if running as root, then you'd have to supply a pid unless you want > all processes running as root to be similarly applied. > > I would definitely do this if I had other stuff running on the same > system that should be considered lower priority. Be careful that you > don't "overnice" mh and cause key OS processes to be starved. > > The machine only runs MH.. It doesn't even have any X installed. > 3) I can't quite tell from your discussion below whether the "spiking" > is occasional or frequent. Also, is it short-lived or significant > duration? Perhaps a better way to get some hint of history is to post > your pause log from a restart up to a condition of frequent/recurring > CPU spikes so that we can tell if the condition is linearly worsening or > seemingly spontaneous. > I think I'm on to something.. It seems to start with something at midnight. Here's a grep of my log for Paused.. I had restarted MH the previous day and noticed my first Paused at 12:31:00 AM At first the interval is 15 minutes but as time goes on it gets to a minute between each pause 06/02/08 12:31:00 AM Paused for 3 seconds 06/02/08 12:46:00 AM Paused for 3 seconds 06/02/08 01:01:00 AM Paused for 3 seconds 06/02/08 01:16:00 AM Paused for 3 seconds 06/02/08 01:31:00 AM Paused for 4 seconds 06/02/08 01:46:00 AM Paused for 3 seconds 06/02/08 02:01:00 AM Paused for 5 seconds 06/02/08 02:16:00 AM Paused for 3 seconds 06/02/08 02:31:00 AM Paused for 5 seconds 06/02/08 02:46:00 AM Paused for 4 seconds 06/02/08 03:00:00 AM Paused for 3 seconds 06/02/08 03:01:00 AM Paused for 6 seconds 06/02/08 03:16:00 AM Paused for 4 seconds 06/02/08 03:31:00 AM Paused for 6 seconds 06/02/08 03:46:00 AM Paused for 4 seconds 06/02/08 04:00:00 AM Paused for 3 seconds 06/02/08 04:01:00 AM Paused for 7 seconds 06/02/08 04:16:00 AM Paused for 4 seconds 06/02/08 04:31:00 AM Paused for 7 seconds 06/02/08 04:46:00 AM Paused for 5 seconds 06/02/08 05:00:00 AM Paused for 3 seconds 06/02/08 05:01:00 AM Paused for 8 seconds 06/02/08 05:15:00 AM Paused for 3 seconds 06/02/08 05:16:00 AM Paused for 5 seconds 06/02/08 05:30:00 AM Paused for 3 seconds 06/02/08 05:31:00 AM Paused for 8 seconds 06/02/08 05:34:00 AM Paused for 3 seconds 06/02/08 05:40:00 AM Paused for 3 seconds 06/02/08 05:41:00 AM Paused for 3 seconds 06/02/08 05:44:00 AM Paused for 3 seconds 06/02/08 05:45:00 AM Paused for 3 seconds 06/02/08 05:46:00 AM Paused for 5 seconds 06/02/08 05:48:00 AM Paused for 3 seconds 06/02/08 05:49:00 AM Paused for 3 seconds 06/02/08 05:50:00 AM Paused for 3 seconds > 4) Do you have open browsers to mh's ia5 interface when the spiking is > occurring? While debugging the problem, make sure that they are all > closed/killed. > I have an Audrey which is always on the MH page.. It's been like this for the last few years.. I'll try restarting MH and turning off the Audrey tonight. > 5) Consider backing out (disabling) various modules that may be less > essential. If you're running/using any of the rrd > functionality--particularly graphing--considering disabling those first. > > weather_rrd_upd takes up a lot of the CPU 25543 root 25 0 82.7 0:08.51 1.0 10688 5384 2472 R weather_rrd_upd If I disable that, I know the graphs will not get updated, but will all the other weather data be lost for the time that it's disabled? > 6) You mention a correlation between pausing and saving object states. > Try watching the size of the saved_states files to see if they are > fluctuating between the point right after restart to whenever you start > seeing pausing: > > ls -l <mh_data_dir>/mh_temp.saved_states* > > Perhaps something from the above may give you some new insight/info that > could lead to a diagnosis. > 55554 Jun 3 mh_temp.saved_states 1025 Jun 3 mh_temp.saved_states.persistent 131244 Jun 1 mh_temp.saved_states.unused The sizes stay the same after a restart thanks, tom |
From: Gregg L. <gr...@li...> - 2008-06-03 17:25:18
|
Hi Tom, Tom wrote: > I think I'm on to something.. It seems to start with something at midnight. > Here's a grep of my log for Paused.. > I had restarted MH the previous day and noticed my first Paused at > 12:31:00 AM At first the interval is 15 minutes but as time goes on it > gets to a minute between each pause > > 06/02/08 12:31:00 AM Paused for 3 seconds > 06/02/08 12:46:00 AM Paused for 3 seconds > 06/02/08 01:01:00 AM Paused for 3 seconds > 06/02/08 01:16:00 AM Paused for 3 seconds > 06/02/08 01:31:00 AM Paused for 4 seconds > 06/02/08 01:46:00 AM Paused for 3 seconds > 06/02/08 02:01:00 AM Paused for 5 seconds > 06/02/08 02:16:00 AM Paused for 3 seconds > 06/02/08 02:31:00 AM Paused for 5 seconds > 06/02/08 02:46:00 AM Paused for 4 seconds > 06/02/08 03:00:00 AM Paused for 3 seconds > 06/02/08 03:01:00 AM Paused for 6 seconds > 06/02/08 03:16:00 AM Paused for 4 seconds > 06/02/08 03:31:00 AM Paused for 6 seconds > 06/02/08 03:46:00 AM Paused for 4 seconds > 06/02/08 04:00:00 AM Paused for 3 seconds > 06/02/08 04:01:00 AM Paused for 7 seconds > 06/02/08 04:16:00 AM Paused for 4 seconds > 06/02/08 04:31:00 AM Paused for 7 seconds > 06/02/08 04:46:00 AM Paused for 5 seconds > 06/02/08 05:00:00 AM Paused for 3 seconds > 06/02/08 05:01:00 AM Paused for 8 seconds > 06/02/08 05:15:00 AM Paused for 3 seconds > 06/02/08 05:16:00 AM Paused for 5 seconds > 06/02/08 05:30:00 AM Paused for 3 seconds > 06/02/08 05:31:00 AM Paused for 8 seconds > 06/02/08 05:34:00 AM Paused for 3 seconds > 06/02/08 05:40:00 AM Paused for 3 seconds > 06/02/08 05:41:00 AM Paused for 3 seconds > 06/02/08 05:44:00 AM Paused for 3 seconds > 06/02/08 05:45:00 AM Paused for 3 seconds > 06/02/08 05:46:00 AM Paused for 5 seconds > 06/02/08 05:48:00 AM Paused for 3 seconds > 06/02/08 05:49:00 AM Paused for 3 seconds > 06/02/08 05:50:00 AM Paused for 3 seconds I would agree that this pattern looks interesting. If it were me, I'd try restarting again to see if the same pattern roughly occurs. If it is almost exact, then that should tell you something. I'd probably hold off disabling things for the time being. I find it very interesting that the time-stamp always shows 00 seconds. I went through all of my past pause logs and noticed that never occurred. And, although I do know what caused my pauses (specifically, tailing an ever lengthening log file), the time stamp corresponded to the number of paused seconds. For example: mm/dd/yy hh:mm:03 xM Paused for 3 seconds mm/dd/yy hh:mm:08 xM Paused for 8 seconds I'll try to mull on this a bit more to see if I can explain why yours is different. Just out of curiosity, are you syncing your computers time? If so, how (rdate, ntp, mh's internet time client)? If not, do you know if your clock tends to drift? FWIW: I've occasionally noticed bad behavior for applications that are very time sensitive and a system clock that is allowed to drift. For this reason, I always use ntp/ntpd. I'd probably start looking for possible culprits to things that may be going on that are time based. Consider grep-ing on time_ (or time_now, time_cron, etc. to be more exact) w/i your code dir to get insights. Gregg |
From: Tom <tom...@gm...> - 2008-06-05 14:31:57
|
Gregg Liming wrote: > I would agree that this pattern looks interesting. If it were me, I'd > try restarting again to see if the same pattern roughly occurs. If it > is almost exact, then that should tell you something. I'd probably hold > off disabling things for the time being. > > The same pattern does occur and it starts about 3-4 hours after restarting. I've noted restart times and the first occurrence of "Paused for..." started at 6/3/08 11:55:55 AM 06/03/08 04:01:00 PM Paused for 3 seconds started at 6/4/08 11:29:39 AM 06/04/08 03:01:00 PM Paused for 3 seconds started at 06/04/08 03:58:54PM 06/04/08 08:01:00 PM Paused for 3 seconds > Just out of curiosity, are you syncing your computers time? > .... For this reason, I always use ntp/ntpd. > I also use ntpd > I'd probably start looking for possible culprits to things that may be > going on that are time based. Consider grep-ing on time_ (or time_now, > time_cron, etc. to be more exact) w/i your code dir to get insights. > I saw some code using time_now and time_cron, but they aren't new code.. > Gregg > thanks, tom |
From: Tom <tom...@gm...> - 2008-06-17 15:57:08
|
I'm still stuck.. The patterns I've been seeing still continues every 3-4 hours after restart. does this give any clues? it's part of monitor_memory_leak.log Fri 06/13/08 14:45:56 -- Restarted --. Perl version: 5.008005 Fri 06/13/08 14:47:05 -- Restarted --. Perl version: 5.008005 Fri 06/13/08 15:50:00 1.0 hours: 1.1 MB in 0.9 hours. 38.7 -> 39.8 at 1.26 MB/hour. Total: 1.06 MB/hour Fri 06/13/08 17:20:00 2.5 hours: 0.5 MB in 1.5 hours. 39.8 -> 40.3 at 0.36 MB/hour. Total: 0.65 MB/hour Fri 06/13/08 19:10:00 4.4 hours: 0.6 MB in 1.8 hours. 40.3 -> 40.9 at 0.30 MB/hour. Total: 0.50 MB/hour Fri 06/13/08 19:50:00 5.0 hours: 9.2 MB in 0.7 hours. 40.9 -> 50.1 at 13.85 MB/hour. Total: 2.27 MB/hour Fri 06/13/08 22:40:20 -- Restarted --. Perl version: 5.008005 Sat 06/14/08 00:20:00 1.7 hours: 0.5 MB in 1.5 hours. 39.3 -> 39.8 at 0.34 MB/hour. Total: 0.30 MB/hour Sat 06/14/08 01:50:00 3.2 hours: 0.5 MB in 1.5 hours. 39.8 -> 40.3 at 0.34 MB/hour. Total: 0.32 MB/hour Sat 06/14/08 03:20:00 4.7 hours: 0.5 MB in 1.5 hours. 40.3 -> 40.9 at 0.36 MB/hour. Total: 0.34 MB/hour Sat 06/14/08 05:20:00 6.7 hours: 0.5 MB in 2.0 hours. 40.9 -> 41.4 at 0.25 MB/hour. Total: 0.31 MB/hour Sat 06/14/08 07:10:00 8.5 hours: 0.5 MB in 1.8 hours. 41.4 -> 41.9 at 0.30 MB/hour. Total: 0.31 MB/hour Sat 06/14/08 09:10:00 10.5 hours: 0.6 MB in 2.0 hours. 41.9 -> 42.6 at 0.32 MB/hour. Total: 0.31 MB/hour Sat 06/14/08 09:30:00 10.8 hours: 1.0 MB in 0.3 hours. 42.6 -> 43.6 at 3.01 MB/hour. Total: 0.39 MB/hour Sat 06/14/08 12:10:00 13.5 hours: 6.0 MB in 2.7 hours. 43.6 -> 49.5 at 2.23 MB/hour. Total: 0.76 MB/hour Sat 06/14/08 14:10:00 15.5 hours: 0.5 MB in 2.0 hours. 49.5 -> 50.1 at 0.27 MB/hour. Total: 0.69 MB/hour Sat 06/14/08 15:40:00 17.0 hours: 0.5 MB in 1.5 hours. 50.1 -> 50.6 at 0.35 MB/hour. Total: 0.66 MB/hour Sat 06/14/08 16:00:00 17.3 hours: 3.1 MB in 0.3 hours. 50.6 -> 53.7 at 9.30 MB/hour. Total: 0.83 MB/hour Sun 06/15/08 01:50:00 27.2 hours: 0.5 MB in 9.8 hours. 53.7 -> 54.2 at 0.05 MB/hour. Total: 0.55 MB/hour Sun 06/15/08 03:20:00 28.7 hours: 0.5 MB in 1.5 hours. 54.2 -> 54.7 at 0.35 MB/hour. Total: 0.54 MB/hour Sun 06/15/08 05:10:00 30.5 hours: 0.6 MB in 1.8 hours. 54.7 -> 55.3 at 0.31 MB/hour. Total: 0.52 MB/hour Sun 06/15/08 06:50:00 32.2 hours: 0.5 MB in 1.7 hours. 55.3 -> 55.8 at 0.30 MB/hour. Total: 0.51 MB/hour Sun 06/15/08 08:10:00 33.5 hours: 0.7 MB in 1.3 hours. 55.8 -> 56.4 at 0.49 MB/hour. Total: 0.51 MB/hour Sun 06/15/08 09:50:00 35.2 hours: 0.5 MB in 1.7 hours. 56.4 -> 56.9 at 0.30 MB/hour. Total: 0.50 MB/hour Sun 06/15/08 12:10:00 37.5 hours: 0.8 MB in 2.3 hours. 56.9 -> 57.7 at 0.33 MB/hour. Total: 0.49 MB/hour Sun 06/15/08 14:10:00 39.5 hours: 0.6 MB in 2.0 hours. 57.7 -> 58.3 at 0.31 MB/hour. Total: 0.48 MB/hour Sun 06/15/08 16:10:00 41.5 hours: 0.6 MB in 2.0 hours. 58.3 -> 59.0 at 0.32 MB/hour. Total: 0.47 MB/hour Sun 06/15/08 17:50:00 43.2 hours: 0.5 MB in 1.7 hours. 59.0 -> 59.5 at 0.30 MB/hour. Total: 0.47 MB/hour Sun 06/15/08 19:20:00 44.7 hours: 0.5 MB in 1.5 hours. 59.5 -> 60.0 at 0.36 MB/hour. Total: 0.46 MB/hour Sun 06/15/08 19:28:29 -- Restarted --. Perl version: 5.008005 Sun 06/15/08 20:10:00 0.7 hours: 0.7 MB in 0.5 hours. 37.5 -> 38.3 at 1.42 MB/hour. Total: 1.07 MB/hour Sun 06/15/08 21:50:00 2.4 hours: 0.5 MB in 1.7 hours. 38.3 -> 38.8 at 0.32 MB/hour. Total: 0.54 MB/hour Sun 06/15/08 23:20:00 3.9 hours: 0.5 MB in 1.5 hours. 38.8 -> 39.3 at 0.35 MB/hour. Total: 0.47 MB/hour Mon 06/16/08 01:40:00 6.2 hours: 0.5 MB in 2.3 hours. 39.3 -> 39.9 at 0.23 MB/hour. Total: 0.38 MB/hour Mon 06/16/08 03:10:00 7.7 hours: 0.5 MB in 1.5 hours. 39.9 -> 40.4 at 0.34 MB/hour. Total: 0.37 MB/hour Mon 06/16/08 04:50:00 9.4 hours: 0.5 MB in 1.7 hours. 40.4 -> 40.9 at 0.31 MB/hour. Total: 0.36 MB/hour Mon 06/16/08 06:20:00 10.9 hours: 0.5 MB in 1.5 hours. 40.9 -> 41.4 at 0.34 MB/hour. Total: 0.36 MB/hour Mon 06/16/08 08:10:00 12.7 hours: 0.6 MB in 1.8 hours. 41.4 -> 42.0 at 0.32 MB/hour. Total: 0.35 MB/hour Mon 06/16/08 08:50:00 13.4 hours: 1.0 MB in 0.7 hours. 42.0 -> 43.0 at 1.49 MB/hour. Total: 0.41 MB/hour Mon 06/16/08 11:10:00 15.7 hours: 0.6 MB in 2.3 hours. 43.0 -> 43.6 at 0.26 MB/hour. Total: 0.39 MB/hour Mon 06/16/08 13:20:00 17.9 hours: 8.9 MB in 2.2 hours. 43.6 -> 52.5 at 4.13 MB/hour. Total: 0.84 MB/hour Mon 06/16/08 23:10:00 27.7 hours: 0.6 MB in 9.8 hours. 52.5 -> 53.1 at 0.06 MB/hour. Total: 0.56 MB/hour Tue 06/17/08 00:50:00 29.4 hours: 0.5 MB in 1.7 hours. 53.1 -> 53.7 at 0.31 MB/hour. Total: 0.55 MB/hour Tue 06/17/08 03:10:00 31.7 hours: 0.6 MB in 2.3 hours. 53.7 -> 54.3 at 0.26 MB/hour. Total: 0.53 MB/hour Tue 06/17/08 04:50:00 33.4 hours: 0.5 MB in 1.7 hours. 54.3 -> 54.8 at 0.30 MB/hour. Total: 0.52 MB/hour Tue 06/17/08 06:20:00 34.9 hours: 0.5 MB in 1.5 hours. 54.8 -> 55.3 at 0.35 MB/hour. Total: 0.51 MB/hour Tue 06/17/08 08:10:00 36.7 hours: 0.6 MB in 1.8 hours. 55.3 -> 55.8 at 0.31 MB/hour. Total: 0.50 MB/hour Tue 06/17/08 09:00:00 37.5 hours: 0.5 MB in 0.8 hours. 55.8 -> 56.4 at 0.60 MB/hour. Total: 0.50 MB/hour Tue 06/17/08 11:10:00 39.7 hours: 0.6 MB in 2.2 hours. 56.4 -> 56.9 at 0.27 MB/hour. Total: 0.49 MB/hour Tom wrote: > Gregg Liming wrote: >> I would agree that this pattern looks interesting. If it were me, >> I'd try restarting again to see if the same pattern roughly occurs. >> If it is almost exact, then that should tell you something. I'd >> probably hold off disabling things for the time being. >> >> > The same pattern does occur and it starts about 3-4 hours after > restarting. I've noted restart times and the first occurrence of > "Paused for..." > > started at 6/3/08 11:55:55 AM > 06/03/08 04:01:00 PM Paused for 3 seconds > > started at 6/4/08 11:29:39 AM > 06/04/08 03:01:00 PM Paused for 3 seconds > > started at 06/04/08 03:58:54PM > 06/04/08 08:01:00 PM Paused for 3 seconds > > >> Just out of curiosity, are you syncing your computers time? .... For >> this reason, I always use ntp/ntpd. >> > I also use ntpd >> I'd probably start looking for possible culprits to things that may >> be going on that are time based. Consider grep-ing on time_ (or >> time_now, time_cron, etc. to be more exact) w/i your code dir to get >> insights. >> > I saw some code using time_now and time_cron, but they aren't new code.. >> Gregg >> > > thanks, > tom |
From: Gregg L. <gr...@li...> - 2008-06-18 18:56:51
|
Tom wrote: > I'm still stuck.. The patterns I've been seeing still continues every > 3-4 hours after restart. > > does this give any clues? it's part of monitor_memory_leak.log Wow--that's quite a lot of memory leaking! FYI - the memory monitor runs every 10 minutes--hence the slight pattern of minutes. Unfortunately, it doesn't immediately suggest anything in the way of a pattern (except that it's far too high). Can you either post or send me off-list the contents of your data/code_select.txt ? Gregg |
From: Tom <tom...@gm...> - 2008-06-18 19:05:12
|
> Wow--that's quite a lot of memory leaking! FYI - the memory monitor > runs every 10 minutes--hence the slight pattern of minutes. > Unfortunately, it doesn't immediately suggest anything in the way of a > pattern (except that it's far too high). Can you either post or send me > off-list the contents of your data/code_select.txt ? > > I thought the leak monitor would help pin down which code it causing problems.. It may be something in my code..is there any way to figure out which code (except for the obvious disabling and enabling one at a time.) Here's my code_select.txt It doesn't show my own code..should it? audrey_control.pl audreyspeak.pl benchmarks.pl callerid.pl comic_dailystrips.pl eliza_server.pl insteon_item_commands.pl menu.pl mh_control.pl mh_release.pl mh_sound.pl mhsend_server.pl monitor_memory.pl organizer.pl phone.pl phone_logs.pl photo_index.pl test_xap.pl tv_grid.pl weather_aws.pl weather_rrd_update.pl x10_item_commands.pl x10_reset.pl xAP_command.pl xAP_send.pl thanks, tom |
From: Gregg L. <gr...@li...> - 2008-06-18 19:15:52
|
Tom wrote: > I thought the leak monitor would help pin down which code it causing > problems.. Well, it may be symptomatic; but, the log itself isn't helpful to my eyes. > It may be something in my code..is there any way to figure > out which code (except for the obvious disabling and enabling one at a > time.) No. I will offer to review it as I get the time if you want. If inclined, tar/gzip it up and send it off-list. > Here's my code_select.txt It doesn't show my own code..should it? no. I'll remove the entries that I have enabled as I'm not hemorrhaging memory nor having the other problems that you are. That doesn't mean that what is left is a problem; but might be things to consider. Maybe other users that are fully asymptomatic (both to your pause rate and memory leak rate) can rule out more. Admittedly, I'd be more suspect of user code. > audrey_control.pl > audreyspeak.pl > callerid.pl > comic_dailystrips.pl > mh_release.pl > test_xap.pl > tv_grid.pl > weather_aws.pl > x10_reset.pl BTW: first on my list to ditch permanently would be text_xap.pl. |
From: Gregg L. <gr...@li...> - 2008-06-19 11:28:09
|
Gregg Liming wrote: > Tom wrote: >> Here's my code_select.txt It doesn't show my own code..should it? > > no. I'll remove the entries that I have enabled as I'm not hemorrhaging > memory nor having the other problems that you are. That doesn't mean > that what is left is a problem; but might be things to consider. Still, though, I was leaking some memory in a very noticeable pattern. And, leaking any--IMO--is not good. After some semi-educated guess-work and subsequent experiments, I narrowed it down to code/common/organizer.pl. I'm currently working on a fix as I believe I know where the problem lies. If correct, then you should see an ever increasing memory leak that worsens semi-linearly as a function of the size of your_data_dir/organizer/calendar.tab. It seems to occur because the tab file is being overzealously overwritten by ical2vsdb--which I'll separately need to investigate (possibly some condition is triggering it specific to my case). My tab file had just become noticeable w/ 368 entries. I'll post back once I can confirm having resolved the problem. In the mean-time, if you can easily live w/o organizer.pl, you might try disabling it and continuing to monitor your memory leak patterns. If they go away, then you're similarly affected. If you are a ical2vsdb user and you really need organizer.pl, you could try increasing ical_read_interval (at least to 15 and possibly more--like 30 or 60). Gregg |
From: Gregg L. <gr...@li...> - 2008-06-19 22:02:13
|
Tom wrote: > I changed my read interval to 90 and so far (6 hours after a restart) I > have not had any pauses (I was getting one starting about 3-4 hours > after a restart) > > Here's my info from monitor_memory_leak.log > > Thu 06/19/08 08:55:09 -- Restarted --. Perl version: 5.008005 > Thu 06/19/08 11:10:00 2.2 hours: 1.0 MB in 2.1 hours. 39.6 -> > 40.5 at 0.47 MB/hour. Total: 0.43 MB/hour > Thu 06/19/08 13:10:00 4.2 hours: 8.5 MB in 2.0 hours. 40.5 -> > 49.0 at 4.23 MB/hour. Total: 2.22 MB/hour BTW: this started out as "high CPU usage". And, some of your original data showed mh spiking the CPU. I would bet that's due to the very large number of accrued entries (that shouldn't exist) based upon organizer.pl's previous operation. But, assuming that the above is now partially or full resolved from the recent commit, I have observed that ical2vsdb is a real CPU "pig". It tries to consume as much CPU as it can on my system for at least 2 minutes. Now, obviously some optimization may be appropriate; but, it has made me think that it would be nice to be able to "nice" or "renice" the forked processes from mh so that mh always runs at a higher priority while the children processes are run at a much lower priority. I'll hope to take a look at this as it may have benefits to improving responsiveness beyond this specific situation. Gregg |
From: Gregg L. <gr...@li...> - 2008-06-20 13:15:50
|
Gregg Liming wrote: > BTW: this started out as "high CPU usage". And, some of your original > data showed mh spiking the CPU. I would bet that's due to the very > large number of accrued entries (that shouldn't exist) based upon > organizer.pl's previous operation. But, assuming that the above is now > partially or full resolved from the recent commit, I have observed that > ical2vsdb is a real CPU "pig". It tries to consume as much CPU as it > can on my system for at least 2 minutes. Now, obviously some > optimization may be appropriate; but, it has made me think that it would > be nice to be able to "nice" or "renice" the forked processes from mh so > that mh always runs at a higher priority while the children processes > are run at a much lower priority. I'll hope to take a look at this as > it may have benefits to improving responsiveness beyond this specific > situation. done and committed. The rest of this discussion applies to *nix users only. You can now set the ini parm: process_nice_level=x or set an individual nice level for a given process item: $some_proc_item->nice_level(x) where x is a "nice" level. Usually, nice defaults to 0 for most user programs. Increasing x (a positive number) makes the forked process "nicer" and therefore run with less favorable scheduling. Setting x to a number higher than mh will help to ensure mh runs more favorably than forked processes--particularly for forked processes with notable run times. This only applies to external programs (i.e., those that run out of mh/bin) and not forked functions native to mh. As a test, I set process_nice_level=10 and waited for ical2vsdb to be forked (as it runs as much CPU as is allowed and runs for a long time). I then kicked off a zoneminder montage display and triggered some alarms. I noticed no obvious impact to either mh's operation or zoneminder while ical2vsdb ran with as much CPU as would be allowed. *nix users that experience pause problems and have already ensured that slow I/O devices are running in a proxy may want to experiment with the above new capabilities. Gregg |
From: Tom <tom...@gm...> - 2008-06-19 21:10:26
|
Gregg Liming wrote: > Gregg Liming wrote: > > In the mean-time, if you can easily live w/o organizer.pl, you might try > disabling it and continuing to monitor your memory leak patterns. If > they go away, then you're similarly affected. If you are a ical2vsdb > user and you really need organizer.pl, you could try increasing > ical_read_interval (at least to 15 and possibly more--like 30 or 60). > That seems to be related to what I'm seeing as well.. I changed my read interval to 90 and so far (6 hours after a restart) I have not had any pauses (I was getting one starting about 3-4 hours after a restart) Here's my info from monitor_memory_leak.log Thu 06/19/08 08:55:09 -- Restarted --. Perl version: 5.008005 Thu 06/19/08 11:10:00 2.2 hours: 1.0 MB in 2.1 hours. 39.6 -> 40.5 at 0.47 MB/hour. Total: 0.43 MB/hour Thu 06/19/08 13:10:00 4.2 hours: 8.5 MB in 2.0 hours. 40.5 -> 49.0 at 4.23 MB/hour. Total: 2.22 MB/hour thanks for figuring this out.. it was driving me nuts :-) |
From: Gregg L. <gr...@li...> - 2008-06-19 21:50:57
|
Tom wrote: > Gregg Liming wrote: >> Gregg Liming wrote: >> >> In the mean-time, if you can easily live w/o organizer.pl, you might try >> disabling it and continuing to monitor your memory leak patterns. If >> they go away, then you're similarly affected. If you are a ical2vsdb >> user and you really need organizer.pl, you could try increasing >> ical_read_interval (at least to 15 and possibly more--like 30 or 60). >> > That seems to be related to what I'm seeing as well.. > > I changed my read interval to 90 and so far (6 hours after a restart) I > have not had any pauses (I was getting one starting about 3-4 hours > after a restart) > > Here's my info from monitor_memory_leak.log > > Thu 06/19/08 08:55:09 -- Restarted --. Perl version: 5.008005 > Thu 06/19/08 11:10:00 2.2 hours: 1.0 MB in 2.1 hours. 39.6 -> > 40.5 at 0.47 MB/hour. Total: 0.43 MB/hour > Thu 06/19/08 13:10:00 4.2 hours: 8.5 MB in 2.0 hours. 40.5 -> > 49.0 at 4.23 MB/hour. Total: 2.22 MB/hour > > > thanks for figuring this out.. it was driving me nuts :-) The fix that I've made to code/common/organizer.pl seems to have significantly reduced recurring memory leaks; so, I've just committed it. If you get a chance, would you mind giving it a try? Some comments: 1) organizer.pl will load a rather large array array of hashes that correspond directly to the contents of the vsDB "database" that holds your calendar. 2) the load only occurs after it sense that the vsDB has changed. Changes can either be manual--through the mh web interface--or "automated" via ical2vsdb. 3) The ical_read_interval will cause the ical2vsdb daemon to be run ical_read_interval minutes after startup; if something has changed, then the large array of hashes is created. This can be so large that it "looks" like a sudden memory leak. This is ok so long as it only happens once. The problem that had been occurring is that it kept on adding incrementally the massive array of hashes so that it was accrueing them rather than starting over. So, if you have a reasonably large set of events in your calendar DB and/or your vsDB might be getting regularly "tweaked" from ical2vsdb (sometimes it seems as though the updates are occurring when they shouldn't be; but, that's a different problem). This is obviously a very huge compounding problem because it not only starts chewing up more memory, but it also causes the evaluation in organizer.pl to become exponentially worse because of the huge number of iterations each new time. 4) From the above, it should be clear that there are a number of differing conditions that collectively can contribute to seeing a noticeable memory leak and/or it getting to a point that is causing real problem. However....... This memory leak has existed at least as long as 2.104 and perhaps slightly sooner. If you enable code/common/organizer.pl and your rev is at least as current as 2.104 or later, then you WILL have a memory leak--it just may not be severe. Assuming that the mod that I made does address the problem (my recurring leaks--despite being small have now almost disappeared), then everyone affected is encouraged to consider the update. At a minimum, you should be aware of the problem and assess whether it is a problem. Even if it isn't a problem now, conditions can change that will not be obvious to you and can create relatively sudden apparent changes in operation due to excess memory consumption and the corresponding performance drain. Hope the above helps... Gregg |