From: Stephan D. <ste...@gm...> - 2003-08-07 13:59:42
|
On Thursday 07 August 2003 15:10, Thomas E Jenkins wrote: > For what it's worth I've had the same problems. It always follows my > other problem, MySQL throwing (2013, 'Lost connection to MySQL server > during query'). Once in a while, I've seeing the same errormessage. Interestingly, this was always early morning and the problem went away after a while. It just looked like the connection run into a timeout over night. Maybe it's possible, to check for this specific errormessage in the DBPool.py code and reset all connections if this happens. Other than that, I've seen several large processes (about 200MB) that appeared and didn't go away. In that case, no convinient errormessage could be seen, but I suspect that it has something to do with swish-e, an external indexer that runs in a os.system call once in a while. But then, again, this is not consistent behaviour and happens only very seldom and has probably nothing to do with Webware at all but with a multithreaded environment. > I get the sames symptoms with the webkit still > listening but not responding. I have been unable to reproduce this > problem reliably, it does happen about once a day. > > > Date: > Thu, 07 Aug 2003 07:55:16 -0400 > > On Wed, 2003-08-06 at 22:07, Hancock, David (DHANCOCK) wrote: > > Adam: Thanks for the additional information. Stephan Diehl also has > > seen this situation on his systems. > > > > I agree about the gap in the PIDs, but most of the time, they're > > contiguous. We do sort of a "heartbeat" ping on our servers with an > > HTTP request at least every 5 minutes, which is how we notice the > > problem. We've got two machines running Apache and WebKit, > > load-balanced, but each gets hit pretty often. There's a LOT of > > memory on these machines (2GB physical); we've typically got 500MB > > physical free and swap generally shows 0K used. We'll start capturing > > memory data to see if we really are using some swap space. > > > > My understanding of swapping (which, granted, is apt to be faulty) is > > that Linux isn't apt to swap something to disk while there's unused > > physical memory. > > > > We are using mod_webkit, and even with the WebKit processes wedged, > > the port (we're using 8086) is still listening, just not responding. > > > > If we were able to reproduce this situation on our development or test > > systems, we could use the debugger to find out more about what's going > > on, but in production, our first priority is to get the system > > responding again. > > > > If/when I learn more, I'll follow up to the list. And if anybody else > > has some ideas, I'd be grateful to hear them. > > > > Cheers! > > -- > > David Hancock | dha...@ar... | 410-266-4384 > > > > -----Original Message----- > > From: Adam Kerrison [mailto:ad...@ti...] > > Sent: Wednesday, August 06, 2003 10:26 AM > > To: web...@li... > > Subject: RE: [Webware-discuss] RE: Anyone seen WebKit > > processes going into a weird state? > > > > > > I can't say I've experienced this behaviour directly but few > > points: > > > > - Process name in brackets does mean "swapped to disk" > > probably because the process has been inactive for a while > > (seems likely!) > > > > - The gap in the PID could just be that another process > > started at that time - you can't rely on the PID's being > > contiguous > > > > - I have had problems where I had to kill threaded apps > > when the code raises an exception. In SOME cases the thread > > dies and the application stops responding (depends a lot on > > how the app is designed). I don't think I've seen this > > specifically with Webware but if the socket handler dies then > > the other threads will be waiting for things that will never > > happen (and the process will be swapped out eventually). I am > > assuming a lot about how the AppServer is working - I don't > > know that this is right but I'm sure someone will correct > > me :-) > > > > If you're using mod_webkit - and assuming that it maintains a > > connection from apache to webkit - you should be able to see > > this connection via netstat. If the socket handler has died > > then the socket may have gone. Using gdb you should be able to > > see the threads running and the state but that probably less > > useful in python. > > > > Not sure that this helps or not - might be a red herring > > > > Adam > > -----Original Message----- > > From: Hancock, David (DHANCOCK) > > [mailto:DHA...@ar...] > > Sent: 06 August 2003 13:28 > > To: 'web...@li...' > > Subject: [Webware-discuss] RE: Anyone seen WebKit > > processes going into a weird state? > > > > > > > > Sorry to be replying to my own post, but I haven't > > seen any list traffic related to my question below, so > > maybe it didn't get out to the list. The situation > > described below has occurred several times this week, > > and in most cases there is a gap in the process > > numbering. Every other time I've looked, the "python > > Launch.py ThreadedAppServer" process numbers are > > sequential, with no gaps. They must start up very > > quickly. In the list below, there is a gap (25802 is > > missing). > > > > I'm grasping at straws here. I think that the process > > id in brackets with no command line means that the > > process is swapped to disk, but I'm not sure about > > that. When we see the processes looking like they do > > below, they really ARE wedged, though, and require > > manual termination. > > > > Cheers! > > -- > > David Hancock | dha...@ar... | 410-266-4384 > > > > -----Original Message----- > > From: Hancock, David (DHANCOCK) > > Sent: Friday, August 01, 2003 4:57 PM > > To: web...@li... > > Subject: Anyone seen WebKit processes > > going into a weird state? > > > > Several times a week on our production > > systems, we're seeing our WebKit processes > > (normally entitled "python Launch.py > > ThreadedAppServer") lose their command lines > > in the output from ps. They're also well > > wedged, and the processes need to be killed by > > hand to clear this situation. Has anybody > > else seen this and have some ideas to help us > > troubleshoot? For now, we're detecting the > > situation with automated monitoring (and > > process-killing and webkit-restarting), but > > we'd sure like to know how we can prevent it, > > not just work around it. > > > > Output from ps auxww: > > > > adc 25799 0.1 1.6 130288 34252 ? > > SN Jul28 10:04 [python] > > adc 25800 0.0 1.6 130288 34252 ? > > SN Jul28 0:00 [python] > > adc 25801 0.0 1.6 130288 34252 ? > > SN Jul28 2:52 [python] > > adc 25803 0.0 1.6 130288 34252 ? > > SN Jul28 1:37 [python] > > adc 25804 0.0 1.6 130288 34252 ? > > SN Jul28 2:17 [python] > > adc 25805 0.0 1.6 130288 34252 ? > > SN Jul28 1:37 [python] > > adc 25806 0.0 1.6 130288 34252 ? > > SN Jul28 1:45 [python] > > adc 25807 0.0 1.6 130288 34252 ? > > SN Jul28 1:27 [python] > > adc 25808 0.0 1.6 130288 34252 ? > > SN Jul28 1:51 [python] > > adc 25809 0.0 1.6 130288 34252 ? > > SN Jul28 1:08 [python] > > adc 25810 0.0 1.6 130288 34252 ? > > SN Jul28 3:37 [python] > > > > Our setup includes: > > > > Python 2.2 > > Webware 0.8 > > RedHat Linux 7.3 > > A couple C extensions: DCOracle2 and > > pymqi (interface to IBM's MQSeries) > > > > > > Thanks in advance for any ideas and > > assistance. > > > > P.S. We had an extreme example of something > > similar several months ago, but even the > > "[python]" was missing from the ps output. > > Thus, it didn't look like WebKit was running > > at all, but a start attempt couldn't bind to > > the port. We could only find the culprit > > process with "netstat -anp | grep 8086" run as > > root. I don't know if that failure is > > related, though, it was just weird. > > > > Cheers! > > -- > > David Hancock | dha...@ar... | > > 410-266-4384 |