From: Clark C . E. <cc...@cl...> - 2001-10-22 18:52:15
|
About two months ago, I was having problems where someone started to upload a file and canceled half way through. This causes the webware server to abort and stop processing clients. I have a similar problem... Suppose I have two servlets A and B. Servlet A executes a long-running query against PostgreSQL (1-2 seconds) and returns quite a bit of information. Servlet B is much smaller. Also suppose that I have a pipe with some pretty decent latency. Now. Suppose a user using IE Explorer or Netscape on both Windows or Macintosh hits servlet B. Immediately after clicking on a link for this servlet (as the page starts to return) they click on servlet A. Bang. No logs, nothing. The back-end webware process crashes. I have two questions: 1. How do I setup a process so that when webware dies it auto-restarts? This is a temporary fix needed for this and other problems which may also have the same symptoms. 2. How do I better debug this bugger? Sorry, I've been working on applicaiton logic for some time, so I don't have the nitty-gritty details of Webware debugging in my pocket. Also. I just put into "production" the recent CVS snapshot, the same problem occurs. Thanks! Clark |
From: Geoff T. <gta...@na...> - 2001-10-22 20:20:02
|
I have some questions... At 03:02 PM 10/22/01 -0400, Clark C . Evans wrote: >About two months ago, I was having problems where someone >started to upload a file and canceled half way through. >This causes the webware server to abort and stop processing >clients. I have a similar problem... What OS are you using, and which adapter are you using? Does it make a difference if you switch to the CGI adapter? Does file uploading still cause Webware to abort even with the latest release? Is it reliably reproducible, for example with the FileUpload example servlet? Does the WebKit process actually exit? Or is it still running but unable to handle requests? If the latter, does its CPU usage shoot up, or is it idle? >Suppose I have two servlets A and B. Servlet A executes a >long-running query against PostgreSQL (1-2 seconds) and returns >quite a bit of information. Servlet B is much smaller. >Also suppose that I have a pipe with some pretty decent >latency. Now. Suppose a user using IE Explorer or Netscape >on both Windows or Macintosh hits servlet B. Immediately after >clicking on a link for this servlet (as the page starts to return) >they click on servlet A. Bang. No logs, nothing. The >back-end webware process crashes. I seem to remember someone else having problems with Webware crashing due to a bug in the PostgreSQL module. They upgraded to a newer version of the PostgreSQL Python module and the crashes went away. You might want to look in the mail list archives for more information about that. >I have two questions: > > 1. How do I setup a process so that when webware dies > it auto-restarts? This is a temporary fix needed > for this and other problems which may also have the > same symptoms. On Unix, there's Monitor.py. I haven't ever used it though. > 2. How do I better debug this bugger? Sorry, I've been > working on applicaiton logic for some time, so I don't > have the nitty-gritty details of Webware debugging > in my pocket. With this type of problem, I usually try to put in lots of logging so I can figure out where it crashed. You can log the thread ID by printing out threading.currentThread().getName() which can be helpful if you're trying to debug a problem that only wedges/crashes one thread. >Also. I just put into "production" the recent CVS snapshot, >the same problem occurs. > >Thanks! -- - Geoff Talvola gtalvola@NameConnector.com |
From: Jeff J. <je...@bo...> - 2001-10-22 20:37:00
|
I thought I sent in a bug email to the list last week but I can't find it now. Anyway, I think it _might_ explain this problem. In ThreadedAppServer there is a generic exception handler around handleRequest that caught this problem when it happened but I don't know if it left the threads or sockets in a bad state. The lines that call conn.recv, lines 351 and 361 and maybe others, don't have exception handlers and I saw them raise an error when the browser terminated the connection. The generic handler around handleRequest printed the error but conn.close() wasn't called as is normally the case. So, exception handling around the socket calls might solve the problem. -Jeff > -----Original Message----- > From: web...@li... > [mailto:web...@li...]On Behalf Of Geoff > Talvola > Sent: Monday, October 22, 2001 4:24 PM > To: Clark C . Evans; web...@li... > Subject: Re: [Webware-discuss] Webware death... >=20 >=20 > At 04:07 PM 10/22/01 -0400, Clark C . Evans wrote: >=20 > >A tid bit of additional information. On Windows, with > >PythonWin 2.1 I get a "StreamOut" error in the log ... > >but the AppServer doesn't crash. >=20 > On Windows, do you get a full traceback? If so, please=20 > forward it to the list. >=20 > To my ears it sounds like an exception handler is needed=20 > around some socket=20 > code to handle broken connections. >=20 >=20 > -- >=20 > - Geoff Talvola > gtalvola@NameConnector.com >=20 > _______________________________________________ > Webware-discuss mailing list > Web...@li... > https://lists.sourceforge.net/lists/listinfo/webware-discuss >=20 |
From: Clark C . E. <cc...@cl...> - 2001-10-22 22:32:15
|
Thanks to Geoff, the problem is narrowed to a crash at self._socket.send() call in the modified ThreadedAppServer.py:401 code below... print "start loop" while sent < reslen: try: print "<%d>" % sent val = self._buffer[sent:sent+8192] print "." cnt = self._socket.send(val) print "," sent = sent + cnt except socket.error, e: if e[0]==errno.EPIPE: #broken pass else: print "StreamOut Error: ", e self._socket.close() self._socket = None break print "end loop" And the result I get... (before it dumps me to the prompt...) end loop start loop end loop start loop <0> . , <8192> . , <16384> . So. It reads in what to write... but when it goes to write to the socket self._socket.send(val) it crashes. Thoughts as to how to proceed? Clark On Mon, Oct 22, 2001 at 05:19:42PM -0400, Geoff Talvola wrote: | At 05:23 PM 10/22/01 -0400, you wrote: | >You can duplicate the error by... | > | > 1. Having two servlets, A B | > 2. Servlet A takes a while to return all of it's data. | > 3. User clicks on A, and before A finishes clicks on B. | > 4. Latency (small pipe) helps to illustrate the problem | > if I duplicate what my user encounteres locally... | > I don't get the error. | | Ah, you're right. I can reliably reproduce the problem (even locally on | the same machine as the web server and app server). This is on WinNT | 4.0. It doesn't crash the server, I just get the message "StreamOut | Error: (10053, 'Software caused connection abort')". | | I can't understand why this would crash on Linux. The area to look is in | ThreadedAppServer.py around line 407 -- that's where Windows is printing | the error message. Try adding some extra exception handlers and debugging | print statements to see what's really happening. | | I'll try to reproduce this at home on my Linux box, but since I only have a | single machine, not a network, I may not be able to provoke the problem at | home at all. | | | -- | | - Geoff Talvola | gtalvola@NameConnector.com |
From: Geoff T. <gta...@na...> - 2001-10-22 22:49:45
|
To me, this is looking like an operating system bug or a bug in the Python socket libraries. Perhaps it's time to escalate to comp.lang.python and see if anyone has seen a crash like this in other contexts? At 06:42 PM 10/22/01 -0400, you wrote: >Thanks to Geoff, the problem is narrowed to a crash at >self._socket.send() call in the modified ThreadedAppServer.py:401 >code below... > > print "start loop" > while sent < reslen: > try: > print "<%d>" % sent > val = self._buffer[sent:sent+8192] > print "." > cnt = self._socket.send(val) > print "," > sent = sent + cnt > except socket.error, e: > if e[0]==errno.EPIPE: #broken > pass > else: > print "StreamOut Error: ", e > self._socket.close() > self._socket = None > break > print "end loop" > >And the result I get... (before it dumps me to the prompt...) > > end loop > start loop > end loop > start loop > <0> > . > , > <8192> > . > , > <16384> > . > >So. It reads in what to write... but when it goes to >write to the socket self._socket.send(val) it crashes. > >Thoughts as to how to proceed? > >Clark -- - Geoff Talvola gtalvola@NameConnector.com |
From: Clark C . E. <cc...@cl...> - 2001-10-22 22:56:55
|
Yes indeed. I was wondering if WebKit.cgi or mod_webkit is properly shutting down the socket when the browser cancels a request? If this isn't the case... then it's definately a lower-level issue. Perhaps I'll go back one version of Python and see if it emerges there... On Mon, Oct 22, 2001 at 06:48:36PM -0400, Geoff Talvola wrote: | To me, this is looking like an operating system bug or a bug in the Python | socket libraries. Perhaps it's time to escalate to comp.lang.python and | see if anyone has seen a crash like this in other contexts? Clark |
From: Geoffrey T. <gta...@me...> - 2001-10-23 13:41:31
|
Clark, I am unable to reproduce the "Webware death" problem at home on my Linux Mandrake 8.1 box. Whenever I cancel a long, large servlet prematurely, the exception handler get triggered due to the broken pipe, as you would expect. But nothing crashes. The StreamOut Error: (10053, 'Software caused connection abort') message you see on Windows NT is also normal. In fact, the exception handler should be modified so that it simply ignores that message -- that's just what you get on Windows instead of EPIPE when the connection is canceled. So I'm now 98% convinced that this is an OS bug, not a Webware bug or Python bug (I'm also using Python 2.1.1). I would suggest trying a newer version of Linux. - Geoff |
From: Clark C . E. <cc...@cl...> - 2001-10-23 15:54:54
|
wOn Tue, Oct 23, 2001 at 09:43:31AM -0400, Geoffrey Talvola wrote: | So I'm now 98% convinced that this is an OS bug, not a Webware bug | or Python bug (I'm also using Python 2.1.1). I would suggest | trying a newer version of Linux I just upgraded to kernel 2.2.19 and glibc to 2.1.3, the most recent security patchs for Debian's "stable" branch, Potato. I still have the same problem. I realize that the recent stable kernel is 2.4.12 and glibc stable is also much higher version number, however, this involves moving to Debian's woody, which is in "testing". Geoff and others, Could you tell me what Linux kernel version and glibc version you have? Perhaps I should move to Open/Free BSD? Any luck with these operating systems? Best, Clark |
From: Chuck E. <Chu...@ya...> - 2001-10-23 21:46:51
|
At 12:05 PM 10/23/2001 -0400, Clark C . Evans wrote: >wOn Tue, Oct 23, 2001 at 09:43:31AM -0400, Geoffrey Talvola wrote: >| So I'm now 98% convinced that this is an OS bug, not a Webware bug >| or Python bug (I'm also using Python 2.1.1). I would suggest >| trying a newer version of Linux > >I just upgraded to kernel 2.2.19 and glibc to 2.1.3, >the most recent security patchs for Debian's "stable" >branch, Potato. I still have the same problem. I >realize that the recent stable kernel is 2.4.12 and >glibc stable is also much higher version number, >however, this involves moving to Debian's woody, >which is in "testing". > >Geoff and others, > > Could you tell me what Linux kernel version and > glibc version you have? > > Perhaps I should move to Open/Free BSD? Any luck > with these operating systems? > >Best, > >Clark Clark, I think your next best bet is to try an op sys that is known to work for some of us. Geoff can't reproduce your problem on Mandrake 8.1 and I can't reproduce Jeff's problem on Mandrake 8.1. If you could also try this same op sys and version that would be great. If it breaks, then we can start looking more at your site. If it doesn't break, we'll know it's an op sys related problem. That's the only way I can think of to nail this down. http://www.linux-mandrake.com/en/ You can purchase the CDs, or download and burn the first 2 ISO C-DROM images and install that (which is what I did because I have already purchased 2 copies in the past). I hope you can try this and let us know how it goes. -Chuck |
From: Chuck E. <Chu...@ya...> - 2001-10-23 22:06:48
|
At 06:00 PM 10/23/2001 -0400, Clark C . Evans wrote: >Chuck, > >Thanks. I'm setting up an OpenBSD box and >will try to get everything running on it. > >Best, Ack! I've had reports of serious threading problems from a Webware user on OpenBSD. It was reported by Aleksandar Kacanski on 7/11/01 who solved his OpenBSD 2.8/2.9 woes by switching to RedHat Linux. I have CCed the discussion list for their benefit. Whatever you end up trying, let us know how it goes. -Chuck |
From: Clark C . E. <cc...@cl...> - 2001-10-26 05:44:01
|
Just to note my status. I've upgraded to the newest kernel 2.4.13, and I'm still getting problems. The only thing that I have that is "old" on my linux box is glibc. So, tomorow I'll try to upgrade it and hopefully that's where the problem is. Best, Clark On Tue, Oct 23, 2001 at 12:05:00PM -0400, Clark C . Evans wrote: | On Tue, Oct 23, 2001 at 09:43:31AM -0400, Geoffrey Talvola wrote: | | So I'm now 98% convinced that this is an OS bug, not a Webware bug | | or Python bug (I'm also using Python 2.1.1). I would suggest | | trying a newer version of Linux | | I just upgraded to kernel 2.2.19 and glibc to 2.1.3, | the most recent security patchs for Debian's "stable" | branch, Potato. I still have the same problem. I | realize that the recent stable kernel is 2.4.12 and | glibc stable is also much higher version number, | however, this involves moving to Debian's woody, | which is in "testing". | | Geoff and others, | | Could you tell me what Linux kernel version and | glibc version you have? | | Perhaps I should move to Open/Free BSD? Any luck | with these operating systems? | | Best, | | Clark | | _______________________________________________ | Webware-discuss mailing list | Web...@li... | https://lists.sourceforge.net/lists/listinfo/webware-discuss |
From: Chuck E. <Chu...@ya...> - 2001-10-26 06:43:29
|
At 01:54 AM 10/26/2001 -0400, Clark C . Evans wrote: >Just to note my status. I've upgraded to the newest >kernel 2.4.13, and I'm still getting problems. >The only thing that I have that is "old" on my >linux box is glibc. So, tomorow I'll try to >upgrade it and hopefully that's where the problem is. And you still have the option of trying Mandrake 8.1 which worked for Geoff. Then you could see what packages were different in terms of versions. -Chuck |
From: Jeff J. <je...@bo...> - 2001-10-23 13:49:09
|
Clark C . Evans wrote: > Yes indeed. I was wondering if WebKit.cgi or mod_webkit is > properly shutting down the socket when the browser cancels > a request? If this isn't the case... then it's definately > a lower-level issue. Perhaps I'll go back one version > of Python and see if it emerges there... WebKit does not properly shutdown connections when the browser cancels a request. I've only seen the error twice and both within a split second of each other so it may be hard to reproduce. They came from an unhandled exception in ThreadedAppServer.py by a call to conn.recv(). See my note from yesterday for more detail. Jeff |
From: Jeff J. <je...@bo...> - 2001-10-23 16:30:52
|
> Perhaps I should move to Open/Free BSD? Any luck > with these operating systems? I'm running FreeBSD 4.3. The biggest problem so far has been that python raises an exception when trying to print to stdout/stderr if the console that started the program is closed. Chuck tested on Mandrake Linux and it did *not* raise an exception. I think he's adding a config option "discardDaemonOutput" which would prevent webkit from crashing in daemon mode on FreeBSD. Regards, Jeff |
From: Clark C . E. <cc...@cl...> - 2001-10-22 19:57:11
|
On Mon, Oct 22, 2001 at 03:02:25PM -0400, Clark C . Evans wrote: | Suppose I have two servlets A and B. Servlet A executes a | long-running query against PostgreSQL (1-2 seconds) and returns | quite a bit of information. Servlet B is much smaller. | Also suppose that I have a pipe with some pretty decent | latency. Now. Suppose a user using IE Explorer or Netscape | on both Windows or Macintosh hits servlet B. Immediately after | clicking on a link for this servlet (as the page starts to return) | they click on servlet A. Bang. No logs, nothing. The | back-end webware process crashes. A tid bit of additional information. On Windows, with PythonWin 2.1 I get a "StreamOut" error in the log ... but the AppServer doesn't crash. With Debian Potato Linux (stock install) using Python 2.1.1, Apache/1.3.19 mod_webkit/0.5 mod_ssl/2.4.10. The app app server crashes without an error. This seems independent of the browser or the adapter used (OneShot.cgi, mod_webkit, WebKit.cgi). Any thoughts? Clark |
From: Geoff T. <gta...@na...> - 2001-10-22 20:24:43
|
At 04:07 PM 10/22/01 -0400, Clark C . Evans wrote: >A tid bit of additional information. On Windows, with >PythonWin 2.1 I get a "StreamOut" error in the log ... >but the AppServer doesn't crash. On Windows, do you get a full traceback? If so, please forward it to the list. To my ears it sounds like an exception handler is needed around some socket code to handle broken connections. -- - Geoff Talvola gtalvola@NameConnector.com |
From: Clark C . E. <cc...@cl...> - 2001-10-22 20:44:15
|
Geoff, thank you for your help. | What OS are you using? Debian Potato (Linux 2.2.19) | and which adapter are you using? Does it make a | difference if you switch to the CGI adapter? mod_webkit; the problem is independent of adapter. | Does file uploading still cause Webware to abort even with the latest | release? Is it reliably reproducible, for example with the FileUpload | example servlet? This was fixed in the latest release, thanks! | Does the WebKit process actually exit? Yes. If I have WebKit running in a window it returns me back to the command prompt. | >Suppose I have two servlets A and B. Servlet A executes a | >long-running query against PostgreSQL (1-2 seconds) and returns | >quite a bit of information. Servlet B is much smaller. | >Also suppose that I have a pipe with some pretty decent | >latency. Now. Suppose a user using IE Explorer or Netscape | >on both Windows or Macintosh hits servlet B. Immediately after | >clicking on a link for this servlet (as the page starts to return) | >they click on servlet A. Bang. No logs, nothing. The | >back-end webware process crashes. | | I seem to remember someone else having problems with Webware crashing due | to a bug in the PostgreSQL module. They upgraded to a newer version of the | PostgreSQL Python module and the crashes went away. You might want to look | in the mail list archives for more information about that. Hmm. Perhaps this may be the problem; if the connection stops before the PostgreSQL stops returning data. Hmm. I'll look into this in more detail. | On Unix, there's Monitor.py. I haven't ever used it though. I'll see if it helps! | On Windows, do you get a full traceback? No traceback... Listening on ('127.0.0.1', 8086) Creating 10 threads.......... Ready StreamOut Error: (10053, 'Software caused connection abort') StreamOut Error: (10053, 'Software caused connection abort') Thank you... this bug hurts bad since a few of my alpha users keep bringing the server down. Best, Clakr |