From: Richard B. <ri...@bu...> - 2008-05-17 19:08:25
|
I have been working on a problem that may be yaws, erlang (and horrors) my application. While doing the research I have made some discoveries. But first a description of the problem: PROBLEM: while my YAWS application is running and processing transactions... from time to time it stops accepting incoming socket connections. The client program receives "connection refused" responses to the connection request. I have inspected the client and the server systems and confirmed that the server is running, however, not responding. I have attempted telnet connections from the localhost and from a remote client, with the same results. When I looked into the report.log file I found: =ERROR REPORT==== 14-May-2008::02:16:56 === SSL accept failed: normal ... snip ... =ERROR REPORT==== 14-May-2008::02:16:56 === ** Generic server <0.894.2> terminating ** Last message in was {transport_accept,<0.888.2>, {sslsocket,6,<0.77.0>}, 10000} ** When Server state == {st,acceptor,<0.893.2>,<0.888.2>,<0.888.2>,nil,true, [],nil,nil,nil,nil,false,false} ** Reason for termination == ** {noproc,{gen_server,call, [<0.77.0>, {getopts,<0.894.2>, [nodelay,active,packet,mode,header,ip, backlog]}, 10000]}} =ERROR REPORT==== 14-May-2008::02:16:56 === ** Generic server <0.895.2> terminating ** Last message in was {transport_accept,<0.889.2>, {sslsocket,5,<0.72.0>}, 10000} ** When Server state == {st,acceptor,<0.893.2>,<0.889.2>,<0.889.2>,nil,true, [],nil,nil,nil,nil,false,false} ** Reason for termination == ** {noproc,{gen_server,call, [<0.72.0>, {getopts,<0.895.2>, [nodelay,active,packet,mode,header,ip, backlog]}, 10000]}} HARDWARE & OS: The application is installed on two complete OpenBSD 4.2 systems with their own primary IP address and a common CARP IP. The CPU is a VIA 1GHz with 1GB of RAM with 80GB of disk. APPLICATION: The application is deployed in SSL mode only with one application running on port 444 and one application on port 443. Only the application on 443 runs an both the exclusive IP and the shared CARP IP. THE TRANSACTION: is a REST-like transaction with GET/POST parameters, performs some processing, and then returns a plaintext body to the caller (over https). The transactions themselves are all about the same in time and complexity. In fact the error condition occurs while nagios is polling the application. (the application is not in production yet and nagios runs three test transaction about every 10 to 30 seconds. On transaction to each of the private IP address and one to the CARP. (my transaction makes use of Mnesia and replication too) (getting CARP to identify this error and take corrective action is on the agenda... but for now I want to find the root cause to the issue). STEPS I HAVE TAKEN: Klacke has recommended that I run; c:i() and inet:i(). I do not understand the output, however, if there is interest from the list I will post it. I have performed several code reviews of my code and YAWS. I found a non-tail-recursion problem in my code that I have since repaired. I'm hoping that it was the only one, however, the tail-recursion optimizer is not full documented or clear to me so I cannot be sure that I have found them all. I have recommended that someone in the erlang family might implement an erl- lint to find these things. CONTINUED: There was mention that the YAWS server periodically goes off the air for no particular reason. LSOF(fstat on OpenBSD) seems to indicate that the system is running out of file handles. This was not the case in my experience. The real issue is trying to capture the event that is causing the symptoms. I think I have mentioned everything... except that the DEV tree of the YAWS code has some new features that I hope to try. They include escalation of some exceptions to force the "heart" to restart YAWS. And a tool for dumping the state of yaws that might provide the key to what's going wrong (I hope). If anyone has any advice I'm open to your suggestions. I have a test environment that can run millions of transactions in a day in order to simulate any hypothesis. Thanks to everyone who participated on this subject privately before I wrote this email. /r |
From: Claes W. <kl...@ta...> - 2008-05-17 19:53:03
|
Richard Bucker wrote: > > STEPS I HAVE TAKEN: Klacke has recommended that I run; c:i() and > inet:i(). I do not understand the output, Richard and I have been communicating privately and I think that maybe I was barking up the wrong tree. I though that maybe he had similar problems as yaws.hyber.org has had I.e an fd leak somewhere - not necessarily in yaws - maybe outside - still unknown. To try to remedy this problem I've (in trunk) introduced a really nice debugging facility # yaws --debug-dump That produces a lot of good info on stdout. Typically a good idea for sites that appear to die out of the blue is to run this debug-dump from a cron script regularly > There was mention that the YAWS server periodically goes off the air > for no particular reason. LSOF(fstat on OpenBSD) seems to indicate > that the system is running out of file handles. This was not the case > in my experience. The real issue is trying to capture the event that > is causing the symptoms. > > I think I have mentioned everything... except that the DEV tree of the > YAWS code has some new features that I hope to try. They include > escalation of some exceptions to force the "heart" to restart YAWS. Actually - I saw this the other day, yaws_sup.erl had a really odd restart strategy. I changed it radically from - {ok,{{one_for_all,10,30}, [YawsLog, YawsRSS, YawsServ, Sess, to + {ok,{{one_for_all, 0, 1}, [YawsLog, YawsRSS, YawsServ, Sess, I think that if yaws_server dies - we have the following choices 1. embedded mode - some other supervisor restarts yaws 2. normal mode - die and let heart restart Big Change - few characters !! As for Richards problems - try to run the debug-dump from a cron script and in particular run debug-dump once the system is dead. /klacke |
From: Michal S. <mi...@er...> - 2008-08-12 09:56:00
|
Hi, 2008/5/17 Richard Bucker <ri...@bu...>: > =ERROR REPORT==== 14-May-2008::02:16:56 === > SSL accept failed: normal > > ... snip ... I am now getting the same crash with yaws-1.77 and otp R12B-3. Last time I have seen the crash, I didn't have enough time to investigate the state of node, but I think the yaws server was still running. If it would die, my restart strategy would bring the whole node down and then the heart script would restart it. Seemed to me that it was just the ssl gen_server which has stopped to accept new connections and then have died. Maybe it has run out of file descriptors. Hope to investigate it further next time the crash occurs, but would like to know if anyone has any update on this one? Thanks, Michal -- Michal Slaski http://www.erlang-consulting.com |
From: Oscar H. <os...@er...> - 2008-08-12 10:33:41
|
Hi, Could this be connected to these issues? http://www.erlang.org/pipermail/erlang-questions/2006-September/022755.html http://www.erlang.org/pipermail/erlang-bugs/2008-June/000829.html http://www.erlang.org/pipermail/erlang-patches/2008-July/000259.html Michal Slaski wrote: > Hi, > > 2008/5/17 Richard Bucker <ri...@bu...>: >> =ERROR REPORT==== 14-May-2008::02:16:56 === >> SSL accept failed: normal >> >> ... snip ... > > I am now getting the same crash with yaws-1.77 and otp R12B-3. Last > time I have seen the crash, I didn't have enough time to investigate > the state of node, but I think the yaws server was still running. If > it would die, my restart strategy would bring the whole node down and > then the heart script would restart it. Seemed to me that it was just > the ssl gen_server which has stopped to accept new connections and > then have died. Maybe it has run out of file descriptors. Hope to > investigate it further next time the crash occurs, but would like to > know if anyone has any update on this one? > > Thanks, > Michal > Best regards -- Oscar Hellström, os...@er... Office: +44 20 7655 0337 Mobile: +44 798 45 44 773 Erlang Training and Consulting http://www.erlang-consulting.com/ |
From: Claes W. <kl...@ta...> - 2008-08-12 11:11:24
|
Michal Slaski wrote: > Hi, > > 2008/5/17 Richard Bucker <ri...@bu...>: >> =ERROR REPORT==== 14-May-2008::02:16:56 === >> SSL accept failed: normal >> >> ... snip ... > > I am now getting the same crash with yaws-1.77 and otp R12B-3. Last > time I have seen the crash, I didn't have enough time to investigate > the state of node, but I think the yaws server was still running. If > it would die, my restart strategy would bring the whole node down and > then the heart script would restart it. Seemed to me that it was just > the ssl gen_server which has stopped to accept new connections and > then have died. Maybe it has run out of file descriptors. Hope to > investigate it further next time the crash occurs, but would like to > know if anyone has any update on this one? I believe this a bug in the ssl app - however it is also a bug in yaws. Yaws should (probably ??) supervise the ssl application and die (for restart) if ssl dies. This is however slightly problematic - since the way ssl is started today is add hoc by simply calling ssl:start() if necessary. The right thing should probably be to add ssl as child to the supervision tree of yaws. This doesn't address the issue of the ssl bug though. However yaws should restart if ssl dies and it doesn't today - this is a bug. /klacke |
From: Michal S. <mi...@er...> - 2008-08-12 16:51:30
|
> On 12 Aug 2008, at 12:39, Oscar Hellström wrote: >> http://www.erlang.org/pipermail/erlang-patches/2008-July/000259.html Thanks for the patch, I have applied it and works great. Michal -- Michal Slaski http://www.erlang-consulting.com |
From: Michal S. <mi...@er...> - 2008-08-13 07:15:25
|
On 12 Aug 2008, at 13:11, Claes Wikström wrote: > Yaws should (probably ??) supervise the ssl application > and die (for restart) if ssl dies. > > [...] The crash we were getting was causing ssl_broker gen_server to die. This gen_server is supervised by ssl_broker_sup supervisor and its restart argument is set to temporary, so when ssl_broker dies, it is not restarted. I believe in this particular case even if Yaws would supervise the ssl supervision tree it wouldn't help. I agree that in general Yaws could somehow supervise the ssl application or they both could be supervised by a top supervisor. Michal -- Michal Slaski http://www.erlang-consulting.com |