From: Richard B. <ri...@bu...> - 2008-05-17 19:08:25
|
I have been working on a problem that may be yaws, erlang (and horrors) my application. While doing the research I have made some discoveries. But first a description of the problem: PROBLEM: while my YAWS application is running and processing transactions... from time to time it stops accepting incoming socket connections. The client program receives "connection refused" responses to the connection request. I have inspected the client and the server systems and confirmed that the server is running, however, not responding. I have attempted telnet connections from the localhost and from a remote client, with the same results. When I looked into the report.log file I found: =ERROR REPORT==== 14-May-2008::02:16:56 === SSL accept failed: normal ... snip ... =ERROR REPORT==== 14-May-2008::02:16:56 === ** Generic server <0.894.2> terminating ** Last message in was {transport_accept,<0.888.2>, {sslsocket,6,<0.77.0>}, 10000} ** When Server state == {st,acceptor,<0.893.2>,<0.888.2>,<0.888.2>,nil,true, [],nil,nil,nil,nil,false,false} ** Reason for termination == ** {noproc,{gen_server,call, [<0.77.0>, {getopts,<0.894.2>, [nodelay,active,packet,mode,header,ip, backlog]}, 10000]}} =ERROR REPORT==== 14-May-2008::02:16:56 === ** Generic server <0.895.2> terminating ** Last message in was {transport_accept,<0.889.2>, {sslsocket,5,<0.72.0>}, 10000} ** When Server state == {st,acceptor,<0.893.2>,<0.889.2>,<0.889.2>,nil,true, [],nil,nil,nil,nil,false,false} ** Reason for termination == ** {noproc,{gen_server,call, [<0.72.0>, {getopts,<0.895.2>, [nodelay,active,packet,mode,header,ip, backlog]}, 10000]}} HARDWARE & OS: The application is installed on two complete OpenBSD 4.2 systems with their own primary IP address and a common CARP IP. The CPU is a VIA 1GHz with 1GB of RAM with 80GB of disk. APPLICATION: The application is deployed in SSL mode only with one application running on port 444 and one application on port 443. Only the application on 443 runs an both the exclusive IP and the shared CARP IP. THE TRANSACTION: is a REST-like transaction with GET/POST parameters, performs some processing, and then returns a plaintext body to the caller (over https). The transactions themselves are all about the same in time and complexity. In fact the error condition occurs while nagios is polling the application. (the application is not in production yet and nagios runs three test transaction about every 10 to 30 seconds. On transaction to each of the private IP address and one to the CARP. (my transaction makes use of Mnesia and replication too) (getting CARP to identify this error and take corrective action is on the agenda... but for now I want to find the root cause to the issue). STEPS I HAVE TAKEN: Klacke has recommended that I run; c:i() and inet:i(). I do not understand the output, however, if there is interest from the list I will post it. I have performed several code reviews of my code and YAWS. I found a non-tail-recursion problem in my code that I have since repaired. I'm hoping that it was the only one, however, the tail-recursion optimizer is not full documented or clear to me so I cannot be sure that I have found them all. I have recommended that someone in the erlang family might implement an erl- lint to find these things. CONTINUED: There was mention that the YAWS server periodically goes off the air for no particular reason. LSOF(fstat on OpenBSD) seems to indicate that the system is running out of file handles. This was not the case in my experience. The real issue is trying to capture the event that is causing the symptoms. I think I have mentioned everything... except that the DEV tree of the YAWS code has some new features that I hope to try. They include escalation of some exceptions to force the "heart" to restart YAWS. And a tool for dumping the state of yaws that might provide the key to what's going wrong (I hope). If anyone has any advice I'm open to your suggestions. I have a test environment that can run millions of transactions in a day in order to simulate any hypothesis. Thanks to everyone who participated on this subject privately before I wrote this email. /r |