I have a socket-server that is failing sporadically.
It is using roughly the same architecture as Frans Witte's single-client model. ie a process listening with a queue length of 1, and a number of other processes waiting to listen on the same port.
This model does have a small window where no process is listening on the socket, but I've catered for that on the client-side and is not the cause of this problem.
With this problem the client appears to get a connection and can write to the server through the socket. However there seems to be no evidence that a GT.M process ever got going at the server end. I've not managed to get the GT.M process to log anything, nor are any errors trapped or even a core dump to indicate that a GT.M process ever existed. It's as if the client process is connecting to thin air. The problem only happens about 1 in 30 requests and only when there is a reasonable rate of requests.
I've tried to eliminate the possibility of a problem with the client side script (Perl using the IO::Socket::INET module) by running it against a Cach based socket server. This works fine and is reliable. As far as I know there is no problem with the perl module I'm using.
The GT.M version is 4.2-002 but I've also observed it on 4.3-000. Does anyone have any ideas?
Further investigation reveals that there is a GT.M listening on the socket, but it does not appear to receive the connection from the client. It happily accepts and processes subsequent connections.
Unless there is some really weird bug or timing window then it has to be a problem with the client (the perl IO::SOcket::INET module). But then this client code is known to work with other socket servers...
Would you check the following:
If it is always the first request, would you check to see if there is a race condition (ie. the client starting before the server).
Does the Perl client connect request hang or fail?
Run netstat after the failure, and grep for the source port number to see if it is established.
Finally, if you can create a reproducible case, would you send it? Thanks.
1 I'm handling the situation where the client starts before the server is listening. I can't see any other kind of race condition that could occur.
2 The perl client connects ok to something, I'm checking for an error and don't get one.
If it's actually connecting to a GT.M then its as if that process is throwing away the connection and then listening for another connection. I can account for all of the GT.M listening processes, none are lost or missing.
3 I modified my perl script to sleep after the error condition occurs so at this point it still has it's end of the socket connection open. This is what I see:
[root@cat georgej]# netstat -ap|grep 6500
tcp 0 0 *:6500 *:* LISTEN 10015/mumps
tcp 0 703 127.0.0.1:3079 127.0.0.1:6500 CLOSE 9280/perl
[root@cat georgej]# ps -Af|grep mumps
georgej 664 630 2 01:10 pts/1 00:01:35 /home/gtm42/mumps -direct
georgej 10015 1 0 02:06 ? 00:00:00 /home/gtm42/mumps -direct
[root@cat georgej]# ps -Af|grep perl
georgej 9280 12787 0 02:05 ? 00:00:00 perl -w /home/georgej/xxx.cgi
Process 664 is a GT.M process that starts a new listener whenever a connection is closed.
Process 10015 is the current GT.M process listening for a new connection on port 6500. This seems to be in an OK state and will happily accept and process the next connection.
Process 9280 is a perl script that thinks it has had a connection to a GT.M process, has written some data to the socket and attempted to read data back, but got nothing. At this point it is in a sleep state just before it closes its end of the socket.
Whats interesting (but I don't know if its significant) is that the Send-Q always contains about 700-750 bytes at this point.
The test-rig I'm using at the moment can reliably reproduce the problem but involves too many components to easily send to you. If this information doesn't shed any light I'll work on making a simpler test-rig for you.
With help from Malli, Roger and the rest of the Sanchez team it is now clear that this problem is happening due to a kind of race condition in the listener process.
A small window exists between the time that a socket connection is established from a client and the time that the listener socket is closed. During this window another client can connect to listener process. However, as the listener is about to close the socket it does not get to see that another connection has been made.
The window occurs between the point that the first connection is actually made and the point at which the listener socket is closed. The following code illustrates this:
u device w /listen(1)
u device w /wait(60)
; The window is here, between the acceptance of
; a connection and the closing of the socket
Because the low level socket service calls are atomic it is not possible to eliminate this window even if a special command was implemented in GT.M.
The only way to fully solve this problem would be to have the listener process keep the socket open continuously and pass off the incoming socket connections to a sub-processes. There is currently no way for GT.M to do this due to some architectural constraints.
One solution, if you have control of the client side end of the process, is to explicitly check for this situation in the client. If a connection is made, but no message is received from the server, then just retry with a new connection.
I'm currently exploring the possibility of using a perl socket listener which can fork each socket connection and then fire up a GT.M process with the socket connected to STDIN and STDOUT. Don't know how feasible this would be...
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.