Re: [Sqlrelay-discussion] listener hang with open database port
Brought to you by:
mused
|
From: Cal H. <ca...@fb...> - 2010-05-27 16:04:42
|
Thanks for the info Renat, this helps me a lot. All of these signal
interactions are pretty confusing the first time. Also I should add that
the first operation works correctly, it's the second call that hangs
indefinitely.
-------------------------
sqlrconnection *conn = new sqlrconnection(...);
conn->autoCommitOff(); // hangs for 10 seconds, then returns false
sqlrcursor *cur = new sqlrcursor(conn);
cur->sendQuery(...); // hangs forever here
-------------------------
In addition, I can break out of my client app and fire it up a second time,
then it will hang forever on the first autoCommitOff() call. So, it seems
like the first operation is successful, but everything after that, even new
forked listeners are waiting on a series of semaphore blocks.
I've traced down the operation so far to this:
---------------
db2connection.C db2connection::logIn()
The SQL_LOGIN_TIMEOUT attr setting to 5 seconds, I believe causes this to
fail correctly at the SQLConnect() call. This function returns false.
---------------
initconnection.C
I have reloginatstart="no" since in my case, if a server is dead I want the
client to go into a failure mode. (I have my own methods on the client of
either picking a different server, or showing a "we'll be back soon"
message)
at around line 102 in initConnection(), attemptLogIn() returns false, which
causes initConnection() to return false.
---------------
---------------
connections/db2/main.C
since initConnection() returns false, the db2 connection proc does an
_exit(1);
---------------
Meanwhile scaler.C openMoreConnections() has done openOneConnection() which
returned successfully, since it's just checking the success of doing the
fork() call in the parent. It then goes into incConnections() where it
waits on semaphore 8.
Since initConnections() has returned before doing
incrementConnectionCount(), semaphore 8 is never signaled, which appears to
cause the scaler to wait inside incConnections() Since it's waiting there,
it will never start up any more connections after that, and we have a
downward spiral of clients and locked up listener processes.
---------------
Here's the scaler gdb run, with a few of my debugging statements added.
(gdb) r
Starting program: /usr/local/firstworks/bin/sqlr-scaler -id openport -debug
-fork -config /usr/local/firstworks/etc/sqlrelay.conf
openMoreConnections(): connections: 0
openMoreConnections(): sessions: 1
openMoreConnections(): grow loop: i=0
openMoreConnections(): start while loop
scaler::openOneConnection_fork(): doing fork with command:
sqlr-connection-db2 -silent -nodetatch -ttl 60 -id openport -connectionid
dev -config /usr/local/firstworks/etc/sqlrelay.conf -debug
scaler: forked pid 20163
openMoreConnections(): after openOneConnection() success=1
incrConnections() start
db2 main.C call initConnection()
Debugging to: /usr/local/firstworks/var/sqlrelay/debug/sqlr-connection.20163
db2connection::logIn() start connect
db2connection::logIn() error connect, return false
sqlrconnection_svr::initConnection(): attemptLogIn() fail
db2 main.C: connect fail, _exit(1)
Debugging to: /usr/local/firstworks/var/sqlrelay/debug/sqlr-listener.20433
listener: waiting for scaler
(hangs here, I did a ctrl C)
Program received signal SIGINT, Interrupt.
0x00000035058c83c9 in semop () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x00000035058c83c9 in semop () from /lib64/tls/libc.so.6
#1 0x0000002a956eeb9f in rudiments::semaphoreset::semOp ()
from /usr/local/firstworks/lib/librudiments-0.32.so.1
#2 0x0000002a956ee03c in rudiments::semaphoreset::wait ()
from /usr/local/firstworks/lib/librudiments-0.32.so.1
#3 0x000000000040486c in scaler::incConnections (this=0x5061f0) at
scaler.C:502
#4 0x0000000000404650 in scaler::openMoreConnections (this=0x5061f0) at
scaler.C:449
#5 0x00000000004049bf in scaler::loop (this=0x5061f0) at scaler.C:544
#6 0x0000000000404ba8 in main (argc=7, argv=0x7fbffff658) at main.C:26
(gdb) frame 3
#3 0x000000000040486c in scaler::incConnections (this=0x5061f0) at
scaler.C:502
502 if (! semset->wait(8) )
---------------
So it seems that the connection fails out, but the scaler just keeps waiting
for the connection proc to increment.
In looking at the rudiments API for semaphores, what if I did a
semset->waitWithUndo() instead?
It looks like that might have solved it. Here's what my
scaler::incConnections() looks like, cleaned up.
void scaler::incConnections()
{
/* wait for the connection count to increase. Time out at 10 seconds.
* Since the login timeout is 5 seconds, this gives a bit of buffer time
*/
if (! semset->waitWithUndo(8, 10, 0) )
return;
if (use_fork) {
this->currentconnections++;
}
}
I'm not sure if this change would cause other bugs though. I'm going to do
some more testing to see how this works, and I might email a patch in later
if I make any other changes.
Let me know if this is the wrong way to solve this.
Thanks!
--Cal
On Thu, May 27, 2010 at 2:21 AM, Renat Sabitov <sr...@st...> wrote:
> Hi Cal,
>
> I don't really understand what happens in your case, but have some ideas.
>
> > "waiting for the scaler..." which is from sqlrlistener.C around line
> > 1285. It hangs at that point until I manually kill the listener
> > process. I've been trying to study what is happening here between the
> > listener and scaler, but haven't determined anything so far.
>
> After this message listener waits for scaler to signal the semaphore
> number 7. You can see this with strace or looking at backtrace in gdb.
>
> Try to run command like this against sqlr-listener (here i did it for
> sqlr-scaler, you can see that it waits for semathore #6):
>
> $ sudo -u sqlrelay strace -p 2201
> Process 2201 attached - interrupt to quit
> semop(294921, {{6, -1, 0}}, 1
>
> Scaler always waits for signal 6 to start the procedure of firing up new
> connections. Then it counts sessions and connections and signals
> listener to keep going with signal 7.
>
> I believe that listener could freeze in this point if there is no scaler
> at all or if the semaphore #4 is acquired by any other process and
> scaler can't aquire it.
>
> You could examine the semaphore state with patched sqlr-status, if the
> value is 1 - then it's free for acquiring, 0 - already acquired.
>
> You could try "-fork" option to sqlr-start, in this case scaler doesn't
> use connection counter in shared memory and so doesn't use semaphore #4.
>
> Or you could just remove acquiring and releasing semaphore #4 from
> scaler::countConnections() because there is no need to serialize access
> to reading one value - who cares if some process write another value a
> bit earlier or later.
>
> But I don't really think that the problem is in the semaphores. You
> should examine the state of processes with strace and gdb first.
>
> --
> Renat Sabitov e-mail: sr...@st...
> Stack Soft jid: sr...@ja...
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Sqlrelay-discussion mailing list
> Sql...@li...
> https://lists.sourceforge.net/lists/listinfo/sqlrelay-discussion
>
|