Thread: [Sqlrelay-discussion] listener hang with open database port
Brought to you by:
mused
|
From: Cal H. <ca...@fb...> - 2010-05-26 20:04:22
|
David / Renat, I was wondering if you might know the answer to this. I have a special fail condition with DB2 that isn't being handled correctly with relay. On some rare cases, if DB2 is overloaded or crashes in a mysterious way, the port is left open but unresponsive after the connection is setup. This is different than the typical connection timeout, because the connection is made quickly, it just hangs on authentication indefinitely. You can recreate this by setting up a DB2 catalog to point at some open service that doesn't respond. I setup one pointing to a running apache instance and it recreated my scenario perfectly. (Even connecting with the db2 command line utility hangs forever, and you can't break out or even kill the process) I've also looked around for DB2CLI.INI timeout settings for this, and there isn't any help. Only connection timeout settings exist. I've also applied all of Renat's patches, and still have the same problem. The last entry I see in the listener debug log is: "waiting for the scaler..." which is from sqlrlistener.C around line 1285. It hangs at that point until I manually kill the listener process. I've been trying to study what is happening here between the listener and scaler, but haven't determined anything so far. Let me know if you have any ideas. Thanks! --Cal |
|
From: Renat S. <sr...@st...> - 2010-05-27 07:21:13
|
Hi Cal,
I don't really understand what happens in your case, but have some ideas.
> "waiting for the scaler..." which is from sqlrlistener.C around line
> 1285. It hangs at that point until I manually kill the listener
> process. I've been trying to study what is happening here between the
> listener and scaler, but haven't determined anything so far.
After this message listener waits for scaler to signal the semaphore
number 7. You can see this with strace or looking at backtrace in gdb.
Try to run command like this against sqlr-listener (here i did it for
sqlr-scaler, you can see that it waits for semathore #6):
$ sudo -u sqlrelay strace -p 2201
Process 2201 attached - interrupt to quit
semop(294921, {{6, -1, 0}}, 1
Scaler always waits for signal 6 to start the procedure of firing up new
connections. Then it counts sessions and connections and signals
listener to keep going with signal 7.
I believe that listener could freeze in this point if there is no scaler
at all or if the semaphore #4 is acquired by any other process and
scaler can't aquire it.
You could examine the semaphore state with patched sqlr-status, if the
value is 1 - then it's free for acquiring, 0 - already acquired.
You could try "-fork" option to sqlr-start, in this case scaler doesn't
use connection counter in shared memory and so doesn't use semaphore #4.
Or you could just remove acquiring and releasing semaphore #4 from
scaler::countConnections() because there is no need to serialize access
to reading one value - who cares if some process write another value a
bit earlier or later.
But I don't really think that the problem is in the semaphores. You
should examine the state of processes with strace and gdb first.
--
Renat Sabitov e-mail: sr...@st...
Stack Soft jid: sr...@ja...
|
|
From: Cal H. <ca...@fb...> - 2010-05-27 16:04:42
|
Thanks for the info Renat, this helps me a lot. All of these signal
interactions are pretty confusing the first time. Also I should add that
the first operation works correctly, it's the second call that hangs
indefinitely.
-------------------------
sqlrconnection *conn = new sqlrconnection(...);
conn->autoCommitOff(); // hangs for 10 seconds, then returns false
sqlrcursor *cur = new sqlrcursor(conn);
cur->sendQuery(...); // hangs forever here
-------------------------
In addition, I can break out of my client app and fire it up a second time,
then it will hang forever on the first autoCommitOff() call. So, it seems
like the first operation is successful, but everything after that, even new
forked listeners are waiting on a series of semaphore blocks.
I've traced down the operation so far to this:
---------------
db2connection.C db2connection::logIn()
The SQL_LOGIN_TIMEOUT attr setting to 5 seconds, I believe causes this to
fail correctly at the SQLConnect() call. This function returns false.
---------------
initconnection.C
I have reloginatstart="no" since in my case, if a server is dead I want the
client to go into a failure mode. (I have my own methods on the client of
either picking a different server, or showing a "we'll be back soon"
message)
at around line 102 in initConnection(), attemptLogIn() returns false, which
causes initConnection() to return false.
---------------
---------------
connections/db2/main.C
since initConnection() returns false, the db2 connection proc does an
_exit(1);
---------------
Meanwhile scaler.C openMoreConnections() has done openOneConnection() which
returned successfully, since it's just checking the success of doing the
fork() call in the parent. It then goes into incConnections() where it
waits on semaphore 8.
Since initConnections() has returned before doing
incrementConnectionCount(), semaphore 8 is never signaled, which appears to
cause the scaler to wait inside incConnections() Since it's waiting there,
it will never start up any more connections after that, and we have a
downward spiral of clients and locked up listener processes.
---------------
Here's the scaler gdb run, with a few of my debugging statements added.
(gdb) r
Starting program: /usr/local/firstworks/bin/sqlr-scaler -id openport -debug
-fork -config /usr/local/firstworks/etc/sqlrelay.conf
openMoreConnections(): connections: 0
openMoreConnections(): sessions: 1
openMoreConnections(): grow loop: i=0
openMoreConnections(): start while loop
scaler::openOneConnection_fork(): doing fork with command:
sqlr-connection-db2 -silent -nodetatch -ttl 60 -id openport -connectionid
dev -config /usr/local/firstworks/etc/sqlrelay.conf -debug
scaler: forked pid 20163
openMoreConnections(): after openOneConnection() success=1
incrConnections() start
db2 main.C call initConnection()
Debugging to: /usr/local/firstworks/var/sqlrelay/debug/sqlr-connection.20163
db2connection::logIn() start connect
db2connection::logIn() error connect, return false
sqlrconnection_svr::initConnection(): attemptLogIn() fail
db2 main.C: connect fail, _exit(1)
Debugging to: /usr/local/firstworks/var/sqlrelay/debug/sqlr-listener.20433
listener: waiting for scaler
(hangs here, I did a ctrl C)
Program received signal SIGINT, Interrupt.
0x00000035058c83c9 in semop () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x00000035058c83c9 in semop () from /lib64/tls/libc.so.6
#1 0x0000002a956eeb9f in rudiments::semaphoreset::semOp ()
from /usr/local/firstworks/lib/librudiments-0.32.so.1
#2 0x0000002a956ee03c in rudiments::semaphoreset::wait ()
from /usr/local/firstworks/lib/librudiments-0.32.so.1
#3 0x000000000040486c in scaler::incConnections (this=0x5061f0) at
scaler.C:502
#4 0x0000000000404650 in scaler::openMoreConnections (this=0x5061f0) at
scaler.C:449
#5 0x00000000004049bf in scaler::loop (this=0x5061f0) at scaler.C:544
#6 0x0000000000404ba8 in main (argc=7, argv=0x7fbffff658) at main.C:26
(gdb) frame 3
#3 0x000000000040486c in scaler::incConnections (this=0x5061f0) at
scaler.C:502
502 if (! semset->wait(8) )
---------------
So it seems that the connection fails out, but the scaler just keeps waiting
for the connection proc to increment.
In looking at the rudiments API for semaphores, what if I did a
semset->waitWithUndo() instead?
It looks like that might have solved it. Here's what my
scaler::incConnections() looks like, cleaned up.
void scaler::incConnections()
{
/* wait for the connection count to increase. Time out at 10 seconds.
* Since the login timeout is 5 seconds, this gives a bit of buffer time
*/
if (! semset->waitWithUndo(8, 10, 0) )
return;
if (use_fork) {
this->currentconnections++;
}
}
I'm not sure if this change would cause other bugs though. I'm going to do
some more testing to see how this works, and I might email a patch in later
if I make any other changes.
Let me know if this is the wrong way to solve this.
Thanks!
--Cal
On Thu, May 27, 2010 at 2:21 AM, Renat Sabitov <sr...@st...> wrote:
> Hi Cal,
>
> I don't really understand what happens in your case, but have some ideas.
>
> > "waiting for the scaler..." which is from sqlrlistener.C around line
> > 1285. It hangs at that point until I manually kill the listener
> > process. I've been trying to study what is happening here between the
> > listener and scaler, but haven't determined anything so far.
>
> After this message listener waits for scaler to signal the semaphore
> number 7. You can see this with strace or looking at backtrace in gdb.
>
> Try to run command like this against sqlr-listener (here i did it for
> sqlr-scaler, you can see that it waits for semathore #6):
>
> $ sudo -u sqlrelay strace -p 2201
> Process 2201 attached - interrupt to quit
> semop(294921, {{6, -1, 0}}, 1
>
> Scaler always waits for signal 6 to start the procedure of firing up new
> connections. Then it counts sessions and connections and signals
> listener to keep going with signal 7.
>
> I believe that listener could freeze in this point if there is no scaler
> at all or if the semaphore #4 is acquired by any other process and
> scaler can't aquire it.
>
> You could examine the semaphore state with patched sqlr-status, if the
> value is 1 - then it's free for acquiring, 0 - already acquired.
>
> You could try "-fork" option to sqlr-start, in this case scaler doesn't
> use connection counter in shared memory and so doesn't use semaphore #4.
>
> Or you could just remove acquiring and releasing semaphore #4 from
> scaler::countConnections() because there is no need to serialize access
> to reading one value - who cares if some process write another value a
> bit earlier or later.
>
> But I don't really think that the problem is in the semaphores. You
> should examine the state of processes with strace and gdb first.
>
> --
> Renat Sabitov e-mail: sr...@st...
> Stack Soft jid: sr...@ja...
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Sqlrelay-discussion mailing list
> Sql...@li...
> https://lists.sourceforge.net/lists/listinfo/sqlrelay-discussion
>
|
|
From: Renat S. <sr...@st...> - 2010-05-27 17:10:47
|
27.05.2010 20:04, Cal Heldenbrand ?????:
> #3 0x000000000040486c in scaler::incConnections (this=0x5061f0) at
> scaler.C:502
> 502 if (! semset->wait(8) )
>
Ok, now I see. Scaler waits for new connection to report that it
started. So if connection hangs (or exit) somewhere before calling
signal(8), scaler will never stop waiting.
> void scaler::incConnections()
> {
> /* wait for the connection count to increase. Time out at 10 seconds.
> * Since the login timeout is 5 seconds, this gives a bit of
> buffer time
> */
> if (! semset->waitWithUndo(8, 10, 0) )
> return;
>
I think that the best way is to move incrementing connection counter
somewhere before interacting with db, just after conenction process
starts. May be in init. Or even get rid of semaphore #8 at all in case
of -fork, because scaler already knows if connection starts or not (it
got PID).
--
Renat
|
|
From: Cal H. <ca...@fb...> - 2010-05-27 17:28:58
|
Well I think I like the last suggestion the best, since it's the easiest to
implement. ;-)
void scaler::incConnections()
{
if ( ! use_fork )
semset->wait(8);
if (use_fork) {
fprintf(stderr, "incConnections(): DO incr\n"); fflush(stderr);
this->currentconnections++;
}
}
This seems to work just the same, however I've noticed that the scalers
"Sessions" count doesn't decrease. Each failed connection causes the count
to increase by one.
I was playing around with something like:
if (! semset->waitWithUndo(8, 10, 0) )
{
// decrement session counter, I'm not sure if this should happen
here
shmdata *ptr=(shmdata *)idmemory->getPointer();
ptr->connectionsinuse--;
return;
}
Which seemed to work fine, but I'm not sure that's the correct place to
decrement that counter.
Any ideas on that?
Thanks Renat!
--Cal
On Thu, May 27, 2010 at 12:13 PM, Renat Sabitov <sr...@st...> wrote:
> 27.05.2010 20:04, Cal Heldenbrand пишет:
>
> #3 0x000000000040486c in scaler::incConnections (this=0x5061f0) at
> scaler.C:502
> 502 if (! semset->wait(8) )
>
> Ok, now I see. Scaler waits for new connection to report that it started.
> So if connection hangs (or exit) somewhere before calling signal(8), scaler
> will never stop waiting.
>
>
> void scaler::incConnections()
> {
> /* wait for the connection count to increase. Time out at 10 seconds.
> * Since the login timeout is 5 seconds, this gives a bit of buffer
> time
> */
> if (! semset->waitWithUndo(8, 10, 0) )
> return;
>
> I think that the best way is to move incrementing connection counter
> somewhere before interacting with db, just after conenction process starts.
> May be in init. Or even get rid of semaphore #8 at all in case of -fork,
> because scaler already knows if connection starts or not (it got PID).
>
> --
> Renat
>
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Sqlrelay-discussion mailing list
> Sql...@li...
> https://lists.sourceforge.net/lists/listinfo/sqlrelay-discussion
>
>
|
|
From: Renat S. <sr...@st...> - 2010-05-27 20:20:54
|
27.05.2010 21:28, Cal Heldenbrand пишет:
> Which seemed to work fine, but I'm not sure that's the correct place
> to decrement that counter.
>
> Any ideas on that?
>
Session counter is incremented in listener when client session is
authorized. It is decremented when connection finishs processing client
session's requests. What happens with you client session? Does it get an
error or what? Does listener complete socket handoff procedure to
connection? When handoff exits with error, there might be the best place
to decrement counter?
----
if (dynamicscaling) {
incrementSessionCount();
}
passstatus=handOffClient(clientsock);
// addition:
if (!passstatus) {
decrementSessionCount(); // this function is
not exist yet
}
----
--
Renat
|
|
From: Renat S. <sr...@st...> - 2010-05-27 20:21:13
|
27.05.2010 21:28, Cal Heldenbrand пишет:
> Which seemed to work fine, but I'm not sure that's the correct place
> to decrement that counter.
>
> Any ideas on that?
>
Session counter is incremented in listener when client session is
authorized. It is decremented when connection finishs processing client
session's requests. What happens with you client session? Does it get an
error or what? Does listener complete socket handoff procedure to
client? When handoff exits with error, there might be the best place to
decrement counter?
----
if (dynamicscaling) {
incrementSessionCount();
}
passstatus=handOffClient(clientsock);
// addition:
if (!passstatus) {
decrementSessionCount(); // this function is
not exist yet
}
----
--
Renat
|
|
From: Renat S. <sr...@st...> - 2010-05-27 19:26:53
|
27.05.2010 23:17, Renat Sabitov пишет: > Does listener complete socket handoff procedure to client? I'm sorry, of course "to connection", not "to client" |
|
From: Cal H. <ca...@fb...> - 2010-05-27 20:51:00
|
Session counter is incremented in listener when client session is
> authorized. It is decremented when connection finishs processing client
> session's requests. What happens with you client session? Does it get an
> error or what? Does listener complete socket handoff procedure to
> client?
The client seems to handle the errors correctly. After the autoCommitOff()
returns false, sendQuery() fails with error message
"Failed to authenticate.
A network error may have ocurred."
When handoff exits with error, there might be the best place to
> decrement counter?
>
> ----
> if (dynamicscaling) {
> incrementSessionCount();
> }
> passstatus=handOffClient(clientsock);
> // addition:
> if (!passstatus) {
> decrementSessionCount(); // this function is
> not exist yet
> }
> ----
>
This is the strange part -- I added a print statement after handOffClient()
to see what the return status is, and it never prints. It is getting to it
though, authstatus is 1. I don't really understand how handOffClient()
works, but it must never return?
I have my listenertimeout config setting to 5 seconds, and it appears that
the listener times out each time the client receives an error back.
(Changing this value changes how fast the client gets an error message)
Could that have something to do with it?
Thanks,
--Cal
|
|
From: Renat S. <sr...@st...> - 2010-05-28 04:52:43
|
28.05.2010 00:50, Cal Heldenbrand пишет: > I have my listenertimeout config setting to 5 seconds, and it appears > that the listener times out each time the client receives an error > back. (Changing this value changes how fast the client gets an error > message) Could that have something to do with it? My suggestion - forked listener starts handoff session to connection-db2, but there is no connection to take the session. So listener times out with signal SIGALRM and catch it in sqlrlistener::alarmHandler. It might be another good place for decrementing session count, but only after it was incremented in listener for sure. |