As per Jeff Rogers on 2006-01-12:
I found a bug in aolserver 4.0.10 (and previous 4.x
versions, not sure about earlier) that causes the server to
lock up. I'm fairly certain I understand the cause, and my
fix appears to work although I'm not sure it is the best
approach.
The bug: when benchmarking the server with a program like ab
with concurrency=1 (that is, it issues a single request,
waits for it to complete, then immediately issues the next
one) the server will lock up, consuming no cpu, but not
responding to any requests.
My explanation: when the max number of threads is hit then
when a new connection is queued (NsQueueConn) it will be
unable to find a free connection in the pool and the
queueing fails, and the new connection is added to the wait
list (waitPtr). If there is a wait list then no drivers are
polled for new connections (driver.c:801), rather it waits
to be triggered (SockTrigger) to indicate that a thread is
available to handle the connection. The triggering is done
when the connection is completed, within NsSockClose.
NsSockClose in turn is going to be called somewhere within
the running of the connection (ConnRun - queue.c:617).
However, the available thread is not put back onto the queue
free list until after ConnRun has completed (queue.c:638).
So if the driver thread runs in the time slice after ConnRun
has completed for all active connections but before they are
added back to the free list, then it attempts to queue the
connection, fails, adds it to the wait list, then waits for
the trigger which will never come, and everything stops.
The problem is a race condition, and as such is extremely
timing sensitive; I cannot reproduce the problem on a
generic setup, but when I'm benchmarking my OpenACS setup it
hits the bug very quickly and reliably. The explanation
suggests, and my testing confirms that it seems to occur
much less reliably with concurrency > 1 or if there is a
small delay between sending the connections. Together these
mean that the lockup is most likely to show up in exactly my
test case, while much less likely on a production server or
with high-concurrency load testing.
My solution is to register SockTrigger as a ready proc,
which are run immediately after the freed conns are put back
on to the free queue (queue.c:645). This fixes the problem
by ensuring that the trigger pipe is notified strictly after
the free queue is updated and the waiting conn will
sucessfully be queued. However I'm not sure this is best:
NsSockClose attempts to minimize the number of times
SockTrigger is called in the case when multiple connections
are being closed at the same time; my fix means it is called
exactly once for each connection, or twice counting the call
in NsSockClose. It's not clear to me what adverse impact
this has, if any, but one thing that could be done is to
remove the SockTrigger calls from NsSockClose as redundant.
Some additional logic could be added into SockTrigger to not
send to the trigger pipe under certain conditions (i.e., if
it has been triggered and not acknowledged yet, or if there
is not waitin connection), but that would require mutex
protection which could ultimately be more expensive than
just blindly triggering the pipe.
Here's a context diff for my patch:
*** driver.c.orig Thu Jan 12 11:39:05 2006
--- driver.c Thu Jan 12 11:39:10 2006
***************
*** 773,778 ****
--- 773,781 ----
drvPtr = nextDrvPtr;
}
+ /* register a ready proc to trigger the poll */
+ Ns_RegisterAtReady(SockTrigger,NULL);
+
/*
* Loop forever until signalled to shutdown and all
* connections are complete and gracefully closed.
Dossy Shiobara
Architecture: Server (nsd)
aolserver_v40
Public
|
Date: 2007-01-15 03:20
|
|
Date: 2006-12-15 17:35
|
|
Date: 2006-12-15 17:27
|
|
Date: 2006-12-14 15:38
|