From: Zoran V. <zv...@ar...> - 2006-01-12 19:55:18
|
Vlad, Stephen, What do you think? Anfang der weitergeleiteten E-Mail: > Von: Jeff Rogers <dv...@DI...> > Datum: 12. Januar 2006 20:34:09 MEZ > An: AOL...@LI... > Betreff: [AOLSERVER] aolserver bug > Antwort an: AOLserver Discussion <AOL...@LI...> > > I found a bug in aolserver 4.0.10 (and previous 4.x versions, not > sure about > earlier) that causes the server to lock up. I'm fairly certain I > understand > the cause, and my fix appears to work although I'm not sure it is > the best > approach. > > The bug: when benchmarking the server with a program like ab with > concurrency=1 (that is, it issues a single request, waits for it to > complete, then immediately issues the next one) the server will > lock up, > consuming no cpu, but not responding to any requests. > > My explanation: when the max number of threads is hit then when a new > connection is queued (NsQueueConn) it will be unable to find a free > connection in the pool and the queueing fails, and the new > connection is > added to the wait list (waitPtr). If there is a wait list then no > drivers > are polled for new connections (driver.c:801), rather it waits to be > triggered (SockTrigger) to indicate that a thread is available to > handle the > connection. The triggering is done when the connection is > completed, within > NsSockClose. NsSockClose in turn is going to be called somewhere > within the > running of the connection (ConnRun - queue.c:617). However, the > available > thread is not put back onto the queue free list until after ConnRun > has > completed (queue.c:638). So if the driver thread runs in the time > slice > after ConnRun has completed for all active connections but before > they are > added back to the free list, then it attempts to queue the connection, > fails, adds it to the wait list, then waits for the trigger which > will never > come, and everything stops. > > The problem is a race condition, and as such is extremely timing > sensitive; > I cannot reproduce the problem on a generic setup, but when I'm > benchmarking > my OpenACS setup it hits the bug very quickly and reliably. The > explanation > suggests, and my testing confirms that it seems to occur much less > reliably > with concurrency > 1 or if there is a small delay between sending the > connections. Together these mean that the lockup is most likely to > show up > in exactly my test case, while much less likely on a production > server or > with high-concurrency load testing. > > My solution is to register SockTrigger as a ready proc, which are run > immediately after the freed conns are put back on to the free queue > (queue.c:645). This fixes the problem by ensuring that the trigger > pipe is > notified strictly after the free queue is updated and the waiting > conn will > sucessfully be queued. However I'm not sure this is best: NsSockClose > attempts to minimize the number of times SockTrigger is called in > the case > when multiple connections are being closed at the same time; my fix > means it > is called exactly once for each connection, or twice counting the > call in > NsSockClose. It's not clear to me what adverse impact this has, if > any, but > one thing that could be done is to remove the SockTrigger calls from > NsSockClose as redundant. Some additional logic could be added into > SockTrigger to not send to the trigger pipe under certain > conditions (i.e., > if it has been triggered and not acknowledged yet, or if there is > not waitin > connection), but that would require mutex protection which could > ultimately > be more expensive than just blindly triggering the pipe. > > Here's a context diff for my patch: > *** driver.c.orig Thu Jan 12 11:39:05 2006 > --- driver.c Thu Jan 12 11:39:10 2006 > *************** > *** 773,778 **** > --- 773,781 ---- > drvPtr = nextDrvPtr; > } > > + /* register a ready proc to trigger the poll */ > + Ns_RegisterAtReady(SockTrigger,NULL); > + > /* > * Loop forever until signalled to shutdown and all > * connections are complete and gracefully closed. > > > -J > > > -- > AOLserver - http://www.aolserver.com/ > > To Remove yourself from this list, simply send an email to > <lis...@li...> with the > body of "SIGNOFF AOLSERVER" in the email message. You can leave the > Subject: field of your email blank. |