Re: [AOLSERVER] Null responses

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Wed, Aug 27, 2014 at 9:34 PM, Torben Brosten <to...@de...> wrote:
>
> On 08/27/2014 12:02 PM, Cyan ogilvie wrote:> ..
>  > There doesn't seem to be a pattern to the failing requests,
>  > sometimes it's small static files like favicon.ico, but mostly not
>  > (although in> our case we're not using fastpath for that - different
> favicons are
>  > served based on the request context).  At the moment I'm leaning
>  > towards some sort of corrupted connection thread state - the failures
>  > tend to cluster by time, server, user - so that, although the failures
>  > are exceedingly rare overall (220 yesterday), it's often the case that
>  > a given user will have to reload a page several times before they get
>  > a successful response.  The servers are fronted by haproxy which will
>  > tend to send a given session back to the same server.
>  > ..
>
> Have you ruled out a router issue, such as from ipv4 exhaustion or
> localized network flooding?

I'm pretty sure it's not network related at this stage.  To test this I
built a man-in-the-middle relay listening on port 8008 running on the same
server as nsd, which forwards all connection traffic to 127.0.0.1:80 and
records the events and data it sees flowing in both directions, and a
packet trace using tcpdump of both the requests arriving on eth0 and the
relayed traffic on lo.  An iptables nat prerouting rule DNATs connections
coming in on eth0 port 80 to the relay's port 8008.  Another process
watches the haproxy logs for indications of a failed request and retrieves
a dump of the ring buffers from the relay and saves them for later analysis.

Here is the packet trace of the relay -> nsd traffic on the lo interface
for a typical event (as captured by tcpdump, which indicated that no
packets had been dropped):

Time      Prot Len  Info
34.750440 TCP  76   47576 > http [SYN] Seq=0 Win=32792 Len=0 MSS=16396
SACK_PERM=1 WS=32
34.750465 TCP  76   http > 47576 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0
MSS=16396 SACK_PERM=1 WS=32
34.750479 TCP  68   47576 > http [ACK] Seq=1 Ack=1 Win=32800 Len=0
34.750720 TCP  4412 [TCP segment of a reassembled PDU]
34.750756 TCP  68   http > 47576 [ACK] Seq=1 Ack=4345 Win=32768 Len=0
34.751274 HTTP 439  GET
/item/531138-RR-1012/Elgin-Art-Deco-Dial-Pocket-Watch HTTP/1.1
34.751295 TCP  68   http > 47576 [ACK] Seq=1 Ack=4716 Win=32768 Len=0
34.751377 TCP  68   http > 47576 [FIN, ACK] Seq=1 Ack=4716 Win=32768 Len=0
34.751492 TCP  68   47576 > http [FIN, ACK] Seq=4716 Ack=2 Win=32800 Len=0
34.751515 TCP  68   http > 47576 [ACK] Seq=2 Ack=4717 Win=32768 Len=0

The connection reaches the ESTABLISHED state and the HTTP request data is
sent to nsd which is acked.  Then 0.8 ms later the connection is closed by
nsd.  The first Tcl code that should execute for this request is a preauth
filter, which starts by writing a log containing [ns_conn request].  That
log message isn't reached in cases like this.  In this example the request
is quite large (around 4.7 KB) because of some large cookies, but the same
pattern happens for requests of around 400 bytes.

There appears to be a less common failure mode where the request processing
happens normally and the Tcl code generates a normal response (HTTP code
200), but the response data never hits the network.  The network trace
looks the same as the example I gave above except that the time between the
ACK of the GET request and nsd's FIN, ACK is longer - around 20 - 70 ms
which is in-line with the normal times for a successful request.  I haven't
yet caught the packet trace between the relay and nsd for this case (only
the trace on eth0 between haproxy and the relay), so I'm not 100% certain
of my interpretation of this failure mode yet.

There is usually a cluster of up to about 6 requests that fail in this way
for a given server in the space of around 2 seconds, with long intervals
between the clusters where no failures happen.

Cyan

Re: [AOLSERVER] Null responses

High-performance multithreaded web server

Re: [AOLSERVER] Null responses