From: Cyan o. <cy...@ru...> - 2014-08-28 11:51:18
|
On Wed, Aug 27, 2014 at 9:34 PM, Torben Brosten <to...@de...> wrote: > > On 08/27/2014 12:02 PM, Cyan ogilvie wrote:> .. > > There doesn't seem to be a pattern to the failing requests, > > sometimes it's small static files like favicon.ico, but mostly not > > (although in> our case we're not using fastpath for that - different > favicons are > > served based on the request context). At the moment I'm leaning > > towards some sort of corrupted connection thread state - the failures > > tend to cluster by time, server, user - so that, although the failures > > are exceedingly rare overall (220 yesterday), it's often the case that > > a given user will have to reload a page several times before they get > > a successful response. The servers are fronted by haproxy which will > > tend to send a given session back to the same server. > > .. > > Have you ruled out a router issue, such as from ipv4 exhaustion or > localized network flooding? I'm pretty sure it's not network related at this stage. To test this I built a man-in-the-middle relay listening on port 8008 running on the same server as nsd, which forwards all connection traffic to 127.0.0.1:80 and records the events and data it sees flowing in both directions, and a packet trace using tcpdump of both the requests arriving on eth0 and the relayed traffic on lo. An iptables nat prerouting rule DNATs connections coming in on eth0 port 80 to the relay's port 8008. Another process watches the haproxy logs for indications of a failed request and retrieves a dump of the ring buffers from the relay and saves them for later analysis. Here is the packet trace of the relay -> nsd traffic on the lo interface for a typical event (as captured by tcpdump, which indicated that no packets had been dropped): Time Prot Len Info 34.750440 TCP 76 47576 > http [SYN] Seq=0 Win=32792 Len=0 MSS=16396 SACK_PERM=1 WS=32 34.750465 TCP 76 http > 47576 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=16396 SACK_PERM=1 WS=32 34.750479 TCP 68 47576 > http [ACK] Seq=1 Ack=1 Win=32800 Len=0 34.750720 TCP 4412 [TCP segment of a reassembled PDU] 34.750756 TCP 68 http > 47576 [ACK] Seq=1 Ack=4345 Win=32768 Len=0 34.751274 HTTP 439 GET /item/531138-RR-1012/Elgin-Art-Deco-Dial-Pocket-Watch HTTP/1.1 34.751295 TCP 68 http > 47576 [ACK] Seq=1 Ack=4716 Win=32768 Len=0 34.751377 TCP 68 http > 47576 [FIN, ACK] Seq=1 Ack=4716 Win=32768 Len=0 34.751492 TCP 68 47576 > http [FIN, ACK] Seq=4716 Ack=2 Win=32800 Len=0 34.751515 TCP 68 http > 47576 [ACK] Seq=2 Ack=4717 Win=32768 Len=0 The connection reaches the ESTABLISHED state and the HTTP request data is sent to nsd which is acked. Then 0.8 ms later the connection is closed by nsd. The first Tcl code that should execute for this request is a preauth filter, which starts by writing a log containing [ns_conn request]. That log message isn't reached in cases like this. In this example the request is quite large (around 4.7 KB) because of some large cookies, but the same pattern happens for requests of around 400 bytes. There appears to be a less common failure mode where the request processing happens normally and the Tcl code generates a normal response (HTTP code 200), but the response data never hits the network. The network trace looks the same as the example I gave above except that the time between the ACK of the GET request and nsd's FIN, ACK is longer - around 20 - 70 ms which is in-line with the normal times for a successful request. I haven't yet caught the packet trace between the relay and nsd for this case (only the trace on eth0 between haproxy and the relay), so I'm not 100% certain of my interpretation of this failure mode yet. There is usually a cluster of up to about 6 requests that fail in this way for a given server in the space of around 2 seconds, with long intervals between the clusters where no failures happen. Cyan |