Sam Steingold writes:
> Don Cohen wrote:
> > > if you have a reproducible case, I am interested.
> > It happens regularly, so I can reproduce it in that sense.
> > If I record the relevant packets then I suppose it should
> > be possible to reproduce it on demand by some approximation
> > of replay. Or perhaps it would suffice for a start to simply
> > describe the packet(s) that cause the error. Shall we start
> > with that?
> I would prefer something like "start clisp, open socket server,
> connect to it from another shell using telnet, type '...' to
> telnet, kill telnet, type '...' to clisp, observe the error".
I understand, but this may be more difficult to produce, so let's
start with this: The errors seem to come (many times recently) from
packet traces that all look substantially the same:
11:40:18.824273 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
length 78: IP 216.240.130.195.49199 > 64.27.16.100.http: S 2265407414:22654074
14(0) win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp 3704081566 0,sackOK,eol>
0x0000: 4500 0040 c003 4000 3f06 cf81 d8f0 82c3 E..@..@.?.......
0x0010: 401b 1064 c02f 0050 8707 5fb6 0000 0000 @..d./.P.._.....
0x0020: b002 ffff 3a2a 0000 0204 05b4 0103 0301 ....:*..........
0x0030: 0101 080a dcc7 cc9e 0000 0000 0402 0000 ................
[tcp syn]
11:40:18.824361 00:30:1b:2c:c9:cf > 00:90:69:8a:f0:5d, ethertype IPv4 (0x0800),
length 74: IP 64.27.16.100.http > 216.240.130.195.49199: S 52171131:52171131(0
) ack 2265407415 win 5792 <mss 1460,sackOK,timestamp 826801476 3704081566,nop,wscale 6>
0x0000: 4500 003c 0000 4000 4006 8e89 401b 1064 E..<..@...
0x0010: d8f0 82c3 0050 c02f 031c 117b 8707 5fb7 .....P./...{.._.
0x0020: a012 16a0 f155 0000 0204 05b4 0402 080a .....U..........
0x0030: 3147 fd44 dcc7 cc9e 0103 0306 1G.D........
[tcp synack]
11:40:18.824602 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
length 66: IP 216.240.130.195.49199 > 64.27.16.100.http: . ack 1 win 33304 <nop,nop,timestamp 3704081566 826801476>
0x0000: 4500 0034 c012 4000 3f06 cf7e d8f0 82c3 E..4..@.?..~....
0x0010: 401b 1064 c02f 0050 8707 5fb7 031c 117c @..d./.P.._....|
0x0020: 8010 8218 b4a8 0000 0101 080a dcc7 cc9e ................
0x0030: 3147 fd44 1G.D
[ack]
11:40:18.825434 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
length 60: IP 216.240.130.195.49199 > 64.27.16.100.http: R 1:1(0) ack 1 win 33304
0x0000: 4500 0028 c037 4000 3f06 cf65 d8f0 82c3 E..(.7@.......
0x0010: 401b 1064 c02f 0050 8707 5fb7 031c 117c @..d./.P.._....|
0x0020: 5014 8218 c5ae 0000 c500 0000 0000 P.............
[reset]
Evidently that results in something filtering up to the
socket-status. In fact, it could well be dependent on my OS version,
in this case, Fedora Core 4 (2.6.17-1.2142_FC4).
This sequence could reasobably be interpreted as a tcp stream being
opened and then closed.
I've now added some debug output before calling socket-status and
afterward in the case of an error (which is now caught). Below shows
it called twice with timeout 1000. The first time it gets an error 11
min. later, the second time immediately (at least within one second):
(:|HTTP| :|SOCKET-STATUS| 1000. "12/23/2008 13:04:06")
(:|HTTP| :|SOCKET-STATUS-ERR| "12/23/2008 13:15:19")
(:|HTTP| :|SOCKET-STATUS| 1000. "12/23/2008 13:15:19")
(:|HTTP| :|SOCKET-STATUS-ERR| "12/23/2008 13:15:19")
After that it got another error at the same time (within the same
second, and then was called again without an immediate error.
There was only one sequence of packets similar to above at 13:15:19.
Perhaps the next question should be what OS call (select?) is being
used to do the socket-status and what errors are advertised as being
possible from that.
My man page says
ERRORS
EBADF An invalid file descriptor was given in one of the sets.
should always occur immediately
EINTR A non blocked signal was caught.
this includes what? Might a reset do it?
EINVAL n is negative or the value contained within timeout is invalid.
should always occur immediately
ENOMEM select was unable to allocate memory for internal tables.
should always occur immediately ?
> (BTW, what do the <1/101> and [98] mean here?)
see print_back_trace & print_stackitem in debug.d
I see where print_back_trace could print <1> where 1 is the value of
index, which seems to be incremented in various places.
I don't see where it would be printing <1/101> and I also don't see
where print_stackitem would be printing [98].
|