|
From: <don...@is...> - 2008-12-22 18:27:17
|
I've recently noticed that socket-status can signal an error. Actually, this has been happening for a long time, but I only recently got around to debugging it. I don't see anything about this in the doc, so this seems like either a problem in the code or a problem in the doc. If errors are (supposed to be) possible I'd like the doc to describe which errors and under which circumstances. *** - UNIX error 104 (ECONNRESET): Connection reset by peer The following restarts are available: ABORT :R1 Abort main loop Break 1 HTTP[19]> where <1/101> #<SYSTEM-FUNCTION SOCKET:SOCKET-STATUS> [98] EVAL frame for form (SOCKET:SOCKET-STATUS SSS::*SOCKET-STATUS-ARG* TIME) (BTW, what do the <1/101> and [98] mean here?) In this case, the timeout argument is non-zero. I've now put an ignore-errors around this call, which I think will fix this case, but I also have a number of calls to socket-status with timeout zero - can those ever cause errors? |
|
From: Sam S. <sd...@gn...> - 2008-12-22 19:18:26
|
Don Cohen wrote: > I've recently noticed that socket-status can signal an error. > Actually, this has been happening for a long time, but I only > recently got around to debugging it. I don't see anything about > this in the doc, so this seems like either a problem in the code > or a problem in the doc. If errors are (supposed to be) possible > I'd like the doc to describe which errors and under which > circumstances. I don't think socket-status should signal errors. > *** - UNIX error 104 (ECONNRESET): Connection reset by peer > The following restarts are available: > ABORT :R1 Abort main loop > Break 1 HTTP[19]> where > <1/101> #<SYSTEM-FUNCTION SOCKET:SOCKET-STATUS> > [98] EVAL frame for form (SOCKET:SOCKET-STATUS SSS::*SOCKET-STATUS-ARG* TIME) > > (BTW, what do the <1/101> and [98] mean here?) see print_back_trace & print_stackitem in debug.d > In this case, the timeout argument is non-zero. > I've now put an ignore-errors around this call, which I think will fix > this case, but I also have a number of calls to socket-status with > timeout zero - can those ever cause errors? if the socket is dead, many bad things can happen. if you have a reproducible case, I am interested. |
|
From: <don...@is...> - 2008-12-22 22:38:22
|
> if you have a reproducible case, I am interested. It happens regularly, so I can reproduce it in that sense. If I record the relevant packets then I suppose it should be possible to reproduce it on demand by some approximation of replay. Or perhaps it would suffice for a start to simply describe the packet(s) that cause the error. Shall we start with that? |
|
From: Sam S. <sd...@gn...> - 2008-12-22 22:51:28
|
Don Cohen wrote: > > if you have a reproducible case, I am interested. > It happens regularly, so I can reproduce it in that sense. > If I record the relevant packets then I suppose it should > be possible to reproduce it on demand by some approximation > of replay. Or perhaps it would suffice for a start to simply > describe the packet(s) that cause the error. Shall we start > with that? I would prefer something like "start clisp, open socket server, connect to it from another shell using telnet, type '...' to telnet, kill telnet, type '...' to clisp, observe the error". |
|
From: <don...@is...> - 2008-12-23 23:05:04
|
Sam Steingold writes:
> Don Cohen wrote:
> > > if you have a reproducible case, I am interested.
> > It happens regularly, so I can reproduce it in that sense.
> > If I record the relevant packets then I suppose it should
> > be possible to reproduce it on demand by some approximation
> > of replay. Or perhaps it would suffice for a start to simply
> > describe the packet(s) that cause the error. Shall we start
> > with that?
> I would prefer something like "start clisp, open socket server,
> connect to it from another shell using telnet, type '...' to
> telnet, kill telnet, type '...' to clisp, observe the error".
I understand, but this may be more difficult to produce, so let's
start with this: The errors seem to come (many times recently) from
packet traces that all look substantially the same:
11:40:18.824273 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
length 78: IP 216.240.130.195.49199 > 64.27.16.100.http: S 2265407414:22654074
14(0) win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp 3704081566 0,sackOK,eol>
0x0000: 4500 0040 c003 4000 3f06 cf81 d8f0 82c3 E..@..@.?.......
0x0010: 401b 1064 c02f 0050 8707 5fb6 0000 0000 @..d./.P.._.....
0x0020: b002 ffff 3a2a 0000 0204 05b4 0103 0301 ....:*..........
0x0030: 0101 080a dcc7 cc9e 0000 0000 0402 0000 ................
[tcp syn]
11:40:18.824361 00:30:1b:2c:c9:cf > 00:90:69:8a:f0:5d, ethertype IPv4 (0x0800),
length 74: IP 64.27.16.100.http > 216.240.130.195.49199: S 52171131:52171131(0
) ack 2265407415 win 5792 <mss 1460,sackOK,timestamp 826801476 3704081566,nop,wscale 6>
0x0000: 4500 003c 0000 4000 4006 8e89 401b 1064 E..<..@.@...@..d
0x0010: d8f0 82c3 0050 c02f 031c 117b 8707 5fb7 .....P./...{.._.
0x0020: a012 16a0 f155 0000 0204 05b4 0402 080a .....U..........
0x0030: 3147 fd44 dcc7 cc9e 0103 0306 1G.D........
[tcp synack]
11:40:18.824602 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
length 66: IP 216.240.130.195.49199 > 64.27.16.100.http: . ack 1 win 33304 <nop,nop,timestamp 3704081566 826801476>
0x0000: 4500 0034 c012 4000 3f06 cf7e d8f0 82c3 E..4..@.?..~....
0x0010: 401b 1064 c02f 0050 8707 5fb7 031c 117c @..d./.P.._....|
0x0020: 8010 8218 b4a8 0000 0101 080a dcc7 cc9e ................
0x0030: 3147 fd44 1G.D
[ack]
11:40:18.825434 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
length 60: IP 216.240.130.195.49199 > 64.27.16.100.http: R 1:1(0) ack 1 win 33304
0x0000: 4500 0028 c037 4000 3f06 cf65 d8f0 82c3 E..(.7@.?..e....
0x0010: 401b 1064 c02f 0050 8707 5fb7 031c 117c @..d./.P.._....|
0x0020: 5014 8218 c5ae 0000 c500 0000 0000 P.............
[reset]
Evidently that results in something filtering up to the
socket-status. In fact, it could well be dependent on my OS version,
in this case, Fedora Core 4 (2.6.17-1.2142_FC4).
This sequence could reasobably be interpreted as a tcp stream being
opened and then closed.
I've now added some debug output before calling socket-status and
afterward in the case of an error (which is now caught). Below shows
it called twice with timeout 1000. The first time it gets an error 11
min. later, the second time immediately (at least within one second):
(:|HTTP| :|SOCKET-STATUS| 1000. "12/23/2008 13:04:06")
(:|HTTP| :|SOCKET-STATUS-ERR| "12/23/2008 13:15:19")
(:|HTTP| :|SOCKET-STATUS| 1000. "12/23/2008 13:15:19")
(:|HTTP| :|SOCKET-STATUS-ERR| "12/23/2008 13:15:19")
After that it got another error at the same time (within the same
second, and then was called again without an immediate error.
There was only one sequence of packets similar to above at 13:15:19.
Perhaps the next question should be what OS call (select?) is being
used to do the socket-status and what errors are advertised as being
possible from that.
My man page says
ERRORS
EBADF An invalid file descriptor was given in one of the sets.
should always occur immediately
EINTR A non blocked signal was caught.
this includes what? Might a reset do it?
EINVAL n is negative or the value contained within timeout is invalid.
should always occur immediately
ENOMEM select was unable to allocate memory for internal tables.
should always occur immediately ?
> (BTW, what do the <1/101> and [98] mean here?)
see print_back_trace & print_stackitem in debug.d
I see where print_back_trace could print <1> where 1 is the value of
index, which seems to be incremented in various places.
I don't see where it would be printing <1/101> and I also don't see
where print_stackitem would be printing [98].
|
|
From: <don...@is...> - 2008-12-24 19:11:30
|
I've now seen a real error and the packet sequence looks similar to
what I sent before. My debug code also crashes with this amusing
result:
ignore error
UNIX error 104 (ECONNRESET): Connection reset by peer
*** - PRINT: Despite *PRINT-READABLY*, #<SYSTEM::SIMPLE-OS-ERROR #x20CA5D56>
cannot be printed readably.
> how about you compile clisp with DEBUG_OS_ERROR defined - this way
(defined where?)
> you will see which line in which file has signaled the error. or,
> better yet, configure --with-debug and run clisp under gdb, setting
$ ./configure --with-module=rawsock --cbc --with-debug build-dir
seems to work
$ gdb ../build-dir/clisp
...
warning: not using untrusted file ".gdbinit"
(I had hoped to use the one in src. so I did cd src before gdb.
I guess that didn't work.)
(gdb) break prepare_error
Function "prepare_error" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
I hope that's ok.
Breakpoint 1 (prepare_error) pending.
(gdb) run
Starting program: /root/clisp-2.47/build-dir/clisp
Reading symbols from shared object read from target memory...done.
Loaded system supplied DSO at 0x6dc000
STACK depth: 98206 [0xaf3f00 0xa94088]
I can't tell whether the break is now in effect.
...
[1]> (load "/home/devel/http-forward-test")
...
[../src/stream.d:6143]
ignore error
UNIX error 104 (ECONNRESET): Connection reset by peer
I guess above is the line number of an error that was caught and
ignored -- which is good, since I don't want to have to continue
on every intentionally ignored error that arrives before the one
of interest.
So now I wait for a break, I guess, and then do something like
(gdb) bt
if I understand correctly?
> a break in prepare_error and send the backtrace here. I suspect
> that the error is signaled by listen_char() which is called by
> socket-status to ensure that a whole unicode char is actually
> available if a byte is.
> the code is:
>
> if (FD_ISSET(in_sock,readfds) || (stream_isbuffered(sock) & bit(1)))
> rd = (char_p ? listen_char(sock) : listen_byte(sock));
>
> there is no error on in_sock, that has been checked, so, apparently, there is a
> race condition here: an error (ECONNRESET) arrives _after_ select() but
> _before_ listen_char() could finish it's work.
This seems very plausible. I see no obvious difference between many
packet traces that don't cause the error and the few that do.
The timestamps reported by tcpdump show time to the microsec, and I
see many cases where the time between the ack creating the connection
and the reset closing it are ~1ms with no error, one of ~.2 sec (no
error), the one with the error:
08:35:20.354589 for the ack packet opening the connection
08:35:20.359225 for the reset packet closing it
i.e., 5ms.
|
|
From: <don...@is...> - 2008-12-29 22:08:28
|
Sam Steingold writes:
> We have a stream which will signal ECONNRESET on read.
> What should we return from SOCKET-STATUS?
> The doc seems to imply :ERROR,
How about whatever it returns for a closed stream?
I guess that's also an error.
> but select() (which we advertise to interface to) returns the FD as
It seems odd that select returns readable when a read will cause an
error.
> readable. Also, what should we return from LISTEN? CLHS says: On
> a non-interactive input-stream, listen returns true except when at
> end of file.
In that case I think the socket stream should be viewed as interactive.
It is certainly the case that there will be times when more data will
be available later but is not yet.
> What does ECONNRESET mean?
Evidently that the connection has been reset.
> SUS says that the peer did a shutdown(), so this is a kind of an EOF.
(SUS ?)
Or perhaps a form of error.
I think of it as analogous to a disk error when you read a file.
> but why then select says that it is readable?
I agree with you on that, but I guess there's not much we can do about
it.
> At any rate, I am tempted to treat this as an EOF (not least
> because it seems easiest to fit this condition into the trichotomy
> of ls_avail/ls_eof/ls_wait.
The easiest alternative I see is to simply document that
socket-status may signal an error, and that it has been known to do
so under the following circumstances...
> > BTW, as part of my search I started to suspect the checksum routines
> > and I did find a few small problems with them. More on that later...
> waiting.
I noticed first that ipcsum returns a value of the checksum with the
bytes swapped. Perhaps this is not an error since I saw no doc on
what value it should return. But the tcp checksum returns the
checksum without the bytes swapped.
I also thought I'd just compare the checksums of incomimg packets, on
the assumption that they are probably correct, with those computed by
clisp code:
(defmacro 16bits(buffer index)
`(+ (aref ,buffer (1+ ,index))
(ash (aref ,buffer ,index) 8)))
(defun checkpktsums()
(loop with i = 0 do
(setf (fill-pointer default-buffer) 1518)
(setf len (rawsock:recvfrom default-socket default-buffer default-device))
(when (and (>= len 52)
(= (aref default-buffer 12) 8) ;; ip
(= (aref default-buffer 13) 0)
(= (aref default-buffer 14) 69) ;; ipv4, len 5
(= (aref default-buffer 23) 6) ;; tcp
)
(setf oldcsum (16bits default-buffer (+ 14 10)))
(rawsock:ipcsum default-buffer)
(setf fredcsum (16bits default-buffer (+ 14 10)))
(unless (= oldcsum fredcsum )
(break "ip sums differ ~a ~a" oldcsum fredcsum))
(setf oldcsum (16bits default-buffer
(+ 14 16 (ash (logand (aref buffer 14) 15) 2))))
(rawsock:tcpcsum default-buffer)
(setf fredcsum (16bits default-buffer
(+ 14 16 (ash (logand (aref buffer 14) 15) 2))))
(unless (= oldcsum fredcsum )
(break "tcp sums differ ~a ~a" oldcsum fredcsum))
(when (= 0 (mod (incf i) 100))(princ "*")))))
#|
[216]> (CHECKPKTSUMS)
*************************************************************************************************************************************************************************************************************************************************************** - Continuable Error
tcp sums differ FFFF 0
|#
This suggests that clisp code is adding the carry bit one time when it
should not. When I looked at the code I thought the IP looked
suspect, like it's swapping bytes, but it does seem to agree with the
incoming packets. Although it might well contain an extra carry add.
The error above was just the first that arrived.
> > Now in another process on the same client machine
> >
> > $ telnet 64.27.16.100 1234
> > Trying 64.27.16.100...
> > Connected to 64.27.16.100 (64.27.16.100).
> > Escape character is '^]'.
>
> I wish this could be folded into reset.lisp.
> it appears that it does matter that waitforpkt is started before
> telnet though.
The telnet could have been simulated entirely from the raw socket.
It was just easier to use some program that was already prepared to
open a connection.
Now that you mention it, the telnet is really no different from
opening a tcp connection from lisp. Or equivalently, lisp could start
a telnet from a shell command. The fact that the wait is running
during that time I think is really not essential cause I think the raw
socket does its own buffering. So you can open the raw socket, then
open the tcp connection, then read from the raw socket. Of course, if
you want to try out multithreading ...
If you were looking for something to add to the test suite, I think
the bigger problem is that you have to be root.
|
|
From: Sam S. <sd...@gn...> - 2008-12-29 22:56:39
|
Don Cohen wrote: > I noticed first that ipcsum returns a value of the checksum with the > bytes swapped. Perhaps this is not an error since I saw no doc on > what value it should return. But the tcp checksum returns the > checksum without the bytes swapped. this is your code. just submit a patch. |
|
From: <don...@is...> - 2008-12-29 23:22:42
|
Sam Steingold writes: > Don Cohen wrote: > > (rawsock:socket :inet :packet #+ignore :all #x300) > > Is there a symbolic name for #x300? The code above suggests that I expected it to be :all but that didn't work. And this matches my recollection. Perhaps it's different in different systems. I'm trying to reproduce the path to the doc, and I think it's this: man socket -> man packet -> sys/if_ether.h /usr/include/linux/if_ether.h : #define ETH_P_ALL 0x0003 /* Every packet (be careful!!!) */ And somehow the byte order is then reversed - I might never have understood why. |
|
From: <don...@is...> - 2008-12-24 01:19:43
|
I just notice that socket-status returns two values, while the doc
only mentions one.
This confused me when I saw the output of
(multiple-value-bind
(val err)
(ignore-errors (ext:socket-status *socket-status-arg* time))
(when err
(http::logform (list :http :socket-status-err
(http::print-current-time nil)
err)) ))
Sam Steingold writes:
> Don Cohen wrote:
> > > if you have a reproducible case, I am interested.
> > It happens regularly, so I can reproduce it in that sense.
> > If I record the relevant packets then I suppose it should
> > be possible to reproduce it on demand by some approximation
> > of replay. Or perhaps it would suffice for a start to simply
> > describe the packet(s) that cause the error. Shall we start
> > with that?
>
> I would prefer something like "start clisp, open socket server, connect to it
> from another shell using telnet, type '...' to telnet, kill telnet, type '...'
> to clisp, observe the error".
|
|
From: Sam S. <sd...@gn...> - 2008-12-24 14:36:45
|
Don Cohen wrote: > I just notice that socket-status returns two values, while the doc > only mentions one. http://clisp.cons.org/impnotes/socket.html#so-status The second value returned is the number of objects with non-NIL status, i.e., “actionable” objects. SOCKET:SOCKET-STATUS returns either due to a timeout or when this number is positive, i.e., if the timeout was NIL and SOCKET:SOCKET-STATUS did return, then the second value is positive (this is the reason NIL is not treated as an empty LIST, but as an invalid argument). |
|
From: <don...@is...> - 2008-12-24 04:52:46
|
I just notice that socket-status returns two values, while the doc
only mentions one.
This confused me when I saw the output of
(multiple-value-bind
(val err)
(ignore-errors (ext:socket-status *socket-status-arg* time))
(when err
(http::logform (list :http :socket-status-err
(http::print-current-time nil)
err)) ))
This means that you should ignore my earlier message about the
resets. Those were just things that were causing socket-status
to return a second value! I'll report back when I see a second
value from ignore-errors that's actually an error.
|
|
From: Sam S. <sd...@gn...> - 2008-12-24 15:17:23
|
Don Cohen wrote:
> Sam Steingold writes:
> > Don Cohen wrote:
> > > > if you have a reproducible case, I am interested.
> > > It happens regularly, so I can reproduce it in that sense.
> > > If I record the relevant packets then I suppose it should
> > > be possible to reproduce it on demand by some approximation
> > > of replay. Or perhaps it would suffice for a start to simply
> > > describe the packet(s) that cause the error. Shall we start
> > > with that?
>
> > I would prefer something like "start clisp, open socket server,
> > connect to it from another shell using telnet, type '...' to
> > telnet, kill telnet, type '...' to clisp, observe the error".
>
> I understand, but this may be more difficult to produce, so let's
how about you compile clisp with DEBUG_OS_ERROR defined - this way you will see
which line in which file has signaled the error.
or, better yet, configure --with-debug and run clisp under gdb, setting a break
in prepare_error and send the backtrace here.
I suspect that the error is signaled by listen_char() which is called by
socket-status to ensure that a whole unicode char is actually available if a
byte is.
the code is:
if (FD_ISSET(in_sock,readfds) || (stream_isbuffered(sock) & bit(1)))
rd = (char_p ? listen_char(sock) : listen_byte(sock));
there is no error on in_sock, that has been checked, so, apparently, there is a
race condition here: an error (ECONNRESET) arrives _after_ select() but
_before_ listen_char() could finish it's work.
> start with this: The errors seem to come (many times recently) from
> packet traces that all look substantially the same:
>
> 11:40:18.824273 00:90:69:8a:f0:5d > 00:30:1b:2c:c9:cf, ethertype IPv4 (0x0800),
> length 78: IP 216.240.130.195.49199 > 64.27.16.100.http: S 2265407414:22654074
> 14(0) win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp 3704081566 0,sackOK,eol>
> 0x0000: 4500 0040 c003 4000 3f06 cf81 d8f0 82c3 E..@..@.?.......
> 0x0010: 401b 1064 c02f 0050 8707 5fb6 0000 0000 @..d./.P.._.....
> 0x0020: b002 ffff 3a2a 0000 0204 05b4 0103 0301 ....:*..........
> 0x0030: 0101 080a dcc7 cc9e 0000 0000 0402 0000 ................
> [tcp syn]
sorry, this looks like a total gibberish to me.
I admire people who understand these hex codes even more than I admire people
who can read assembly however.
|
|
From: Sam S. <sd...@gn...> - 2008-12-24 20:17:41
|
Don Cohen wrote: > I've now seen a real error and the packet sequence looks similar to > what I sent before. My debug code also crashes with this amusing > result: > ignore error > UNIX error 104 (ECONNRESET): Connection reset by peer > *** - PRINT: Despite *PRINT-READABLY*, #<SYSTEM::SIMPLE-OS-ERROR #x20CA5D56> > cannot be printed readably. what did you expect? the code you posted prints the second return value of ignore-errors (i.e., the error object) readably. > > how about you compile clisp with DEBUG_OS_ERROR defined - this way > (defined where?) CFLAGS in Makefile or directly in lispbibl.d > > you will see which line in which file has signaled the error. or, > > better yet, configure --with-debug and run clisp under gdb, setting > > $ ./configure --with-module=rawsock --cbc --with-debug build-dir > seems to work > > $ gdb ../build-dir/clisp > ... > warning: not using untrusted file ".gdbinit" > (I had hoped to use the one in src. so I did cd src before gdb. > I guess that didn't work.) src/.gdbinit is copied to the build directory by the build process. see http://clisp.cons.org/impnotes/faq.html#faq-debug also, do not forger to pass "CFLAGS=''" to configure to avoid the "-g -O2" idiocy. > (gdb) break prepare_error not needed if you follow the FAQ. > [../src/stream.d:6143] > ignore error > UNIX error 104 (ECONNRESET): Connection reset by peer > I guess above is the line number of an error that was caught and > ignored -- which is good, since I don't want to have to continue > on every intentionally ignored error that arrives before the one > of interest. yes, except that there is no errors on this line in cvs head. please use cvs head. > So now I wait for a break, I guess, and then do something like > (gdb) bt > if I understand correctly? yes. you might also want to examine the relevant local variables. use "xout" and "zout" for lisp objects. > > a break in prepare_error and send the backtrace here. I suspect > > that the error is signaled by listen_char() which is called by > > socket-status to ensure that a whole unicode char is actually > > available if a byte is. > > > the code is: > > > > if (FD_ISSET(in_sock,readfds) || (stream_isbuffered(sock) & bit(1))) > > rd = (char_p ? listen_char(sock) : listen_byte(sock)); > > > > there is no error on in_sock, that has been checked, so, apparently, there is a > > race condition here: an error (ECONNRESET) arrives _after_ select() but > > _before_ listen_char() could finish it's work. > > This seems very plausible. I see no obvious difference between many > packet traces that don't cause the error and the few that do. > > The timestamps reported by tcpdump show time to the microsec, and I > see many cases where the time between the ack creating the connection > and the reset closing it are ~1ms with no error, one of ~.2 sec (no > error), the one with the error: > 08:35:20.354589 for the ack packet opening the connection > 08:35:20.359225 for the reset packet closing it > i.e., 5ms. alas, if this conjecture were true, it would have been easy to reproduce: connect to a clisp running under gdb, step through to the code above, kill telnet before listen_char so that the connection is reset, then continue in gdb. alas, no error is raised. |
|
From: <don...@is...> - 2008-12-24 20:37:05
|
Sam Steingold writes: > > $ ./configure --with-module=rawsock --cbc --with-debug build-dir > > seems to work > > > > $ gdb ../build-dir/clisp > > ... > > warning: not using untrusted file ".gdbinit" > > (I had hoped to use the one in src. so I did cd src before gdb. > > I guess that didn't work.) > > src/.gdbinit is copied to the build directory by the build process. > see > http://clisp.cons.org/impnotes/faq.html#faq-debug would've been worth while to mention that before > also, do not forger to pass "CFLAGS=''" to configure to avoid the > "-g -O2" idiocy. The #faq-debug suggests this isn't needed. In any case, too late now for this run. > > (gdb) break prepare_error > not needed if you follow the FAQ. I was wondering whether this would also set other break points that I would not want. > > [../src/stream.d:6143] > > ignore error > > UNIX error 104 (ECONNRESET): Connection reset by peer > > I guess above is the line number of an error that was caught and > > ignored -- which is good, since I don't want to have to continue > > on every intentionally ignored error that arrives before the one > > of interest. > > yes, except that there is no errors on this line in cvs head. > please use cvs head. Again too late now. I hope 2.47 will suffice. This is all on a live server where I don't plan to use cvs head. > > So now I wait for a break, I guess, and then do something like > > (gdb) bt > > if I understand correctly? > > yes. > you might also want to examine the relevant local variables. > use "xout" and "zout" for lisp objects. If you can tell me exactly what to type when it breaks, I'll do that. I'd rather not leave it in the break waiting for further instructions. > alas, if this conjecture were true, it would have been easy to reproduce: > connect to a clisp running under gdb, step through to the code above, kill > telnet before listen_char so that the connection is reset, then continue in gdb. > alas, no error is raised. Just killing telnet probably does not result in a reset. So I think there's still hope. |
|
From: Sam S. <sd...@gn...> - 2008-12-25 04:37:59
|
> * Don Cohen <qba...@vf...> [2008-12-24 12:37:07 -0800]: > > Sam Steingold writes: > > http://clisp.cons.org/impnotes/faq.html#faq-debug > would've been worth while to mention that before you have been using clisp for many many years. maybe it would be a good idea for you to spend 5 minutes skimming over the FAQ. > > also, do not forger to pass "CFLAGS=''" to configure to avoid the > > "-g -O2" idiocy. > The #faq-debug suggests this isn't needed. I now modifued makemake in the cvs head, it should not be. > > > (gdb) break prepare_error > > not needed if you follow the FAQ. > I was wondering whether this would also set other break points that I > would not want. just disable them when you hit them. "info break" in gdb. > > alas, if this conjecture were true, it would have been easy to > > reproduce: connect to a clisp running under gdb, step through to > > the code above, kill telnet before listen_char so that the > > connection is reset, then continue in gdb. alas, no error is > > raised. > > Just killing telnet probably does not result in a reset. what does? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 8.10 (intrepid) http://mideasttruth.com http://jihadwatch.org http://honestreporting.com http://ffii.org http://iris.org.il http://thereligionofpeace.com Daddy, what does "format disk c: complete" mean? |
|
From: <don...@is...> - 2008-12-25 07:01:39
|
Sam Steingold writes: > you have been using clisp for many many years. > maybe it would be a good idea for you to spend 5 minutes skimming over > the FAQ. I have many times. Even more than 5 min. But many times may be once every year or two. And things I do very rarely, like c level debugging, tend not to recall to mind such relevant instructions. > > Just killing telnet probably does not result in a reset. > what does? A number of things could be done to get the reset but none are particularly convenient for you. Start by testing whether you get it by running tcpdump. If you're opening the listener and telnetting to it from the same machine then you want something like (probably have to be root) tcpdump -i lo & which, when you do telnet localhost 1234 will generate a few lines like this: $ tcpdump: listening on lo 22:33:40.700926 localhost.localdomain.32890 > localhost.localdomain.1234: [time] [ip.port sender] [ip.port receiver] S 1036002160:1036002160(0) win 32767 [S means SYN, the rest is tcp detail <mss 16396,sackOK,timestamp 379067074 0,nop,wscale 0> (DF) [tos 0x10] 22:33:40.700949 localhost.localdomain.1234 > localhost.localdomain.32890: [this is the reply if there's nobody listening on port 1234] R 0:0(0) ack 1036002161 win 0 (DF) [tos 0x10] [R means reset - cause nobody is listening for such a packet] If someone IS listening you get, typically, 3 packets, S from client S from server and an ACK from client (no S, no R) Now kill telnet and see whether you get any more packets. I expect not. The way to get a reset would be to either manufacture it on your own or, perhaps a little easier, to manufacture another packet that causes the system that used to run the client to generate the reset. There's something called spak for generating and sending packets. I also have code for doing it from lisp if you have rawsock. Let me know if you want to go down that path. In the mean while I've seen lots of connections reset by peer but still waiting for one that is not caught. |
|
From: Sam S. <sd...@gn...> - 2008-12-25 15:52:40
|
> * Don Cohen <qba...@vf...> [2008-12-24 23:01:45 -0800]: > > Sam Steingold writes: > > you have been using clisp for many many years. > > maybe it would be a good idea for you to spend 5 minutes skimming over > > the FAQ. > I have many times. Even more than 5 min. But many times may be once > every year or two. And things I do very rarely, like c level > debugging, tend not to recall to mind such relevant instructions. of course, I did not expect you to remember the C debugging instructions, but I thought you might remember that FAQ has them. > I also have code for doing it from lisp if you have rawsock. > Let me know if you want to go down that path. yes, please. I want the code that would let me debug this without an external telnet. i.e., clisp should open the connection, kill it, and send a reset. thanks -- Sam Steingold (http://sds.podval.org/) on Ubuntu 8.10 (intrepid) http://camera.org http://memri.org http://openvotingconsortium.org http://dhimmi.com http://pmw.org.il http://palestinefacts.org Failure is not an option. It comes bundled with your Microsoft product. |
|
From: <don...@is...> - 2008-12-29 18:02:47
|
Sam Steingold writes: > > I also have code for doing it from lisp if you have rawsock. > > Let me know if you want to go down that path. > yes, please. > I want the code that would let me debug this without an external > telnet. i.e., clisp should open the connection, kill it, and send a > reset. I'm trying to make this as simple as possible, but you'll see it's really not there. I actually spent way too much time on this trying to figure out what I was doing wrong before I discovered that it behaved as I expected over the net but not for communication within one machine (localhost). I still don't understand why that is, but I suggest you do the testing on two different machines as described below. It could be related to the OS version, in this case both machines are FC4: Linux don-eve.dyndns.org 2.6.17-1.2142_FC4 #1 Tue Jul 11 22:41:14 EDT 2006 i686 i686 i386 GNU/Linux BTW, as part of my search I started to suspect the checksum routines and I did find a few small problems with them. More on that later... Anyhow, here's my demo: The code I'm sending uses rawsock, so of course you need a clisp built with rawsock. And you have to be root to use it. On the server machine (where you'll be adding gdb to what I do): run lisp [1]> (setf ss (socket:socket-server 1234 :interface "64.27.16.100")) 64.27.16.100 => ip address of your server, not 127.0.0.1 Same substitution applies below. #<SOCKET-SERVER 64.27.16.100:1234> [2]> (setf str (socket:socket-accept ss)) on the client machine as root run clisp with rawsock (load "reset.lisp") ;; provided below (waitforpkt '(64 27 16 100) 1234) This is likely to print some *'s to show you that it sees packets other than what it's waiting for. Now in another process on the same client machine $ telnet 64.27.16.100 1234 Trying 64.27.16.100... Connected to 64.27.16.100 (64.27.16.100). Escape character is '^]'. This should cause the other two lisp processes that are waiting for relevant packets to return. server: [3]> client: ... ***** T [2]> now you can send the reset from the client lisp process: [2]> (reset) and now on the server I get this [3]> (socket:socket-status str) [../src/stream.d:6143] *** - UNIX error 104 (ECONNRESET): Connection reset by peer The following restarts are available: ABORT :R1 Abort main loop Break 1 [4]> By the way, at this point you can now return to the telnet and simply type enter to get this: Connection closed by foreign host. ==== reset.lisp (defvar default-socket (rawsock:socket :inet :packet #+ignore :all #x300)) (unless (> default-socket 0) (error "socket failed - running as root?")) (defvar default-buffer (make-array 1518 :element-type '(unsigned-byte 8) :fill-pointer 100)) (defvar default-device (rawsock:make-sockaddr :packet (make-array 14 :element-type '(unsigned-byte 8)))) (defun waitforpkt(serverip serverport &key (show 100)) (loop for i from 0 with len do (setf (fill-pointer default-buffer) 1518) (setf len (rawsock:recvfrom default-socket default-buffer default-device)) (when (= 0 (mod i show))(princ "*")) (when (and (>= len 52) (= (aref default-buffer 12) 8) ;; ip (= (aref default-buffer 13) 0) (= (aref default-buffer 14) 69) ;; ipv4, len 5 (= (aref default-buffer 23) 6) ;; tcp (loop for j from 0 as a in serverip always (= a (aref default-buffer (+ j 30)))) ;; correct ip addr (= serverport (+ (aref default-buffer 37) (* 256 (aref default-buffer 36)))) (= 16 (aref default-buffer 47))) (return t)))) (defun reset() ;; change flags to from ack to rst+ack (setf (aref default-buffer 47) 20) (setf (aref default-buffer 46) 80) ;tcp header length = 5 words ;; set buffer length (setf (fill-pointer default-buffer) 54) ;; 60 ?? (setf (aref default-buffer 17) 40) (rawsock:tcpcsum default-buffer) ;; not returning right answer? (setf ipc (rawsock:ipcsum default-buffer)) ;; returns bytes reversed? (rawsock:sendto default-socket default-buffer default-device)) #| (waitforpkt '(64 27 16 100) 1234) (reset) |# |
|
From: Sam S. <sd...@gn...> - 2008-12-30 04:14:14
|
> * Bruno Haible <oe...@py...> [2008-12-30 02:33:31 +0100]: > >> Let us start with what we should do at the high level. >> We have a stream which will signal ECONNRESET on read. > > I would vote for signaling a STREAM-ERROR condition - so that the > program gets alerted about the abrupt termination of the socket - and > at the same time set the stream to a state equivalent to EOF - because > ECONNRESET is not a transient error condition. cool! this is what we do now - OS_filestream_error. no changes needed. however, I think returning :ERROR from socket-status makes more sense. (although this is certainly a step away from a simple select()). however, doing that is nontrivial from a coding POV: we would need to change low_fill_buffered_handle not to call OS_filestream_error but return a status instead &c &c. - is perserv==perserv_immediate. maybe it will even have to accept a separate argument because non-listen calls might also use persev_immediate... BTW, BufferedStreamLow_fill can be either low_fill_buffered_handle or low_fill_buffered_socket, and the latter does not check the sock_read's return value the way the former checks fd_read's return value... -- Sam Steingold (http://sds.podval.org/) on Ubuntu 8.10 (intrepid) http://dhimmi.com http://camera.org http://iris.org.il http://mideasttruth.com http://pmw.org.il http://palestinefacts.org http://openvotingconsortium.org Do not tell me what to do and I will not tell you where to go. |
|
From: Sam S. <sd...@gn...> - 2008-12-31 17:00:26
|
Sam Steingold wrote: >> * Bruno Haible <oe...@py...> [2008-12-30 02:33:31 +0100]: >> >>> Let us start with what we should do at the high level. >>> We have a stream which will signal ECONNRESET on read. >> I would vote for signaling a STREAM-ERROR condition - so that the >> program gets alerted about the abrupt termination of the socket - and >> at the same time set the stream to a state equivalent to EOF - because >> ECONNRESET is not a transient error condition. > > cool! this is what we do now - OS_filestream_error. > no changes needed. alas, OS_filestream_error signaled a simple OS_error for non-file streams. this has now been fixed in the CVS. please take a look at http://clisp.podval.org/impnotes/socket.html#so-status (the penultimate sentence documents the possible error signaling by SOCKET-STATUS). |
|
From: Sam S. <sd...@gn...> - 2008-12-29 20:40:10
|
Don Cohen wrote: > Sam Steingold writes: > > > I also have code for doing it from lisp if you have rawsock. > > > Let me know if you want to go down that path. > > yes, please. > > I want the code that would let me debug this without an external > > telnet. i.e., clisp should open the connection, kill it, and send a > > reset. > I'm trying to make this as simple as possible, but you'll see it's > really not there. I actually spent way too much time on this trying > to figure out what I was doing wrong before I discovered that it > behaved as I expected over the net but not for communication within > one machine (localhost). I still don't understand why that is, but > I suggest you do the testing on two different machines as described > below. I followed your instructions and I did get the error. I see what is going on, but I see no obvious way to fix this. Let us start with what we should do at the high level. We have a stream which will signal ECONNRESET on read. What should we return from SOCKET-STATUS? The doc seems to imply :ERROR, but select() (which we advertise to interface to) returns the FD as readable. Also, what should we return from LISTEN? CLHS says: On a non-interactive input-stream, listen returns true except when at end of file. What does ECONNRESET mean? SUS says that the peer did a shutdown(), so this is a kind of an EOF. but why then select says that it is readable? At any rate, I am tempted to treat this as an EOF (not least because it seems easiest to fit this condition into the trichotomy of ls_avail/ls_eof/ls_wait. especially given that on win32 we already treat WSAESHUTDOWN as eof. > BTW, as part of my search I started to suspect the checksum routines > and I did find a few small problems with them. More on that later... waiting. > Now in another process on the same client machine > > $ telnet 64.27.16.100 1234 > Trying 64.27.16.100... > Connected to 64.27.16.100 (64.27.16.100). > Escape character is '^]'. I wish this could be folded into reset.lisp. it appears that it does matter that waitforpkt is started before telnet though. |
|
From: Bruno H. <br...@cl...> - 2008-12-30 01:58:44
|
Hi Sam, Don, > What does ECONNRESET mean? > SUS says that the peer did a shutdown(), so this is a kind of an EOF. SUS (I think you mean SUSv3, which is the same as POSIX:2001) is no longer the current standard. The current one is POSIX:2008, also known as "Base Specifications Issue 7" and available online at <http://www.opengroup.org/onlinepubs/9699919799/> About ECONNRESET it says - for read(): "A read was attempted on a socket and the connection was forcibly closed by its peer." - for write(): "A write was attempted on a socket that is not connected." whereas EPIPE occurs when "A write was attempted on a socket that is shut down for writing, or is no longer connected." Also see <http://www.wlug.org.nz/ECONNRESET>: "This usually means that the program on the other end has crashed, or closed the socket unexpectedly." I think what this means is that the normal way of shutdown of a socket connection - when the peer does a close() or shutdown() call - causes read() to return 0 and write() to fail with SIGPIPE or EPIPE. Whereas a ECONRESET indicates a more severe kind of shutdown. > but why then select says that it is readable? This is normal. select() does not distinguish between "data available" and "no more data available". It distinguishes between "data or EOF available now" and "I/O would block; you must wait until you can get data or EOF". > Let us start with what we should do at the high level. > We have a stream which will signal ECONNRESET on read. I would vote for signaling a STREAM-ERROR condition - so that the program gets alerted about the abrupt termination of the socket - and at the same time set the stream to a state equivalent to EOF - because ECONNRESET is not a transient error condition. > At any rate, I am tempted to treat this as an EOF (not least because it seems > easiest to fit this condition into the trichotomy of ls_avail/ls_eof/ls_wait. This is right w.r.t. to the stream's internal state, but ... > especially given that on win32 we already treat WSAESHUTDOWN as eof. ESHUTDOWN is something different again, see <http://www.wlug.org.nz/ESHUTDOWN> Bruno |
|
From: Sam S. <sd...@gn...> - 2008-12-29 21:23:37
|
Don Cohen wrote:
> and now on the server I get this
> [3]> (socket:socket-status str)
>
> [../src/stream.d:6143]
> *** - UNIX error 104 (ECONNRESET): Connection reset by peer
> The following restarts are available:
> ABORT :R1 Abort main loop
> Break 1 [4]>
with the appended patch:
[3]> (socket:socket-status str)
:APPEND ;
1
Alas:
[4]> (write-line "foo" str)
Program received signal SIGPIPE, Broken pipe.
0x00000033c82c0e50 in __write_nocancel () from /lib64/libc.so.6
(gdb) up
#1 0x00000000005eacef in fd_write (fd=8, bufarea=0x7fff75f78a60, nbyte=3,
persev=persev_full) at ../src/unixaux.d:462
462 var ssize_t retval = write(fd,buf,nbyte);
the socket appears to be in a totally broken state!
Bruno, why am I getting SIGPIPE instead of a EPIPE?
--- unixaux.d.~1.64.~ 2008-12-05 09:40:57.000000000 -0500
+++ unixaux.d 2008-12-29 16:17:12.000395000 -0500
@@ -319,6 +319,10 @@ global ssize_t fd_read (int fd, void* bu
errno = ENOENT;
break;
} else if (retval < 0) {
+ #ifdef ECONNRESET
+ if (errno == ECONNRESET)
+ { errno = ENOENT; break; }
+ #endif
#ifdef EINTR
if (errno != EINTR)
#endif
|
|
From: <don...@is...> - 2008-12-29 22:08:25
|
Sam Steingold writes: > Don Cohen wrote: > > and now on the server I get this > > [3]> (socket:socket-status str) > > > > [../src/stream.d:6143] > > *** - UNIX error 104 (ECONNRESET): Connection reset by peer > > The following restarts are available: > > ABORT :R1 Abort main loop > > Break 1 [4]> > > with the appended patch: > [3]> (socket:socket-status str) > :APPEND ; > 1 I don't think this is desired. FIN means that the peer is done sending and you are still allowed to send. RST means the peer doesn't even recognize this connection. > Alas: > [4]> (write-line "foo" str) > Program received signal SIGPIPE, Broken pipe. That seems right to me. > 0x00000033c82c0e50 in __write_nocancel () from /lib64/libc.so.6 > (gdb) up > #1 0x00000000005eacef in fd_write (fd=8, bufarea=0x7fff75f78a60, nbyte=3, > persev=persev_full) at ../src/unixaux.d:462 > 462 var ssize_t retval = write(fd,buf,nbyte); > > the socket appears to be in a totally broken state! The socket should be viewed as closed. |