From: Sam S. <sd...@gn...> - 2006-01-17 14:22:12
|
clisp 2.37 has a bug on open :if-exists :append, so I want to release 2.38 ASAP. the only blocker is the mac os x socket bug. it has been reported that this bug was not present in 2.36, so the brute force "fill everything with 0" patch will NOT go in. since the mac os x SF CF hosts are dead, I cannot debug this. it is up to you, the mac os x users, to close this issue. what you have to do is: 1. make sure that it was the Dec 19 "(SOCKET-SERVER): accept keywords :INTERFACE :BACKLOG" patch that broke mac os x 2. find out why it fails (using, e.g., strace). specifically, (socket-server) and (socket-server 3456) - do they work or fail on 2.37 and 2.36? what are the traces (around the failed calls and the corresponding sucessful calls) it is up to you to ensure that sockets work on your system in clisp 2.38. -- Sam Steingold (http://www.podval.org/~sds) running w2k http://www.openvotingconsortium.org http://pmw.org.il http://www.memri.org http://ffii.org http://www.mideasttruth.com http://truepeace.org Trespassers will be shot. Survivors will be prosecuted. |
From: Doug P. <dg...@ma...> - 2006-01-17 14:38:27
|
>clisp 2.37 has a bug on open :if-exists :append, so I want to release >2.38 ASAP. Cool! >the only blocker is the mac os x socket bug. >it has been reported that this bug was not present in 2.36, so the brute >force "fill everything with 0" patch will NOT go in. Sam, I must strenuously object to this line of thinking. You cannot prove the bug did not exist because you cannot prove that the structure didn't just happen to have 0s in the right places by accident. If you really want to prove that the 2.36 version was OK, you would have to assert that the structure's memory (not just the defined fields since padding is inaccessible from defined field access) was filled with something other than zeroes. Was it it 0xFF filled first? I strongly urge you to put the fill everything with 0 patch IN. We know it works, and anyone who has done software for as long as I (and you, all of us: we) have should know that uninitialized variable bugs can hide for a long, long time. That this wasn't caught until now doesn't make Mac OS X any less buggy before, nor does it indicate that anything more ominous happened than random stack garbage change the uninitialized values of the structure. --Doug |
From: Sam S. <sd...@gn...> - 2006-01-17 14:59:27
|
Doug, > * Doug Philips <qtbh@znp.pbz> [2006-01-17 09:38:07 -0500]: > > I must strenuously object to this line of thinking. You cannot prove > the bug did not exist because you cannot prove that the structure > didn't just happen to have 0s in the right places by accident. > > If you really want to prove that the 2.36 version was OK, you would > have to assert that the structure's memory (not just the defined > fields since padding is inaccessible from defined field access) was > filled with something other than zeroes. Was it it 0xFF filled first? > > I strongly urge you to put the fill everything with 0 patch IN. We > know it works, and anyone who has done software for as long as I (and > you, all of us: we) have should know that uninitialized variable bugs > can hide for a long, long time. That this wasn't caught until now > doesn't make Mac OS X any less buggy before, nor does it indicate that > anything more ominous happened than random stack garbage change the > uninitialized values of the structure. sounds very convincing - if we assume that you are right and, indeed, the problem is uninitialized slots. the fact that the FILL0 patch fixes the bug does not prove it - just like the fact that 2.36 worked does not disprove it either (as you yourself pointed out). Now, to actually prove that FILL0 is TRT, you need, e.g., to apply it to 2.36 with 0 replaced with 0xFF and see if it breaks 2.36. You need to trace a "working" clisp against a "broken" one. BTW, what's the word from Apple - did you report the bug to them? Is it a documented behavior? What is the reference? -- Sam Steingold (http://www.podval.org/~sds) running w2k http://www.camera.org http://www.jihadwatch.org http://www.dhimmi.com http://www.palestinefacts.org http://www.honestreporting.com Hard work has a future payoff. Laziness pays off NOW. |
From: Doug P. <dg...@ma...> - 2006-01-17 15:36:01
|
Sam, >Now, to actually prove that FILL0 is TRT, you need, e.g., to apply it to >2.36 with 0 replaced with 0xFF and see if it breaks 2.36. >You need to trace a "working" clisp against a "broken" one. Assuming everything goes as planned, I should be able to grab a 2.36 tree tonight. We have four scenarios to test: 2.36 fill 0, 2.36 fill 0xFF, 2.37 fill 0, 2.37 fill 0xFF. (I will get to this as soon as I can, but if someone else can beat me to it, all the better!) >BTW, what's the word from Apple - did you report the bug to them? >Is it a documented behavior? >What is the reference? No, I think someone else was going to report it to Apple, but I don't have access to my email archives from work to double check who it was. (I'm terrible with names, sorry). I was pretty sure you and this other person had exchanged emails re: The first/third editions of Stevens saying that this 0fill was needed, etc. --Doug |
From: Sam S. <sd...@gn...> - 2006-01-17 15:56:55
|
> * Doug Philips <qtbh@znp.pbz> [2006-01-17 10:05:25 -0500]: > >>Now, to actually prove that FILL0 is TRT, you need, e.g., to apply it to >>2.36 with 0 replaced with 0xFF and see if it breaks 2.36. >>You need to trace a "working" clisp against a "broken" one. > > Assuming everything goes as planned, I should be able to grab a 2.36 > tree tonight. We have four scenarios to test: 2.36 fill 0, 2.36 fill > 0xFF, 2.37 fill 0, 2.37 fill 0xFF. Thanks. >>BTW, what's the word from Apple - did you report the bug to them? >>Is it a documented behavior? >>What is the reference? see http://www.macdevcenter.com/pub/a/mac/2002/12/26/cocoa.html?page=3 > No, I think someone else was going to report it to Apple, but I don't > have access to my email archives from work to double check who it > was. (I'm terrible with names, sorry). I was pretty sure you and this > other person had exchanged emails re: The first/third editions of > Stevens saying that this 0fill was needed, etc. Gregory Wright? Lennart Staflin? Dan Starr? all the erg files I have seen so far point to the 2005-12-19 socket-server patch. -- Sam Steingold (http://www.podval.org/~sds) running w2k http://pmw.org.il http://ffii.org http://www.honestreporting.com http://www.savegushkatif.org http://www.dhimmi.com Only adults have difficulty with child-proof caps. |
From: Douglas P. <dg...@ma...> - 2006-01-18 06:24:46
|
On 2006 Jan 17, at 10:55 AM, Sam Steingold wrote: >> * Doug Philips <qtbh@znp.pbz> [2006-01-17 10:05:25 -0500]: >> Assuming everything goes as planned, I should be able to grab a 2.36 >> tree tonight. We have four scenarios to test: 2.36 fill 0, 2.36 fill >> 0xFF, 2.37 fill 0, 2.37 fill 0xFF. > > Thanks. Results so far: "out of the box" 2.36 passes its socket tests. Adding FILL0 (and calls to it) in 2.36 makes for no change, the tests all pass either way. Changing FILL0 to fill with 0xFF causes no change in the 2.36 socket tests. Changing FILL0 to fill with 0xFF causes 30 errors in the 2.37 test. Running the 2.37 socket test under 2.36 (0 and FF) causes a segmentation fault: (PROGN (SETQ *SERVER* (SOCKET-SERVER 9090) *SOCKET-1* (SOCKET-CONNECT 9090 "localhost" :TIMEOUT 0 :BUFFERED NIL) *SOCKET-2* (SOCKET-ACCEPT *SERVER* :BUFFERED NIL)) (WRITE-CHAR #\a *SOCKET-1*)) EQL-OK: #\a (LISTP (SHOW (LIST (MULTIPLE-VALUE-LIST (SOCKET-STREAM-LOCAL *SOCKET-1*)) (MULTIPLE-VALUE-LIST (SOCKET-STREAM-PEER *SOCKET-1*)) (MULTIPLE-VALUE-LIST (SOCKET-STREAM-LOCAL *SOCKET-2*)) (MULTIPLE- VALUE-LIST (SOCKET-STREAM-PEER *SOCKET-2*))) :PRETTY T)) make: *** [socket2.erg] Segmentation fault make: *** Deleting file `socket2.erg' When I get over my head cold, I'll investigate further. diff on the 2.36 and 2.37 socket test was annoyingly filled with uninteresting changes, but I haven't looked more deeply. --Doug |
From: Sam S. <sd...@gn...> - 2006-01-18 14:26:12
|
> * Douglas Philips <qtbh@znp.pbz> [2006-01-18 01:24:33 -0500]: > > On 2006 Jan 17, at 10:55 AM, Sam Steingold wrote: >>> * Doug Philips <qtbh@znp.pbz> [2006-01-17 10:05:25 -0500]: >>> Assuming everything goes as planned, I should be able to grab a 2.36 >>> tree tonight. We have four scenarios to test: 2.36 fill 0, 2.36 fill >>> 0xFF, 2.37 fill 0, 2.37 fill 0xFF. >> >> Thanks. > > Results so far: > "out of the box" 2.36 passes its socket tests. > Adding FILL0 (and calls to it) in 2.36 makes for no change, the > tests all pass either way. > Changing FILL0 to fill with 0xFF causes no change in the 2.36 > socket tests. So it looks like that problem is not that FILL0 is required on mac os x, but that something is wrong in CLISP. right? now you need to run CLISP 2.36 and 2.37 under gdb and find the bind call that fails on 2.37 but succeeds on 2.36 and compare the arguments. just type (socket-server 3456) in 2.36 and 2.37 and see what happens. > Changing FILL0 to fill with 0xFF causes 30 errors in the 2.37 test. as I said, the bug was introduced on 2005-12-19, it would make more sense to compare the pre-2005-12-19 snapshot with a post-2005-12-19 one rather than 2.36 with 2.37 > Running the 2.37 socket test under 2.36 (0 and FF) causes a > segmentation fault: > (PROGN (SETQ *SERVER* (SOCKET-SERVER 9090) *SOCKET-1* (SOCKET-CONNECT > 9090 "localhost" :TIMEOUT 0 :BUFFERED NIL) *SOCKET-2* (SOCKET-ACCEPT > *SERVER* :BUFFERED NIL)) (WRITE-CHAR #\a *SOCKET-1*)) > EQL-OK: #\a > (LISTP (SHOW (LIST (MULTIPLE-VALUE-LIST (SOCKET-STREAM-LOCAL > *SOCKET-1*)) (MULTIPLE-VALUE-LIST (SOCKET-STREAM-PEER *SOCKET-1*)) > (MULTIPLE-VALUE-LIST (SOCKET-STREAM-LOCAL *SOCKET-2*)) (MULTIPLE- > VALUE-LIST (SOCKET-STREAM-PEER *SOCKET-2*))) :PRETTY T)) > make: *** [socket2.erg] Segmentation fault > make: *** Deleting file `socket2.erg' this bug was fixed on 2005-12-15 -- Sam Steingold (http://www.podval.org/~sds) running w2k http://www.savegushkatif.org http://truepeace.org http://www.memri.org http://www.palestinefacts.org http://www.openvotingconsortium.org The difference between genius and stupidity is that genius has its limits. |
From: Sam S. <sd...@gn...> - 2006-01-18 16:28:04
|
> * Sam Steingold <fq...@ta...t> [2006-01-18 09:24:39 -0500]: > > now you need to run CLISP 2.36 and 2.37 under gdb and find the bind call > that fails on 2.37 but succeeds on 2.36 and compare the arguments. to put it simply: on SF CF openpower-linux1 (ppc64): $ strace ./clisp -norc -q -x '(socket-server-close (socket-server))' 2>&1 1>/dev/null | tail -20 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 listen(6, 1) = 0 getsockname(6, {sa_family=AF_INET, sin_port=htons(39632), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 close(6) = 0 just do this same thing with 2.37 and 2.36 on macosx with and without FILL0 and FILLxFF. I cannot believe this is taking so long. -- Sam Steingold (http://www.podval.org/~sds) running w2k http://www.iris.org.il http://www.openvotingconsortium.org http://pmw.org.il http://truepeace.org http://www.palestinefacts.org Politically Correct Chess: Translucent VS. Transparent. |
From: Lennart S. <le...@ly...> - 2006-01-25 17:34:59
|
On 17 jan 2006, at 15:20, Sam Steingold wrote: > clisp 2.37 has a bug on open :if-exists :append, so I want to release > 2.38 ASAP. > the only blocker is the mac os x socket bug. > it has been reported that this bug was not present in 2.36, so the > brute > force "fill everything with 0" patch will NOT go in. > since the mac os x SF CF hosts are dead, I cannot debug this. > it is up to you, the mac os x users, to close this issue. > what you have to do is: > > 1. make sure that it was the Dec 19 > "(SOCKET-SERVER): accept keywords :INTERFACE :BACKLOG" > patch that broke mac os x > > 2. find out why it fails (using, e.g., strace). specifically, > (socket-server) and (socket-server 3456) - do they work or fail on > 2.37 and 2.36? what are the traces (around the failed calls and the > corresponding sucessful calls) > > it is up to you to ensure that sockets work on your system in clisp > 2.38. > OK, clisp 2.38 has been released with out this resolved, just as I am getting enough compute cycles to perhaps get to the bottom of this. 1. Indeed before the addition of :INTERFACE etc, the socket-server code worked. And after it didn't. I have looked at the call to bind in gdb in both instances and the difference is that before it tried to bind to address 0 (0.0.0.0) and after it is binding to 127.0.0.1. And with the non zero address bind fails. I verified this by setting it to zero in the debugger and stepping past bind. (gdb) print ({struct sockaddr_in} addr).sin_addr.s_addr=0 I found this interesting comment in the OpenMCL source: ;; Darwin includes the SIN_ZERO field of the sockaddr_in when ;; comparing the requested address to the addresses of configured ;; interfaces (as if the zeros were somehow part of either address.) ;; "rletz" zeros out the stack-allocated structure, so those zeros ;; will be 0. I tried instead of setting sin_addr to zero, I zeroed sin_zero: (gdb) print ({struct sockaddr_in} addr).sin_zero="\0\0\0\0\0\0\0" That worked to, with 127.0.0.1 in sin_addr. Perhaps it is like this: If the bind address is 0 (i.e. the any address) it is accepted directly. If the address is not 0, the list of interfaces is scanned to find a matching interface and when comparing the sin_zero part of the struct is included (perhaps that part is used for ipv6?). I'm not sure if this is documented. But clearing sin_zero seems reasonable. 2. Given the above, (socket-server 0 :interface "0.0.0.0") should work. But it doesn't. I looked at the code for socket-server in stream.d and it looks strange to me. It seems that if interface is specified as a string, create_server_socket_by_string will be called twice. The second time with "127.0.0.1" hardcoded as the address and also leaking the socket from the first call. Perhaps I am misreading the code, but it looks wrong to me. //Lennart Staflin |
From: Sam S. <sd...@gn...> - 2006-01-25 18:43:13
|
> * Lennart Staflin <yr...@yl...> [2006-01-25 18:34:36 +0100]: > > 2. Given the above, (socket-server 0 :interface "0.0.0.0") should > work. But it doesn't. I looked at the code for socket-server in > stream.d and it looks strange to me. It seems that if interface is > specified as a string, create_server_socket_by_string will be called > twice. The second time with "127.0.0.1" hardcoded as the address and > also leaking the socket from the first call. Perhaps I am misreading > the code, but it looks wrong to me. thanks, I just fixed this bug in the CVS -- Sam Steingold (http://www.podval.org/~sds) running w2k http://www.dhimmi.com http://www.savegushkatif.org http://www.camera.org http://www.memri.org http://www.iris.org.il Sex is like air. It's only a big deal if you can't get any. |