From: <don...@is...> - 2010-01-11 20:54:49
|
I've been doing a binary search in cvs time. I've seen the ebadf problem in a build as of 2009-08-22 -- the transcript ends like this: UNIX error 9 (EBADF): Bad file number Segmentation fault After running for a week (though very light usage) in a build as of 2009-08-15 I don't see the problem. So I now want to build as of 8-17 or 8-18. I'm getting this error: config.status: creating config.h configure: ** I18N (Done) make[1]: Entering directory `/home/clisp-build/cvs-date/2009-08-17T00:00:00-00/clisp/build-mt-no/i18n' Makefile:24: *** target pattern contains no `%'. Stop. make[1]: Leaving directory `/home/clisp-build/cvs-date/2009-08-17T00:00:00-00/clisp/build-mt-no/i18n' make: *** [i18n] Error 2 The configure line: ./configure --with-debug --with-threads=POSIX_THREADS --with-module=rawsock --with-dynamic-modules=no build-mt-no If the error above is related in internationalization code I'd be happy to add a configuration option to not include that. BTW, the with-dynamic-modules=no is due to the fact that without it my builds on this machine end with this: base/lisp.run -B . -M base/lispinit.mem -norc -q -i i18n/i18n -i syscalls/posix -i regexp/regexp -i readline/readline -x (saveinitmem "base/lispinit.mem") STACK size: 98206 [0x172f00 0x113088] ./clisp-link: line 97: 6041 Segmentation fault "$@" ./clisp-link: failed in /home/clisp-build/clisp/build-mt make: *** [base] Error 1 I consider that to be another problem still on the queue, so I'd be happy to hear any theories about what causes it or what might fix it. Since the dynamic modules error appears on only one of the two machines doing nightly builds, it occurs to me that it's possible that the ebadf error might also appear only on this machine. It's relatively inconvenient for me to use the other nightly build machine (running a much more recent linux) to test for ebadf, but I begin to think it might be worth trying. |
From: <don...@is...> - 2010-02-03 20:16:51
|
I sent this on Jan 11 and saw no reply, but now that I search for it on gmane, I don't see it there. And yet the message I forward did come from the list, it's not just a cc to myself. In any case I resend in hope of an answer. The version I've been running from cvs as of 2009-08-15 continues to work with no ebadf's, whereas the version as of 2009-08-22 ended with UNIX error 9 (EBADF): Bad file number UNIX error 9 (EBADF): Bad file number Segmentation fault so I'm pretty sure that the problem was introduced somewhere in that interval. On the other hand, this is one particular server running a rather old linux version, and it seems possible that this problem would not occur on newer OS versions. Don Cohen writes: > > I've been doing a binary search in cvs time. > I've seen the ebadf problem in a build as of 2009-08-22 -- > the transcript ends like this: > > UNIX error 9 (EBADF): Bad file number > > > Segmentation fault > > > After running for a week (though very light usage) in a build as of > 2009-08-15 I don't see the problem. So I now want to build as of > 8-17 or 8-18. I'm getting this error: > > config.status: creating config.h > configure: ** I18N (Done) > make[1]: Entering directory `/home/clisp-build/cvs-date/2009-08-17T00:00:00-00/clisp/build-mt-no/i18n' > Makefile:24: *** target pattern contains no `%'. Stop. > make[1]: Leaving directory `/home/clisp-build/cvs-date/2009-08-17T00:00:00-00/clisp/build-mt-no/i18n' > make: *** [i18n] Error 2 > > The configure line: > ./configure --with-debug --with-threads=POSIX_THREADS --with-module=rawsock --with-dynamic-modules=no build-mt-no > > If the error above is related in internationalization code I'd be > happy to add a configuration option to not include that. > > BTW, the with-dynamic-modules=no is due to the fact that without it my > builds on this machine end with this: > > base/lisp.run -B . -M base/lispinit.mem -norc -q -i i18n/i18n -i syscalls/posix -i regexp/regexp -i readline/readline -x (saveinitmem "base/lispinit.mem") > STACK size: 98206 [0x172f00 0x113088] > ./clisp-link: line 97: 6041 Segmentation fault "$@" > ./clisp-link: failed in /home/clisp-build/clisp/build-mt > make: *** [base] Error 1 > > I consider that to be another problem still on the queue, so I'd be > happy to hear any theories about what causes it or what might fix it. > > Since the dynamic modules error appears on only one of the two > machines doing nightly builds, it occurs to me that it's possible > that the ebadf error might also appear only on this machine. > It's relatively inconvenient for me to use the other nightly build > machine (running a much more recent linux) to test for ebadf, but I > begin to think it might be worth trying. > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > clisp-list mailing list > cli...@li... > https://lists.sourceforge.net/lists/listinfo/clisp-list |
From: Sam S. <sd...@gn...> - 2010-02-03 21:16:46
|
Don Cohen wrote: > The version I've been running from cvs as of 2009-08-15 continues to > work with no ebadf's, whereas the version as of 2009-08-22 ended with > UNIX error 9 (EBADF): Bad file number > > UNIX error 9 (EBADF): Bad file number > > Segmentation fault > > so I'm pretty sure that the problem was introduced somewhere in that > interval. On the other hand, this is one particular server running > a rather old linux version, and it seems possible that this problem > would not occur on newer OS versions. here are the changes between these two dates: 2009-08-20 Sam Steingold <sd...@gn...> accept -disable-readline run-time option * lispbibl.d (disable_readline): declare * spvw.d (disable_readline): define (usage): document -disable-readline (parse_options): set disable_readline when -disable-readline is given * stream.d (make_terminal_stream_): do not use readline when disable_readline is true 2009-08-19 Vladimir Tzankov <vtz...@gm...> * package.d (symbol_list_lookup): search for symbol name in a list (symtab_find, shadowing_lookup): use it 2009-08-18 Sam Steingold <sd...@gn...> * makemake.in (XCC_PICFLAG) [cygwin]: empty: "warning: -fPIC ignored for target (all code is position independent)" 2009-08-17 Bruno Haible <br...@cl...> * spvw_sigsegv.d (stackoverflow_handler_continuation): Update reference to Linux/arm register to match current API. 2009-08-16 Sam Steingold <sd...@gn...> * modules/berkeley-db/Makefile.in, modules/bindings/glibc/Makefile: * modules/bindings/win32/Makefile, modules/clx/mit-clx/Makefile: * modules/clx/new-clx/Makefile.in, modules/dbus/Makefile.in: * modules/dirkey/Makefile.in, modules/fastcgi/Makefile.in: * modules/gdbm/Makefile.in, modules/gtk2/Makefile.in: * modules/i18n/Makefile.in, modules/libsvm/Makefile: * modules/matlab/Makefile, modules/netica/Makefile: * modules/oracle/Makefile.in, modules/pari/Makefile.in: * modules/pcre/Makefile.in, modules/postgresql/Makefile.in: * modules/queens/Makefile, modules/rawsock/Makefile.in: * modules/readline/Makefile.in, modules/regexp/Makefile.in: * modules/syscalls/Makefile.in, modules/wildcard/Makefile.in: * modules/zlib/Makefile.in: avoid GNU extensions Reported by Aleksej Saushev <as...@in...> 2009-08-16 Vladimir Tzankov <vtz...@gm...> [MULTITHREAD]: make packages threads safe * package.d (rehash_symtab): do not reuse old cons cell. allocate new symtab (make_present, unexport, make_external): assign returned symtab - possibly newly allocated (unuse_1package): do not lock anything. caller should have obtained both package mutexes (unuse_package): obtain package locks before calling unuse_1package (USE-PACKAGE, UNUSE-PACKAGE): obtain global packages lock since more than one package mutex will be locked at a time (%IN-PACKAGE): lock while modifying existing packages (DELETE-PACKAGE): lock existing package during unuse_1package (WITH_PACKAGE_LIST_MUTEX_LOCK): macro for obtaining all mutexes of a list of packages. on unwinding releases them (use_package): use it (make_package): guard insertion into all_packages the only changes related to MT are making packages thread-safe. are you manipulating packages in any way? interning stuff? you can try reverting these changes (by Vladimir on 2009-08-16 & 2009-08-19) in your 2009-08-15 tree and rebuilding. |
From: <don...@is...> - 2010-02-03 21:29:53
|
Sam Steingold writes: > Don Cohen wrote: > > The version I've been running from cvs as of 2009-08-15 continues to > > work with no ebadf's, whereas the version as of 2009-08-22 ended with > > UNIX error 9 (EBADF): Bad file number > > > > UNIX error 9 (EBADF): Bad file number > > > > Segmentation fault > > > > so I'm pretty sure that the problem was introduced somewhere in that > > interval. On the other hand, this is one particular server running > > a rather old linux version, and it seems possible that this problem > > would not occur on newer OS versions. > > here are the changes between these two dates: ... > the only changes related to MT are making packages thread-safe. > are you manipulating packages in any way? > interning stuff? Almost certainly. And the various threads are almost certainly changing the same packages, possibly even trying to create the same symbols, using gentemp unless gentemp is aware of MT and doing its own locking. But all of that was going on before the change, so if anything I'd expect things to work better after the change. It hadn't occurred to me that this could be related to ebadf, which I thought was related to files. How do you get from packages to ebadf? > you can try reverting these changes (by Vladimir on 2009-08-16 & > 2009-08-19) in your 2009-08-15 tree and rebuilding. How does one revert a particular change in cvs? And isn't it likely that some later update also changed that same code? Which would leave me trying to guess at the correct way to incorporate the new changes without the older ones, right? So I guess you don't think it's worth while to narrow down to the single change before which things work and after which they don't. |
From: Sam S. <sd...@gn...> - 2010-02-03 21:47:33
|
Don Cohen wrote: > Sam Steingold writes: > > the only changes related to MT are making packages thread-safe. > > are you manipulating packages in any way? > > interning stuff? > > Almost certainly. And the various threads are almost certainly > changing the same packages, possibly even trying to create the same > symbols, using gentemp unless gentemp is aware of MT and doing its gentemp does no locking. it creates interned symbols which are never uninterned, so it introduces a potential memory leak - unless you are careful. why not use gensym instead? > own locking. But all of that was going on before the change, so if > anything I'd expect things to work better after the change. ... if the change was bug-free :-) > It hadn't occurred to me that this could be related to ebadf, which I > thought was related to files. How do you get from packages to ebadf? if the change leads to memory corruption, _anything_ can happen. > > you can try reverting these changes (by Vladimir on 2009-08-16 & > > 2009-08-19) in your 2009-08-15 tree and rebuilding. > > How does one revert a particular change in cvs? get the diff (cvs diff -r ... -r ... > patch) and apply it (patch -R < patch) > And isn't it likely that some later update also changed that same > code? Which would leave me trying to guess at the correct way to > incorporate the new changes without the older ones, right? huh? > So I guess you don't think it's worth while to narrow down to the > single change before which things work and after which they don't. we are already doing that. 1. take the working 2009-08-15 tree 2. apply the suspicious patch (see above) 3. build clisp 2009-08-15 + the suspicious patch alternatively: 1'. take the working 2009-08-22 tree 2'. revert the suspicious patch (see above) 3'. build clisp 2009-08-22 - the suspicious patch now you have a clisp executable which you should run and see if you get the error. if the executable from [3] works fine (or the executable from [3'] fails), then the patch is OK. otherwise it is broken. Sam. |
From: <don...@is...> - 2010-02-04 19:34:26
|
> gentemp does no locking. > it creates interned symbols which are never uninterned, so it introduces a > potential memory leak - unless you are careful. > why not use gensym instead? I now remember some other cases besides gentemp where I intern. Typically I'm defining functions that are to be called and possibly later redefined by other code. These functions are, in my mind, not very different from the ones defined directly by source code. I might even want to call them myself while debugging. Admittedly I could implement my own symbol table, but why do that when it's already there? > if the change leads to memory corruption, _anything_ can happen. Ah, so that's the kind of bug we're looking for. > 1. take the working 2009-08-15 tree > 2. apply the suspicious patch (see above) > 3. build clisp 2009-08-15 + the suspicious patch I had imagined that the easy way to do this was to check out the cvs tree from just before and just after the suspicious patch. The problem was that I got those build errors. So you want to avoid those errors by directly applying a patch to a state other than the one where it was made - we're assuming that the suspicious patch is independent of all of the others between it and my good state. It's on my queue... |
From: <don...@is...> - 2010-02-17 08:29:55
|
Sam Steingold writes: > 1. take the working 2009-08-15 tree > 2. apply the suspicious patch (see above) > 3. build clisp 2009-08-15 + the suspicious patch After running the version of 2009-08-15 + the result of cvs diff -D "20090815" -D "20090817" package.d (a 535 line patch - 1.120) for a week without incident I tried a new version adding to that the result of cvs diff -D "20090817" -D "20090822" package.d (a 60 line patch - 1.121) This promptly produced UNIX error 9 (EBADF): Bad file number Segmentation fault So I guess this is good news - it looks like the problem is in the small patch rather than the big one. I hope you can find it. I'm now back to running the previous version (8-17). Let me know if you want me to try some intermediate version between 1.120 and 1.121. |
From: Sam S. <sd...@gn...> - 2010-02-17 18:28:21
|
Don Cohen wrote: > Sam Steingold writes: > > 1. take the working 2009-08-15 tree > > 2. apply the suspicious patch (see above) > > 3. build clisp 2009-08-15 + the suspicious patch > After running the version of 2009-08-15 + the result of > cvs diff -D "20090815" -D "20090817" package.d > (a 535 line patch - 1.120) > for a week without incident I tried a new version adding to that > the result of > cvs diff -D "20090817" -D "20090822" package.d > (a 60 line patch - 1.121) > This promptly produced > > UNIX error 9 (EBADF): Bad file number > > Segmentation fault > > So I guess this is good news - it looks like the problem is in the > small patch rather than the big one. I hope you can find it. > I'm now back to running the previous version (8-17). Let me know > if you want me to try some intermediate version between 1.120 and > 1.121. Thanks for your effort! The patch between 1.120 & 1.121 looks quite innocuous. I see nothing wrong with it. One tiny tweak won't hurt though: --- package.d.~1.136.~ 2009-11-13 09:40:45.000000000 -0500 +++ package.d 2010-02-17 13:22:59.000737000 -0500 @@ -3009,10 +3009,9 @@ LISPFUNN(package_iterate,1) { shadowing-list of pack), 2. itself not already present in pack (because in this case the accessibility would be :INTERNAL or :EXTERNAL). */ - var object shadowingsym; if (!(eq(Car(PIS(state,FLAGS)),S(Kinherited)) && (shadowing_lookup(Symbol_name(value2),false, - PIS(state,PACK),&shadowingsym) + PIS(state,PACK),NULL) || symtab_find(value2, ThePackage(PIS(state,PACK))-> pack_internal_symbols) try applying this to the broken clisp and see if it fixes it. thanks. |
From: <don...@is...> - 2010-02-17 19:59:52
|
Sam Steingold writes: > One tiny tweak won't hurt though: > > --- package.d.~1.136.~ 2009-11-13 09:40:45.000000000 -0500 > +++ package.d 2010-02-17 13:22:59.000737000 -0500 > @@ -3009,10 +3009,9 @@ LISPFUNN(package_iterate,1) { > shadowing-list of pack), > 2. itself not already present in pack (because in this case > the accessibility would be :INTERNAL or :EXTERNAL). */ > - var object shadowingsym; > if (!(eq(Car(PIS(state,FLAGS)),S(Kinherited)) > && (shadowing_lookup(Symbol_name(value2),false, > - PIS(state,PACK),&shadowingsym) > + PIS(state,PACK),NULL) > || symtab_find(value2, > ThePackage(PIS(state,PACK))-> > pack_internal_symbols) > > try applying this to the broken clisp and see if it fixes it. The code I see doesn't seem to quite correspond to yours: 2. itself not already present in pack (because in this case the accessibility would be :INTERNAL or :EXTERNAL). */ { var object shadowingsym; [I guess you want to delete the line above] if (!(eq(Car(TheSvector(state)->data[5]),S(Kinherited)) && (shadowing_lookup(Symbol_name(value2),false, TheSvector(state)->data[4], &shadowingsym) [I guess you want to change the line above to "NULL)"] || symtab_find(value2, ThePackage(TheSvector(state)->data[4])-> pack_internal_symbols) || symtab_find(value2, ThePackage(TheSvector(state)->data[4])-> pack_external_symbols)))) { /* Symbol value2 is really accessible. */ Is that correct? You did look at 1.121, right? This seems to be unrelated to it. |
From: Sam S. <sd...@gn...> - 2010-02-17 20:05:44
|
Don Cohen wrote: > Sam Steingold writes: > > > One tiny tweak won't hurt though: > > > > --- package.d.~1.136.~ 2009-11-13 09:40:45.000000000 -0500 > > +++ package.d 2010-02-17 13:22:59.000737000 -0500 > > @@ -3009,10 +3009,9 @@ LISPFUNN(package_iterate,1) { > > shadowing-list of pack), > > 2. itself not already present in pack (because in this case > > the accessibility would be :INTERNAL or :EXTERNAL). */ > > - var object shadowingsym; > > if (!(eq(Car(PIS(state,FLAGS)),S(Kinherited)) > > && (shadowing_lookup(Symbol_name(value2),false, > > - PIS(state,PACK),&shadowingsym) > > + PIS(state,PACK),NULL) > > || symtab_find(value2, > > ThePackage(PIS(state,PACK))-> > > pack_internal_symbols) > > > > try applying this to the broken clisp and see if it fixes it. > > The code I see doesn't seem to quite correspond to yours: > 2. itself not already present in pack (because in this case > the accessibility would be :INTERNAL or :EXTERNAL). */ > { > var object shadowingsym; > [I guess you want to delete the line above] > if (!(eq(Car(TheSvector(state)->data[5]),S(Kinherited)) > && (shadowing_lookup(Symbol_name(value2),false, > TheSvector(state)->data[4], > &shadowingsym) > [I guess you want to change the line above to "NULL)"] > || symtab_find(value2, > ThePackage(TheSvector(state)->data[4])-> > pack_internal_symbols) > || symtab_find(value2, > ThePackage(TheSvector(state)->data[4])-> > pack_external_symbols)))) { > /* Symbol value2 is really accessible. */ > > Is that correct? yes. > You did look at 1.121, right? This seems to be unrelated to it. 1.121 made shadowing_lookup into a macro from a function. |
From: <don...@is...> - 2010-02-19 02:21:34
|
Sam Steingold writes: > > > One tiny tweak won't hurt though: I just got the ebadf from 2009-08-15 plus package.d 1.120 plus package.d 1.121 plus tiny tweak. Now back to running 2009-08-15 plus package.d 1.120 which previously ran for a week without error. So until I get an ebadf from that one it appears that the problem comes from package.d 1.121 and is not then fixed by tiny tweak. |
From: Sam S. <sd...@gn...> - 2010-02-23 18:15:38
|
Don Cohen wrote: > Sam Steingold writes: > > > > One tiny tweak won't hurt though: > I just got the ebadf from > 2009-08-15 > plus package.d 1.120 > plus package.d 1.121 > plus tiny tweak. > > Now back to running > 2009-08-15 > plus package.d 1.120 > which previously ran for a week without error. > > So until I get an ebadf from that one it appears that the problem > comes from package.d 1.121 and is not then fixed by tiny tweak. Vladimir, do you have any comment on this? Don, could you please try also the cvs head? maybe by some lucky coincidence the bug has been fixed :-) |
From: Vladimir T. <vtz...@gm...> - 2010-02-23 18:33:52
|
On 2/23/10, Sam Steingold <sd...@gn...> wrote: > Don Cohen wrote: >> Sam Steingold writes: >> > > > One tiny tweak won't hurt though: >> I just got the ebadf from >> 2009-08-15 >> plus package.d 1.120 >> plus package.d 1.121 >> plus tiny tweak. >> >> Now back to running >> 2009-08-15 >> plus package.d 1.120 >> which previously ran for a week without error. >> >> So until I get an ebadf from that one it appears that the problem >> comes from package.d 1.121 and is not then fixed by tiny tweak. > > Vladimir, do you have any comment on this? No. I was following the topic but do not see something in this commit that can cause the problem. |
From: <don...@is...> - 2010-02-23 18:45:33
|
Sam Steingold writes: > Don, could you please try also the cvs head? > maybe by some lucky coincidence the bug has been fixed :-) That turns out to be very easy, since this is one of my nightly build machines. Now running. I'll let you know what happens. |
From: <don...@is...> - 2010-02-24 21:35:44
|
Don Cohen writes: > Sam Steingold writes: > > Don, could you please try also the cvs head? > > maybe by some lucky coincidence the bug has been fixed :-) > That turns out to be very easy, since this is one of my nightly build > machines. Now running. I'll let you know what happens. Here's the answer: ... [../src/stream.d:13063] 2010/2/23 12:42:37 ignore error from process-output UNIX error 32 (EPIPE): Broken pipe, child process terminated or socket closed this is normal - even in the version of 8/15, though the line number there is 13054 [../src/stream.d:13063] 2010/2/23 12:42:39 ignore error from process-output UNIX error 32 (EPIPE): Broken pipe, child process terminated or socket closed again normal - above two were probably from the same client process, which has two connections, I think. It's the same client program as below, though probably a different incarnation. Process-output is supposed to be sending output to a client over a socket, so I guess the error above makes sense. Though I'd prefer if it could tell me that it came from a closed socket and not from a child process terminated. All below seems related to the crash, you see all at the same time. Also at this time I got a similar error from the (java) client losing its connection(s). Note the different line numbers from above [../src/stream.d:6195] 2010/2/24 12:43:18 ignore error from wait Wait is supposed to wait for activity on any of the connected sockets. So this error should be coming from socket-status. UNIX error 104 (ECONNRESET): Connection reset by peer [../src/stream.d:6195] [../src/stream.d:4530] 2010/2/24 12:43:18 ignore error from disconnect-connection When we think a connection has been lost we do this (when (open-stream-p stream) (close stream)) so the error above I think could be coming from either open-stream-p or close. I think my earlier debugging indicated close. But, of course one might argue that open-stream-p should also have returned nil. But sockets can be half open - what should open-stream-p return then? I guess open. Clearly close should work whether the stream is open, closed or half open (of either type). UNIX error 9 (EBADF): Bad file number UNIX error 9 (EBADF): Bad file number Segmentation fault I don't know whether the two error messages are due to two different streams or something else. Let me know if you can think of some other data worth collecting. If we think the problem is related to this particular client (which I think it is, since that's likely the only client running at that moment) I could record packets to see what is actually being sent. In the mean while I'm back to running the 8/15+large patch version. |
From: Sam S. <sd...@gn...> - 2010-02-24 21:54:52
|
Don Cohen wrote: > But sockets can be half open - what should > open-stream-p return then? what is a "half open" socket? |
From: <don...@is...> - 2010-02-24 22:21:11
|
Sam Steingold writes: > Don Cohen wrote: > > But sockets can be half open - what should > > open-stream-p return then? > > what is a "half open" socket? Tcp sockets are really bidirectional streams. You can send a "FIN" to close one direction and the other direction can continue to send until it closes with its own "FIN". I think you can also send data after you get an ACK for your SYN even if it doesn't contain its own SYN. That is A sends SYN to B B replies with ACK (not SYNACK) to A now the stream from A to B is usable |
From: Sam S. <sd...@gn...> - 2010-02-24 23:04:19
|
Don Cohen wrote: > Sam Steingold writes: > > Don Cohen wrote: > > > But sockets can be half open - what should > > > open-stream-p return then? > > > > what is a "half open" socket? > > Tcp sockets are really bidirectional streams. > You can send a "FIN" to close one direction and the other direction > can continue to send until it closes with its own "FIN". > I think you can also send data after you get an ACK for your SYN > even if it doesn't contain its own SYN. > That is > A sends SYN to B > B replies with ACK (not SYNACK) to A > now the stream from A to B is usable isn't this what shutdown(2) does? then this has nothing to do with closing. you can, I believe, shutdown as many times as you like, and you still have to close _once_ |
From: <don...@is...> - 2010-02-24 23:28:01
|
Sam Steingold writes: > isn't this what shutdown(2) does? I've never seen that page before, but it does seem to describe what I was describing from a different point of view. (Interesting that you seem to understand the middle of this diagram lisp -- c -- network better than the right side and I'm the other way around.) > then this has nothing to do with closing. > you can, I believe, shutdown as many times as you like, and you > still have to close _once_ The end of the transcript I sent says [../src/stream.d:6195] [../src/stream.d:4530] 2010/2/24 12:43:18 ignore error from disconnect-connection UNIX error 9 (EBADF): Bad file number UNIX error 9 (EBADF): Bad file number Segmentation fault The "ignore error from disconnect-connection" comes from this source: (ignore-errs "disconnect-connection" (shut-down-connection connection)) (defmethod shut-down-connection ((connection connection)) (let ((stream (sstream connection))) (when (open-stream-p stream) (close stream)))) I don't know what the relation is between close and shutdown(2), but the man page does say that shutdown can cause ebadf and the output above does indicate that just before the first ebadf was reported we were doing either open-stream-p or close. I also don't know how the order of unix error messages and the other messages is related to the order in which they happen. Perhaps the two ebadf errors come from places earlier than they appear in the transcript. I guess it would be useful to know how stream.d 6195 and 4530 are related to these or other functions... |
From: Sam S. <sd...@gn...> - 2010-02-24 23:49:23
|
Don Cohen wrote: > Sam Steingold writes: > > > isn't this what shutdown(2) does? > I've never seen that page before, but it does seem to describe what > I was describing from a different point of view. > (Interesting that you seem to understand the middle of this diagram > lisp -- c -- network > better than the right side and I'm the other way around.) ;-) > > then this has nothing to do with closing. > > you can, I believe, shutdown as many times as you like, and you > > still have to close _once_ > > The end of the transcript I sent says > [../src/stream.d:6195] low_fill_buffered_handle this is read(2) > [../src/stream.d:4530] low_close_handle this is close(2) > I don't know what the relation is between close and shutdown(2), close is a "os-level" function: it removes the fd from the table. shutdown is a "protocol-level" function: it send fin down the pipe. > but the man page does say that shutdown can cause ebadf only if the fd is already bad. i.e., a successful shutdown cannot cause ebadf in a future system call. cf. a successful close which does cause ebadf on any further syscall. |
From: Sam S. <sd...@gn...> - 2010-02-24 23:54:34
|
I think we need a new theory of when ebadf can cause a segfault. char t1[] = "/tmp/clisp-x-io-XXXXXX"; char t2[] = "/tmp/clisp-x-io-XXXXXX"; char t3[] = "/tmp/clisp-x-io-XXXXXX"; int fd1 = mkstemp(t1); int fd2 = mkstemp(t2); printf("==MKSTEMP==\n%s == %d\n%s == %d\n",t1,fd1,t2,fd2); if (fd1 != -1) close(fd1); if (close(fd1)) perror("close-1"); if (close(fd1)) perror("close-2"); if (close(fd1)) perror("close-3"); if (close(fd1)) perror("close-4"); if (close(fd1)) perror("close-5"); if (fd2 != -1) close(fd2); fd1 = mkstemp(t3); printf("%s == %d\n",t3,fd1); if (fd1 != -1) close(fd1); if (remove(t1)) perror(t1); if (remove(t2)) perror(t2); if (remove(t3)) perror(t3); if (close(fd1)) perror("close1"); if (close(fd2)) perror("close2"); if (close(fd1)) perror("close3"); if (close(fd2)) perror("close4"); if (close(fd1)) perror("close5"); if (close(fd2)) perror("close6"); if (close(fd1)) perror("close7"); if (close(fd2)) perror("close8"); ==MKSTEMP== /tmp/clisp-x-io-yckWIX == 3 /tmp/clisp-x-io-ETQ2GN == 4 close-1: Bad file descriptor close-2: Bad file descriptor close-3: Bad file descriptor close-4: Bad file descriptor close-5: Bad file descriptor /tmp/clisp-x-io-49ndFD == 3 close1: Bad file descriptor close2: Bad file descriptor close3: Bad file descriptor close4: Bad file descriptor close5: Bad file descriptor close6: Bad file descriptor close7: Bad file descriptor close8: Bad file descriptor as you can see, repeated ebadf does not cause a segfault. |
From: <don...@is...> - 2010-02-25 08:57:56
|
Sam Steingold writes: > I think we need a new theory of when ebadf can cause a segfault. I didn't expect that ebadf would directly cause a segfault. If it did then we would never see two of them, which does seem to be the norm before a crash. And I've seen a lot of them appear as single events with no crash. But none before 8/15. So something related to the changes c.8/20 is causing both ebadf's and segfaults. As I recall, the first time I reported this I had managed to get a break on the ebadf and then when I tried to close the stream I got the segfault. I'd have hoped that lisp close would not generate errors when trying to close a stream that was already closed, or at least only generate lisp errors, not OS errors. Is it possible that when the connection is lost, the stream is now closed (but was not before), and that close is not aware of that? So that close then tries to close it and is not expecting (and does not properly handle) the resulting ebadf ? |
From: <don...@is...> - 2010-02-25 19:18:03
|
Sam Steingold writes: > I am not sure what your questions are. I was only suggesting a plausible (from my naive viewpoint) mechanism for generating the observed symptoms. > 1. if the lisp stream object is already marked as closed, do > nothing, return immediately. One issue is whether there could ever be a difference between what streams are "marked as closed" and what streams are "really" closed. By which I mean some disagreement between clisp and the OS about whether a stream is closed. > 2. flush buffers (see close_buffered, close_ochannel, close_ichannel) Below you say that this can cause errors that are not correctly handled but it sounds like the problem is only when abort is true. I don't think I'm using with-open-stream to create network streams, and I'm pretty sure that it's the network streams that are causing the errors. > 3. call close(2) (see low_close_handle); if the abort argument is > non-NIL, ignore errors from close(2). What sort of errors could occur and what would be the result of abort is nil ? > 4. mark the lisp stream object as closed; remove it from the list > of open streams. > > normally abort is NIL (unless you are using with-open-stream, see > macroexpand), so if close(2) fails, lisp CLOSE will _not_ mark the > stream as closed and will not remove it from the list of open > streams, i.e., if the lisp stream object is abandoned, then GC will > close it before collecting (i.e., call builtin_stream_close with But this close would probably produce the same errors as before. So we end up with a stream that lisp thinks is closed (and can be GC'd) but the OS thinks it's still open? Maybe that does no harm, since it's only an integer fh that, as far as lisp can see, simply never gets allocated again. > abort=true - thus not printing the ebadf message, see > stream.d:close_some_files, called from spvw_garcol.d:gar_col_done). > > so, after you call CLOSE on a lisp stream object with a bad FD, you > see the EBADF error and the lisp stream object is NOT marked > close. Whenever you do something else with it, you are probably > getting a ebadf, until you either call (CLOSE :ABORT) on it or > abandon the object, it is GCed, and GC calls (CLOSE :ABORT) on it. I think the segfault occurred when I called close (without abort) the first time, and it seems likely (to me) that this is also how all the other segfaults arise. > however, it just occurred to me that the buffer flushing (see "2." > above) does not know about :ABORT, so, in fact, this seems to be a > way to break the system: if the FD is bad, buffer flushing will > keep raising errors, including in GC (which could crash the system, > I guess). I don't follow how this breaks things. Sam Steingold writes: > Sam Steingold wrote: > (gdb) br builtin_stream_close > (gdb) run > clisp> (setq s (socket-connect 21 "ftp.gnu.org" :buffered t)) > clisp> (write-char #\a s) > clisp> (close s) > > now we stop in builtin_stream_close and close the FD before clisp > does its thing: This corresponds to my theory that the broken connection was causing the fd to be closed. Perhaps that's what I've not been adequately explaining to you. Is it possible that when the OS detects that a tcp connection is broken, it closes the fd ? So at that point lisp thinks that it's open and the OS thinks otherwise? > now the stream is no longer referenced and can be GCed. > > clisp> (gc) > clisp> (gc) > clisp> (gc) > ** - UNIX error 9 (EBADF): Bad file number > > an error in GC!!! cause it's trying to close the stream, I gather? > this is, of course, no good. > however, this is a far cry from a segfault. > maybe if this happens under MT this could cause a segfault? |
From: Sam S. <sd...@gn...> - 2010-02-25 19:38:46
|
Don Cohen wrote: > Sam Steingold writes: > > > 1. if the lisp stream object is already marked as closed, do > > nothing, return immediately. > One issue is whether there could ever be a difference between what > streams are "marked as closed" and what streams are "really" closed. > By which I mean some disagreement between clisp and the OS about > whether a stream is closed. if CLOSE signals an error, then the underlying FD _might_ have been closed, but the lisp stream object is not marked as closed. CLOSE can signal errors if some buffer flushing operation fails (unless :ABORT is T, in which case all errors are ignored). > > 2. flush buffers (see close_buffered, close_ochannel, close_ichannel) > Below you say that this can cause errors that are not correctly handled > but it sounds like the problem is only when abort is true. what does "correctly handled" mean in this context? ABORT T: all errors are ignored ABORT NIL: debugger is invoked on the first error, all further processing is abandoned. > I don't think I'm using with-open-stream to create network streams, > and I'm pretty sure that it's the network streams that are causing > the errors. OK, so you never call CLOSE :ABORT T, right? > > 3. call close(2) (see low_close_handle); if the abort argument is > > non-NIL, ignore errors from close(2). > What sort of errors could occur and what would be the result of abort > is nil ? see the close(2) man page as to what errors can happen. if ABORT=T, these errors are silently ignored. if ABORT=NIL, lisp function ERROR is called. > > 4. mark the lisp stream object as closed; remove it from the list > > of open streams. > > > > normally abort is NIL (unless you are using with-open-stream, see > > macroexpand), so if close(2) fails, lisp CLOSE will _not_ mark the > > stream as closed and will not remove it from the list of open > > streams, i.e., if the lisp stream object is abandoned, then GC will > > close it before collecting (i.e., call builtin_stream_close with > But this close would probably produce the same errors as before. > So we end up with a stream that lisp thinks is closed (and can be > GC'd) but the OS thinks it's still open? > Maybe that does no harm, since it's only an integer fh that, as far as > lisp can see, simply never gets allocated again. if you do not use with-open-stream, then, indeed, if CLOSE fails, you may end up with a lisp stream object which is considered open by lisp but has a closed FD under the hood. > > abort=true - thus not printing the ebadf message, see > > stream.d:close_some_files, called from spvw_garcol.d:gar_col_done). > > > > so, after you call CLOSE on a lisp stream object with a bad FD, you > > see the EBADF error and the lisp stream object is NOT marked > > close. Whenever you do something else with it, you are probably > > getting a ebadf, until you either call (CLOSE :ABORT) on it or > > abandon the object, it is GCed, and GC calls (CLOSE :ABORT) on it. > I think the segfault occurred when I called close (without abort) > the first time, and it seems likely (to me) that this is also how all > the other segfaults arise. so, the fundamental question now is - how come you have bad FDs? > > (gdb) br builtin_stream_close > > (gdb) run > > clisp> (setq s (socket-connect 21 "ftp.gnu.org" :buffered t)) > > clisp> (write-char #\a s) > > clisp> (close s) > > > > now we stop in builtin_stream_close and close the FD before clisp > > does its thing: > > This corresponds to my theory that the broken connection was > causing the fd to be closed. Perhaps that's what I've not been > adequately explaining to you. Is it possible that when the OS > detects that a tcp connection is broken, it closes the fd ? I have no idea. I think it is best to ask this on a dedicated unix forum. Maybe Fred knows that off hand? > So at that point lisp thinks that it's open and the OS thinks > otherwise? if, e.g., shutdown(2) in both directions calls close(2), then, yes, lisp will think the FD is open, but OS will think it is closed. > > now the stream is no longer referenced and can be GCed. > > > > clisp> (gc) > > clisp> (gc) > > clisp> (gc) > > ** - UNIX error 9 (EBADF): Bad file number > > > > an error in GC!!! > cause it's trying to close the stream, I gather? yes, when a stream object is GCed, it is automatically closed. |
From: <don...@is...> - 2010-02-25 20:00:49
|
Sam Steingold writes: > Sam Steingold wrote: > I just fixed that. > please apply the last 2 patches to stream.d to your "bad" clisp and > see if your problem goes away. I guess you mean stream 671,672. And bad means 2009/8/15 plus two patches + your tiny patch. Do you expect such patches to apply cleanly to old versions of stream? Against all expectations, $ patch < /tmp/diff-2010-2-25 patching file ChangeLog patching file stream.d Hunk #1 succeeded at 4505 (offset 2 lines). Hunk #3 succeeded at 4510 (offset 2 lines). Hunk #5 succeeded at 4524 (offset 2 lines). Hunk #6 succeeded at 5431 (offset -1 lines). Hunk #7 succeeded at 5857 (offset 2 lines). Hunk #8 succeeded at 5860 (offset -1 lines). Hunk #9 succeeded at 8350 (offset -2 lines). Hunk #10 succeeded at 8363 (offset -1 lines). Hunk #11 succeeded at 15873 (offset -87 lines). Hunk #12 succeeded at 16027 (offset -1 lines). Hunk #13 succeeded at 16053 (offset -87 lines). Hunk #14 succeeded at 17062 (offset 2 lines). Hunk #15 succeeded at 16976 (offset -87 lines). Hunk #16 succeeded at 16979 (offset -87 lines). Hunk #17 succeeded at 16995 (offset -87 lines). Hunk #18 succeeded at 17086 (offset 2 lines). Hunk #19 succeeded at 17057 (offset -33 lines). Hunk #20 succeeded at 17067 (offset -33 lines). Hunk #21 succeeded at 17098 (offset -33 lines). Hunk #22 succeeded at 17129 (offset -33 lines). I then rm the build dir, ./configure (seems to work) and make => gcc -I/home/clisp-build/cvs-date/2009-08-15-patch2/clisp/build-mt-no/gllib -g -O2 -W -Wswitch -Wcomment -Wpointer-arith -Wimplicit -Wreturn-type -Wmissing-declarations -Wno-sign-compare -Wno-format-nonliteral -falign-functions=4 -pthread -g -O0 -DDEBUG_OS_ERROR -DDEBUG_SPVW -DDEBUG_BYTECODE -DSAFETY=3 -DUNICODE -DMULTITHREAD -DPOSIX_THREADS -DDYNAMIC_FFI -I. -c stream.c In file included from ../src/stream.d:10: ../src/lispbibl.d:9060: warning: volatile register variables don't work as you might wish ../src/stream.d: In function 'stream_lend_handle': ../src/stream.d:16980: error: case label not within a switch statement ../src/stream.d:16983: error: case label not within a switch statement ../src/stream.d: In function 'C_read_byte': ../src/stream.d:17108: warning: no previous declaration for 'C_read_byte_lookahead' ../src/stream.d:17121: warning: no previous declaration for 'C_read_byte_will_hang_p' ../src/stream.d:17128: warning: no previous declaration for 'C_read_byte_no_hang' ../src/stream.d: In function 'C_read_byte_no_hang': ../src/stream.d:17151: warning: no previous declaration for 'C_read_integer' ../src/stream.d:17191: warning: no previous declaration for 'C_read_float' ../src/stream.d:17237: warning: no previous declaration for 'C_write_byte' ../src/stream.d:17248: warning: no previous declaration for 'C_write_integer' ../src/stream.d:17284: warning: no previous declaration for 'C_write_float' ../src/stream.d:17356: error: invalid storage class for function 'check_open_file_stream' ../src/stream.d:17356: warning: no previous declaration for 'check_open_file_stream' ../src/stream.d:17394: warning: no previous declaration for 'open_file_stream_handle' ../src/stream.d:17407: warning: no previous declaration for 'handle_length' ../src/stream.d:17422: warning: no previous declaration for 'C_file_position' ../src/stream.d:17621: warning: no previous declaration for 'C_file_length' ../src/stream.d:17650: warning: no previous declaration for 'C_file_string_length' ../src/stream.d:17781: error: invalid storage class for function 'stream_isbuffered_low' ../src/stream.d:17781: warning: no previous declaration for 'stream_isbuffered_low' ../src/stream.d:17817: warning: no previous declaration for 'stream_isbuffered' ../src/stream.d:17824: warning: no previous declaration for 'stream_line_number' ../src/stream.d:17834: warning: no previous declaration for 'C_line_number' ../src/stream.d:17849: warning: no previous declaration for 'stream_get_fasl' ../src/stream.d:17870: warning: no previous declaration for 'stream_set_fasl' ../src/stream.d:17890: warning: no previous declaration for 'C_stream_fasl_p' ../src/stream.d:17907: warning: no previous declaration for 'C_defgray' ../src/stream.d:17928: error: syntax error at end of input make: *** [stream.o] Error 1 Maybe it would make more sense to test current cvs if we could solve the latest build problem. |