From: William H. N. <wil...@ai...> - 2001-05-14 22:34:13
|
Despite the fact that I haven't been checking anything in, I've actually been working on SBCL. I started some days ago by trying to make the Alpha patched version work on OpenBSD. Today, some ten hours of work later, I lowered my sights to making it work reliably on Linux. So far it's still not done. Progress would have been been faster if I hadn't stumbled across two different ways to hang the OpenBSD kernel. Why me? And why now? OpenBSD (2.6 and 2.7) were rock-solid for me and for SBCL, despite being put through much more abuse than this.. Around June 1, OpenBSD 2.9 will come out, and I'll try it and decide whether to use it or just go back to Debian.:-| I fixed -- I think -- problems with current_dynamic_space being uninitialized (but still used..) when GENCGC is defined. But for some time I've been unable to fix another pair of problems, first that the system sometimes fails a GC assertion because an alloc_region isn't reset when it's expected to be reset, and second that if that doesn't happen, the system fails with a SIGINT when it tries to load a core files that it's written. I've been having all kinds of fun wandering through the GENCGC code trying to figure out what it does, enjoying the way that a "reset" state is referred to as an "empty" state elsewhere, and that gc_alloc_large() actually means gc_alloc_possibly_large(), and the boxedness or unboxedness of an alloc_region is logically associated with the alloc_region but physically passed around separately from the alloc_region, and so forth. It's marvellous, actually: generational GC is after all so simple and straightforward that it might be boring, too easy to understand, if it weren't for all these nice little creative touches.. Anyway, hopefully this will take only a finite amount of time to fix and then I'll be able to go back to less obnoxious problems. Or failing that I'll go back to a version which worked and make changes in smaller steps.. -- William Harold Newman <wil...@ai...> "Tweak alpha so it sends SIGBUS for unaligned access, and does NOT do a fixup. This encourages people to fix their code." -- a commit note from <http://www.OpenBSD.org/plus29.html> PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C |
From: Daniel B. <da...@te...> - 2001-05-15 08:13:37
|
William Harold Newman <wil...@ai...> writes: > I fixed -- I think -- problems with current_dynamic_space being > uninitialized (but still used..) when GENCGC is defined. But for some > time I've been unable to fix another pair of problems, first that the > system sometimes fails a GC assertion because an alloc_region isn't > reset when it's expected to be reset, and second that if that doesn't > happen, the system fails with a SIGINT when it tries to load a core > files that it's written. Ew. Is this a problem that can be reproduced in the snapshot (i.e. was it my fault?) If so and if you have a reasonable way of reproducing it, I can have a look for anything I might have done that could be causing it. -dan -- http://ww.telent.net/cliki/ - Link farm for free CL-on-Unix resources |
From: William H. N. <wil...@ai...> - 2001-05-15 14:17:39
|
I've now gotten the system to build and sort of run on both Linux and OpenBSD by turning off PURIFY. In the build process, after dumping the system at the end of warm init, there's sometimes an error, but it seems to occur only after the system is already written. The result doesn't pass the regression tests, though -- it gets a little ways, but then starts to just spin for many CPU minutes. I've started a flaky1 branch to store the current state in CVS. This branch is basically just intended for my own use, so that I can do things like "cvs diff" back to it, and so that I can experiment with working with CVS branches. But in this case, since it's been a while since I've been able to make the system stable enough to check in to the main development branch, it's also the only way that anyone with a morbid interest in can check what I'm doing. Very likely flakyxxx branches, or some similar device, will be a recurring theme in the CVS tree in the future, because there have been other times that I wanted to do a CVS checkin but held off because the system was still too broken to inflict on anyone else. I'm also tentatively thinking of making a stable_0_6 branch for the dying embers of 0.6.x once the conflagration that is 0.7.x starts, but I'll burn that bridge when I come to it. Incidentally, if anyone thinks I'm doing something less than optimal in the CVS admin, you're probably right, so please speak right up, since I'm pretty unfamiliar with any but the really basic features. On Mon, 14 May 2001 15:48:29 -0700, Michael Vanier wrote: > Have you seen this? > > http://mindprod.com/unmain.html No, I hadn't. Thank you. I had a good, if somewhat crazed, laugh. On Tue, May 15, 2001 at 09:12:45AM +0100, Daniel Barlow wrote: > William Harold Newman <wil...@ai...> writes: > > > I fixed -- I think -- problems with current_dynamic_space being > > uninitialized (but still used..) when GENCGC is defined. But for some > > time I've been unable to fix another pair of problems, first that the > > system sometimes fails a GC assertion because an alloc_region isn't > > reset when it's expected to be reset, and second that if that doesn't > > happen, the system fails with a SIGINT when it tries to load a core > > files that it's written. > > Ew. Is this a problem that can be reproduced in the snapshot > (i.e. was it my fault?) The "sometimes fails a GC assertion" problem doesn't seem to be all that reproducible even in my build. I'm guessing that it may depend on the exact size of the src/runtime/runtime executable. In the version in flaky1, it happens in OpenBSD T * NIL * NIL * [undoing binding stack... Argh! alloc_region not reset in gc_alloc_new_region() alloc_region *0x459ba4: first_page=0x00000000, last_page=0xffffffff, start_addr=0x48000000, free_pointer=0x49becaa0, end_addr=0x48000000 fatal error encountered in SBCL runtime system done] [saving current Lisp image into output/sbcl.core: writing 1688(0x698) bytes from the read-only(3) space at 0x10000000 writing 1424(0x590) bytes from the static(2) space at 0x28000000 writing 106274816(0x655a000) bytes from the dynamic(1) space at 0x48000000 done] LDB monitor ldb> (where the funny order is probably because this output was collected with "2>&1 | tee make.tmp", so that stderr and stdout get a bit mixed up) but it doesn't happen on Linux. A few builds ago, though, something similar (another "alloc_region not reset" assertion failure) happened on Linux too, and I've basically just added assertions and refactored code since then, nothing which should've truly fixed the problem. I suspect that at least some of the current problem is partly "your fault", perhaps because of changes you made in the way that current_region_free_pointer is used. But the behavior of current_region_free_pointer and SymbolValue(ALLOCATION_POINTER) in gencgc.c seems more bizarre than you might reasonably have expected, so much so that interacting with it while trying unscrew the all-the-world's-an-x86 simplifications I made when first setting up SBCL is asking for trouble. And I can certainly understand testing the system, seeing that it works, and thinking it's OK. I do think causing the GENCGC code to use current_dynamic_space without initializing it was a bad idea, though: +#ifndef GENCGC + current_dynamic_space = DYNAMIC_0_SPACE_START; #endif and then all the stuff which used to use DYNAMIC_SPACE_START using current_dynamic_space instead. But then it did work somehow in sbcl-0.6.12.7, so maybe I'm confused; and anyway, again, it's basically pretty reasonable to test the system, see that it works, and send it in. In general I really like specific tests accompanying patches, but unfortunately it's pretty hard to test this kind of GC stuff much more than "I built it, and did some things which put some stress on the GC, and it still worked". So I don't know what more I can ask for, other than you never making any mistakes.:-| > If so and if you have a reasonable way of reproducing it, I can have a > look for anything I might have done that could be causing it. If you (Dan, or anyone else, especially Alpha users) want to take a look at the flaky1 branch, you can: $ cvs $whatever_magic_you_use_to_get_to_sourceforge checkout -rflaky1 sbcl I think. I hope I'll be able fix the x86 problems soon, so with luck it won't be worth trying to debug it from your end. However, I would be interested in knowing whether it still builds on an Alpha, since I don't have any good way to test whether I've made a mistake which affects the non-GENCGC code. Also, I'm accumulating a list of cleanups that I'd like to make in the GC code once things are more stable again, and I'll want to get your (Dan, or any other Alpha hackers) opinion of them, to try to make sure that I don't mess up the non-GENCGC world too badly. -- William Harold Newman <wil...@ai...> "To foil the maintenance programmer, you have to understand how he thinks." -- <http://mindprod.com/unmain.html> PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C |
From: William H. N. <wil...@ai...> - 2001-05-19 01:20:57
|
(I just checked in sbcl-0.6.12.8, which is the result of merging the flaky1 branch back into the main branch now that it passes its tests and rebuilds itself, at least under Linux. OpenBSD is probably still broken.) On Tue, May 15, 2001 at 07:09:31PM +0100, Daniel Barlow wrote: > William Harold Newman <wil...@ai...> writes: > > > Incidentally, if anyone thinks I'm doing something less than optimal > > in the CVS admin, you're probably right, so please speak right up, > > since I'm pretty unfamiliar with any but the really basic features. > > My experience has been that it's generally better not to develop on a > branch if it's avoidable - use the HEAD for developing and the > branches for "stable releases that might need bugs fixed". That's > more of a guideline than a rule, though; it makes "cvs log" and the > web-based tools a lot more informative. And means that all the new > files you create on the branch don't end up in the Attic for most of > their working life. I just reviewed the "cvs2cl.pl" output and it looks reasonable to me. The flaky1 checkins appear as part of the synthesized ChangeLog in chronological order, with their checkin messages. For the purposes of reading the ChangeLog, they might as well be on the main branch. I'm not sure what problems you were referring to, and I can't really guess from the "cvs log" output. I've hardly ever used bare "cvs log" at all, or the web-based tools based on it, since I use "cvs2cl.pl" (from http://www.red-bean.com/cvs2cl/>) almost exclusively. So I don't really know what people expect from bare "cvs log". If Dan or anyone else wants to take a look at the CVS log from the past 10 days or so and comment on whether it reflects the flaky1 branch, flaky1 changes, and flaky1 merge in a useful way, I'd be interested to hear from you. -- William Harold Newman <wil...@ai...> "A TRUE Klingon Warrior does not explain his commits!" PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C |