Thread: [Sbcl-devel] Argh. (progress, or lack thereof, on sbcl-0.6.12.8)

Common Lisp compiler and runtime

Brought to you by: crhodes, demoss, jsnell, pkhuong, and 4 others

sbcl-devel

[Sbcl-devel] Argh. (progress, or lack thereof, on sbcl-0.6.12.8)

From: William H. N. <wil...@ai...> - 2001-05-14 22:34:13

Despite the fact that I haven't been checking anything in, I've
actually been working on SBCL. I started some days ago by trying to
make the Alpha patched version work on OpenBSD. Today, some ten hours
of work later, I lowered my sights to making it work reliably on
Linux. So far it's still not done.

Progress would have been been faster if I hadn't stumbled across two
different ways to hang the OpenBSD kernel. Why me? And why now?
OpenBSD (2.6 and 2.7) were rock-solid for me and for SBCL, despite
being put through much more abuse than this.. Around June 1, OpenBSD
2.9 will come out, and I'll try it and decide whether to use it or
just go back to Debian.:-|

I fixed -- I think -- problems with current_dynamic_space being
uninitialized (but still used..) when GENCGC is defined. But for some
time I've been unable to fix another pair of problems, first that the
system sometimes fails a GC assertion because an alloc_region isn't
reset when it's expected to be reset, and second that if that doesn't
happen, the system fails with a SIGINT when it tries to load a core
files that it's written.

I've been having all kinds of fun wandering through the GENCGC code
trying to figure out what it does, enjoying the way that a "reset"
state is referred to as an "empty" state elsewhere, and that
gc_alloc_large() actually means gc_alloc_possibly_large(), and the
boxedness or unboxedness of an alloc_region is logically associated
with the alloc_region but physically passed around separately from the
alloc_region, and so forth. It's marvellous, actually: generational GC
is after all so simple and straightforward that it might be boring,
too easy to understand, if it weren't for all these nice little
creative touches..

Anyway, hopefully this will take only a finite amount of time to fix
and then I'll be able to go back to less obnoxious problems. Or
failing that I'll go back to a version which worked and make changes
in smaller steps..

-- 
William Harold Newman <wil...@ai...>
"Tweak alpha so it sends SIGBUS for unaligned access, and does NOT
do a fixup. This encourages people to fix their code." -- a commit
note from <http://www.OpenBSD.org/plus29.html>
PGP key fingerprint 85 CE 1C BA 79 8D 51 8C  B9 25 FB EE E0 C3 E5 7C

Re: [Sbcl-devel] Argh. (progress, or lack thereof, on sbcl-0.6.12.8)

From: Daniel B. <da...@te...> - 2001-05-15 08:13:37

William Harold Newman <wil...@ai...> writes:

> I fixed -- I think -- problems with current_dynamic_space being
> uninitialized (but still used..) when GENCGC is defined. But for some
> time I've been unable to fix another pair of problems, first that the
> system sometimes fails a GC assertion because an alloc_region isn't
> reset when it's expected to be reset, and second that if that doesn't
> happen, the system fails with a SIGINT when it tries to load a core
> files that it's written.

Ew.  Is this a problem that can be reproduced in the snapshot
(i.e. was it my fault?)

If so and if you have a reasonable way of reproducing it, I can have a
look for anything I might have done that could be causing it.

-dan

-- 

  http://ww.telent.net/cliki/ - Link farm for free CL-on-Unix resources

[Sbcl-devel] Argh, continued. (FLAKY branch, finding fault, unmain.html..)

From: William H. N. <wil...@ai...> - 2001-05-15 14:17:39

I've now gotten the system to build and sort of run on both Linux and
OpenBSD by turning off PURIFY. In the build process, after dumping the
system at the end of warm init, there's sometimes an error, but it
seems to occur only after the system is already written.

The result doesn't pass the regression tests, though -- it gets a
little ways, but then starts to just spin for many CPU minutes.

I've started a flaky1 branch to store the current state in CVS. This
branch is basically just intended for my own use, so that I can do
things like "cvs diff" back to it, and so that I can experiment with
working with CVS branches. But in this case, since it's been a while
since I've been able to make the system stable enough to check in to
the main development branch, it's also the only way that anyone with a
morbid interest in can check what I'm doing.

Very likely flakyxxx branches, or some similar device, will be a
recurring theme in the CVS tree in the future, because there have been
other times that I wanted to do a CVS checkin but held off because the
system was still too broken to inflict on anyone else. I'm also
tentatively thinking of making a stable_0_6 branch for the dying
embers of 0.6.x once the conflagration that is 0.7.x starts, but I'll
burn that bridge when I come to it.

Incidentally, if anyone thinks I'm doing something less than optimal
in the CVS admin, you're probably right, so please speak right up,
since I'm pretty unfamiliar with any but the really basic features.

On Mon, 14 May 2001 15:48:29 -0700, Michael Vanier wrote:
> Have you seen this?
>
> http://mindprod.com/unmain.html
No, I hadn't. Thank you. I had a good, if somewhat crazed, laugh.

On Tue, May 15, 2001 at 09:12:45AM +0100, Daniel Barlow wrote:
> William Harold Newman <wil...@ai...> writes:
> 
> > I fixed -- I think -- problems with current_dynamic_space being
> > uninitialized (but still used..) when GENCGC is defined. But for some
> > time I've been unable to fix another pair of problems, first that the
> > system sometimes fails a GC assertion because an alloc_region isn't
> > reset when it's expected to be reset, and second that if that doesn't
> > happen, the system fails with a SIGINT when it tries to load a core
> > files that it's written.
> 
> Ew.  Is this a problem that can be reproduced in the snapshot
> (i.e. was it my fault?)

The "sometimes fails a GC assertion" problem doesn't seem to be all
that reproducible even in my build. I'm guessing that it may depend
on the exact size of the src/runtime/runtime executable. In the 
version in flaky1, it happens in OpenBSD
  T
  * 
  NIL
  * 
  NIL
  * [undoing binding stack... Argh! alloc_region not reset in gc_alloc_new_region()
  alloc_region *0x459ba4:
    first_page=0x00000000, last_page=0xffffffff,
    start_addr=0x48000000, free_pointer=0x49becaa0, end_addr=0x48000000
  fatal error encountered in SBCL runtime system
  done]
  [saving current Lisp image into output/sbcl.core:
  writing 1688(0x698) bytes from the read-only(3) space at 0x10000000
  writing 1424(0x590) bytes from the static(2) space at 0x28000000
  writing 106274816(0x655a000) bytes from the dynamic(1) space at 0x48000000
  done]
  LDB monitor
  ldb> 
(where the funny order is probably because this output was 
collected with "2>&1 | tee make.tmp", so that stderr and stdout get
a bit mixed up) but it doesn't happen on Linux. A few builds ago, 
though, something similar (another "alloc_region not reset" assertion
failure) happened on Linux too, and I've basically just added assertions
and refactored code since then, nothing which should've truly fixed
the problem.

I suspect that at least some of the current problem is partly "your
fault", perhaps because of changes you made in the way that
current_region_free_pointer is used. But the behavior of
current_region_free_pointer and SymbolValue(ALLOCATION_POINTER) in
gencgc.c seems more bizarre than you might reasonably have expected,
so much so that interacting with it while trying unscrew the
all-the-world's-an-x86 simplifications I made when first setting up
SBCL is asking for trouble. And I can certainly understand testing the
system, seeing that it works, and thinking it's OK.

I do think causing the GENCGC code to use current_dynamic_space
without initializing it was a bad idea, though:
  +#ifndef GENCGC
  +	current_dynamic_space = DYNAMIC_0_SPACE_START;
   #endif
and then all the stuff which used to use DYNAMIC_SPACE_START using
current_dynamic_space instead. But then it did work somehow in
sbcl-0.6.12.7, so maybe I'm confused; and anyway, again, it's
basically pretty reasonable to test the system, see that it works, and
send it in. In general I really like specific tests accompanying
patches, but unfortunately it's pretty hard to test this kind of GC
stuff much more than "I built it, and did some things which put some
stress on the GC, and it still worked". So I don't know what more I
can ask for, other than you never making any mistakes.:-|

> If so and if you have a reasonable way of reproducing it, I can have a
> look for anything I might have done that could be causing it.

If you (Dan, or anyone else, especially Alpha users) want to take a
look at the flaky1 branch, you can:
  $ cvs $whatever_magic_you_use_to_get_to_sourceforge checkout -rflaky1 sbcl
I think. I hope I'll be able fix the x86 problems soon, so with luck
it won't be worth trying to debug it from your end. However, I would
be interested in knowing whether it still builds on an Alpha, since I
don't have any good way to test whether I've made a mistake which
affects the non-GENCGC code.

Also, I'm accumulating a list of cleanups that I'd like to make in the
GC code once things are more stable again, and I'll want to get your
(Dan, or any other Alpha hackers) opinion of them, to try to make sure
that I don't mess up the non-GENCGC world too badly.

-- 
William Harold Newman <wil...@ai...>
"To foil the maintenance programmer, you have to understand how he
thinks." -- <http://mindprod.com/unmain.html>
PGP key fingerprint 85 CE 1C BA 79 8D 51 8C  B9 25 FB EE E0 C3 E5 7C

[Sbcl-devel] CVS logs after branching (in sbcl-0.6.12.8)

From: William H. N. <wil...@ai...> - 2001-05-19 01:20:57

(I just checked in sbcl-0.6.12.8, which is the result of merging the
flaky1 branch back into the main branch now that it passes its tests
and rebuilds itself, at least under Linux. OpenBSD is probably still
broken.)

On Tue, May 15, 2001 at 07:09:31PM +0100, Daniel Barlow wrote:
> William Harold Newman <wil...@ai...> writes:
> 
> > Incidentally, if anyone thinks I'm doing something less than optimal
> > in the CVS admin, you're probably right, so please speak right up,
> > since I'm pretty unfamiliar with any but the really basic features.
> 
> My experience has been that it's generally better not to develop on a
> branch if it's avoidable - use the HEAD for developing and the
> branches for "stable releases that might need bugs fixed".  That's
> more of a guideline than a rule, though; it makes "cvs log" and the
> web-based tools a lot more informative.  And means that all the new
> files you create on the branch don't end up in the Attic for most of
> their working life.

I just reviewed the "cvs2cl.pl" output and it looks reasonable to me.
The flaky1 checkins appear as part of the synthesized ChangeLog in
chronological order, with their checkin messages. For the purposes of
reading the ChangeLog, they might as well be on the main branch.

I'm not sure what problems you were referring to, and I can't really
guess from the "cvs log" output. I've hardly ever used bare "cvs log"
at all, or the web-based tools based on it, since I use "cvs2cl.pl"
(from http://www.red-bean.com/cvs2cl/>) almost exclusively. So I don't
really know what people expect from bare "cvs log". If Dan or anyone
else wants to take a look at the CVS log from the past 10 days or so
and comment on whether it reflects the flaky1 branch, flaky1 changes,
and flaky1 merge in a useful way, I'd be interested to hear from you.

-- 
William Harold Newman <wil...@ai...>
"A TRUE Klingon Warrior does not explain his commits!" 
PGP key fingerprint 85 CE 1C BA 79 8D 51 8C  B9 25 FB EE E0 C3 E5 7C