On 05/14/2010 01:35 PM, Nikodemus Siivola wrote:
On 14 May 2010 11:48, Valentin Pavlov <x.pavlov@gmail.com> wrote:

  
SBCL goes SuperBallistic! I have hacked together an SBCL port for the
IBM BlueGene/P supercomputer's Compute Node Kernel (CNK). This allows
    
This is sounds very cool! I at least would love to hear more.

Cheers,

 -- Nikodemus
  

OK, here it goes a [not so-] short story:

As a part-time job, i'm doing system administration for a BlueGene/P supercomputer in the Bulgarian National Supercomputing Centre. The BlueGene/P consists of a front-end node (SLES10 powerpc64) and a bunch (in my case 2048, max 72x1024) of computing nodes, which run IBM propriatary kernel (Compute Node Kernel, CNK). Each node has 4 PowerPC 450 processors and the CNK provides linux-esque environment by providing access to many linux syscalls. There is a cross-compiler toolchain (gcc & friends) that generates code for the CNK. Jobs are submitted via scheduler and during the submission you specify how many CPUs you want. When the machine accepts your job, it boots the CNK on each node and loads and runs your executable. CNK is a single-process kernel, each process might put up to 4 threads (but since sb-thread is not ported to ppc, this is irrelevant). The different copies of your executable can communicate via standard MPI and do something in parallel. On the machine that I have access to, you can run 8192 parallel copies of your program.

Now, I have a parallel quantum computing simulator written in CL and I wanted to give it a shove on the BlueGene. Unfortunatelly, it only has C/C++/Fortran (the cross-compiling gcc toolchain and IBM XL suite of compilers) and (?) Python (the language). So I decided to try to port SBCL. Here's what happened:
  - first, I compiled sbcl for ppc and installed it on the front-end. This would be my SBCL_XC_HOST;
  - then, I decided that since I would want both regular ppc compilation (for the front-end) and a CNK-targeted compilation, I would control this via an environment variable. So, if I want to compile for the CNK I run 'BGPCNK=1 sh make.sh';
  - I decided to base everything on the ppc-linux arch/os setting and made a copy of various related files in src/runtime;
  - little change in make-config.sh to put :bgpcnk in features and select the appropriate arch/os files;
  - I ran make.sh and problems started to appear one by one in make-target-2 in a month-long odyssey, as follows:

  - it turns out that CNK's mmap requires MAP_FIXED, but then only if addr is not null. That was easy;
  - in a signal handler, when you make changes to the context (e.g. arch_skip_next_instruction) nothing happens. It turns out that the context you get passed is a read-only copy, and the real thing that gets copied back to the caller before sigreturn is further down the stack. Moreover, due to cache lineup, it is not on a fixed offset from SP, so I have to first find it. Thus, a wrapper around the signal handlers (e.g. SIGTRAP, SIGSEGV, etc.) first finds the offset of the real context (I named it Amber after Zelazny's novels), and then before returning copies the contents of the context to Amber. The sigreturn then copies the modifed Amber context to the caller. Suddenly, the runtime started to execute cold-sbcl.core and all was ok, until the first garbage collection when all hell broke (lose);
  - it took me considerable amount of time to get to know gencgc and hack trough it, until I finally realized that CNK's mprotect does nothing -- it's just a stub that returns 'no error'. So, the write barrier is not working and everything goes to hell. I decided to switch to Cheney GC;
  - and after several sleepless nights I understood that Cheney GC also relies on mprotect -- to trigger collection. So, I decided to try the Solaris hack and make os_invalidate of the region after current_auto_gc_trigger. Well, it turns out that this also does nothing and I can read and write to munmap-ped addresses however I wish. So I decided to hack the allocation macro;
  - I changed the allocation macro to look something like the gencgc allocation -- it first checks if the new memory address would be above current_auto_gc_trigger and if so, issues a trap, which triggers the garbage collection, much in the same way as handle_wp_violation. Well, this adds a little overhead to the allocation, but its not more than the overhead for the gencgc allocation. I decided that I will live with it;
  - next, it turns out that CNK's select call does not work for stdin (and for that matter, for stdout, too). Which is too bad, since sysread-may-block-p does not work for stdin anymore and that causes a whole lot of problems. After futile wandering here and there and trying out various stuff, I decided to add a new command line argument (--stdin) which will provide a name of a file that will be opened and its fd used instead of 0 in the *stdin* fd-stream. This proved ok, and the runtime actually started to process make-target-2.lisp and make-target-2-load.lisp and so make-target-2.sh finally was done;
  - I immediately decided to run-sbcl.sh and found out that some getpwdir is not working. I quickly found out that for some unknown reason calloc (used in wrap.c) ends up allocating 4096 bytes, but zeroing out 4108 (wtf!?). I decided not to cope with it and just changed calloc to malloc and then memset with zero. This done, run-sbcl.sh actually ran, repl-ed, I declared major success and opened up a beer;
  - having the first version running, I decided to make it better. First of all, default compilation mode of the C cross-compiler is static, which is OK, since the CNK only uses 1 process and having shared libraries is not of much benefit. Dynamic linking is supported however, and I decided to make the next step and rebuild everything with dynamic linking;
  - this turned out to be a major problem, because the CNK, when the program is dynamically linked, places the heap in a virtual address that is above 2GB. So I was stuck. Until recently, when I found out that there is a feature called persistent memory (which allows a memory area to remain between job invokations). The amount of persistent memory and the time when it is reset is controlled by environment variables, so I decided to ask for 1MB persistent memory, just to see what happens. Suddenly, the CNK decided to put my heap around 1GB virtual address, which suits me right. I still don't understand the exact mechanics of how CNK decides what to put where, but I get consistent behavior and that's all I care for now;
  - I moved on to building contribs. sb-grovel is not usable, since it forks a process and compiles foo.c, which is just not possible on CNK (it is a single-process kernel). But, I did it myself on the front-end and created a pre-defined constants.lisp file, just like the win32 case;
  - then, it turns out that dlopen(0) is not working and I had to ldso-stubify some of the routines that are required in the contribs but are actually in libc;

Well, more or less that's the story until now. I have a working swank on the CNK and an emacs/slime connected to it from my home machine via a ssh tunelling. Next steps is to clean up all the messy bits and pieces and come up with a patchset that I can apply on new sbcl versions. And to try  the foreign interface to MPI so I can make the parallel runtimes communicate with each other.

cheers,
-Valentin