[Ray-devel] VM Redesign, GC, Coroutines (please read)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello everybody!

It's time to get some life onto the list again. 

1. Coroutines
-------------

Using stream processing and kernels allows for support of modern 
arcitectures to come up in the future (multi-core SMP-like systems 
wich will probably be in consumer boxes in some years already (see 
recent "Spektrum der Wissenschaft") as well as more radical ones like 
the Cell architecture). 
However, one of the problems is how to adequately implement streams 
and kernels on a (currently) standard UP (uniprocessor) box. 

One solution for that could be the use of threads combined with the 
use of coroutines. For details, see Knuth, "The Art of Computer 
Programming" or the documentation for coroutine.{cc,h} in  
devel-ray/src/lib/threads/. 
The coroutine implementation in there is actually a thread-safe port 
of libPCL (portable coroutine library) by Davide Libenzi. 

Basically, coroutines allows for cooperative fast user-space switching of 
contexts without system interaction. 
(On my linux-2.6 UP system, a coroutine context switch is 7 times as fast 
as a thread context switch.)

Coroutine switching is done cooperatively which additionally eliminates 
locking required for concurrent threads. 

2. VM
-----

After Andrew's remarks concerning the VM and a recent feeling on 
behalf of my sinde about the need for more performance, I thought 
about changing the (planned) internal VM layout somewhat. 

For some background, we were having the following problems: 
- We need a garbage collector because the user of the VM cannot be 
  relied on to properly deallocate objects. However, during the GC 
  run, other threads need to be stopped which is not easily possible 
  using pthreads and similar APIs. 
- The VM has a two-stage lookup/dereferenciation for pointers. 
  Advantage: Easily allows to use safe pointers (user cannot crash VM), 
  allows to implement on-demand requests of objects over the network 
  on the VM level. 
  However, using real pointers would of couse be faster and the impact 
  is not limited on the VM because all exported objects need to use the 
  same notion of pointers as the VM itself. 

So, a possible way to go seems to be the following: 
- Use real pointers in the VM. The VM can use a checking mode where 
  every pointer value is first looked up in a hash table to verify 
  its validity. This is much slower than the previous two-stage lookup 
  (which is O(1)) but the non-checking version of the VM (for real 
  renderings) is faster. 
- Network distribution or on-demand requests of objects then need to 
  be handeled at a higher level. (This will probably demand for an 
  indirection layer for objects capable of such on-demand loading.)
- The garbage collection is performed by the conservative GC 
  (http://www.hpl.hp.com/personal/Hans_Boehm/gc/) which is also used 
  in gcc and other projects. 
  Hans Boehm, et al, spent a lot of time in writing a state-of-the-art 
  GC which nowadays even supports multi-threading (by stopping other 
  threads; they have platform-dependent code for that part). 
  This also eliminates some development work for us. 
- SPL (as of my current design) allows for explicit deletion of objects 
  (despite the GC). If the user explicitly deletes all his allocated 
  objects, the application will run faster because the GC has less work 
  (I verified that with boehm_gc). 
  We can allow the user to not use garbage collection (and do his own 
  memory management) if he wishes to do so (at his own risk, as always). 

The main problems with this approach are: 
- The VM needs to know the object base location to do dynamic casts which 
  are also needed for virtual function calls. This can be solved by 
  attaching an offset value to all base classes in all instances. 
  Actually, no additional memory is required unless the object is a 
  (base) class of non-zero size but has _no_ virtual functions. These 
  cases are probably rare. 
- Pointer size is no longer constant for the VM since we support 32bit 
  and 64bit systems. This means that the compiler cannot easily calculate 
  offsets for the target machine. The easiest solution that comes to my 
  mind is that the VM assembly uses 2 values for each address/offset 
  specification, one for 32bit and one for 64bit systems. The VM can 
  then select the correct version when loading the assembly file. 
- Explicit deletion changes in behaviour. Previously, a pointer to an 
  explicitly deleted object would be NULL after deleting the object 
  because the shared indirection layer index contains NULL. Now, a 
  pointer is not automagically NULL and dereferencing a deleted object 
  may crash the VM (or trigger an assertion for the checking VM). 

3. SPL
------

Just by chance, I stumbled accross a tool called "treecc" 
(www.southern-storm.com.au/treecc.html). It helps in compiler development 
mainly by protecting the programmer from forgetting cases. 
I'm currently seeing if I will use it for the SPL compiler implementation. 
This is the reason why compilation of the code currently fails just 
before the end in the spl/ directory. 

Since treecc is quite small and not very common, I will put it into the 
3rdparty/ directoy once it is clear that it will be used. 

The same holds for the garbage collector. It is correct that there exist 
binary distributions of it but we must make sure that we have the 
thread-safe version and also may play with parallel marking and other 
compile-time tuning. 

(Things already work in my development version but I would like to avoid 
adding lots of 3rdparty code to the CVS which will be removed again later.) 

4. RT core
----------

Any more ideas/suggestions on core design / stream processing / SIMD?
(Andrew: I would be very happy to see your ideas before you have to 
         leave us.)

Regards,
Wolfgang