Re: [Ray-devel] VM Redesign, GC, Coroutines (please read)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sunday 13 March 2005 04:06, Andrew Clinton wrote:
> Quoting Wolfgang Wieser <ww...@gm...>:
>> [Coroutines]
> I think that this is actually a really good way to do it, and it is
> supported by some of the research that I've been doing recently.  In
> particular, you should check out the Kilhauea renderer:
> http://portal.acm.org/citation.cfm?id=569673.569675&dl=GUIDE&dl=ACM&type=se
>ries&idx=SERIES10714&part=Proceedings&WantType=Proceedings&title=ACM%20Inter
>national%20Conference%20Proceeding%20Series
>
> (If you don't have ACM access I can mail you the paper)
>
Thanks; I'll check that tomorrow. 

> The way they do it is to match the number of threads to the number of
> processors on the machine.  Within each thread, they use task parallelism
> like you said, with application-controlled task switches.  
>
This is also what I had in my mind. (Maybe using some threads more than 
CPUs to fill up disc latencies while reading data and the like.) 
[BTW, a coroutine switch is about the time of "for(int i=0; i<72; i++);" 
on my box.]

> One thing that 
> Kilauea supports that we probably won't need is the ability to split a
> large scene database among a collection of machines, which adds a lot of
> complexity to their design.
>
This is a nice feature but I think we should not implement that unless (for 
some reason) we could do it easily (which I actually doubt). Instead, I would 
like to have on-demand loading of large images, meshes and NURBS which 
are the things which usually use up most of the storage. 

> I've been doing a lot of reading in my spare time, esp. on global
> illumination topics.  I've been making my way through "Physically Based
> Rendering" by Pharr, Humphreys, which is turning out to be a really
> excellent book (so far).  
>
I just had a look at the table of contents of that book (available at 
amazon). Anyways I need to get some more clue about physically 
based rendering...

> I've been busy lately, unfortunately the next two weeks I need to catch up
> with a compiler project for school so I can't comment on everything you've
> mentioned.
>
Compiler project - that's nice :) 
You know that I'm just starting to write a compiler? (For SPL.)

> - floats or doubles: I'm leaning towards using floats in the renderer, at
> least for storage and the majority of computation.  Reasons: approx. 1/2 of
> memory use for meshes, better cache performance, and improved computation
> speed (but not twice as fast, float divide is 24 cycles on AMD64 while
> double is 28 cycles). We could change to doubles for some computation paths
> if it turns out there are precision issues.
>
Well, I'd rather stick to double for most of the computation. 

Okay, meshes, NURBS parameters and all the color computation could be done 
using floats but for the rest of the geometry and intersection computations 
(especially everything involving polynomial root solving), I would like to use 
double precision. At least that's my feeling after all the numerics I've been 
doing. 

We should, however, try to code in a way to allow changing that easily at 
compile time. Most of the code in lib/numerics is provided as float/double 
templates (notable exception: root solvers). 

> - SIMD processing: After some more research, I'm less confident of using a
> SIMD approach to coherent rays, esp. for global illumination.  I'm
> wondering if we could get a better speedup just be making specific
> optimizations for low-level vector oprations.
>
I've actually been a bit reluctant with SIMD right from the beginning and 
I am happy that what you propose here is pretty much what was planned right 
from the beginning: 
- Using a Vector2,3,4 C++ class which can implement vector operations 
  using processor's SIMD instructions (inline assembly or built-in gcc 
  functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.)
- Implementing Vector2,3,4 types as native types in the VM to allow the 
  VM to internally do SIMD operations on them. 

Collecting values and evaluating several of them them SIMD-like in the VM 
(e.g. for isosurfaces) would probably still increase performance especially 
when not JITCompiling the code. This could be done with a stream design 
using lots of kernels (e.g. one for isosurface evaluation). One then has to 
make sure that the kernel will process its input _if_needed_ (even if there 
is only _one_ value in the input stream) to prevent deadlock. This is similar 
to what you mentioned with ray priority. 

When not using a stream design, we can forget about higher level SIMD 
anyways. 

> - Triangles only: We could optimize our design by having most primitives
> tesselate to triangles.  Then the accelerator could be constructed with
> geometry specified inline with the leaf nodes.  We could do this by
> templating the accelerator to the type of data that will be processed, in
> our case either triangles or object pointers.  Non-triangle primitives like
> isosurfaces would be placed into a seperate accelerator.
>
The accelerator data structure is one of the things about the RT core which 
I worry much about because of its impact on intersection finding speed. 
Actually, I would like to define a clear interface to the accelerator to make 
it changeable and be able to have BSP/kD/grid/octree structures. 

(You probably know the paper of Havran et al. "Statistical Comparison of 
Ray-Shooting Efficiency Schemes"; kD seems to perform best in most 
situations.) 

If I understand you correctly, you propose that scene shapes can either be 
represented as meshes (via tesselation of the shape) or via a bounding box 
and their own intersection function (which computes exact intersections). 
All objects must be able to provide an intersection function, some may also 
provide a tesselation function. 
(We can do this, however I don't like the tesselation very much because it 
is one of the features of RT that it can easily be used on arbitrary geometry 
and not only on triangles and NURBS. And because a tesselated object 
takes much more space in memory. But OTOH, if it improves speed, why 
not allow that on a per-object basis?)

But there are some other problems: 
- Transferring the acceleration structure to the clients. This becomes 
  "interesting" especially for on-demand downloading of the scene graph 
  in case we want to do that (the new VM design will not do that for us 
  due to the lack of an indirection layer). 
- Objects may not be representable by just an object pointer. 
  I like the idea of "procedural transformations" (if one can call it like 
  that) which means that an object transformation node in the CSG tree 
  can store properties like "1000 objects in a line" without having to 
  allocate 1000 object transformation matrices. 
  This means that the object should be represented by the transformation 
  node pointer and the transformation index (0..999 here). 
  Otherwise (if we just have the node) we would have to test all 1000 
  contained "objects" for intersections. 
  Obviously, one index is not enough as soon as these things get nested. 

I'll re-read the other parts of your email when I had a look at the "Kilauea" 
paper. Thanks for the hint. 

Cheers,
Wolfgang