Re: [Ray-devel] VM Redesign, GC, Coroutines (please read)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> > One thing that 
> > Kilauea supports that we probably won't need is the ability to split a
> > large scene database among a collection of machines, which adds a lot of
> > complexity to their design.
> >
> This is a nice feature but I think we should not implement that unless (for 
> some reason) we could do it easily (which I actually doubt). Instead, I would
> 
> like to have on-demand loading of large images, meshes and NURBS which 
> are the things which usually use up most of the storage. 
> 

I agree; they needed to develop distributed algorithms for ray intersection and
photon gather, which require rays to be sent to multiple hosts and then the
results combined to get the result.  It would be much more useful to encourage
users to develop scenes so that objects can be demand-loaded or built on the
fly, so that the system can manage memory effectively on a single host.

> > - floats or doubles: I'm leaning towards using floats in the renderer, at
> > least for storage and the majority of computation.  Reasons: approx. 1/2
> of
> > memory use for meshes, better cache performance, and improved computation
> > speed (but not twice as fast, float divide is 24 cycles on AMD64 while
> > double is 28 cycles). We could change to doubles for some computation
> paths
> > if it turns out there are precision issues.
> >
> Well, I'd rather stick to double for most of the computation. 
> 
> Okay, meshes, NURBS parameters and all the color computation could be done 
> using floats but for the rest of the geometry and intersection computations 
> (especially everything involving polynomial root solving), I would like to
> use 
> double precision. At least that's my feeling after all the numerics I've been
> 
> doing. 
> 
> We should, however, try to code in a way to allow changing that easily at 
> compile time. Most of the code in lib/numerics is provided as float/double 
> templates (notable exception: root solvers). 
> 

I would aim for using floats everywhere except where there is a known numerical
problem.  So we could use doubles for only the root solving components of the
system.  This makes sense, because there will be little storage or caching
issues for root finding, therefore doubles should perform only slightly worse
than floats.  For triangle meshes, patch surfaces, bounding boxes, and
accelerator nodes, we could use floats to improve performance (probably
approaching 2 times better performance).

One more advantage of using floats is that if we ever perform a SIMD
optimization of the code path, we will get a 2x improvement over what we would
get if doubles were used.  SSE instructions fit 4 floats but only 2 doubles.

> > - SIMD processing: After some more research, I'm less confident of using a
> > SIMD approach to coherent rays, esp. for global illumination.  I'm
> > wondering if we could get a better speedup just be making specific
> > optimizations for low-level vector oprations.
> >
> I've actually been a bit reluctant with SIMD right from the beginning and 
> I am happy that what you propose here is pretty much what was planned right 
> from the beginning: 
> - Using a Vector2,3,4 C++ class which can implement vector operations 
>   using processor's SIMD instructions (inline assembly or built-in gcc 
>   functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.)
> - Implementing Vector2,3,4 types as native types in the VM to allow the 
>   VM to internally do SIMD operations on them. 
> 
> Collecting values and evaluating several of them them SIMD-like in the VM 
> (e.g. for isosurfaces) would probably still increase performance especially 
> when not JITCompiling the code. This could be done with a stream design 
> using lots of kernels (e.g. one for isosurface evaluation). One then has to 
> make sure that the kernel will process its input _if_needed_ (even if there 
> is only _one_ value in the input stream) to prevent deadlock. This is similar
> 
> to what you mentioned with ray priority. 
> 

One idea I've been thinking is to allow the user to flag a computation (eg. an
isosurface kernel) as "parallel".  In this case, the system will make the
optimization you've described, and collect samples to evaluate together.  By
allowing the user to specify which kernels should be optimized, we can avoid the
large sorting problem of collecting samples for all kernels.  This might also
make sense due to the conventional use of isosurface (for the most part in
POV-Ray), where a single large isosurface is used as a terrain or a major
feature of the scene.

> > - Triangles only: We could optimize our design by having most primitives
> > tesselate to triangles.  Then the accelerator could be constructed with
> > geometry specified inline with the leaf nodes.  We could do this by
> > templating the accelerator to the type of data that will be processed, in
> > our case either triangles or object pointers.  Non-triangle primitives
> like
> > isosurfaces would be placed into a seperate accelerator.
> >
> The accelerator data structure is one of the things about the RT core which 
> I worry much about because of its impact on intersection finding speed. 
> Actually, I would like to define a clear interface to the accelerator to make
> 
> it changeable and be able to have BSP/kD/grid/octree structures. 
> 

This is usually done by making the accelerator have the same interface as the
primitives themselves.  This is what I've done in my raytracer, and is also what
is done in pbrt.

I would focus only on KD trees.  You can use my patch to start, I'm not sure
that I'll have time to make the necessary adaptations.  Additional optimizations
in the book include the following:

- Represent all tree nodes with 8 bytes (64 bits)

struct KDNode {
    union {
      unsigned int flags; //Both
      float split; // Interior
      unsigned int nprimitives; // Leaf
    };
    union {
      unsigned int children; // Interior
      T *primitive; // Leaf
      T **primitives; // Leaf
    };
};

The trick is to encode the flags in the two least significant bits of the
floating point split value.  This will have huge caching benefits when iterating
through the accelerator.

- Use a cost model with different costs for iteration steps and object
intersections.  My accelerator (I think) assumes traversal steps cost nothing.

> (You probably know the paper of Havran et al. "Statistical Comparison of 
> Ray-Shooting Efficiency Schemes"; kD seems to perform best in most 
> situations.) 
> 

Yes, in case you also were not aware, the BSP patch I wrote for POV-Ray was in
fact an implementation of a lot of the features in Havran's Phd (Heuristic ray
shooting algorithms).

> If I understand you correctly, you propose that scene shapes can either be 
> represented as meshes (via tesselation of the shape) or via a bounding box 
> and their own intersection function (which computes exact intersections). 
> All objects must be able to provide an intersection function, some may also 
> provide a tesselation function. 
> (We can do this, however I don't like the tesselation very much because it 
> is one of the features of RT that it can easily be used on arbitrary geometry
> 
> and not only on triangles and NURBS. And because a tesselated object 
> takes much more space in memory. But OTOH, if it improves speed, why 
> not allow that on a per-object basis?)
> 

It actually improves speed immensely for most cases (I would not be suprised at
1 or 2 orders of magnitude difference between intersecting a finely tesselated
nurb vs. an analytic method).  I would make this the primary method of handling
these types of patch surfaces and ignore analytic methods for anything but
user-defined isosurfaces or infinite objects.  Even for isosurface we might want
to build in a conversion to be able to get fast previews.

I think our system should make use of ray differentials to choose the level of
tesselation.  Check online for the paper "Ray differentials for multiresolution
geometry caching" by Christensen.  This way, highly incoherent rays for GI can
use a low-res tesselation, saving lots of effort.

One more thing (I don't think I was that clear) is that tesselated geometry can
be put directly into the accelerator.  So we could have the memory layout:

ILGGGGLGGGILGGLLGGG

I = interior KD tree node
L = leaf node
G = triangle

This would have really nice cache performance.  My current implementation in
POV-Ray stores the leaves, object lists, and geometry in entirely seperate
memory locations, so there are 2 layers of indirection just to intersect against
the object list.  This way almost all the processing for a leaf could be done
without any indirections.  The disadvantage is that for finely subdivided
accelerators, we will need to duplicate a sizable number of triangles (but with
good termination criteria for the construction, this can be reduced to a minimum).

> But there are some other problems: 
> - Transferring the acceleration structure to the clients. This becomes 
>   "interesting" especially for on-demand downloading of the scene graph 
>   in case we want to do that (the new VM design will not do that for us 
>   due to the lack of an indirection layer). 

I would recommend building accelerators locally and on-demand.  This does not
prohibit on-demand geometry transfers.

> - Objects may not be representable by just an object pointer. 
>   I like the idea of "procedural transformations" (if one can call it like 
>   that) which means that an object transformation node in the CSG tree 
>   can store properties like "1000 objects in a line" without having to 
>   allocate 1000 object transformation matrices. 
>   This means that the object should be represented by the transformation 
>   node pointer and the transformation index (0..999 here). 
>   Otherwise (if we just have the node) we would have to test all 1000 
>   contained "objects" for intersections. 
>   Obviously, one index is not enough as soon as these things get nested. 
> 

Yes, procedurals should be expanded and put into an accelerator locally.  There
is not need to transfer anything except for the initial program to the rendering
servers.

Andrew

----------------------------------------
This mail sent through www.mywaterloo.ca