From: Wolfgang W. <ww...@gm...> - 2005-03-13 22:13:24
|
On Sunday 13 March 2005 04:06, Andrew Clinton wrote: > Quoting Wolfgang Wieser <ww...@gm...>: >> [Coroutines] > I think that this is actually a really good way to do it, and it is > supported by some of the research that I've been doing recently. In > particular, you should check out the Kilhauea renderer: > http://portal.acm.org/citation.cfm?id=569673.569675&dl=GUIDE&dl=ACM&type=se >ries&idx=SERIES10714&part=Proceedings&WantType=Proceedings&title=ACM%20Inter >national%20Conference%20Proceeding%20Series > > (If you don't have ACM access I can mail you the paper) > Thanks; I'll check that tomorrow. > The way they do it is to match the number of threads to the number of > processors on the machine. Within each thread, they use task parallelism > like you said, with application-controlled task switches. > This is also what I had in my mind. (Maybe using some threads more than CPUs to fill up disc latencies while reading data and the like.) [BTW, a coroutine switch is about the time of "for(int i=0; i<72; i++);" on my box.] > One thing that > Kilauea supports that we probably won't need is the ability to split a > large scene database among a collection of machines, which adds a lot of > complexity to their design. > This is a nice feature but I think we should not implement that unless (for some reason) we could do it easily (which I actually doubt). Instead, I would like to have on-demand loading of large images, meshes and NURBS which are the things which usually use up most of the storage. > I've been doing a lot of reading in my spare time, esp. on global > illumination topics. I've been making my way through "Physically Based > Rendering" by Pharr, Humphreys, which is turning out to be a really > excellent book (so far). > I just had a look at the table of contents of that book (available at amazon). Anyways I need to get some more clue about physically based rendering... > I've been busy lately, unfortunately the next two weeks I need to catch up > with a compiler project for school so I can't comment on everything you've > mentioned. > Compiler project - that's nice :) You know that I'm just starting to write a compiler? (For SPL.) > - floats or doubles: I'm leaning towards using floats in the renderer, at > least for storage and the majority of computation. Reasons: approx. 1/2 of > memory use for meshes, better cache performance, and improved computation > speed (but not twice as fast, float divide is 24 cycles on AMD64 while > double is 28 cycles). We could change to doubles for some computation paths > if it turns out there are precision issues. > Well, I'd rather stick to double for most of the computation. Okay, meshes, NURBS parameters and all the color computation could be done using floats but for the rest of the geometry and intersection computations (especially everything involving polynomial root solving), I would like to use double precision. At least that's my feeling after all the numerics I've been doing. We should, however, try to code in a way to allow changing that easily at compile time. Most of the code in lib/numerics is provided as float/double templates (notable exception: root solvers). > - SIMD processing: After some more research, I'm less confident of using a > SIMD approach to coherent rays, esp. for global illumination. I'm > wondering if we could get a better speedup just be making specific > optimizations for low-level vector oprations. > I've actually been a bit reluctant with SIMD right from the beginning and I am happy that what you propose here is pretty much what was planned right from the beginning: - Using a Vector2,3,4 C++ class which can implement vector operations using processor's SIMD instructions (inline assembly or built-in gcc functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.) - Implementing Vector2,3,4 types as native types in the VM to allow the VM to internally do SIMD operations on them. Collecting values and evaluating several of them them SIMD-like in the VM (e.g. for isosurfaces) would probably still increase performance especially when not JITCompiling the code. This could be done with a stream design using lots of kernels (e.g. one for isosurface evaluation). One then has to make sure that the kernel will process its input _if_needed_ (even if there is only _one_ value in the input stream) to prevent deadlock. This is similar to what you mentioned with ray priority. When not using a stream design, we can forget about higher level SIMD anyways. > - Triangles only: We could optimize our design by having most primitives > tesselate to triangles. Then the accelerator could be constructed with > geometry specified inline with the leaf nodes. We could do this by > templating the accelerator to the type of data that will be processed, in > our case either triangles or object pointers. Non-triangle primitives like > isosurfaces would be placed into a seperate accelerator. > The accelerator data structure is one of the things about the RT core which I worry much about because of its impact on intersection finding speed. Actually, I would like to define a clear interface to the accelerator to make it changeable and be able to have BSP/kD/grid/octree structures. (You probably know the paper of Havran et al. "Statistical Comparison of Ray-Shooting Efficiency Schemes"; kD seems to perform best in most situations.) If I understand you correctly, you propose that scene shapes can either be represented as meshes (via tesselation of the shape) or via a bounding box and their own intersection function (which computes exact intersections). All objects must be able to provide an intersection function, some may also provide a tesselation function. (We can do this, however I don't like the tesselation very much because it is one of the features of RT that it can easily be used on arbitrary geometry and not only on triangles and NURBS. And because a tesselated object takes much more space in memory. But OTOH, if it improves speed, why not allow that on a per-object basis?) But there are some other problems: - Transferring the acceleration structure to the clients. This becomes "interesting" especially for on-demand downloading of the scene graph in case we want to do that (the new VM design will not do that for us due to the lack of an indirection layer). - Objects may not be representable by just an object pointer. I like the idea of "procedural transformations" (if one can call it like that) which means that an object transformation node in the CSG tree can store properties like "1000 objects in a line" without having to allocate 1000 object transformation matrices. This means that the object should be represented by the transformation node pointer and the transformation index (0..999 here). Otherwise (if we just have the node) we would have to test all 1000 contained "objects" for intersections. Obviously, one index is not enough as soon as these things get nested. I'll re-read the other parts of your email when I had a look at the "Kilauea" paper. Thanks for the hint. Cheers, Wolfgang |