ray-devel Mailing List for ray

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

It's so quiet here...

If anybody on this list is still interested in this project in any way, 
then PLEASE speak up NOW. 

Regards,
Wolfgang

Hello everybody!

As we all know, RAY aims to be a full-featured extensible raytracer 
portable among POSIX and Windows with native support for multi-threading 
and network distribution. 

Up to now, quite a lot has been done (mainly by implementing libraries which 
are largely core independent) but we need to go for the main parts now, and 
hence, it is probably a good idea to decide roghly on how to go on. 

This is my suggestion concerning the next steps in the development roadmap. 
I would like to encourage everybody to comment on that. 

* RT core design. Before we implement it, we should get a good idea of 
  the flow of information, flow of execution and required classes and 
  interfaces. 

  Andrew offered to do some more overall RT core design / planning some time 
  back and I'd like to kindly invite him especially because he has much more 
  experience about raytracers than me. 

  Apart from the "RT core event model", I did not do much more planning, 
  however, I am very confident to nicely implement a slightly modified 
  event model fixing the portability problems. 
  More on that in a separate email. 

* General discussion about rendering on lots of computers: How things should 
  operate, how to distribute information (scene graph,...). 

* VM and Compiler implementation. 
  This is what I'll do in my free time now. 
  After more reasoning, I decided to implement the "real pointer VM" (I 
  probably said that already) which requires a change in the assembly format 
  in order to be portable accross 32 and 64 bit platforms. 
  Hence, I'll re-implement the linker as well (thereby tidying up the 
  code, too). 

I'm still deeply convinced that the project will eventually be a great 
raytracer and that more supporters can be found once it is operating. 
However, as usual, getting it started is hard and that is what we're 
currently going through. 

Yet another important point is the amount of time everyone is able 
or willing to invest. 
I'll have two important exams after this semester and therefore will be 
much more limited in the amout of free time compared to the months before. 
However, as always, I'll do the best I can. 

Regards,
Wolfgang

GNet has now been removed from the CVS. 

There is no good reason to have GNet in the source tree any more: 
- GNet will not or only very slowly be further developed. 
- I implemented / ported a lot of the core functionality of GNet 
  (name resolution, TCP socket, UDP socket, address abstraction) in 
  src/lib/net. 
- It is much easier to compile everything without it. 
- We do MUCH better error reporting than GNet; and also the code should 
  generally be more efficient now. 

Wolfgang

> I will probably have a more in-depth view of the problem but let me first
> compile gcc-4.1.x on the P4 box...
>
Well, I cannot find the problem now and don't have enough time to dig 
further into the issue. Since there's IMO nothing wrong with the C++ code, 
there is little we can do anyways. 

So, in the end, the same code will calculate ridiculous timings on P4 
while reporting fairly quick timings on Athlons. Either it's the program 
having difficulty measuring the timings or the Intel chips just loose badly 
on these constructs -- or something else I did not think about. 

Wolfgang

On Thursday 26 May 2005 06:19, Andrew Clinton wrote:
> I've updated the tests
> - Changed constants single precision (neither float or double seems to
> introduce any casts in the assembly, now)
>
I made some tests and it seems that gcc-4.1.0 20050526 ("today's version")
does not make any difference between 0.5 and 0.5f as fp constants. 

While earier versions of gcc apparently automatically widened 
float -> double while doing down-conversion double->float at run time, 
current (development) versions make no difference. 

Especially, I verified that the following snippets of test-speed.cc compile 
into binary equivalent code: 
  v[0]=res; v[1]=res-0.1f;   v[2]=res-0.2f;
  v[0]=res; v[1]=res-0.1;    v[2]=res-0.2;
  v[0]=res; v[1]=res-(T)0.1; v[2]=res-(T)0.2;

Wolfgang

On Thursday 26 May 2005 06:19, ajc...@en... wrote:
> I've updated the tests
> - Changed constants single precision (neither float or double seems to
> introduce any casts in the assembly, now)
Yes, as I verified, the compiler will automatically do float -> double 
widening at compile time while double -> float conversion is done at 
run time. 

> - Added dependency between successive iterations for each test
>
I slightly changed that again: Instead of having the same variable on both 
sides (like in "a=a*b"), I introduced 2 tests: One for "a*=b" and one for 
(a=b*c, b=c*a, c=a*b). This is because commonly the result of an operation 
will not overwrite one input variable but be stored somewhere else. 
This has impact on compiler generated temporaries. 

The overall picture of the benchmarks is not very much affected by this. 

> These changes make a substantial difference on my system:
>
> --AMD 64 3500+ 2.2Ghz
>
Ah, great - you have a fairly fast AMD64!
Could you please run the tests in src/lib/threads and send me the output?
(Also compiler & kernel version.)

I'd be very much interested in the switch and lock times on Linux/AMD64. 
(The timings on FreeBSD/AMD64 are not so good and the question is whether 
it's the architecture or the operating system. I don't have access to an 
idle Linux/AMD64 system.)

> So, I suppose we can conclude that doubles really are preferred to floats
> for straight arithmetic, at least on AMD64, for algorithms with only
> additions and multiplications.
>
It seems logical that the 64 bit doubles are the natural data type for 
64 bit platforms and preferrable over floats as long as storage size does 
not matter. 

However, for the 32bit Athlon, double is considerably faster than float. 

Here are updated timings: 

--<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, NPTL>----------------------------
  (nothing)                : 0.358208 nsec/cyc
  vector3::length          : flt:  37.1814   dbl:   37.179  nsec/cyc  (1.00)
  vector3*vector3          : flt:  17.4478   dbl:  25.3944  nsec/cyc  (1.46)
  vector3 x vector3        : flt:  10.6335   dbl:  15.1611  nsec/cyc  (1.43)
  matrix3*vector3          : flt:  20.4215   dbl:   32.536  nsec/cyc  (1.59)
  trafomatrix*vector3      : flt:  20.3866   dbl:  28.9683  nsec/cyc  (1.42)
  trafomatrix::inverse     : flt:  120.664   dbl:  112.873  nsec/cyc  (0.94)
  trafomatrix*trafomatrix  : flt:  88.0527   dbl:  97.4989  nsec/cyc  (1.11)
  trafomatrix*=trafomatrix : flt:  78.0263   dbl:  75.8576  nsec/cyc  (0.97)
  matrix3*matrix3          : flt:  65.0964   dbl:  109.145  nsec/cyc  (1.68)
  matrix3*=matrix3         : flt:  68.4487   dbl:  67.6107  nsec/cyc  (0.99)

--<AMD64 1.8GHz FreeBSD-5.4 gcc-3.4.2>---------------------------------------
  (nothing)                : 0.28741 nsec/cyc
  vector3::length          : flt:  21.1455   dbl:  25.4859  nsec/cyc  (1.21)
  vector3*vector3          : flt:  3.86673   dbl:  3.79015  nsec/cyc  (0.98)
  vector3 x vector3        : flt:  8.95599   dbl:  8.94856  nsec/cyc  (1.00)
  matrix3*vector3          : flt:  17.1492   dbl:  14.7704  nsec/cyc  (0.86)
  trafomatrix*vector3      : flt:  16.4591   dbl:  13.9329  nsec/cyc  (0.85)
  trafomatrix::inverse     : flt:  58.0311   dbl:  61.0099  nsec/cyc  (1.05)
  trafomatrix*trafomatrix  : flt:  62.8014   dbl:  57.8739  nsec/cyc  (0.92)
  trafomatrix*=trafomatrix : flt:  65.3974   dbl:  55.8263  nsec/cyc  (0.85)
  matrix3*matrix3          : flt:  51.4784   dbl:  50.3268  nsec/cyc  (0.98)
  matrix3*=matrix3         : flt:  54.4237   dbl:  49.8042  nsec/cyc  (0.92)

But now the bad news: Your changes somwhow screwed up heavily the 
gcc-3.4/Pentium4 pair. Look at that: (r1.6, r1.7 are the CVS revisions)

r1.6 (AJC)
  (nothing)                : 0.116446 nsec/cyc
  vector3::length          : flt:   2519.8   dbl:  2517.85  nsec/cyc  (1.00)
  vector3*vector3          : flt:  628.565   dbl:  628.868  nsec/cyc  (1.00)
  vector3 x vector3        : flt:  3032.07   dbl:  3033.61  nsec/cyc  (1.00)
  matrix3*vector3          : flt:  4726.15   dbl:  4727.36  nsec/cyc  (1.00)
  trafomatrix*vector3      : flt:     4718   dbl:  4732.51  nsec/cyc  (1.00)
  trafomatrix::inverse     : flt:  115.973   dbl:  73.5815  nsec/cyc  (0.63)
  trafomatrix*trafomatrix  : flt:  20309.1   dbl:  20257.4  nsec/cyc  (1.00)
  matrix3*matrix3          : flt:  14475.6   dbl:  14460.1  nsec/cyc  (1.00)

My changes based on Andrew's don't make things better either...

r1.7 (WW)
  (nothing)                : 0.11606 nsec/cyc
  vector3::length          : flt:  2523.47   dbl:  2527.31  nsec/cyc  (1.00)
  vector3*vector3          : flt:  630.039   dbl:  628.347  nsec/cyc  (1.00)
  vector3 x vector3        : flt:  3055.63   dbl:  3057.65  nsec/cyc  (1.00)
  matrix3*vector3          : flt:  4725.59   dbl:  4722.21  nsec/cyc  (1.00)
  trafomatrix*vector3      : flt:  4722.81   dbl:  4727.08  nsec/cyc  (1.00)
  trafomatrix::inverse     : flt:  115.964   dbl:  74.8951  nsec/cyc  (0.65)
  trafomatrix*trafomatrix  : flt:  20279.7   dbl:  20276.5  nsec/cyc  (1.00)
  trafomatrix*=trafomatrix : flt:  20135.6   dbl:  20150.9  nsec/cyc  (1.00)
  matrix3*matrix3          : flt:  14451.3   dbl:  14454.8  nsec/cyc  (1.00)
  matrix3*=matrix3         : flt:  14400.8   dbl:  14450.7  nsec/cyc  (1.00)

The "(nothing)" and inversion tests are okay but the rest is just horrible. 

I will probably have a more in-depth view of the problem but let me first 
compile gcc-4.1.x on the P4 box...

Wolfgang

I've updated the tests
- Changed constants single precision (neither float or double seems to introduce
any casts in the assembly, now)
- Added dependency between successive iterations for each test

These changes make a substantial difference on my system:

--AMD 64 3500+ 2.2Ghz
  (nothing)                : 0.250217 nsec/cyc
  vector3::length          : flt:  18.2921   dbl:  22.4885  nsec/cyc  (1.23)
  vector3*vector3          : flt:  3.37153   dbl:  3.33858  nsec/cyc  (0.99)
  vector3 x vector3        : flt:  6.43606   dbl:  6.51059  nsec/cyc  (1.01)
  matrix3*vector3          : flt:  14.3812   dbl:  13.4929  nsec/cyc  (0.94)
  trafomatrix*vector3      : flt:  13.8013   dbl:  12.6405  nsec/cyc  (0.92)
  trafomatrix::inverse     : flt:  51.8531   dbl:  53.4506  nsec/cyc  (1.03)
  trafomatrix*trafomatrix  : flt:  60.8745   dbl:  55.2666  nsec/cyc  (0.91)
  matrix3*matrix3          : flt:  46.5577   dbl:  51.3582  nsec/cyc  (1.10)

It turns out I was somewhat wrong stating that single precision operations have
a longer latency than double.  At least on AMD64, additions and multiplications
of floats and doubles have the same latency (4 cycles).  Only division and
square root have longer latency for double precision operations.

I spent a while trying to figure out why there are still cases where double
outperforms float without coming to any conclusion.  I'm guessing that the
remaining difference is due to unaligned accesses to some of the floats.  The
AMD64 docs state that some of the move operations are slower when a floating
point value is not 8-byte aligned.

So, I suppose we can conclude that doubles really are preferred to floats for
straight arithmetic, at least on AMD64, for algorithms with only additions and
multiplications.

Quoting Wolfgang Wieser <ww...@gm...>:

> Hello Andrew!
> 
> On Wednesday 25 May 2005 02:00, Andrew Clinton wrote:
> > The only way I think you'll be able to make sure this isn't
> > happening is to check the assembly code manually.
> >
> The easiest thing is simply to USE the output (e.g. by summing it all up) 
> and slightly vary the input (e.g. by adding 0.1 each iteration). 
> This is what is actually done (right?)
> 
> Another possibility is to look at execution time scaling (which I did as 
> well): 
> - The basic operations are inline functions which are called 4 times in 
>   each loop. If I comment out 2 of them, the time halves. 
> - There is one no-op measurement, i.e. a loop with 4 inlines which to 
>   nothing. This is the "(nothing)" line. Its execution time indicates 
>   required time if the stmts are optimized away and does not change when 
>   2 of the calls are commented out. 
> 
> While only looking at and understanding the assembly gives a definite answer,
> 
> the scaling already gives confidence and the using/varying the vars prevents
> 
> the compiler from removing operations. 
> 
> I tried the explicit float casting (by adding a "f" to all FP constants) 
> and it did not change the timings. 
> 
> > It looks like your code is subject to some of these problems.  I've made
> > some attempts at fixing it but it is difficult to verify (looking through
> > lots of assembly).  
> >
> Did you see any change in the timings?
> Feel free to commit a "fixed" version; I'd like to have a look at it. 
> 
> > Especially problematic might be the production of a 
> > result value that is never used again, 
> >
> Well, the results in the loop statements are usually summed up and hence 
> "used". The calculated sum is in the end not used but that should not make 
> any difference. 
> 
> The fact that produced results are not directly needed for the next 
> calculation may enable the compiler/processor to make better use of 
> pipelining. I expect this to be benifitial to both types (float, double). 
> 
> > Maybe a better test would be to do
> > something useful with each type of arithmetic (intersect a ray with a
> > sphere?).
> >
> Sure. 
> 
> Let me tell you something about the history why I did this: I just wanted 
> to get a feeling for the time needed for typical floating point operations 
> as compared to the time required by thread-safe locking and context 
> switching. 
> 
> So, as I wrote in the email the day before, I'd be very happy if you 
> could post some real ray-shape intersection timings for your raytracer and 
> a "typical" scene (i.e. lots of basic CSG / a big triangle mesh / ..). 
> 
> Because the point is basically that I would like to compare the cost of a 
> ray-shape intersection to the cost of a coroutine switch. 
> This is required to decide if it makes sense to use ray-shape intersections 
> as request type or if the introduced overhead is too high. 
> 
> Wolfgang
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate
> online with coworkers and clients while avoiding the high cost of travel and
> communications. There is no equipment to buy and you can meet as often as
> you want. Try it
> free.http://ads.osdn.com/?ad_id=7402&alloc_id=16135&op=click
> _______________________________________________
> Ray-devel mailing list
> Ray...@li...
> https://lists.sourceforge.net/lists/listinfo/ray-devel
> 

----------------------------------------
This mail sent through www.mywaterloo.ca

Hello Andrew!

On Wednesday 25 May 2005 02:00, Andrew Clinton wrote:
> The only way I think you'll be able to make sure this isn't
> happening is to check the assembly code manually.
>
The easiest thing is simply to USE the output (e.g. by summing it all up) 
and slightly vary the input (e.g. by adding 0.1 each iteration). 
This is what is actually done (right?)

Another possibility is to look at execution time scaling (which I did as 
well): 
- The basic operations are inline functions which are called 4 times in 
  each loop. If I comment out 2 of them, the time halves. 
- There is one no-op measurement, i.e. a loop with 4 inlines which to 
  nothing. This is the "(nothing)" line. Its execution time indicates 
  required time if the stmts are optimized away and does not change when 
  2 of the calls are commented out. 

While only looking at and understanding the assembly gives a definite answer, 
the scaling already gives confidence and the using/varying the vars prevents 
the compiler from removing operations. 

I tried the explicit float casting (by adding a "f" to all FP constants) 
and it did not change the timings. 

> It looks like your code is subject to some of these problems.  I've made
> some attempts at fixing it but it is difficult to verify (looking through
> lots of assembly).  
>
Did you see any change in the timings?
Feel free to commit a "fixed" version; I'd like to have a look at it. 

> Especially problematic might be the production of a 
> result value that is never used again, 
>
Well, the results in the loop statements are usually summed up and hence 
"used". The calculated sum is in the end not used but that should not make 
any difference. 

The fact that produced results are not directly needed for the next 
calculation may enable the compiler/processor to make better use of 
pipelining. I expect this to be benifitial to both types (float, double). 

> Maybe a better test would be to do
> something useful with each type of arithmetic (intersect a ray with a
> sphere?).
>
Sure. 

Let me tell you something about the history why I did this: I just wanted 
to get a feeling for the time needed for typical floating point operations 
as compared to the time required by thread-safe locking and context 
switching. 

So, as I wrote in the email the day before, I'd be very happy if you 
could post some real ray-shape intersection timings for your raytracer and 
a "typical" scene (i.e. lots of basic CSG / a big triangle mesh / ..). 

Because the point is basically that I would like to compare the cost of a 
ray-shape intersection to the cost of a coroutine switch. 
This is required to decide if it makes sense to use ray-shape intersections 
as request type or if the introduced overhead is too high. 

Wolfgang

I'm almost certain that there is some problem in your testing procedure.  I've
run a similar test (vector normalization) for floats and doubles, and will
always get about a 20% improvement with floats.  If you look at the manuals for
the processors, you'll see longer latency for the double operations.

I'd wager one of the following things might be messing up the tests...
- unecessary conversions: when you specify a float literal, always add 'f' (eg.
1.0f instead of 1.0).  Otherwise the compiler will synthesize conversions. 
Conversions can consume a lot of clock cycles.
- missing dependencies: compiler might be optimizing some code or taking it out
of the loop.  If you compute a result then never use it, the code might
disappear.  If you compute the same result many times, it might only execute
once.  The only way I think you'll be able to make sure this isn't happening is
to check the assembly code manually.

It looks like your code is subject to some of these problems.  I've made some
attempts at fixing it but it is difficult to verify (looking through lots of
assembly).  Especially problematic might be the production of a result value
that is never used again, so you aren't really timing the latency of the
operation at all.  Maybe a better test would be to do something useful with each
type of arithmetic (intersect a ray with a sphere?).

Andrew

Quoting Wolfgang Wieser <ww...@gm...>:

> Hello everybody,
> 
> In order to get an estimate how long basic numerical caculations take 
> on current hardware, I ran the speed test
> (src/lib/numerics/3d/test-speed.cc)
> on different platforms. 
> 
> The results are below. This is just plain C++ code compiled; there are 
> no hand-coded SSE vectorisations or similar. 
> 
> The most remarkable fact is that using floats instead of doubles may have 
> the opposite effect than expected... Especially note the value in brackets 
> at the end of the lines: It is the ratio of double and float calc times, 
> i.e. values larger than 1 mean "double takes longer" while values smaller 
> than 1 mean "double is faster than float". 
> 
> Wolfgang
> 
> BTW, in the last email I forgot to mention that longjmp'ing between thread 
> contexts is undefined in POSIX. So, we need to be prepared for the case that
> 
> it cannot be done. 
> 
> --<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0,
> NPTL>-----------------------------
>   (nothing)                : 0.533687 nsec/cyc
>   vector3::length          : flt:  28.0004   dbl:   26.718  nsec/cyc  (0.95)
>   vector3*vector3          : flt:  10.2434   dbl:   22.804  nsec/cyc  (2.23)
>   vector3 x vector3        : flt:  15.1952   dbl:  34.8585  nsec/cyc  (2.29)
>   matrix3*vector3          : flt:  26.2452   dbl:  40.8993  nsec/cyc  (1.56)
>   trafomatrix*vector3      : flt:  25.6166   dbl:  25.7406  nsec/cyc  (1.00)
>   trafomatrix::inverse     : flt:  126.307   dbl:  96.1046  nsec/cyc  (0.76)
>   trafomatrix*trafomatrix  : flt:  99.8809   dbl:  113.816  nsec/cyc  (1.14)
>   matrix3*matrix3          : flt:  75.9704   dbl:  131.277  nsec/cyc  (1.73)
> 
> --<P4 2.80 GHz Linux-2.6.11 gcc-3.4.4,
> LinuxThreads>--------------------------
>   (nothing)                : 0.118761 nsec/cyc
>   vector3::length          : flt:    15.69   dbl:  15.5786  nsec/cyc  (0.99)
>   vector3*vector3          : flt:  17.0779   dbl:  5.04801  nsec/cyc  (0.30)
>   vector3 x vector3        : flt:  16.3483   dbl:  11.8949  nsec/cyc  (0.73)
>   matrix3*vector3          : flt:  18.4656   dbl:  17.3939  nsec/cyc  (0.94)
>   trafomatrix*vector3      : flt:  18.4069   dbl:   17.374  nsec/cyc  (0.94)
>   trafomatrix::inverse     : flt:  120.072   dbl:  84.1365  nsec/cyc  (0.70)
>   trafomatrix*trafomatrix  : flt:  79.1698   dbl:  86.2536  nsec/cyc  (1.09)
>   matrix3*matrix3          : flt:   61.501   dbl:  68.3568  nsec/cyc  (1.11)
>  
> --<AMD64 1.8GHz FreeBSD-5.4
> gcc-3.4.2>----------------------------------------
>   (nothing)                : 0.287412 nsec/cyc
>   vector3::length          : flt:  16.2991   dbl:  16.4841  nsec/cyc  (1.01)
>   vector3*vector3          : flt:  10.1551   dbl:  5.49865  nsec/cyc  (0.54)
>   vector3 x vector3        : flt:  16.7179   dbl:  9.67714  nsec/cyc  (0.58)
>   matrix3*vector3          : flt:  27.3062   dbl:  19.6488  nsec/cyc  (0.72)
>   trafomatrix*vector3      : flt:   27.463   dbl:  19.1422  nsec/cyc  (0.70)
>   trafomatrix::inverse     : flt:  70.7865   dbl:  64.1694  nsec/cyc  (0.91)
>   trafomatrix*trafomatrix  : flt:  83.5633   dbl:   68.334  nsec/cyc  (0.82)
>   matrix3*matrix3          : flt:  67.8196   dbl:  59.2467  nsec/cyc  (0.87)
> 
> --<AthlonXP 1.47GHz Linux-2.6.11 gcc-3.4.2 for Win32, NPTL, WINE
> EMULATION!>--
>   (nothing)                : 0.356384 nsec/cyc
>   vector3::length          : flt:  29.7528   dbl:  27.0064  nsec/cyc  (0.91)
>   vector3*vector3          : flt:  15.3446   dbl:  7.66602  nsec/cyc  (0.50)
>   vector3 x vector3        : flt:  21.3162   dbl:  13.4919  nsec/cyc  (0.63)
>   matrix3*vector3          : flt:  26.0698   dbl:  26.6165  nsec/cyc  (1.02)
>   trafomatrix*vector3      : flt:  26.3388   dbl:  25.8365  nsec/cyc  (0.98)
>   trafomatrix::inverse     : flt:   126.99   dbl:  114.622  nsec/cyc  (0.90)
>   trafomatrix*trafomatrix  : flt:  128.008   dbl:  134.121  nsec/cyc  (1.05)
>   matrix3*matrix3          : flt:  102.492   dbl:  105.253  nsec/cyc  (1.03)
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
> _______________________________________________
> Ray-devel mailing list
> Ray...@li...
> https://lists.sourceforge.net/lists/listinfo/ray-devel
> 

----------------------------------------
This mail sent through www.mywaterloo.ca

Hello everybody,

In order to get an estimate how long basic numerical caculations take 
on current hardware, I ran the speed test (src/lib/numerics/3d/test-speed.cc)
on different platforms. 

The results are below. This is just plain C++ code compiled; there are 
no hand-coded SSE vectorisations or similar. 

The most remarkable fact is that using floats instead of doubles may have 
the opposite effect than expected... Especially note the value in brackets 
at the end of the lines: It is the ratio of double and float calc times, 
i.e. values larger than 1 mean "double takes longer" while values smaller 
than 1 mean "double is faster than float". 

Wolfgang

BTW, in the last email I forgot to mention that longjmp'ing between thread 
contexts is undefined in POSIX. So, we need to be prepared for the case that 
it cannot be done. 

--<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, NPTL>-----------------------------
  (nothing)                : 0.533687 nsec/cyc
  vector3::length          : flt:  28.0004   dbl:   26.718  nsec/cyc  (0.95)
  vector3*vector3          : flt:  10.2434   dbl:   22.804  nsec/cyc  (2.23)
  vector3 x vector3        : flt:  15.1952   dbl:  34.8585  nsec/cyc  (2.29)
  matrix3*vector3          : flt:  26.2452   dbl:  40.8993  nsec/cyc  (1.56)
  trafomatrix*vector3      : flt:  25.6166   dbl:  25.7406  nsec/cyc  (1.00)
  trafomatrix::inverse     : flt:  126.307   dbl:  96.1046  nsec/cyc  (0.76)
  trafomatrix*trafomatrix  : flt:  99.8809   dbl:  113.816  nsec/cyc  (1.14)
  matrix3*matrix3          : flt:  75.9704   dbl:  131.277  nsec/cyc  (1.73)

--<P4 2.80 GHz Linux-2.6.11 gcc-3.4.4, LinuxThreads>--------------------------
  (nothing)                : 0.118761 nsec/cyc
  vector3::length          : flt:    15.69   dbl:  15.5786  nsec/cyc  (0.99)
  vector3*vector3          : flt:  17.0779   dbl:  5.04801  nsec/cyc  (0.30)
  vector3 x vector3        : flt:  16.3483   dbl:  11.8949  nsec/cyc  (0.73)
  matrix3*vector3          : flt:  18.4656   dbl:  17.3939  nsec/cyc  (0.94)
  trafomatrix*vector3      : flt:  18.4069   dbl:   17.374  nsec/cyc  (0.94)
  trafomatrix::inverse     : flt:  120.072   dbl:  84.1365  nsec/cyc  (0.70)
  trafomatrix*trafomatrix  : flt:  79.1698   dbl:  86.2536  nsec/cyc  (1.09)
  matrix3*matrix3          : flt:   61.501   dbl:  68.3568  nsec/cyc  (1.11)

--<AMD64 1.8GHz FreeBSD-5.4 gcc-3.4.2>----------------------------------------
  (nothing)                : 0.287412 nsec/cyc
  vector3::length          : flt:  16.2991   dbl:  16.4841  nsec/cyc  (1.01)
  vector3*vector3          : flt:  10.1551   dbl:  5.49865  nsec/cyc  (0.54)
  vector3 x vector3        : flt:  16.7179   dbl:  9.67714  nsec/cyc  (0.58)
  matrix3*vector3          : flt:  27.3062   dbl:  19.6488  nsec/cyc  (0.72)
  trafomatrix*vector3      : flt:   27.463   dbl:  19.1422  nsec/cyc  (0.70)
  trafomatrix::inverse     : flt:  70.7865   dbl:  64.1694  nsec/cyc  (0.91)
  trafomatrix*trafomatrix  : flt:  83.5633   dbl:   68.334  nsec/cyc  (0.82)
  matrix3*matrix3          : flt:  67.8196   dbl:  59.2467  nsec/cyc  (0.87)

--<AthlonXP 1.47GHz Linux-2.6.11 gcc-3.4.2 for Win32, NPTL, WINE EMULATION!>--
  (nothing)                : 0.356384 nsec/cyc
  vector3::length          : flt:  29.7528   dbl:  27.0064  nsec/cyc  (0.91)
  vector3*vector3          : flt:  15.3446   dbl:  7.66602  nsec/cyc  (0.50)
  vector3 x vector3        : flt:  21.3162   dbl:  13.4919  nsec/cyc  (0.63)
  matrix3*vector3          : flt:  26.0698   dbl:  26.6165  nsec/cyc  (1.02)
  trafomatrix*vector3      : flt:  26.3388   dbl:  25.8365  nsec/cyc  (0.98)
  trafomatrix::inverse     : flt:   126.99   dbl:  114.622  nsec/cyc  (0.90)
  trafomatrix*trafomatrix  : flt:  128.008   dbl:  134.121  nsec/cyc  (1.05)
  matrix3*matrix3          : flt:  102.492   dbl:  105.253  nsec/cyc  (1.03)

Hello everybody!

Recently, I've been thinking about the RT core model, i.e. how the 
actual calculation should be split for different boxes. 

The idea I would like to present below is based on discussion with 
Andrew about stream processing and the "Kilauea paper" and follows my 
proposal from some time back about using as many worker threads as 
CPUs in the system and additionally doing userspace (coroutine) switching. 

Overview
========

The core consists of the following main parts: 
- a request PQ (priority queue)
- processing kernels; one for each request type
- a list of saved states (coroutines)
- a response queue
- an (optional) network interface

Requests all have a priority, a return-path (i.e. which computer and 
couroutine generated the request and needs to receive the corresponding 
response) and a type. 

The type can be something like "calculate ray shape intersection", 
"shade this ray", "completely trace this ray",...

The system below is independent of these types; there just needs to be 
a kernel which can process a request of each type. 

Operation
=========

1. Requests are queued in the request PQ according to their priority 
(higher prio first). 
Inputs for the queue are network (received request) or internal sources. 

2. There are several worker threads (as many as there are CPUs). As soon as 
a request is available (and there are no responses to be handeled locally) 
and a thread is "free" (i.e. needs more work to do), it [creates a new 
coroutine context and] dequeues the request and then processes it using 
the apropriate kernel routines. 

3. The kernel then computes a response ("result") which is fed into the 
response queue to be "sent" back according to the request's return path 
(either via network or internally, see below). 

4. The kernel itself may need to make further requests in order to process 
the current request. E.g. for shading a surface, tracing several rays (e.g. 
shadow rays) may be needed. The kernel will then enqueue these requests in 
the request PQ with a priority which is larger than the prio of the 
current request. The request PQ will decide whether to keep them in the 
local queue or send the requests over the network (thereby removing them 
from the local queue). 
When all the sub-requests are queued, the thread must "wait" for the 
responses. Instead of doing nothing, the current coroutine state is saved 
and the thread continues as "free" (see 2. and 5.). 

5. If a worker thread is "free" and there is a response (in the response Q) 
which needs to be processed locally, the thread takes this response and 
continues with the saved coroutine state which corresponds to the response. 
It is important that 4. takes precedence over 2. so that responses are 
handeled before new requests. 

Here is a figure showing the flow of execution for the threads: 
http://www.cip.physik.uni-muenchen.de/~wwieser/tmp/files/ray-struct.png

Portability
===========

In order for this concept to be portable, at least 3 threads are required: 

1. There is one network receive thread [could be more as well] which simply 
"poll(2)"s/listens for incoming requests/responses over the network and 
queues them in the request PQ. 

2. There is a network send thread which is either blocked by the request/ 
response queues waiting for something to send or blocked in poll(2) [or 
analogous windows function] trying to send something. 
[This cannot be done by the thread from 1 because therefore the thread 
would have to be able to simultaniously to a poll() and monitor the two 
queues.]

3. At least one worker thread with several coroutine states. 
The design above demands for coroutine switching between threads, i.e. that 
one thread can execute a coroutine created by another thread. According 
to my tests, this makes no trouble under Linux and FreeBSD but is currently
not possible under Windows (src/lib/threads/test-switch.cc). 
This means that [as long as we find no solution] the Windows version will 
- use only one worker thread (i.e. suitable for uniprocessor boxes) or
- use one thread per coroutine context or
- use several threads but keep coroutine contexts tied to their threads 
  [which means that instead of processing responses in the order they arrive 
  the next "free" thread processes the first response with a compatible 
  coroutine context]

Performance
===========

The most critical part is performance and how much overhead is introduced 
by the above approach. 

I made some tests under Linux (ix86), FreeBSD (AMD64) and Windows (ix86). 
For a more detailed view of timings, see my next email. 

As for now, a the basic work cycle is: 
- Take a request from a queue. 
- Switch to the apropriate coroutine. 
- [Process the request. This is a no-op here.]
- Put a response into a queue. 
- Switch the coroutine back. 
[see src/lib/threads/test-switch.cc]

One such work cycle is 370 nsec (0.37 microseconds) on an AhlonXP with 
1.47 GHz running Linux-2.6.11 using gcc-4.1 and a glibc with NPTL support. 
This is about 15 vector transformations (matrix * vector) or 3 matrix 
inversions. 

OTOH, on an AMD64 (Athlon64) with 1.8 GHz running FreeBSD-5.4 and gcc-3.4, 
one such work cycle is 5100 nsec. (Yes, 5.1 usec!) 
This sucks because it is something like 200 vector transformations on 
that box. 

There are no good timings for Windows because the code fails to run 
properly when optimization is enabled during compiling (use -O0). 
Without optimization, I get 4.5 usec on an P4-M, 1.18 GHz. 
(Note that optimization (-O2) improves runtime by a factor of 4.5 on the 
Linux box.)

Conclusion
==========

Threads & coroutines can be used to implement a request-response 
stream-like system. 

However, the requests should be chosen in a way that the pure kernel 
calculation time of each such request consumes an order of magnitude 
more CPU time than the overhead introduced by the queues and switching. 

If anyone (Andrew?) has estimates for the ray-surface intersection times 
on current hardware for "typical" scenes (e.g. a large triangle mesh), 
please post them for comparison. 

Wolfgang

Hello Everybody!

While not advancing much in development of the code itself, I spent a lot 
of hours in the last 2 weeks in order to get the code compiled for the 
MS Windows (win32) platform. 

And good news: I finally succeeded!

So, the complete code in devel-ray/src/ can now be compiled for win32 
using the MinGW environment (minimalistic GW, I recommend the debian 
packages) under Linux applying a cross-compilation gcc (i586-mingw32msvc-gcc, 
version 3.4.2). Also, I downloaded Tor Lillqvist's pre-compiled Glib. 

The nice thing is that one can do all of it on the local Linux PC in 
a second shell and just type "make" into the native and the cross build 
terminal. All that without losing the comfort of unix shells...
Of course, the check programs ("make check") are not executed on Linux 
but I tested them on a windows XP box. 

So, what we get are native windows console programs. Don't mix that up 
with DOS mode. It is native Windows, 32bit and we can access all RAM 
(not just 640kb), threads, and so on. The only thing not yet working 
are coroutines. Unfortunately, I currently do not understand why because 
(I think at least) it is the same switching code as used in libpcl which 
works -- that is according to the author...

Anybody here with some experience or knowledge on how to write a 
non-console application for Windows?

Regards,
Wolfgang

On Tuesday 22 March 2005 22:10, Andrew Clinton wrote:
> remember reading somewhere that for a good accelerator, the number of cells
>"stabbed" by a ray will be something like O(n^(1/3)) (cube root).  If we only
>get the first intersection, its probably more like O(log(n)).
>
Very good. Thanks for the infomation. 

> I think we could probably get away with an even simpler hash table.  For
> example, just use a fixed-size table (eg. 256 entries), and allow for some
> small number of re-intersection tests (in this case, maybe around 1/128 of
> tests would have a collision).  This would be really simple to implement,
> fast, and uses little space while getting most of the benefit of other data
> structures.
>
Yes, another interesting idea. We could even get along with 256 
booleans in a bitfield and let the hash compute the bit index. 

> The problem with using boolean values for this purpose is that we need some
> way to reset them once the ray intersections have been completed for a
> single ray 
>
Of course. But as this bitfield is stored in one consecutive area in the ray 
data structure, the reset operation is basically a 
  memset(bitfield,0,size);
Especially when combining it with the hash approach introduced by you above, 
the reset for a 256-entry hash is setting 8 integers (with 32bit each) to 0. 
Really cheap. And especially it fits nicely into the cache. 

>  Anyways, the user can
> always be able to store more geometry by going and buying more RAM...
>
Sure. 
But for example I myself cannot go and buy more RAM for the >=25-node P4 HT 
cluster I have access to... Each of these boxes has 1Gb, so the total is 25Gb
which means the cluster can store a lot more scene information than one box. 

Concerning scenes with huge amounts of data: Being able to handle them 
has been a design goal right from the beginning. (Because I am doing these 
planet topography renders: E.g. the easiest way for a flight on Mars would be 
to simply load the complete scene information -- which is 2.4Gb... Currently 
I am handling that with a POVRay-hack which loads only parts of the image 
into RAM -- but it is such a crude hack that I do not dare to publish the 
patch and it requires knowledge of the required area in advance.)

PRIMITIVES
----------

I'm currently making my way through PBRT (not in-depth, yet). 
I like the simplicity in design of their interface for Primitives. 
However, I would do things a little different: 

- All objects live in object coordinates and are instanced for the scene. 
  I.e. all things seen in the scene are "instance primitives". This way one 
  can eliminate one transformation for each instance. 

- There are special parametric transformations which allow e.g. to have 
  1000 objects in a line without instancing them 1000 times. 
  The first problem is how to identify these in an accelerator (e.g. using 
  a pointer and an index mentioned earler by me) but I failed to think up 
  an efficient and simple way to handle that for arbitrary depths. 
  So, I'd suggest to instead use own accelerators for these parametric 
  transformations. The advantage is that the mentioned "objects-in-a-line" 
  trafo could internally do an O(1)-like acceleration in case the objects' 
  AABBs do not overlap much. 

- CSG seems to naturally fit into the framework by using CSG difference 
  and CSG intersection primitives. This would directly allow use of BVH 
  bounding. 
  Andrew, since you are the expert for that: How would you suggest to 
  introduce CSG with Kd-tree bounding when starting with the known 
  "Primitive" from PBRT?
  (I would like to do CSG with just about anything which can possibly be 
  given an inside and an outside, much like POVRay.)
  (BTW, it seems that one could easily allow the user to do a special type of 
  bounding on certain primitives thereby giving an opportunity to compensate 
  for the performance loss at shells7.pov using BSP...)

Another question: Does anybody know a good way to prevent self-shadowing 
(other than the usual min. epsilon depth.)

Regards,
Wolfgang

> I wanted to make the point about storing the information in the _ray_. 
> 
> Which data structure to choose here is an interesting question. 
> Direct storage in the primitive is of course the fastest way. 
> 
> Using a tree or hash requires dynamic resizing of this structure which 
> is slow, however usind chunked allocation will fix that. 
> Another way might be to associate a "cost" with intersection calculation 
> and only store a fixed number of tested objects: Those which cost most. 
> 

I think we could probably get away with an even simpler hash table.  For
example, just use a fixed-size table (eg. 256 entries), and allow for some small
number of re-intersection tests (in this case, maybe around 1/128 of tests would
have a collision).  This would be really simple to implement, fast, and uses
little space while getting most of the benefit of other data structures.

> Another idea would be to give each primitive a serial number in the 
> range 0..N-1 (with N being the number of primitives) and using a 
> per-coroutine bitfield indexed with the primitive serial number to store 
> the information. Memory consumption would be fixed to N/8 bytes 
> per coroutine (or per simultanious ray). However, the information 
> would not be cache-local and probably waste memory, although being 
> faster (algorithically) than a hash. 
> 

The problem with using boolean values for this purpose is that we need some way
to reset them once the ray intersections have been completed for a single ray (I
thought about this before when working on the POV-Ray patch).  This is possible
but it will require an auxiliary storage for the list of primitives, so that we
can go back and reset the booleans when the ray is completed.

> How many such (already intersected) objects do we have to expect need to 
> be stored for one ray?
> 

I remember reading somewhere that for a good accelerator, the number of cells
"stabbed" by a ray will be something like O(n^(1/3)) (cube root).  If we only
get the first intersection, its probably more like O(log(n)).  So I'd imagine
that in almost all cases there will be less than ~256 objects intersected by the
ray, for scenes that fit in memory.

> <Splitting the scene information>
> 

This is an interesting question and it should require more investigation. 
However, I'm still concerned that this type of scene splitting will be overkill
in almost all cases.  It would be much better to have good support for
instancing, geometry/texture caching, and procedurals so that the user can
actively control the memory use of their scene.  Anyways, the user can always be
able to store more geometry by going and buying more RAM...

Andrew

----------------------------------------
This mail sent through www.mywaterloo.ca

<News>

Seems like I finally got the first version of the C++-like SPL grammar 
(see src/spl/grammar.yy). It spent much more time than I thought on that. 
Special SPL-features (like nice specification of blend maps) still need to 
be added but are generally unlikely to make difficulties (i.e. introduce 
s/r or r/r conflicts) as long as they are prefixed by a reserved word 
(like "map"). 

I'll now push forward the SPL compiler. 

On Friday 18 March 2005 02:43, Andrew Clinton wrote:
> > It seems to me that the natural way which remains is to store a list of
> > intersected primitives in the ray. The downside is that this sort of
> > thing depends a bit on the accelerator used (BVH does not need mail
> > box numbers, I guess...)
>
> Yes, a list would work but It will scale poorly if the ray needs to test a
> lot of primitives (need to test everything in the list on each
> intersection). That's why I mentioned we might want to do it with a hash
> table.
>
I wanted to make the point about storing the information in the _ray_. 

Which data structure to choose here is an interesting question. 
Direct storage in the primitive is of course the fastest way. 

Using a tree or hash requires dynamic resizing of this structure which 
is slow, however usind chunked allocation will fix that. 
Another way might be to associate a "cost" with intersection calculation 
and only store a fixed number of tested objects: Those which cost most. 

Another idea would be to give each primitive a serial number in the 
range 0..N-1 (with N being the number of primitives) and using a 
per-coroutine bitfield indexed with the primitive serial number to store 
the information. Memory consumption would be fixed to N/8 bytes 
per coroutine (or per simultanious ray). However, the information 
would not be cache-local and probably waste memory, although being 
faster (algorithically) than a hash. 

How many such (already intersected) objects do we have to expect need to 
be stored for one ray?

<Splitting the scene information>

I think, CSG could be split among several computers, if we transfer the CSG 
root and (maybe) first few levels to all computers and subsequent trees to 
different boxes. This works if we store a description about which computer 
has the required information at the place where a subtree was cut off. 
However, this is a bit more complicated than just passing rays. 

Another interesting thing one tends to overlook is now to make "instancing" 
work with splitting scene iformation and CSG. Kilauea (without CSG) could 
simply make sure that one box has all instances of a given object. 
This, however, undermines equal load balancing. 

For CSG, it is even harder. The easy solution is (of course) to give each box 
a copy. 

As a general consideration, the Kilauea people have correctly pointed out 
that moving rays is better than moving geometry because geometry 
can be huge. This is the "only" reason why I am considering it. 
However, the problem is that if geometry does not only consist of triangles, 
it is impossible to know in advance how to split it so that all boxes will 
be finished at the same time. (Otherwise that half of the boxes which got 
the isosurface is still busy while all those with just triangles are idle and 
cannot help.)

This, it seems, does not happen when moving geometry (on-demand loading) 
because all computers would process rays as long as there are rays to 
compute. However, loading and possibly re-loading (if it is too large to fit 
into memory of one box and hence has to be cached) the geometry eats up 
a lot of time and bandwidth. Furthermore, the server(s) tend(s) to get the 
bottleneck. 

So, the interesting question is, if there is an easy way to combine these 
two ways. 

Cheers,
Wolfgang

> In order for multi-threading to work we may neither store the mailbox 
> number in the accelerator nor in the scene graph. Both should be used 
> read-only by the rendering threads. 
> 
> It seems to me that the natural way which remains is to store a list of 
> intersected primitives in the ray. The downside is that this sort of thing 
> depends a bit on the accelerator used (BVH does not need mail 
> box numbers, I guess...)
> 

Yes, a list would work but It will scale poorly if the ray needs to test a lot
of primitives (need to test everything in the list on each intersection). 
That's why I mentioned we might want to do it with a hash table.

> (Anyways, the above about mailbox numbers was probably not the whole 
> story because we may want to get all intersections with an object, so some 
> sort of inside/outside information must as well be used, I guess.)
> 

If we want to get all intersections, then the primitive intersection routine
should do that for us.  There should not be a need to call intersection routines
more than once for a ray.  I use a list of Intersection objects in my ray tracer
for tests that need to get all the intersections for a ray (ie. I have two
virtual methods on all objects: nearestIntersection and allIntersections.  We
might even want an anyIntersection method for simple visibility tests).

> One other thing which is not described in the paper is how the "first- 
> come-first-served" works. If box A wants to have a ray traced and 
> sends a message to all the other boxes, how can box C and D know 
> that B accepted it without introducing a race? I see only 2 solutions: 
> 1. Use one server S which keeps track of all rays. A sends the ray to S 
>   which sends it to B. 
> 2. A directly sends the ray to B without telling anybody else. 
> Unfortunately, 1 introduces a bottleneck: the server S. Fortunately, I 
> already have an idea how one could do 2. (later)
> 

This is something that has confused me as well.  The only clue they give is in
section 3.3.  They mention that rays can be balanced by allowing hosts to take
rays from an input queue on first-come-first-served basis, but the rest of the
discussion does not mention any global queue to do this.  I'm guessing they do
something like (2) with local queues on each host, by sending load information
messages between the sets.  If there is not enough space in the local queue to
process a ray, the host will consult the loading table and send the ray
appropriately.  Just speculation...

Andrew

----------------------------------------
This mail sent through www.mywaterloo.ca

On Tuesday 15 March 2005 02:39, ajc...@en... wrote:
> I'm not sure whether there is any other benefit to the ray-level
> partitioning except to make it easier to split up the scene database when
> it is too large for a single host.
>
I guess there is in case the image is very inhomogene like having tiny 
spots which trigger lots of secondary rays. Something which could happen 
when rendering a image from space (lots of black background, some 
interesting spots). 

> Yes, I think their partitioning strategy is only going to work with simple
> primitives.  Other stuff will need to be duplicated on all the nodes
> (doesn't seem too harsh a constraint, I don't think we expect to have
> millions of isosurfaces or CSG models).
>
Well, but one can do CSG operations on meshes when defining inside/outside 
of the mesh. IIRC POVRay can do that. 

> > > If we take this approach to threading, we'll need to worry about the
> > > following access control issues:
> > > 1. Access to mailbox numbers on primitives in the accelerator
> > > 2. Access to a read/write irradiance cache (if we use one)
> > > 3. Writeback to the frame buffer (we can do this in blocks - this is
> > > one reason we should probably use blocks of pixels rather than rays as
> > > the basic unit of parallelism).
> 1. Mailboxes are used to store the ray number on a primitive so that while
> traversing the accelerator we only intersect it once (before intersecting
> any primitive, compare the mailbox number with the current ray => if equal,
> then the primitive has already been tested).  
>
In order for multi-threading to work we may neither store the mailbox 
number in the accelerator nor in the scene graph. Both should be used 
read-only by the rendering threads. 

It seems to me that the natural way which remains is to store a list of 
intersected primitives in the ray. The downside is that this sort of thing 
depends a bit on the accelerator used (BVH does not need mail 
box numbers, I guess...)

If the maximum number of "threads" is known in advance (and this may 
not be the case if coroutines are created on demand), then the array method 
you mentioned could as well be used. 

(Anyways, the above about mailbox numbers was probably not the whole 
story because we may want to get all intersections with an object, so some 
sort of inside/outside information must as well be used, I guess.)

> [2] ...   Depending on how we implement it, it will result in
> nondeterministic image output: eg. a simple race between two threads will
> cause a different image result when one gets the cached value and the other
> needs to create it (or vice-versa).
>
Well, it is right that this will cause the image to be non-deterministic. 
But this is due to the fact that the irradiance cache is somewhat flawed by 
design in this respect because the result depends on the order in which rays 
are calculated (and thereby cache entries are generated). 

> 3. I mentioned so that we should avoid locking whenever a ray makes a
> contribution back into the buffer.  
>
Don't worry too much about locking. Locking mainly becomes a problem 
for scalability if the thread holding the lock is likely to hold it for longer 
time. 

For the above problem, there is e.g. a very simple solution which can work on 
single pixels as well by using private output queues in each thread: 
The thread will queue 8x8=64 pixels in his private output queue and then 
lock one time to transfer the complete queue to the framebuffer handler 
(this is 4 pointer assignments and hence the lock will not be held very 
long). 

BTW, thanks for pointing me at the paper of Cristensen (ray differentials). 
Although I find the "original" Igehy paper more informative concerning ray 
differential computation, the former has some more information about 
Kilauea. Seems like they tesselate everything before. This (IMO) means that 
1. they do not have a problem splitting the scene (It's just meshes)
2. all ray intersections take about equally long. 
Concerning (2): In case Kilauea (as described) had to calculate isosurface 
intersections as well, those boxes which happen to have that part of the 
scene without an isosurface object will in the end be idle and wait for all 
those computers which still have slow isosurface intersections to calculate. 
This is due to the way they divide up the scene but probably they do not 
see the problem because they'd tesselate everything first in any case. 

One other thing which is not described in the paper is how the "first- 
come-first-served" works. If box A wants to have a ray traced and 
sends a message to all the other boxes, how can box C and D know 
that B accepted it without introducing a race? I see only 2 solutions: 
1. Use one server S which keeps track of all rays. A sends the ray to S 
  which sends it to B. 
2. A directly sends the ray to B without telling anybody else. 
Unfortunately, 1 introduces a bottleneck: the server S. Fortunately, I 
already have an idea how one could do 2. (later)

Wolfgang

I've just seen the film "Robots" (Blue sky studios), was really impressed with
the graphics.  Notably, a number of the shots had effects that clearly couldn't
have been done without ray tracing...  Here is a page about their rendering
software:

http://www.blueskystudios.com/content/process-tools.php

It turns out that the entire movie was actually ray traced!  Another interesting
point, they trace curved patches (even subdivision surfaces) without
tesselation.  I'd be really interested to know how they have been able to do
this efficiently.  Of course they also have a renderfarm with 500+ machines :)

Andrew

----------------------------------------
This mail sent through www.mywaterloo.ca

Quoting Wolfgang Wieser <ww...@gm...>:

> > (If you don't have ACM access I can mail you the paper)
> >
> Well, it turned out that we do not have ACM in the list of subscribed 
> journals at the LMU, so I walked over to the guys at the TUM to get an 
> account there (something I should have done eariler anyways). 
> 
> The design seems to be a quite good one. Especially, I like the idea of 
> using and passing rays and photons because the simplicity in which 
> the massive parallelism is achieved is appealing. Furthermore, this design 
> is highly parallel without using a "hard-core" stream design with small 
> kernels for everything and hence nearer to a traditional sequential renderer
> 
> and also easier to implement. Actually, it is somehow like I dreamt of 
> being able to do some months ago: Dividing up work at the ray level. 
> 

Although they don't mention it in the paper, I'd imagine that the partitioning
of eye rays to rendering engine threads uses some sort of dynamic tile
assignment, I've seen this mentioned in some other papers.  I'm not sure whether
there is any other benefit to the ray-level partitioning except to make it
easier to split up the scene database when it is too large for a single host.

> However, as usual, people don't comment on the interesting details. 
> Let's consider a scene which consists only of one very large CSG tree. 
> Can it be split without having to copy some non-leaf nodes?
> If not, a uniform distribution would cause lots of overhead and consequently,
>  
> subtrees would have to be distributed. This, in turn, may make some boxes 
> (with important subtrees) much more busy than others. 
> 
> Or let's say one box happens to get the isosurface while the others only 
> have triangle meshes. Guess who will be waiting?
> 

Yes, I think their partitioning strategy is only going to work with simple
primitives.  Other stuff will need to be duplicated on all the nodes (doesn't
seem too harsh a constraint, I don't think we expect to have millions of
isosurfaces or CSG models).

> Furthermore, where is the texture information stored? Where the interior 
> information (like index of refraction, media)?
> 

Probably these things are assigned as object level rather than triangle level,
so it is not too unreasonable to duplicate it among the nodes.

> > This is usually done by making the accelerator have the same interface as
> > the primitives themselves.  This is what I've done in my raytracer, and is
> > also what is done in pbrt.
> >
> ...and would you advise against doing so?
> 

I would advise in favor of doing so.

> 
> > I think our system should make use of ray differentials to choose the
> level
> > of tesselation.  Check online for the paper "Ray differentials for
> > multiresolution geometry caching" by Christensen.  This way, highly
> > incoherent rays for GI can use a low-res tesselation, saving lots of
> > effort.
> >
> Something similar could be done for isosurfaces by bowering the solver 
> accuracy in this case. I actually spent some thoughts on that already. 
> (I'll check out the paper because estimating the ray differential was the 
> point I got stuck at.)
> 

Lowering accuracy of isosurface sounds like a good idea, I wonder if it will
work in practise.

> > One more thing (I don't think I was that clear) is that tesselated
> geometry
> > can be put directly into the accelerator.  
> >
> Actually, you were clear enough, although I was not aware on how to exactly 
> put the information into the leaf. 
> And I do not yet fully understand it (triangle duplication problem). 
> 

Triangles that lie on accelerator cell boundaries will need to be duplicated. 
The alternative is to use lists of pointers to refer to them, in which case we
introduce a level of indirection (making caching performance worse), and need to
store lists of pointers (increasing memory use).

> [previously]
> > If we take this approach to threading, we'll need to worry about the
> > following access control issues:
> > 1. Access to mailbox numbers on primitives in the accelerator
> > 2. Access to a read/write irradiance cache (if we use one)
> > 3. Writeback to the frame buffer (we can do this in blocks - this is one
> > reason we should probably use blocks of pixels rather than rays as the
> > basic unit of parallelism).
> >
> Maybe I am blind but I barely see problems. 
> 1. is used read-only (don't we?), 
> 2. is a classical locking issue and probably a distribution problem (OK) and
> 
> 3. is just a minor locking/queuing problem
> (Using blocks of pixels for each node is interesting because of the 
> resulting positive effect on caching due to image space coherence -- 
> especially when demand-loading textures/geometry.)
> 

I was just trying to enumerate the cases where we will need to lock global data
structures.
1. Mailboxes are used to store the ray number on a primitive so that while
traversing the accelerator we only intersect it once (before intersecting any
primitive, compare the mailbox number with the current ray => if equal, then the
primitive has already been tested).  This can improve performance a lot, check
out the source code for my patch (I added mailbox numbers to the OBJECT struct).
 There are multithreading issues because two threads could need to update the
mailbox number of a primitive independently.  One alternative I can think of is
to make an array of mailbox numbers per primitive, but memory use will increase.
 We could also potentially use an alternative method of recording primitives
that are visited within a thread (perhaps some sort of hashing would do the
trick...)
2. This is probably not going to be too much trouble, but we will need to make a
lot of access to the irradiance cache during rendering.  I'm don't really have
any idea how much impact this will have on performance but we should be aware of
it.  Depending on how we implement it, it will result in nondeterministic image
output: eg. a simple race between two threads will cause a different image
result when one gets the cached value and the other needs to create it (or
vice-versa).
3. I mentioned so that we should avoid locking whenever a ray makes a
contribution back into the buffer.  If we are allocating blocks of pixels to
individual threads, then queue up the pixels in local storage and then lock the
frame buffer and write the block all at once.

I'm guessing that other multithreading issues will arise in the future, these
were just what I've thought of so far.

----------------------------------------
This mail sent through www.mywaterloo.ca

> (If you don't have ACM access I can mail you the paper)
>
Well, it turned out that we do not have ACM in the list of subscribed 
journals at the LMU, so I walked over to the guys at the TUM to get an 
account there (something I should have done eariler anyways). 

The design seems to be a quite good one. Especially, I like the idea of 
using and passing rays and photons because the simplicity in which 
the massive parallelism is achieved is appealing. Furthermore, this design 
is highly parallel without using a "hard-core" stream design with small 
kernels for everything and hence nearer to a traditional sequential renderer 
and also easier to implement. Actually, it is somehow like I dreamt of 
being able to do some months ago: Dividing up work at the ray level. 

However, as usual, people don't comment on the interesting details. 
Let's consider a scene which consists only of one very large CSG tree. 
Can it be split without having to copy some non-leaf nodes?
If not, a uniform distribution would cause lots of overhead and consequently,  
subtrees would have to be distributed. This, in turn, may make some boxes 
(with important subtrees) much more busy than others. 

Or let's say one box happens to get the isosurface while the others only 
have triangle meshes. Guess who will be waiting?

Furthermore, where is the texture information stored? Where the interior 
information (like index of refraction, media)?

On Monday 14 March 2005 00:26, Andrew Clinton wrote:
> I agree; they needed to develop distributed algorithms for ray intersection
> and photon gather, which require rays to be sent to multiple hosts and then
> the results combined to get the result.  It would be much more useful to
> encourage users to develop scenes so that objects can be demand-loaded or
> built on the fly, so that the system can manage memory effectively on a
> single host.
>
Of course. 
However, being able to handle such large scenes as well would as well be 
nice in any case...

I'll try and see if I can come up with a rough design based on what I've 
read before going on "dreaming"...

> SSE instructions fit 4 floats but only 2 doubles.
>
...which unfortunately makes them pretty useless in a 3dimensional world 
represented with double floats...
I'll add something like a "main" floating point type which can be float or 
double switchable at compile time. 

> This is usually done by making the accelerator have the same interface as
> the primitives themselves.  This is what I've done in my raytracer, and is
> also what is done in pbrt.
>
...and would you advise against doing so?

Another advantage of this would be to allow procedural transformations (as 
defined previously) to have their own local accelerator. 
(Of course O(log(n)+log(m)) > O(log(n+m)), so one big accelerator is better 
than several nested ones.)

> I would focus only on KD trees.  You can use my patch to start, 
>
I've been having it on my HD for some time already :)

> Additional optimizations in the book include the following:
>
Thanks for pointing these out. 

>> [tesselation]
> finely tesselated nurb vs. an analytic method).  I would make this the
> primary method of handling these types of patch surfaces and ignore
> analytic methods for anything but user-defined isosurfaces or infinite
> objects. 
...and box, sphere, disc, quadric, I guess. 

> Even for isosurface we might want to build in a conversion to be 
> able to get fast previews.
>
One of the main points of using isosurfaces (e.g. for landscapes) is that 
a high degree of complexity can be achieved with low memory footprint. 
I'd rather focus on efficient (mesh-like) bounding schemes for isosurfaces 
to speed up root finding, than tesselating them. (This probably requires a 
rough inside and an outside mesh with roots in between. Not a simple 
thing if the surface has bubbles and holes.)

> I think our system should make use of ray differentials to choose the level
> of tesselation.  Check online for the paper "Ray differentials for
> multiresolution geometry caching" by Christensen.  This way, highly
> incoherent rays for GI can use a low-res tesselation, saving lots of
> effort.
>
Something similar could be done for isosurfaces by bowering the solver 
accuracy in this case. I actually spent some thoughts on that already. 
(I'll check out the paper because estimating the ray differential was the 
point I got stuck at.)

> One more thing (I don't think I was that clear) is that tesselated geometry
> can be put directly into the accelerator.  
>
Actually, you were clear enough, although I was not aware on how to exactly 
put the information into the leaf. 
And I do not yet fully understand it (triangle duplication problem). 

[previously]
> If we take this approach to threading, we'll need to worry about the
> following access control issues:
> 1. Access to mailbox numbers on primitives in the accelerator
> 2. Access to a read/write irradiance cache (if we use one)
> 3. Writeback to the frame buffer (we can do this in blocks - this is one
> reason we should probably use blocks of pixels rather than rays as the
> basic unit of parallelism).
>
Maybe I am blind but I barely see problems. 
1. is used read-only (don't we?), 
2. is a classical locking issue and probably a distribution problem (OK) and 
3. is just a minor locking/queuing problem
(Using blocks of pixels for each node is interesting because of the 
resulting positive effect on caching due to image space coherence -- 
especially when demand-loading textures/geometry.)

> Whatever advantage we gain from locality within a
> coroutine might be offset by the cost of maintaining suspended states. 
>
I completely agree. This is a very nice formulation of what I meant with 
"overhead" in previous postings. 

Regards,
Wolfgang

> > One thing that 
> > Kilauea supports that we probably won't need is the ability to split a
> > large scene database among a collection of machines, which adds a lot of
> > complexity to their design.
> >
> This is a nice feature but I think we should not implement that unless (for 
> some reason) we could do it easily (which I actually doubt). Instead, I would
> 
> like to have on-demand loading of large images, meshes and NURBS which 
> are the things which usually use up most of the storage. 
> 

I agree; they needed to develop distributed algorithms for ray intersection and
photon gather, which require rays to be sent to multiple hosts and then the
results combined to get the result.  It would be much more useful to encourage
users to develop scenes so that objects can be demand-loaded or built on the
fly, so that the system can manage memory effectively on a single host.

> > - floats or doubles: I'm leaning towards using floats in the renderer, at
> > least for storage and the majority of computation.  Reasons: approx. 1/2
> of
> > memory use for meshes, better cache performance, and improved computation
> > speed (but not twice as fast, float divide is 24 cycles on AMD64 while
> > double is 28 cycles). We could change to doubles for some computation
> paths
> > if it turns out there are precision issues.
> >
> Well, I'd rather stick to double for most of the computation. 
> 
> Okay, meshes, NURBS parameters and all the color computation could be done 
> using floats but for the rest of the geometry and intersection computations 
> (especially everything involving polynomial root solving), I would like to
> use 
> double precision. At least that's my feeling after all the numerics I've been
> 
> doing. 
> 
> We should, however, try to code in a way to allow changing that easily at 
> compile time. Most of the code in lib/numerics is provided as float/double 
> templates (notable exception: root solvers). 
> 

I would aim for using floats everywhere except where there is a known numerical
problem.  So we could use doubles for only the root solving components of the
system.  This makes sense, because there will be little storage or caching
issues for root finding, therefore doubles should perform only slightly worse
than floats.  For triangle meshes, patch surfaces, bounding boxes, and
accelerator nodes, we could use floats to improve performance (probably
approaching 2 times better performance).

One more advantage of using floats is that if we ever perform a SIMD
optimization of the code path, we will get a 2x improvement over what we would
get if doubles were used.  SSE instructions fit 4 floats but only 2 doubles.

> > - SIMD processing: After some more research, I'm less confident of using a
> > SIMD approach to coherent rays, esp. for global illumination.  I'm
> > wondering if we could get a better speedup just be making specific
> > optimizations for low-level vector oprations.
> >
> I've actually been a bit reluctant with SIMD right from the beginning and 
> I am happy that what you propose here is pretty much what was planned right 
> from the beginning: 
> - Using a Vector2,3,4 C++ class which can implement vector operations 
>   using processor's SIMD instructions (inline assembly or built-in gcc 
>   functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.)
> - Implementing Vector2,3,4 types as native types in the VM to allow the 
>   VM to internally do SIMD operations on them. 
> 
> Collecting values and evaluating several of them them SIMD-like in the VM 
> (e.g. for isosurfaces) would probably still increase performance especially 
> when not JITCompiling the code. This could be done with a stream design 
> using lots of kernels (e.g. one for isosurface evaluation). One then has to 
> make sure that the kernel will process its input _if_needed_ (even if there 
> is only _one_ value in the input stream) to prevent deadlock. This is similar
> 
> to what you mentioned with ray priority. 
> 

One idea I've been thinking is to allow the user to flag a computation (eg. an
isosurface kernel) as "parallel".  In this case, the system will make the
optimization you've described, and collect samples to evaluate together.  By
allowing the user to specify which kernels should be optimized, we can avoid the
large sorting problem of collecting samples for all kernels.  This might also
make sense due to the conventional use of isosurface (for the most part in
POV-Ray), where a single large isosurface is used as a terrain or a major
feature of the scene.

> > - Triangles only: We could optimize our design by having most primitives
> > tesselate to triangles.  Then the accelerator could be constructed with
> > geometry specified inline with the leaf nodes.  We could do this by
> > templating the accelerator to the type of data that will be processed, in
> > our case either triangles or object pointers.  Non-triangle primitives
> like
> > isosurfaces would be placed into a seperate accelerator.
> >
> The accelerator data structure is one of the things about the RT core which 
> I worry much about because of its impact on intersection finding speed. 
> Actually, I would like to define a clear interface to the accelerator to make
> 
> it changeable and be able to have BSP/kD/grid/octree structures. 
> 

This is usually done by making the accelerator have the same interface as the
primitives themselves.  This is what I've done in my raytracer, and is also what
is done in pbrt.

I would focus only on KD trees.  You can use my patch to start, I'm not sure
that I'll have time to make the necessary adaptations.  Additional optimizations
in the book include the following:

- Represent all tree nodes with 8 bytes (64 bits)

struct KDNode {
    union {
      unsigned int flags; //Both
      float split; // Interior
      unsigned int nprimitives; // Leaf
    };
    union {
      unsigned int children; // Interior
      T *primitive; // Leaf
      T **primitives; // Leaf
    };
};

The trick is to encode the flags in the two least significant bits of the
floating point split value.  This will have huge caching benefits when iterating
through the accelerator.

- Use a cost model with different costs for iteration steps and object
intersections.  My accelerator (I think) assumes traversal steps cost nothing.

> (You probably know the paper of Havran et al. "Statistical Comparison of 
> Ray-Shooting Efficiency Schemes"; kD seems to perform best in most 
> situations.) 
> 

Yes, in case you also were not aware, the BSP patch I wrote for POV-Ray was in
fact an implementation of a lot of the features in Havran's Phd (Heuristic ray
shooting algorithms).

> If I understand you correctly, you propose that scene shapes can either be 
> represented as meshes (via tesselation of the shape) or via a bounding box 
> and their own intersection function (which computes exact intersections). 
> All objects must be able to provide an intersection function, some may also 
> provide a tesselation function. 
> (We can do this, however I don't like the tesselation very much because it 
> is one of the features of RT that it can easily be used on arbitrary geometry
> 
> and not only on triangles and NURBS. And because a tesselated object 
> takes much more space in memory. But OTOH, if it improves speed, why 
> not allow that on a per-object basis?)
> 

It actually improves speed immensely for most cases (I would not be suprised at
1 or 2 orders of magnitude difference between intersecting a finely tesselated
nurb vs. an analytic method).  I would make this the primary method of handling
these types of patch surfaces and ignore analytic methods for anything but
user-defined isosurfaces or infinite objects.  Even for isosurface we might want
to build in a conversion to be able to get fast previews.

I think our system should make use of ray differentials to choose the level of
tesselation.  Check online for the paper "Ray differentials for multiresolution
geometry caching" by Christensen.  This way, highly incoherent rays for GI can
use a low-res tesselation, saving lots of effort.

One more thing (I don't think I was that clear) is that tesselated geometry can
be put directly into the accelerator.  So we could have the memory layout:

ILGGGGLGGGILGGLLGGG

I = interior KD tree node
L = leaf node
G = triangle

This would have really nice cache performance.  My current implementation in
POV-Ray stores the leaves, object lists, and geometry in entirely seperate
memory locations, so there are 2 layers of indirection just to intersect against
the object list.  This way almost all the processing for a leaf could be done
without any indirections.  The disadvantage is that for finely subdivided
accelerators, we will need to duplicate a sizable number of triangles (but with
good termination criteria for the construction, this can be reduced to a minimum).

> But there are some other problems: 
> - Transferring the acceleration structure to the clients. This becomes 
>   "interesting" especially for on-demand downloading of the scene graph 
>   in case we want to do that (the new VM design will not do that for us 
>   due to the lack of an indirection layer). 

I would recommend building accelerators locally and on-demand.  This does not
prohibit on-demand geometry transfers.

> - Objects may not be representable by just an object pointer. 
>   I like the idea of "procedural transformations" (if one can call it like 
>   that) which means that an object transformation node in the CSG tree 
>   can store properties like "1000 objects in a line" without having to 
>   allocate 1000 object transformation matrices. 
>   This means that the object should be represented by the transformation 
>   node pointer and the transformation index (0..999 here). 
>   Otherwise (if we just have the node) we would have to test all 1000 
>   contained "objects" for intersections. 
>   Obviously, one index is not enough as soon as these things get nested. 
> 

Yes, procedurals should be expanded and put into an accelerator locally.  There
is not need to transfer anything except for the initial program to the rendering
servers.

Andrew

----------------------------------------
This mail sent through www.mywaterloo.ca

On Sunday 13 March 2005 04:06, Andrew Clinton wrote:
> Quoting Wolfgang Wieser <ww...@gm...>:
>> [Coroutines]
> I think that this is actually a really good way to do it, and it is
> supported by some of the research that I've been doing recently.  In
> particular, you should check out the Kilhauea renderer:
> http://portal.acm.org/citation.cfm?id=569673.569675&dl=GUIDE&dl=ACM&type=se
>ries&idx=SERIES10714&part=Proceedings&WantType=Proceedings&title=ACM%20Inter
>national%20Conference%20Proceeding%20Series
>
> (If you don't have ACM access I can mail you the paper)
>
Thanks; I'll check that tomorrow. 

> The way they do it is to match the number of threads to the number of
> processors on the machine.  Within each thread, they use task parallelism
> like you said, with application-controlled task switches.  
>
This is also what I had in my mind. (Maybe using some threads more than 
CPUs to fill up disc latencies while reading data and the like.) 
[BTW, a coroutine switch is about the time of "for(int i=0; i<72; i++);" 
on my box.]

> One thing that 
> Kilauea supports that we probably won't need is the ability to split a
> large scene database among a collection of machines, which adds a lot of
> complexity to their design.
>
This is a nice feature but I think we should not implement that unless (for 
some reason) we could do it easily (which I actually doubt). Instead, I would 
like to have on-demand loading of large images, meshes and NURBS which 
are the things which usually use up most of the storage. 

> I've been doing a lot of reading in my spare time, esp. on global
> illumination topics.  I've been making my way through "Physically Based
> Rendering" by Pharr, Humphreys, which is turning out to be a really
> excellent book (so far).  
>
I just had a look at the table of contents of that book (available at 
amazon). Anyways I need to get some more clue about physically 
based rendering...

> I've been busy lately, unfortunately the next two weeks I need to catch up
> with a compiler project for school so I can't comment on everything you've
> mentioned.
>
Compiler project - that's nice :) 
You know that I'm just starting to write a compiler? (For SPL.)

> - floats or doubles: I'm leaning towards using floats in the renderer, at
> least for storage and the majority of computation.  Reasons: approx. 1/2 of
> memory use for meshes, better cache performance, and improved computation
> speed (but not twice as fast, float divide is 24 cycles on AMD64 while
> double is 28 cycles). We could change to doubles for some computation paths
> if it turns out there are precision issues.
>
Well, I'd rather stick to double for most of the computation. 

Okay, meshes, NURBS parameters and all the color computation could be done 
using floats but for the rest of the geometry and intersection computations 
(especially everything involving polynomial root solving), I would like to use 
double precision. At least that's my feeling after all the numerics I've been 
doing. 

We should, however, try to code in a way to allow changing that easily at 
compile time. Most of the code in lib/numerics is provided as float/double 
templates (notable exception: root solvers). 

> - SIMD processing: After some more research, I'm less confident of using a
> SIMD approach to coherent rays, esp. for global illumination.  I'm
> wondering if we could get a better speedup just be making specific
> optimizations for low-level vector oprations.
>
I've actually been a bit reluctant with SIMD right from the beginning and 
I am happy that what you propose here is pretty much what was planned right 
from the beginning: 
- Using a Vector2,3,4 C++ class which can implement vector operations 
  using processor's SIMD instructions (inline assembly or built-in gcc 
  functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.)
- Implementing Vector2,3,4 types as native types in the VM to allow the 
  VM to internally do SIMD operations on them. 

Collecting values and evaluating several of them them SIMD-like in the VM 
(e.g. for isosurfaces) would probably still increase performance especially 
when not JITCompiling the code. This could be done with a stream design 
using lots of kernels (e.g. one for isosurface evaluation). One then has to 
make sure that the kernel will process its input _if_needed_ (even if there 
is only _one_ value in the input stream) to prevent deadlock. This is similar 
to what you mentioned with ray priority. 

When not using a stream design, we can forget about higher level SIMD 
anyways. 

> - Triangles only: We could optimize our design by having most primitives
> tesselate to triangles.  Then the accelerator could be constructed with
> geometry specified inline with the leaf nodes.  We could do this by
> templating the accelerator to the type of data that will be processed, in
> our case either triangles or object pointers.  Non-triangle primitives like
> isosurfaces would be placed into a seperate accelerator.
>
The accelerator data structure is one of the things about the RT core which 
I worry much about because of its impact on intersection finding speed. 
Actually, I would like to define a clear interface to the accelerator to make 
it changeable and be able to have BSP/kD/grid/octree structures. 

(You probably know the paper of Havran et al. "Statistical Comparison of 
Ray-Shooting Efficiency Schemes"; kD seems to perform best in most 
situations.) 

If I understand you correctly, you propose that scene shapes can either be 
represented as meshes (via tesselation of the shape) or via a bounding box 
and their own intersection function (which computes exact intersections). 
All objects must be able to provide an intersection function, some may also 
provide a tesselation function. 
(We can do this, however I don't like the tesselation very much because it 
is one of the features of RT that it can easily be used on arbitrary geometry 
and not only on triangles and NURBS. And because a tesselated object 
takes much more space in memory. But OTOH, if it improves speed, why 
not allow that on a per-object basis?)

But there are some other problems: 
- Transferring the acceleration structure to the clients. This becomes 
  "interesting" especially for on-demand downloading of the scene graph 
  in case we want to do that (the new VM design will not do that for us 
  due to the lack of an indirection layer). 
- Objects may not be representable by just an object pointer. 
  I like the idea of "procedural transformations" (if one can call it like 
  that) which means that an object transformation node in the CSG tree 
  can store properties like "1000 objects in a line" without having to 
  allocate 1000 object transformation matrices. 
  This means that the object should be represented by the transformation 
  node pointer and the transformation index (0..999 here). 
  Otherwise (if we just have the node) we would have to test all 1000 
  contained "objects" for intersections. 
  Obviously, one index is not enough as soon as these things get nested. 

I'll re-read the other parts of your email when I had a look at the "Kilauea" 
paper. Thanks for the hint. 

Cheers,
Wolfgang

Quoting Wolfgang Wieser <ww...@gm...>:

> Hello everybody!
> 
> It's time to get some life onto the list again. 
> 
> 1. Coroutines
> -------------
> 
> Using stream processing and kernels allows for support of modern 
> arcitectures to come up in the future (multi-core SMP-like systems 
> wich will probably be in consumer boxes in some years already (see 
> recent "Spektrum der Wissenschaft") as well as more radical ones like 
> the Cell architecture). 
> However, one of the problems is how to adequately implement streams 
> and kernels on a (currently) standard UP (uniprocessor) box. 
> 
> One solution for that could be the use of threads combined with the 
> use of coroutines. For details, see Knuth, "The Art of Computer 
> Programming" or the documentation for coroutine.{cc,h} in  
> devel-ray/src/lib/threads/. 
> The coroutine implementation in there is actually a thread-safe port 
> of libPCL (portable coroutine library) by Davide Libenzi. 
> 
> Basically, coroutines allows for cooperative fast user-space switching of 
> contexts without system interaction. 
> (On my linux-2.6 UP system, a coroutine context switch is 7 times as fast 
> as a thread context switch.)
> 
> Coroutine switching is done cooperatively which additionally eliminates 
> locking required for concurrent threads. 
> 

I think that this is actually a really good way to do it, and it is supported by
some of the research that I've been doing recently.  In particular, you should
check out the Kilhauea renderer:
http://portal.acm.org/citation.cfm?id=569673.569675&dl=GUIDE&dl=ACM&type=series&idx=SERIES10714&part=Proceedings&WantType=Proceedings&title=ACM%20International%20Conference%20Proceeding%20Series

(If you don't have ACM access I can mail you the paper)

The way they do it is to match the number of threads to the number of processors
on the machine.  Within each thread, they use task parallelism like you said,
with application-controlled task switches.  One thing that Kilauea supports that
we probably won't need is the ability to split a large scene database among a
collection of machines, which adds a lot of complexity to their design.

If we take this approach to threading, we'll need to worry about the following
access control issues:
- Access to mailbox numbers on primitives in the accelerator
- Access to a read/write irradiance cache (if we use one)
- Writeback to the frame buffer (we can do this in blocks - this is one reason
we should probably use blocks of pixels rather than rays as the basic unit of
parallelism).

There are also issues that need to be resolved to make the stream model work:
- How to handle dependent rays: will we allow shaders to trace rays that depend
on the results of previous ray queries?  A simple example of this is an adaptive
sampler.  Kilauea handles this by suspending shader execution (saving state) and
then waiting until the query rays return.  Deeper rays are given priority.
- How to store state of tasks: For example, if we have seperate coroutines for
running the accelerator traversal and triangle intersection, then it is
necessary to cache the state of the accelerator while batches of rays are tested
against primitives.  If we use a KD-tree, then we'll need to cache an entire
stack of untraversed nodes so that we can continue where we left off.  On the
other hand, if we use grids the amount of data in suspended state would need to
be much less.

After a more careful look at Purcell's thesis (Ray tracing on a stream
processor), I've noticed that these issues are avoided by ignoring them (ie.
dependent rays spawns are disallowed, and each ray carries its weight to the
final image throughout the computation).  They also use a basic grid accelerator
that will be insufficient for use in a complex scene.

After doing more research, the benefit of the stream model seems dubious on
current architectures.  Whatever advantage we gain from locality within a
coroutine might be offset by the cost of maintaining suspended states.  We'll
need to weigh the potential benefits of the architecture against the complexity
of implementation.

> 
> 4. RT core
> ----------
> 
> Any more ideas/suggestions on core design / stream processing / SIMD?
> (Andrew: I would be very happy to see your ideas before you have to 
>          leave us.)
> 

I've been doing a lot of reading in my spare time, esp. on global illumination
topics.  I've been making my way through "Physically Based Rendering" by Pharr,
Humphreys, which is turning out to be a really excellent book (so far).  The
book comes with a raytracer that they've used for teaching, but it is also
fairly optimized in some areas (eg. they have implemented a highly efficient KD
tree, with some optimizations that I did not know about before).

I've been busy lately, unfortunately the next two weeks I need to catch up with
a compiler project for school so I can't comment on everything you've mentioned.

Here are some other design issues I've been thinking about:
- floats or doubles: I'm leaning towards using floats in the renderer, at least
for storage and the majority of computation.  Reasons: approx. 1/2 of memory use
for meshes, better cache performance, and improved computation speed (but not
twice as fast, float divide is 24 cycles on AMD64 while double is 28 cycles). 
We could change to doubles for some computation paths if it turns out there are
precision issues.
- Triangles only: We could optimize our design by having most primitives
tesselate to triangles.  Then the accelerator could be constructed with geometry
specified inline with the leaf nodes.  We could do this by templating the
accelerator to the type of data that will be processed, in our case either
triangles or object pointers.  Non-triangle primitives like isosurfaces would be
placed into a seperate accelerator.
- SIMD processing: After some more research, I'm less confident of using a SIMD
approach to coherent rays, esp. for global illumination.  I'm wondering if we
could get a better speedup just be making specific optimizations for low-level
vector oprations.

I'll get back with some more comments as soon as I find some time...

Andrew

----------------------------------------
This mail sent through www.mywaterloo.ca

Hello everybody!

It's time to get some life onto the list again. 

1. Coroutines
-------------

Using stream processing and kernels allows for support of modern 
arcitectures to come up in the future (multi-core SMP-like systems 
wich will probably be in consumer boxes in some years already (see 
recent "Spektrum der Wissenschaft") as well as more radical ones like 
the Cell architecture). 
However, one of the problems is how to adequately implement streams 
and kernels on a (currently) standard UP (uniprocessor) box. 

One solution for that could be the use of threads combined with the 
use of coroutines. For details, see Knuth, "The Art of Computer 
Programming" or the documentation for coroutine.{cc,h} in  
devel-ray/src/lib/threads/. 
The coroutine implementation in there is actually a thread-safe port 
of libPCL (portable coroutine library) by Davide Libenzi. 

Basically, coroutines allows for cooperative fast user-space switching of 
contexts without system interaction. 
(On my linux-2.6 UP system, a coroutine context switch is 7 times as fast 
as a thread context switch.)

Coroutine switching is done cooperatively which additionally eliminates 
locking required for concurrent threads. 

2. VM
-----

After Andrew's remarks concerning the VM and a recent feeling on 
behalf of my sinde about the need for more performance, I thought 
about changing the (planned) internal VM layout somewhat. 

For some background, we were having the following problems: 
- We need a garbage collector because the user of the VM cannot be 
  relied on to properly deallocate objects. However, during the GC 
  run, other threads need to be stopped which is not easily possible 
  using pthreads and similar APIs. 
- The VM has a two-stage lookup/dereferenciation for pointers. 
  Advantage: Easily allows to use safe pointers (user cannot crash VM), 
  allows to implement on-demand requests of objects over the network 
  on the VM level. 
  However, using real pointers would of couse be faster and the impact 
  is not limited on the VM because all exported objects need to use the 
  same notion of pointers as the VM itself. 

So, a possible way to go seems to be the following: 
- Use real pointers in the VM. The VM can use a checking mode where 
  every pointer value is first looked up in a hash table to verify 
  its validity. This is much slower than the previous two-stage lookup 
  (which is O(1)) but the non-checking version of the VM (for real 
  renderings) is faster. 
- Network distribution or on-demand requests of objects then need to 
  be handeled at a higher level. (This will probably demand for an 
  indirection layer for objects capable of such on-demand loading.)
- The garbage collection is performed by the conservative GC 
  (http://www.hpl.hp.com/personal/Hans_Boehm/gc/) which is also used 
  in gcc and other projects. 
  Hans Boehm, et al, spent a lot of time in writing a state-of-the-art 
  GC which nowadays even supports multi-threading (by stopping other 
  threads; they have platform-dependent code for that part). 
  This also eliminates some development work for us. 
- SPL (as of my current design) allows for explicit deletion of objects 
  (despite the GC). If the user explicitly deletes all his allocated 
  objects, the application will run faster because the GC has less work 
  (I verified that with boehm_gc). 
  We can allow the user to not use garbage collection (and do his own 
  memory management) if he wishes to do so (at his own risk, as always). 

The main problems with this approach are: 
- The VM needs to know the object base location to do dynamic casts which 
  are also needed for virtual function calls. This can be solved by 
  attaching an offset value to all base classes in all instances. 
  Actually, no additional memory is required unless the object is a 
  (base) class of non-zero size but has _no_ virtual functions. These 
  cases are probably rare. 
- Pointer size is no longer constant for the VM since we support 32bit 
  and 64bit systems. This means that the compiler cannot easily calculate 
  offsets for the target machine. The easiest solution that comes to my 
  mind is that the VM assembly uses 2 values for each address/offset 
  specification, one for 32bit and one for 64bit systems. The VM can 
  then select the correct version when loading the assembly file. 
- Explicit deletion changes in behaviour. Previously, a pointer to an 
  explicitly deleted object would be NULL after deleting the object 
  because the shared indirection layer index contains NULL. Now, a 
  pointer is not automagically NULL and dereferencing a deleted object 
  may crash the VM (or trigger an assertion for the checking VM). 

3. SPL
------

Just by chance, I stumbled accross a tool called "treecc" 
(www.southern-storm.com.au/treecc.html). It helps in compiler development 
mainly by protecting the programmer from forgetting cases. 
I'm currently seeing if I will use it for the SPL compiler implementation. 
This is the reason why compilation of the code currently fails just 
before the end in the spl/ directory. 

Since treecc is quite small and not very common, I will put it into the 
3rdparty/ directoy once it is clear that it will be used. 

The same holds for the garbage collector. It is correct that there exist 
binary distributions of it but we must make sure that we have the 
thread-safe version and also may play with parallel marking and other 
compile-time tuning. 

(Things already work in my development version but I would like to avoid 
adding lots of 3rdparty code to the CVS which will be removed again later.) 

4. RT core
----------

Any more ideas/suggestions on core design / stream processing / SIMD?
(Andrew: I would be very happy to see your ideas before you have to 
         leave us.)

Regards,
Wolfgang

Hello everybody...

GLib was now removed from the 3rdparty directory. Developers need to 
have glib installed (at least version 2.4.6). 

Furthermore, there is an Athlon64 box running Linux at university and I 
successfully compiled the code on it. 
"Porting" was easy; the only required change was fixing an incorrect 
use of va_list. 

Wolfgang

2004	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (28)	Dec (37)
2005	Jan (13)	Feb (25)	Mar (12)	Apr	May (9)	Jun (3)	Jul	Aug	Sep	Oct	Nov	Dec

ray-devel Mailing List for ray

ray-devel — Developer discussion forum.