You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(28) |
Dec
(37) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(13) |
Feb
(25) |
Mar
(12) |
Apr
|
May
(9) |
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Wolfgang W. <ww...@gm...> - 2005-06-25 21:23:04
|
It's so quiet here... If anybody on this list is still interested in this project in any way, then PLEASE speak up NOW. Regards, Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-06-22 20:48:46
|
Hello everybody! As we all know, RAY aims to be a full-featured extensible raytracer portable among POSIX and Windows with native support for multi-threading and network distribution. Up to now, quite a lot has been done (mainly by implementing libraries which are largely core independent) but we need to go for the main parts now, and hence, it is probably a good idea to decide roghly on how to go on. This is my suggestion concerning the next steps in the development roadmap. I would like to encourage everybody to comment on that. * RT core design. Before we implement it, we should get a good idea of the flow of information, flow of execution and required classes and interfaces. Andrew offered to do some more overall RT core design / planning some time back and I'd like to kindly invite him especially because he has much more experience about raytracers than me. Apart from the "RT core event model", I did not do much more planning, however, I am very confident to nicely implement a slightly modified event model fixing the portability problems. More on that in a separate email. * General discussion about rendering on lots of computers: How things should operate, how to distribute information (scene graph,...). * VM and Compiler implementation. This is what I'll do in my free time now. After more reasoning, I decided to implement the "real pointer VM" (I probably said that already) which requires a change in the assembly format in order to be portable accross 32 and 64 bit platforms. Hence, I'll re-implement the linker as well (thereby tidying up the code, too). I'm still deeply convinced that the project will eventually be a great raytracer and that more supporters can be found once it is operating. However, as usual, getting it started is hard and that is what we're currently going through. Yet another important point is the amount of time everyone is able or willing to invest. I'll have two important exams after this semester and therefore will be much more limited in the amout of free time compared to the months before. However, as always, I'll do the best I can. Regards, Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-06-22 20:01:49
|
GNet has now been removed from the CVS. There is no good reason to have GNet in the source tree any more: - GNet will not or only very slowly be further developed. - I implemented / ported a lot of the core functionality of GNet (name resolution, TCP socket, UDP socket, address abstraction) in src/lib/net. - It is much easier to compile everything without it. - We do MUCH better error reporting than GNet; and also the code should generally be more efficient now. Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-05-28 20:05:33
|
> I will probably have a more in-depth view of the problem but let me first > compile gcc-4.1.x on the P4 box... > Well, I cannot find the problem now and don't have enough time to dig further into the issue. Since there's IMO nothing wrong with the C++ code, there is little we can do anyways. So, in the end, the same code will calculate ridiculous timings on P4 while reporting fairly quick timings on Athlons. Either it's the program having difficulty measuring the timings or the Intel chips just loose badly on these constructs -- or something else I did not think about. Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-05-26 15:11:23
|
On Thursday 26 May 2005 06:19, Andrew Clinton wrote: > I've updated the tests > - Changed constants single precision (neither float or double seems to > introduce any casts in the assembly, now) > I made some tests and it seems that gcc-4.1.0 20050526 ("today's version") does not make any difference between 0.5 and 0.5f as fp constants. While earier versions of gcc apparently automatically widened float -> double while doing down-conversion double->float at run time, current (development) versions make no difference. Especially, I verified that the following snippets of test-speed.cc compile into binary equivalent code: v[0]=res; v[1]=res-0.1f; v[2]=res-0.2f; v[0]=res; v[1]=res-0.1; v[2]=res-0.2; v[0]=res; v[1]=res-(T)0.1; v[2]=res-(T)0.2; Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-05-26 12:40:02
|
On Thursday 26 May 2005 06:19, ajc...@en... wrote: > I've updated the tests > - Changed constants single precision (neither float or double seems to > introduce any casts in the assembly, now) Yes, as I verified, the compiler will automatically do float -> double widening at compile time while double -> float conversion is done at run time. > - Added dependency between successive iterations for each test > I slightly changed that again: Instead of having the same variable on both sides (like in "a=a*b"), I introduced 2 tests: One for "a*=b" and one for (a=b*c, b=c*a, c=a*b). This is because commonly the result of an operation will not overwrite one input variable but be stored somewhere else. This has impact on compiler generated temporaries. The overall picture of the benchmarks is not very much affected by this. > These changes make a substantial difference on my system: > > --AMD 64 3500+ 2.2Ghz > Ah, great - you have a fairly fast AMD64! Could you please run the tests in src/lib/threads and send me the output? (Also compiler & kernel version.) I'd be very much interested in the switch and lock times on Linux/AMD64. (The timings on FreeBSD/AMD64 are not so good and the question is whether it's the architecture or the operating system. I don't have access to an idle Linux/AMD64 system.) > So, I suppose we can conclude that doubles really are preferred to floats > for straight arithmetic, at least on AMD64, for algorithms with only > additions and multiplications. > It seems logical that the 64 bit doubles are the natural data type for 64 bit platforms and preferrable over floats as long as storage size does not matter. However, for the 32bit Athlon, double is considerably faster than float. Here are updated timings: --<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, NPTL>---------------------------- (nothing) : 0.358208 nsec/cyc vector3::length : flt: 37.1814 dbl: 37.179 nsec/cyc (1.00) vector3*vector3 : flt: 17.4478 dbl: 25.3944 nsec/cyc (1.46) vector3 x vector3 : flt: 10.6335 dbl: 15.1611 nsec/cyc (1.43) matrix3*vector3 : flt: 20.4215 dbl: 32.536 nsec/cyc (1.59) trafomatrix*vector3 : flt: 20.3866 dbl: 28.9683 nsec/cyc (1.42) trafomatrix::inverse : flt: 120.664 dbl: 112.873 nsec/cyc (0.94) trafomatrix*trafomatrix : flt: 88.0527 dbl: 97.4989 nsec/cyc (1.11) trafomatrix*=trafomatrix : flt: 78.0263 dbl: 75.8576 nsec/cyc (0.97) matrix3*matrix3 : flt: 65.0964 dbl: 109.145 nsec/cyc (1.68) matrix3*=matrix3 : flt: 68.4487 dbl: 67.6107 nsec/cyc (0.99) --<AMD64 1.8GHz FreeBSD-5.4 gcc-3.4.2>--------------------------------------- (nothing) : 0.28741 nsec/cyc vector3::length : flt: 21.1455 dbl: 25.4859 nsec/cyc (1.21) vector3*vector3 : flt: 3.86673 dbl: 3.79015 nsec/cyc (0.98) vector3 x vector3 : flt: 8.95599 dbl: 8.94856 nsec/cyc (1.00) matrix3*vector3 : flt: 17.1492 dbl: 14.7704 nsec/cyc (0.86) trafomatrix*vector3 : flt: 16.4591 dbl: 13.9329 nsec/cyc (0.85) trafomatrix::inverse : flt: 58.0311 dbl: 61.0099 nsec/cyc (1.05) trafomatrix*trafomatrix : flt: 62.8014 dbl: 57.8739 nsec/cyc (0.92) trafomatrix*=trafomatrix : flt: 65.3974 dbl: 55.8263 nsec/cyc (0.85) matrix3*matrix3 : flt: 51.4784 dbl: 50.3268 nsec/cyc (0.98) matrix3*=matrix3 : flt: 54.4237 dbl: 49.8042 nsec/cyc (0.92) But now the bad news: Your changes somwhow screwed up heavily the gcc-3.4/Pentium4 pair. Look at that: (r1.6, r1.7 are the CVS revisions) r1.6 (AJC) (nothing) : 0.116446 nsec/cyc vector3::length : flt: 2519.8 dbl: 2517.85 nsec/cyc (1.00) vector3*vector3 : flt: 628.565 dbl: 628.868 nsec/cyc (1.00) vector3 x vector3 : flt: 3032.07 dbl: 3033.61 nsec/cyc (1.00) matrix3*vector3 : flt: 4726.15 dbl: 4727.36 nsec/cyc (1.00) trafomatrix*vector3 : flt: 4718 dbl: 4732.51 nsec/cyc (1.00) trafomatrix::inverse : flt: 115.973 dbl: 73.5815 nsec/cyc (0.63) trafomatrix*trafomatrix : flt: 20309.1 dbl: 20257.4 nsec/cyc (1.00) matrix3*matrix3 : flt: 14475.6 dbl: 14460.1 nsec/cyc (1.00) My changes based on Andrew's don't make things better either... r1.7 (WW) (nothing) : 0.11606 nsec/cyc vector3::length : flt: 2523.47 dbl: 2527.31 nsec/cyc (1.00) vector3*vector3 : flt: 630.039 dbl: 628.347 nsec/cyc (1.00) vector3 x vector3 : flt: 3055.63 dbl: 3057.65 nsec/cyc (1.00) matrix3*vector3 : flt: 4725.59 dbl: 4722.21 nsec/cyc (1.00) trafomatrix*vector3 : flt: 4722.81 dbl: 4727.08 nsec/cyc (1.00) trafomatrix::inverse : flt: 115.964 dbl: 74.8951 nsec/cyc (0.65) trafomatrix*trafomatrix : flt: 20279.7 dbl: 20276.5 nsec/cyc (1.00) trafomatrix*=trafomatrix : flt: 20135.6 dbl: 20150.9 nsec/cyc (1.00) matrix3*matrix3 : flt: 14451.3 dbl: 14454.8 nsec/cyc (1.00) matrix3*=matrix3 : flt: 14400.8 dbl: 14450.7 nsec/cyc (1.00) The "(nothing)" and inversion tests are okay but the rest is just horrible. I will probably have a more in-depth view of the problem but let me first compile gcc-4.1.x on the P4 box... Wolfgang |
From: <ajc...@en...> - 2005-05-26 04:19:46
|
I've updated the tests - Changed constants single precision (neither float or double seems to introduce any casts in the assembly, now) - Added dependency between successive iterations for each test These changes make a substantial difference on my system: --AMD 64 3500+ 2.2Ghz (nothing) : 0.250217 nsec/cyc vector3::length : flt: 18.2921 dbl: 22.4885 nsec/cyc (1.23) vector3*vector3 : flt: 3.37153 dbl: 3.33858 nsec/cyc (0.99) vector3 x vector3 : flt: 6.43606 dbl: 6.51059 nsec/cyc (1.01) matrix3*vector3 : flt: 14.3812 dbl: 13.4929 nsec/cyc (0.94) trafomatrix*vector3 : flt: 13.8013 dbl: 12.6405 nsec/cyc (0.92) trafomatrix::inverse : flt: 51.8531 dbl: 53.4506 nsec/cyc (1.03) trafomatrix*trafomatrix : flt: 60.8745 dbl: 55.2666 nsec/cyc (0.91) matrix3*matrix3 : flt: 46.5577 dbl: 51.3582 nsec/cyc (1.10) It turns out I was somewhat wrong stating that single precision operations have a longer latency than double. At least on AMD64, additions and multiplications of floats and doubles have the same latency (4 cycles). Only division and square root have longer latency for double precision operations. I spent a while trying to figure out why there are still cases where double outperforms float without coming to any conclusion. I'm guessing that the remaining difference is due to unaligned accesses to some of the floats. The AMD64 docs state that some of the move operations are slower when a floating point value is not 8-byte aligned. So, I suppose we can conclude that doubles really are preferred to floats for straight arithmetic, at least on AMD64, for algorithms with only additions and multiplications. Quoting Wolfgang Wieser <ww...@gm...>: > Hello Andrew! > > On Wednesday 25 May 2005 02:00, Andrew Clinton wrote: > > The only way I think you'll be able to make sure this isn't > > happening is to check the assembly code manually. > > > The easiest thing is simply to USE the output (e.g. by summing it all up) > and slightly vary the input (e.g. by adding 0.1 each iteration). > This is what is actually done (right?) > > Another possibility is to look at execution time scaling (which I did as > well): > - The basic operations are inline functions which are called 4 times in > each loop. If I comment out 2 of them, the time halves. > - There is one no-op measurement, i.e. a loop with 4 inlines which to > nothing. This is the "(nothing)" line. Its execution time indicates > required time if the stmts are optimized away and does not change when > 2 of the calls are commented out. > > While only looking at and understanding the assembly gives a definite answer, > > the scaling already gives confidence and the using/varying the vars prevents > > the compiler from removing operations. > > I tried the explicit float casting (by adding a "f" to all FP constants) > and it did not change the timings. > > > It looks like your code is subject to some of these problems. I've made > > some attempts at fixing it but it is difficult to verify (looking through > > lots of assembly). > > > Did you see any change in the timings? > Feel free to commit a "fixed" version; I'd like to have a look at it. > > > Especially problematic might be the production of a > > result value that is never used again, > > > Well, the results in the loop statements are usually summed up and hence > "used". The calculated sum is in the end not used but that should not make > any difference. > > The fact that produced results are not directly needed for the next > calculation may enable the compiler/processor to make better use of > pipelining. I expect this to be benifitial to both types (float, double). > > > Maybe a better test would be to do > > something useful with each type of arithmetic (intersect a ray with a > > sphere?). > > > Sure. > > Let me tell you something about the history why I did this: I just wanted > to get a feeling for the time needed for typical floating point operations > as compared to the time required by thread-safe locking and context > switching. > > So, as I wrote in the email the day before, I'd be very happy if you > could post some real ray-shape intersection timings for your raytracer and > a "typical" scene (i.e. lots of basic CSG / a big triangle mesh / ..). > > Because the point is basically that I would like to compare the cost of a > ray-shape intersection to the cost of a coroutine switch. > This is required to decide if it makes sense to use ray-shape intersections > as request type or if the introduced overhead is too high. > > Wolfgang > > > ------------------------------------------------------- > SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate > online with coworkers and clients while avoiding the high cost of travel and > communications. There is no equipment to buy and you can meet as often as > you want. Try it > free.http://ads.osdn.com/?ad_id=7402&alloc_id=16135&op=click > _______________________________________________ > Ray-devel mailing list > Ray...@li... > https://lists.sourceforge.net/lists/listinfo/ray-devel > ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-05-25 16:41:40
|
Hello Andrew! On Wednesday 25 May 2005 02:00, Andrew Clinton wrote: > The only way I think you'll be able to make sure this isn't > happening is to check the assembly code manually. > The easiest thing is simply to USE the output (e.g. by summing it all up) and slightly vary the input (e.g. by adding 0.1 each iteration). This is what is actually done (right?) Another possibility is to look at execution time scaling (which I did as well): - The basic operations are inline functions which are called 4 times in each loop. If I comment out 2 of them, the time halves. - There is one no-op measurement, i.e. a loop with 4 inlines which to nothing. This is the "(nothing)" line. Its execution time indicates required time if the stmts are optimized away and does not change when 2 of the calls are commented out. While only looking at and understanding the assembly gives a definite answer, the scaling already gives confidence and the using/varying the vars prevents the compiler from removing operations. I tried the explicit float casting (by adding a "f" to all FP constants) and it did not change the timings. > It looks like your code is subject to some of these problems. I've made > some attempts at fixing it but it is difficult to verify (looking through > lots of assembly). > Did you see any change in the timings? Feel free to commit a "fixed" version; I'd like to have a look at it. > Especially problematic might be the production of a > result value that is never used again, > Well, the results in the loop statements are usually summed up and hence "used". The calculated sum is in the end not used but that should not make any difference. The fact that produced results are not directly needed for the next calculation may enable the compiler/processor to make better use of pipelining. I expect this to be benifitial to both types (float, double). > Maybe a better test would be to do > something useful with each type of arithmetic (intersect a ray with a > sphere?). > Sure. Let me tell you something about the history why I did this: I just wanted to get a feeling for the time needed for typical floating point operations as compared to the time required by thread-safe locking and context switching. So, as I wrote in the email the day before, I'd be very happy if you could post some real ray-shape intersection timings for your raytracer and a "typical" scene (i.e. lots of basic CSG / a big triangle mesh / ..). Because the point is basically that I would like to compare the cost of a ray-shape intersection to the cost of a coroutine switch. This is required to decide if it makes sense to use ray-shape intersections as request type or if the introduced overhead is too high. Wolfgang |
From: <ajc...@en...> - 2005-05-25 00:07:52
|
I'm almost certain that there is some problem in your testing procedure. I've run a similar test (vector normalization) for floats and doubles, and will always get about a 20% improvement with floats. If you look at the manuals for the processors, you'll see longer latency for the double operations. I'd wager one of the following things might be messing up the tests... - unecessary conversions: when you specify a float literal, always add 'f' (eg. 1.0f instead of 1.0). Otherwise the compiler will synthesize conversions. Conversions can consume a lot of clock cycles. - missing dependencies: compiler might be optimizing some code or taking it out of the loop. If you compute a result then never use it, the code might disappear. If you compute the same result many times, it might only execute once. The only way I think you'll be able to make sure this isn't happening is to check the assembly code manually. It looks like your code is subject to some of these problems. I've made some attempts at fixing it but it is difficult to verify (looking through lots of assembly). Especially problematic might be the production of a result value that is never used again, so you aren't really timing the latency of the operation at all. Maybe a better test would be to do something useful with each type of arithmetic (intersect a ray with a sphere?). Andrew Quoting Wolfgang Wieser <ww...@gm...>: > Hello everybody, > > In order to get an estimate how long basic numerical caculations take > on current hardware, I ran the speed test > (src/lib/numerics/3d/test-speed.cc) > on different platforms. > > The results are below. This is just plain C++ code compiled; there are > no hand-coded SSE vectorisations or similar. > > The most remarkable fact is that using floats instead of doubles may have > the opposite effect than expected... Especially note the value in brackets > at the end of the lines: It is the ratio of double and float calc times, > i.e. values larger than 1 mean "double takes longer" while values smaller > than 1 mean "double is faster than float". > > Wolfgang > > BTW, in the last email I forgot to mention that longjmp'ing between thread > contexts is undefined in POSIX. So, we need to be prepared for the case that > > it cannot be done. > > --<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, > NPTL>----------------------------- > (nothing) : 0.533687 nsec/cyc > vector3::length : flt: 28.0004 dbl: 26.718 nsec/cyc (0.95) > vector3*vector3 : flt: 10.2434 dbl: 22.804 nsec/cyc (2.23) > vector3 x vector3 : flt: 15.1952 dbl: 34.8585 nsec/cyc (2.29) > matrix3*vector3 : flt: 26.2452 dbl: 40.8993 nsec/cyc (1.56) > trafomatrix*vector3 : flt: 25.6166 dbl: 25.7406 nsec/cyc (1.00) > trafomatrix::inverse : flt: 126.307 dbl: 96.1046 nsec/cyc (0.76) > trafomatrix*trafomatrix : flt: 99.8809 dbl: 113.816 nsec/cyc (1.14) > matrix3*matrix3 : flt: 75.9704 dbl: 131.277 nsec/cyc (1.73) > > --<P4 2.80 GHz Linux-2.6.11 gcc-3.4.4, > LinuxThreads>-------------------------- > (nothing) : 0.118761 nsec/cyc > vector3::length : flt: 15.69 dbl: 15.5786 nsec/cyc (0.99) > vector3*vector3 : flt: 17.0779 dbl: 5.04801 nsec/cyc (0.30) > vector3 x vector3 : flt: 16.3483 dbl: 11.8949 nsec/cyc (0.73) > matrix3*vector3 : flt: 18.4656 dbl: 17.3939 nsec/cyc (0.94) > trafomatrix*vector3 : flt: 18.4069 dbl: 17.374 nsec/cyc (0.94) > trafomatrix::inverse : flt: 120.072 dbl: 84.1365 nsec/cyc (0.70) > trafomatrix*trafomatrix : flt: 79.1698 dbl: 86.2536 nsec/cyc (1.09) > matrix3*matrix3 : flt: 61.501 dbl: 68.3568 nsec/cyc (1.11) > > --<AMD64 1.8GHz FreeBSD-5.4 > gcc-3.4.2>---------------------------------------- > (nothing) : 0.287412 nsec/cyc > vector3::length : flt: 16.2991 dbl: 16.4841 nsec/cyc (1.01) > vector3*vector3 : flt: 10.1551 dbl: 5.49865 nsec/cyc (0.54) > vector3 x vector3 : flt: 16.7179 dbl: 9.67714 nsec/cyc (0.58) > matrix3*vector3 : flt: 27.3062 dbl: 19.6488 nsec/cyc (0.72) > trafomatrix*vector3 : flt: 27.463 dbl: 19.1422 nsec/cyc (0.70) > trafomatrix::inverse : flt: 70.7865 dbl: 64.1694 nsec/cyc (0.91) > trafomatrix*trafomatrix : flt: 83.5633 dbl: 68.334 nsec/cyc (0.82) > matrix3*matrix3 : flt: 67.8196 dbl: 59.2467 nsec/cyc (0.87) > > --<AthlonXP 1.47GHz Linux-2.6.11 gcc-3.4.2 for Win32, NPTL, WINE > EMULATION!>-- > (nothing) : 0.356384 nsec/cyc > vector3::length : flt: 29.7528 dbl: 27.0064 nsec/cyc (0.91) > vector3*vector3 : flt: 15.3446 dbl: 7.66602 nsec/cyc (0.50) > vector3 x vector3 : flt: 21.3162 dbl: 13.4919 nsec/cyc (0.63) > matrix3*vector3 : flt: 26.0698 dbl: 26.6165 nsec/cyc (1.02) > trafomatrix*vector3 : flt: 26.3388 dbl: 25.8365 nsec/cyc (0.98) > trafomatrix::inverse : flt: 126.99 dbl: 114.622 nsec/cyc (0.90) > trafomatrix*trafomatrix : flt: 128.008 dbl: 134.121 nsec/cyc (1.05) > matrix3*matrix3 : flt: 102.492 dbl: 105.253 nsec/cyc (1.03) > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click > _______________________________________________ > Ray-devel mailing list > Ray...@li... > https://lists.sourceforge.net/lists/listinfo/ray-devel > ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-05-22 20:54:11
|
Hello everybody, In order to get an estimate how long basic numerical caculations take on current hardware, I ran the speed test (src/lib/numerics/3d/test-speed.cc) on different platforms. The results are below. This is just plain C++ code compiled; there are no hand-coded SSE vectorisations or similar. The most remarkable fact is that using floats instead of doubles may have the opposite effect than expected... Especially note the value in brackets at the end of the lines: It is the ratio of double and float calc times, i.e. values larger than 1 mean "double takes longer" while values smaller than 1 mean "double is faster than float". Wolfgang BTW, in the last email I forgot to mention that longjmp'ing between thread contexts is undefined in POSIX. So, we need to be prepared for the case that it cannot be done. --<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, NPTL>----------------------------- (nothing) : 0.533687 nsec/cyc vector3::length : flt: 28.0004 dbl: 26.718 nsec/cyc (0.95) vector3*vector3 : flt: 10.2434 dbl: 22.804 nsec/cyc (2.23) vector3 x vector3 : flt: 15.1952 dbl: 34.8585 nsec/cyc (2.29) matrix3*vector3 : flt: 26.2452 dbl: 40.8993 nsec/cyc (1.56) trafomatrix*vector3 : flt: 25.6166 dbl: 25.7406 nsec/cyc (1.00) trafomatrix::inverse : flt: 126.307 dbl: 96.1046 nsec/cyc (0.76) trafomatrix*trafomatrix : flt: 99.8809 dbl: 113.816 nsec/cyc (1.14) matrix3*matrix3 : flt: 75.9704 dbl: 131.277 nsec/cyc (1.73) --<P4 2.80 GHz Linux-2.6.11 gcc-3.4.4, LinuxThreads>-------------------------- (nothing) : 0.118761 nsec/cyc vector3::length : flt: 15.69 dbl: 15.5786 nsec/cyc (0.99) vector3*vector3 : flt: 17.0779 dbl: 5.04801 nsec/cyc (0.30) vector3 x vector3 : flt: 16.3483 dbl: 11.8949 nsec/cyc (0.73) matrix3*vector3 : flt: 18.4656 dbl: 17.3939 nsec/cyc (0.94) trafomatrix*vector3 : flt: 18.4069 dbl: 17.374 nsec/cyc (0.94) trafomatrix::inverse : flt: 120.072 dbl: 84.1365 nsec/cyc (0.70) trafomatrix*trafomatrix : flt: 79.1698 dbl: 86.2536 nsec/cyc (1.09) matrix3*matrix3 : flt: 61.501 dbl: 68.3568 nsec/cyc (1.11) --<AMD64 1.8GHz FreeBSD-5.4 gcc-3.4.2>---------------------------------------- (nothing) : 0.287412 nsec/cyc vector3::length : flt: 16.2991 dbl: 16.4841 nsec/cyc (1.01) vector3*vector3 : flt: 10.1551 dbl: 5.49865 nsec/cyc (0.54) vector3 x vector3 : flt: 16.7179 dbl: 9.67714 nsec/cyc (0.58) matrix3*vector3 : flt: 27.3062 dbl: 19.6488 nsec/cyc (0.72) trafomatrix*vector3 : flt: 27.463 dbl: 19.1422 nsec/cyc (0.70) trafomatrix::inverse : flt: 70.7865 dbl: 64.1694 nsec/cyc (0.91) trafomatrix*trafomatrix : flt: 83.5633 dbl: 68.334 nsec/cyc (0.82) matrix3*matrix3 : flt: 67.8196 dbl: 59.2467 nsec/cyc (0.87) --<AthlonXP 1.47GHz Linux-2.6.11 gcc-3.4.2 for Win32, NPTL, WINE EMULATION!>-- (nothing) : 0.356384 nsec/cyc vector3::length : flt: 29.7528 dbl: 27.0064 nsec/cyc (0.91) vector3*vector3 : flt: 15.3446 dbl: 7.66602 nsec/cyc (0.50) vector3 x vector3 : flt: 21.3162 dbl: 13.4919 nsec/cyc (0.63) matrix3*vector3 : flt: 26.0698 dbl: 26.6165 nsec/cyc (1.02) trafomatrix*vector3 : flt: 26.3388 dbl: 25.8365 nsec/cyc (0.98) trafomatrix::inverse : flt: 126.99 dbl: 114.622 nsec/cyc (0.90) trafomatrix*trafomatrix : flt: 128.008 dbl: 134.121 nsec/cyc (1.05) matrix3*matrix3 : flt: 102.492 dbl: 105.253 nsec/cyc (1.03) |
From: Wolfgang W. <ww...@gm...> - 2005-05-22 14:41:38
|
Hello everybody! Recently, I've been thinking about the RT core model, i.e. how the actual calculation should be split for different boxes. The idea I would like to present below is based on discussion with Andrew about stream processing and the "Kilauea paper" and follows my proposal from some time back about using as many worker threads as CPUs in the system and additionally doing userspace (coroutine) switching. Overview ======== The core consists of the following main parts: - a request PQ (priority queue) - processing kernels; one for each request type - a list of saved states (coroutines) - a response queue - an (optional) network interface Requests all have a priority, a return-path (i.e. which computer and couroutine generated the request and needs to receive the corresponding response) and a type. The type can be something like "calculate ray shape intersection", "shade this ray", "completely trace this ray",... The system below is independent of these types; there just needs to be a kernel which can process a request of each type. Operation ========= 1. Requests are queued in the request PQ according to their priority (higher prio first). Inputs for the queue are network (received request) or internal sources. 2. There are several worker threads (as many as there are CPUs). As soon as a request is available (and there are no responses to be handeled locally) and a thread is "free" (i.e. needs more work to do), it [creates a new coroutine context and] dequeues the request and then processes it using the apropriate kernel routines. 3. The kernel then computes a response ("result") which is fed into the response queue to be "sent" back according to the request's return path (either via network or internally, see below). 4. The kernel itself may need to make further requests in order to process the current request. E.g. for shading a surface, tracing several rays (e.g. shadow rays) may be needed. The kernel will then enqueue these requests in the request PQ with a priority which is larger than the prio of the current request. The request PQ will decide whether to keep them in the local queue or send the requests over the network (thereby removing them from the local queue). When all the sub-requests are queued, the thread must "wait" for the responses. Instead of doing nothing, the current coroutine state is saved and the thread continues as "free" (see 2. and 5.). 5. If a worker thread is "free" and there is a response (in the response Q) which needs to be processed locally, the thread takes this response and continues with the saved coroutine state which corresponds to the response. It is important that 4. takes precedence over 2. so that responses are handeled before new requests. Here is a figure showing the flow of execution for the threads: http://www.cip.physik.uni-muenchen.de/~wwieser/tmp/files/ray-struct.png Portability =========== In order for this concept to be portable, at least 3 threads are required: 1. There is one network receive thread [could be more as well] which simply "poll(2)"s/listens for incoming requests/responses over the network and queues them in the request PQ. 2. There is a network send thread which is either blocked by the request/ response queues waiting for something to send or blocked in poll(2) [or analogous windows function] trying to send something. [This cannot be done by the thread from 1 because therefore the thread would have to be able to simultaniously to a poll() and monitor the two queues.] 3. At least one worker thread with several coroutine states. The design above demands for coroutine switching between threads, i.e. that one thread can execute a coroutine created by another thread. According to my tests, this makes no trouble under Linux and FreeBSD but is currently not possible under Windows (src/lib/threads/test-switch.cc). This means that [as long as we find no solution] the Windows version will - use only one worker thread (i.e. suitable for uniprocessor boxes) or - use one thread per coroutine context or - use several threads but keep coroutine contexts tied to their threads [which means that instead of processing responses in the order they arrive the next "free" thread processes the first response with a compatible coroutine context] Performance =========== The most critical part is performance and how much overhead is introduced by the above approach. I made some tests under Linux (ix86), FreeBSD (AMD64) and Windows (ix86). For a more detailed view of timings, see my next email. As for now, a the basic work cycle is: - Take a request from a queue. - Switch to the apropriate coroutine. - [Process the request. This is a no-op here.] - Put a response into a queue. - Switch the coroutine back. [see src/lib/threads/test-switch.cc] One such work cycle is 370 nsec (0.37 microseconds) on an AhlonXP with 1.47 GHz running Linux-2.6.11 using gcc-4.1 and a glibc with NPTL support. This is about 15 vector transformations (matrix * vector) or 3 matrix inversions. OTOH, on an AMD64 (Athlon64) with 1.8 GHz running FreeBSD-5.4 and gcc-3.4, one such work cycle is 5100 nsec. (Yes, 5.1 usec!) This sucks because it is something like 200 vector transformations on that box. There are no good timings for Windows because the code fails to run properly when optimization is enabled during compiling (use -O0). Without optimization, I get 4.5 usec on an P4-M, 1.18 GHz. (Note that optimization (-O2) improves runtime by a factor of 4.5 on the Linux box.) Conclusion ========== Threads & coroutines can be used to implement a request-response stream-like system. However, the requests should be chosen in a way that the pure kernel calculation time of each such request consumes an order of magnitude more CPU time than the overhead introduced by the queues and switching. If anyone (Andrew?) has estimates for the ray-surface intersection times on current hardware for "typical" scenes (e.g. a large triangle mesh), please post them for comparison. Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-05-09 08:32:20
|
Hello Everybody! While not advancing much in development of the code itself, I spent a lot of hours in the last 2 weeks in order to get the code compiled for the MS Windows (win32) platform. And good news: I finally succeeded! So, the complete code in devel-ray/src/ can now be compiled for win32 using the MinGW environment (minimalistic GW, I recommend the debian packages) under Linux applying a cross-compilation gcc (i586-mingw32msvc-gcc, version 3.4.2). Also, I downloaded Tor Lillqvist's pre-compiled Glib. The nice thing is that one can do all of it on the local Linux PC in a second shell and just type "make" into the native and the cross build terminal. All that without losing the comfort of unix shells... Of course, the check programs ("make check") are not executed on Linux but I tested them on a windows XP box. So, what we get are native windows console programs. Don't mix that up with DOS mode. It is native Windows, 32bit and we can access all RAM (not just 640kb), threads, and so on. The only thing not yet working are coroutines. Unfortunately, I currently do not understand why because (I think at least) it is the same switching code as used in libpcl which works -- that is according to the author... Anybody here with some experience or knowledge on how to write a non-console application for Windows? Regards, Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-03-31 22:57:58
|
On Tuesday 22 March 2005 22:10, Andrew Clinton wrote: > remember reading somewhere that for a good accelerator, the number of cells >"stabbed" by a ray will be something like O(n^(1/3)) (cube root). If we only >get the first intersection, its probably more like O(log(n)). > Very good. Thanks for the infomation. > I think we could probably get away with an even simpler hash table. For > example, just use a fixed-size table (eg. 256 entries), and allow for some > small number of re-intersection tests (in this case, maybe around 1/128 of > tests would have a collision). This would be really simple to implement, > fast, and uses little space while getting most of the benefit of other data > structures. > Yes, another interesting idea. We could even get along with 256 booleans in a bitfield and let the hash compute the bit index. > The problem with using boolean values for this purpose is that we need some > way to reset them once the ray intersections have been completed for a > single ray > Of course. But as this bitfield is stored in one consecutive area in the ray data structure, the reset operation is basically a memset(bitfield,0,size); Especially when combining it with the hash approach introduced by you above, the reset for a 256-entry hash is setting 8 integers (with 32bit each) to 0. Really cheap. And especially it fits nicely into the cache. > Anyways, the user can > always be able to store more geometry by going and buying more RAM... > Sure. But for example I myself cannot go and buy more RAM for the >=25-node P4 HT cluster I have access to... Each of these boxes has 1Gb, so the total is 25Gb which means the cluster can store a lot more scene information than one box. Concerning scenes with huge amounts of data: Being able to handle them has been a design goal right from the beginning. (Because I am doing these planet topography renders: E.g. the easiest way for a flight on Mars would be to simply load the complete scene information -- which is 2.4Gb... Currently I am handling that with a POVRay-hack which loads only parts of the image into RAM -- but it is such a crude hack that I do not dare to publish the patch and it requires knowledge of the required area in advance.) PRIMITIVES ---------- I'm currently making my way through PBRT (not in-depth, yet). I like the simplicity in design of their interface for Primitives. However, I would do things a little different: - All objects live in object coordinates and are instanced for the scene. I.e. all things seen in the scene are "instance primitives". This way one can eliminate one transformation for each instance. - There are special parametric transformations which allow e.g. to have 1000 objects in a line without instancing them 1000 times. The first problem is how to identify these in an accelerator (e.g. using a pointer and an index mentioned earler by me) but I failed to think up an efficient and simple way to handle that for arbitrary depths. So, I'd suggest to instead use own accelerators for these parametric transformations. The advantage is that the mentioned "objects-in-a-line" trafo could internally do an O(1)-like acceleration in case the objects' AABBs do not overlap much. - CSG seems to naturally fit into the framework by using CSG difference and CSG intersection primitives. This would directly allow use of BVH bounding. Andrew, since you are the expert for that: How would you suggest to introduce CSG with Kd-tree bounding when starting with the known "Primitive" from PBRT? (I would like to do CSG with just about anything which can possibly be given an inside and an outside, much like POVRay.) (BTW, it seems that one could easily allow the user to do a special type of bounding on certain primitives thereby giving an opportunity to compensate for the performance loss at shells7.pov using BSP...) Another question: Does anybody know a good way to prevent self-shadowing (other than the usual min. epsilon depth.) Regards, Wolfgang |
From: <ajc...@en...> - 2005-03-22 21:10:41
|
> I wanted to make the point about storing the information in the _ray_. > > Which data structure to choose here is an interesting question. > Direct storage in the primitive is of course the fastest way. > > Using a tree or hash requires dynamic resizing of this structure which > is slow, however usind chunked allocation will fix that. > Another way might be to associate a "cost" with intersection calculation > and only store a fixed number of tested objects: Those which cost most. > I think we could probably get away with an even simpler hash table. For example, just use a fixed-size table (eg. 256 entries), and allow for some small number of re-intersection tests (in this case, maybe around 1/128 of tests would have a collision). This would be really simple to implement, fast, and uses little space while getting most of the benefit of other data structures. > Another idea would be to give each primitive a serial number in the > range 0..N-1 (with N being the number of primitives) and using a > per-coroutine bitfield indexed with the primitive serial number to store > the information. Memory consumption would be fixed to N/8 bytes > per coroutine (or per simultanious ray). However, the information > would not be cache-local and probably waste memory, although being > faster (algorithically) than a hash. > The problem with using boolean values for this purpose is that we need some way to reset them once the ray intersections have been completed for a single ray (I thought about this before when working on the POV-Ray patch). This is possible but it will require an auxiliary storage for the list of primitives, so that we can go back and reset the booleans when the ray is completed. > How many such (already intersected) objects do we have to expect need to > be stored for one ray? > I remember reading somewhere that for a good accelerator, the number of cells "stabbed" by a ray will be something like O(n^(1/3)) (cube root). If we only get the first intersection, its probably more like O(log(n)). So I'd imagine that in almost all cases there will be less than ~256 objects intersected by the ray, for scenes that fit in memory. > <Splitting the scene information> > This is an interesting question and it should require more investigation. However, I'm still concerned that this type of scene splitting will be overkill in almost all cases. It would be much better to have good support for instancing, geometry/texture caching, and procedurals so that the user can actively control the memory use of their scene. Anyways, the user can always be able to store more geometry by going and buying more RAM... Andrew ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-03-20 22:24:44
|
<News> Seems like I finally got the first version of the C++-like SPL grammar (see src/spl/grammar.yy). It spent much more time than I thought on that. Special SPL-features (like nice specification of blend maps) still need to be added but are generally unlikely to make difficulties (i.e. introduce s/r or r/r conflicts) as long as they are prefixed by a reserved word (like "map"). I'll now push forward the SPL compiler. On Friday 18 March 2005 02:43, Andrew Clinton wrote: > > It seems to me that the natural way which remains is to store a list of > > intersected primitives in the ray. The downside is that this sort of > > thing depends a bit on the accelerator used (BVH does not need mail > > box numbers, I guess...) > > Yes, a list would work but It will scale poorly if the ray needs to test a > lot of primitives (need to test everything in the list on each > intersection). That's why I mentioned we might want to do it with a hash > table. > I wanted to make the point about storing the information in the _ray_. Which data structure to choose here is an interesting question. Direct storage in the primitive is of course the fastest way. Using a tree or hash requires dynamic resizing of this structure which is slow, however usind chunked allocation will fix that. Another way might be to associate a "cost" with intersection calculation and only store a fixed number of tested objects: Those which cost most. Another idea would be to give each primitive a serial number in the range 0..N-1 (with N being the number of primitives) and using a per-coroutine bitfield indexed with the primitive serial number to store the information. Memory consumption would be fixed to N/8 bytes per coroutine (or per simultanious ray). However, the information would not be cache-local and probably waste memory, although being faster (algorithically) than a hash. How many such (already intersected) objects do we have to expect need to be stored for one ray? <Splitting the scene information> I think, CSG could be split among several computers, if we transfer the CSG root and (maybe) first few levels to all computers and subsequent trees to different boxes. This works if we store a description about which computer has the required information at the place where a subtree was cut off. However, this is a bit more complicated than just passing rays. Another interesting thing one tends to overlook is now to make "instancing" work with splitting scene iformation and CSG. Kilauea (without CSG) could simply make sure that one box has all instances of a given object. This, however, undermines equal load balancing. For CSG, it is even harder. The easy solution is (of course) to give each box a copy. As a general consideration, the Kilauea people have correctly pointed out that moving rays is better than moving geometry because geometry can be huge. This is the "only" reason why I am considering it. However, the problem is that if geometry does not only consist of triangles, it is impossible to know in advance how to split it so that all boxes will be finished at the same time. (Otherwise that half of the boxes which got the isosurface is still busy while all those with just triangles are idle and cannot help.) This, it seems, does not happen when moving geometry (on-demand loading) because all computers would process rays as long as there are rays to compute. However, loading and possibly re-loading (if it is too large to fit into memory of one box and hence has to be cached) the geometry eats up a lot of time and bandwidth. Furthermore, the server(s) tend(s) to get the bottleneck. So, the interesting question is, if there is an easy way to combine these two ways. Cheers, Wolfgang |
From: <ajc...@en...> - 2005-03-18 01:43:28
|
> In order for multi-threading to work we may neither store the mailbox > number in the accelerator nor in the scene graph. Both should be used > read-only by the rendering threads. > > It seems to me that the natural way which remains is to store a list of > intersected primitives in the ray. The downside is that this sort of thing > depends a bit on the accelerator used (BVH does not need mail > box numbers, I guess...) > Yes, a list would work but It will scale poorly if the ray needs to test a lot of primitives (need to test everything in the list on each intersection). That's why I mentioned we might want to do it with a hash table. > (Anyways, the above about mailbox numbers was probably not the whole > story because we may want to get all intersections with an object, so some > sort of inside/outside information must as well be used, I guess.) > If we want to get all intersections, then the primitive intersection routine should do that for us. There should not be a need to call intersection routines more than once for a ray. I use a list of Intersection objects in my ray tracer for tests that need to get all the intersections for a ray (ie. I have two virtual methods on all objects: nearestIntersection and allIntersections. We might even want an anyIntersection method for simple visibility tests). > One other thing which is not described in the paper is how the "first- > come-first-served" works. If box A wants to have a ray traced and > sends a message to all the other boxes, how can box C and D know > that B accepted it without introducing a race? I see only 2 solutions: > 1. Use one server S which keeps track of all rays. A sends the ray to S > which sends it to B. > 2. A directly sends the ray to B without telling anybody else. > Unfortunately, 1 introduces a bottleneck: the server S. Fortunately, I > already have an idea how one could do 2. (later) > This is something that has confused me as well. The only clue they give is in section 3.3. They mention that rays can be balanced by allowing hosts to take rays from an input queue on first-come-first-served basis, but the rest of the discussion does not mention any global queue to do this. I'm guessing they do something like (2) with local queues on each host, by sending load information messages between the sets. If there is not enough space in the local queue to process a ray, the host will consult the loading table and send the ray appropriately. Just speculation... Andrew ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-03-18 00:12:53
|
On Tuesday 15 March 2005 02:39, ajc...@en... wrote: > I'm not sure whether there is any other benefit to the ray-level > partitioning except to make it easier to split up the scene database when > it is too large for a single host. > I guess there is in case the image is very inhomogene like having tiny spots which trigger lots of secondary rays. Something which could happen when rendering a image from space (lots of black background, some interesting spots). > Yes, I think their partitioning strategy is only going to work with simple > primitives. Other stuff will need to be duplicated on all the nodes > (doesn't seem too harsh a constraint, I don't think we expect to have > millions of isosurfaces or CSG models). > Well, but one can do CSG operations on meshes when defining inside/outside of the mesh. IIRC POVRay can do that. > > > If we take this approach to threading, we'll need to worry about the > > > following access control issues: > > > 1. Access to mailbox numbers on primitives in the accelerator > > > 2. Access to a read/write irradiance cache (if we use one) > > > 3. Writeback to the frame buffer (we can do this in blocks - this is > > > one reason we should probably use blocks of pixels rather than rays as > > > the basic unit of parallelism). > 1. Mailboxes are used to store the ray number on a primitive so that while > traversing the accelerator we only intersect it once (before intersecting > any primitive, compare the mailbox number with the current ray => if equal, > then the primitive has already been tested). > In order for multi-threading to work we may neither store the mailbox number in the accelerator nor in the scene graph. Both should be used read-only by the rendering threads. It seems to me that the natural way which remains is to store a list of intersected primitives in the ray. The downside is that this sort of thing depends a bit on the accelerator used (BVH does not need mail box numbers, I guess...) If the maximum number of "threads" is known in advance (and this may not be the case if coroutines are created on demand), then the array method you mentioned could as well be used. (Anyways, the above about mailbox numbers was probably not the whole story because we may want to get all intersections with an object, so some sort of inside/outside information must as well be used, I guess.) > [2] ... Depending on how we implement it, it will result in > nondeterministic image output: eg. a simple race between two threads will > cause a different image result when one gets the cached value and the other > needs to create it (or vice-versa). > Well, it is right that this will cause the image to be non-deterministic. But this is due to the fact that the irradiance cache is somewhat flawed by design in this respect because the result depends on the order in which rays are calculated (and thereby cache entries are generated). > 3. I mentioned so that we should avoid locking whenever a ray makes a > contribution back into the buffer. > Don't worry too much about locking. Locking mainly becomes a problem for scalability if the thread holding the lock is likely to hold it for longer time. For the above problem, there is e.g. a very simple solution which can work on single pixels as well by using private output queues in each thread: The thread will queue 8x8=64 pixels in his private output queue and then lock one time to transfer the complete queue to the framebuffer handler (this is 4 pointer assignments and hence the lock will not be held very long). BTW, thanks for pointing me at the paper of Cristensen (ray differentials). Although I find the "original" Igehy paper more informative concerning ray differential computation, the former has some more information about Kilauea. Seems like they tesselate everything before. This (IMO) means that 1. they do not have a problem splitting the scene (It's just meshes) 2. all ray intersections take about equally long. Concerning (2): In case Kilauea (as described) had to calculate isosurface intersections as well, those boxes which happen to have that part of the scene without an isosurface object will in the end be idle and wait for all those computers which still have slow isosurface intersections to calculate. This is due to the way they divide up the scene but probably they do not see the problem because they'd tesselate everything first in any case. One other thing which is not described in the paper is how the "first- come-first-served" works. If box A wants to have a ray traced and sends a message to all the other boxes, how can box C and D know that B accepted it without introducing a race? I see only 2 solutions: 1. Use one server S which keeps track of all rays. A sends the ray to S which sends it to B. 2. A directly sends the ray to B without telling anybody else. Unfortunately, 1 introduces a bottleneck: the server S. Fortunately, I already have an idea how one could do 2. (later) Wolfgang |
From: <ajc...@en...> - 2005-03-15 18:51:20
|
I've just seen the film "Robots" (Blue sky studios), was really impressed with the graphics. Notably, a number of the shots had effects that clearly couldn't have been done without ray tracing... Here is a page about their rendering software: http://www.blueskystudios.com/content/process-tools.php It turns out that the entire movie was actually ray traced! Another interesting point, they trace curved patches (even subdivision surfaces) without tesselation. I'd be really interested to know how they have been able to do this efficiently. Of course they also have a renderfarm with 500+ machines :) Andrew ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: <ajc...@en...> - 2005-03-15 01:40:02
|
Quoting Wolfgang Wieser <ww...@gm...>: > > (If you don't have ACM access I can mail you the paper) > > > Well, it turned out that we do not have ACM in the list of subscribed > journals at the LMU, so I walked over to the guys at the TUM to get an > account there (something I should have done eariler anyways). > > The design seems to be a quite good one. Especially, I like the idea of > using and passing rays and photons because the simplicity in which > the massive parallelism is achieved is appealing. Furthermore, this design > is highly parallel without using a "hard-core" stream design with small > kernels for everything and hence nearer to a traditional sequential renderer > > and also easier to implement. Actually, it is somehow like I dreamt of > being able to do some months ago: Dividing up work at the ray level. > Although they don't mention it in the paper, I'd imagine that the partitioning of eye rays to rendering engine threads uses some sort of dynamic tile assignment, I've seen this mentioned in some other papers. I'm not sure whether there is any other benefit to the ray-level partitioning except to make it easier to split up the scene database when it is too large for a single host. > However, as usual, people don't comment on the interesting details. > Let's consider a scene which consists only of one very large CSG tree. > Can it be split without having to copy some non-leaf nodes? > If not, a uniform distribution would cause lots of overhead and consequently, > > subtrees would have to be distributed. This, in turn, may make some boxes > (with important subtrees) much more busy than others. > > Or let's say one box happens to get the isosurface while the others only > have triangle meshes. Guess who will be waiting? > Yes, I think their partitioning strategy is only going to work with simple primitives. Other stuff will need to be duplicated on all the nodes (doesn't seem too harsh a constraint, I don't think we expect to have millions of isosurfaces or CSG models). > Furthermore, where is the texture information stored? Where the interior > information (like index of refraction, media)? > Probably these things are assigned as object level rather than triangle level, so it is not too unreasonable to duplicate it among the nodes. > > This is usually done by making the accelerator have the same interface as > > the primitives themselves. This is what I've done in my raytracer, and is > > also what is done in pbrt. > > > ...and would you advise against doing so? > I would advise in favor of doing so. > > > I think our system should make use of ray differentials to choose the > level > > of tesselation. Check online for the paper "Ray differentials for > > multiresolution geometry caching" by Christensen. This way, highly > > incoherent rays for GI can use a low-res tesselation, saving lots of > > effort. > > > Something similar could be done for isosurfaces by bowering the solver > accuracy in this case. I actually spent some thoughts on that already. > (I'll check out the paper because estimating the ray differential was the > point I got stuck at.) > Lowering accuracy of isosurface sounds like a good idea, I wonder if it will work in practise. > > One more thing (I don't think I was that clear) is that tesselated > geometry > > can be put directly into the accelerator. > > > Actually, you were clear enough, although I was not aware on how to exactly > put the information into the leaf. > And I do not yet fully understand it (triangle duplication problem). > Triangles that lie on accelerator cell boundaries will need to be duplicated. The alternative is to use lists of pointers to refer to them, in which case we introduce a level of indirection (making caching performance worse), and need to store lists of pointers (increasing memory use). > [previously] > > If we take this approach to threading, we'll need to worry about the > > following access control issues: > > 1. Access to mailbox numbers on primitives in the accelerator > > 2. Access to a read/write irradiance cache (if we use one) > > 3. Writeback to the frame buffer (we can do this in blocks - this is one > > reason we should probably use blocks of pixels rather than rays as the > > basic unit of parallelism). > > > Maybe I am blind but I barely see problems. > 1. is used read-only (don't we?), > 2. is a classical locking issue and probably a distribution problem (OK) and > > 3. is just a minor locking/queuing problem > (Using blocks of pixels for each node is interesting because of the > resulting positive effect on caching due to image space coherence -- > especially when demand-loading textures/geometry.) > I was just trying to enumerate the cases where we will need to lock global data structures. 1. Mailboxes are used to store the ray number on a primitive so that while traversing the accelerator we only intersect it once (before intersecting any primitive, compare the mailbox number with the current ray => if equal, then the primitive has already been tested). This can improve performance a lot, check out the source code for my patch (I added mailbox numbers to the OBJECT struct). There are multithreading issues because two threads could need to update the mailbox number of a primitive independently. One alternative I can think of is to make an array of mailbox numbers per primitive, but memory use will increase. We could also potentially use an alternative method of recording primitives that are visited within a thread (perhaps some sort of hashing would do the trick...) 2. This is probably not going to be too much trouble, but we will need to make a lot of access to the irradiance cache during rendering. I'm don't really have any idea how much impact this will have on performance but we should be aware of it. Depending on how we implement it, it will result in nondeterministic image output: eg. a simple race between two threads will cause a different image result when one gets the cached value and the other needs to create it (or vice-versa). 3. I mentioned so that we should avoid locking whenever a ray makes a contribution back into the buffer. If we are allocating blocks of pixels to individual threads, then queue up the pixels in local storage and then lock the frame buffer and write the block all at once. I'm guessing that other multithreading issues will arise in the future, these were just what I've thought of so far. ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-03-14 23:00:58
|
> (If you don't have ACM access I can mail you the paper) > Well, it turned out that we do not have ACM in the list of subscribed journals at the LMU, so I walked over to the guys at the TUM to get an account there (something I should have done eariler anyways). The design seems to be a quite good one. Especially, I like the idea of using and passing rays and photons because the simplicity in which the massive parallelism is achieved is appealing. Furthermore, this design is highly parallel without using a "hard-core" stream design with small kernels for everything and hence nearer to a traditional sequential renderer and also easier to implement. Actually, it is somehow like I dreamt of being able to do some months ago: Dividing up work at the ray level. However, as usual, people don't comment on the interesting details. Let's consider a scene which consists only of one very large CSG tree. Can it be split without having to copy some non-leaf nodes? If not, a uniform distribution would cause lots of overhead and consequently, subtrees would have to be distributed. This, in turn, may make some boxes (with important subtrees) much more busy than others. Or let's say one box happens to get the isosurface while the others only have triangle meshes. Guess who will be waiting? Furthermore, where is the texture information stored? Where the interior information (like index of refraction, media)? On Monday 14 March 2005 00:26, Andrew Clinton wrote: > I agree; they needed to develop distributed algorithms for ray intersection > and photon gather, which require rays to be sent to multiple hosts and then > the results combined to get the result. It would be much more useful to > encourage users to develop scenes so that objects can be demand-loaded or > built on the fly, so that the system can manage memory effectively on a > single host. > Of course. However, being able to handle such large scenes as well would as well be nice in any case... I'll try and see if I can come up with a rough design based on what I've read before going on "dreaming"... > SSE instructions fit 4 floats but only 2 doubles. > ...which unfortunately makes them pretty useless in a 3dimensional world represented with double floats... I'll add something like a "main" floating point type which can be float or double switchable at compile time. > This is usually done by making the accelerator have the same interface as > the primitives themselves. This is what I've done in my raytracer, and is > also what is done in pbrt. > ...and would you advise against doing so? Another advantage of this would be to allow procedural transformations (as defined previously) to have their own local accelerator. (Of course O(log(n)+log(m)) > O(log(n+m)), so one big accelerator is better than several nested ones.) > I would focus only on KD trees. You can use my patch to start, > I've been having it on my HD for some time already :) > Additional optimizations in the book include the following: > Thanks for pointing these out. >> [tesselation] > finely tesselated nurb vs. an analytic method). I would make this the > primary method of handling these types of patch surfaces and ignore > analytic methods for anything but user-defined isosurfaces or infinite > objects. ...and box, sphere, disc, quadric, I guess. > Even for isosurface we might want to build in a conversion to be > able to get fast previews. > One of the main points of using isosurfaces (e.g. for landscapes) is that a high degree of complexity can be achieved with low memory footprint. I'd rather focus on efficient (mesh-like) bounding schemes for isosurfaces to speed up root finding, than tesselating them. (This probably requires a rough inside and an outside mesh with roots in between. Not a simple thing if the surface has bubbles and holes.) > I think our system should make use of ray differentials to choose the level > of tesselation. Check online for the paper "Ray differentials for > multiresolution geometry caching" by Christensen. This way, highly > incoherent rays for GI can use a low-res tesselation, saving lots of > effort. > Something similar could be done for isosurfaces by bowering the solver accuracy in this case. I actually spent some thoughts on that already. (I'll check out the paper because estimating the ray differential was the point I got stuck at.) > One more thing (I don't think I was that clear) is that tesselated geometry > can be put directly into the accelerator. > Actually, you were clear enough, although I was not aware on how to exactly put the information into the leaf. And I do not yet fully understand it (triangle duplication problem). [previously] > If we take this approach to threading, we'll need to worry about the > following access control issues: > 1. Access to mailbox numbers on primitives in the accelerator > 2. Access to a read/write irradiance cache (if we use one) > 3. Writeback to the frame buffer (we can do this in blocks - this is one > reason we should probably use blocks of pixels rather than rays as the > basic unit of parallelism). > Maybe I am blind but I barely see problems. 1. is used read-only (don't we?), 2. is a classical locking issue and probably a distribution problem (OK) and 3. is just a minor locking/queuing problem (Using blocks of pixels for each node is interesting because of the resulting positive effect on caching due to image space coherence -- especially when demand-loading textures/geometry.) > Whatever advantage we gain from locality within a > coroutine might be offset by the cost of maintaining suspended states. > I completely agree. This is a very nice formulation of what I meant with "overhead" in previous postings. Regards, Wolfgang |
From: <ajc...@en...> - 2005-03-13 23:26:05
|
> > One thing that > > Kilauea supports that we probably won't need is the ability to split a > > large scene database among a collection of machines, which adds a lot of > > complexity to their design. > > > This is a nice feature but I think we should not implement that unless (for > some reason) we could do it easily (which I actually doubt). Instead, I would > > like to have on-demand loading of large images, meshes and NURBS which > are the things which usually use up most of the storage. > I agree; they needed to develop distributed algorithms for ray intersection and photon gather, which require rays to be sent to multiple hosts and then the results combined to get the result. It would be much more useful to encourage users to develop scenes so that objects can be demand-loaded or built on the fly, so that the system can manage memory effectively on a single host. > > - floats or doubles: I'm leaning towards using floats in the renderer, at > > least for storage and the majority of computation. Reasons: approx. 1/2 > of > > memory use for meshes, better cache performance, and improved computation > > speed (but not twice as fast, float divide is 24 cycles on AMD64 while > > double is 28 cycles). We could change to doubles for some computation > paths > > if it turns out there are precision issues. > > > Well, I'd rather stick to double for most of the computation. > > Okay, meshes, NURBS parameters and all the color computation could be done > using floats but for the rest of the geometry and intersection computations > (especially everything involving polynomial root solving), I would like to > use > double precision. At least that's my feeling after all the numerics I've been > > doing. > > We should, however, try to code in a way to allow changing that easily at > compile time. Most of the code in lib/numerics is provided as float/double > templates (notable exception: root solvers). > I would aim for using floats everywhere except where there is a known numerical problem. So we could use doubles for only the root solving components of the system. This makes sense, because there will be little storage or caching issues for root finding, therefore doubles should perform only slightly worse than floats. For triangle meshes, patch surfaces, bounding boxes, and accelerator nodes, we could use floats to improve performance (probably approaching 2 times better performance). One more advantage of using floats is that if we ever perform a SIMD optimization of the code path, we will get a 2x improvement over what we would get if doubles were used. SSE instructions fit 4 floats but only 2 doubles. > > - SIMD processing: After some more research, I'm less confident of using a > > SIMD approach to coherent rays, esp. for global illumination. I'm > > wondering if we could get a better speedup just be making specific > > optimizations for low-level vector oprations. > > > I've actually been a bit reluctant with SIMD right from the beginning and > I am happy that what you propose here is pretty much what was planned right > from the beginning: > - Using a Vector2,3,4 C++ class which can implement vector operations > using processor's SIMD instructions (inline assembly or built-in gcc > functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.) > - Implementing Vector2,3,4 types as native types in the VM to allow the > VM to internally do SIMD operations on them. > > Collecting values and evaluating several of them them SIMD-like in the VM > (e.g. for isosurfaces) would probably still increase performance especially > when not JITCompiling the code. This could be done with a stream design > using lots of kernels (e.g. one for isosurface evaluation). One then has to > make sure that the kernel will process its input _if_needed_ (even if there > is only _one_ value in the input stream) to prevent deadlock. This is similar > > to what you mentioned with ray priority. > One idea I've been thinking is to allow the user to flag a computation (eg. an isosurface kernel) as "parallel". In this case, the system will make the optimization you've described, and collect samples to evaluate together. By allowing the user to specify which kernels should be optimized, we can avoid the large sorting problem of collecting samples for all kernels. This might also make sense due to the conventional use of isosurface (for the most part in POV-Ray), where a single large isosurface is used as a terrain or a major feature of the scene. > > - Triangles only: We could optimize our design by having most primitives > > tesselate to triangles. Then the accelerator could be constructed with > > geometry specified inline with the leaf nodes. We could do this by > > templating the accelerator to the type of data that will be processed, in > > our case either triangles or object pointers. Non-triangle primitives > like > > isosurfaces would be placed into a seperate accelerator. > > > The accelerator data structure is one of the things about the RT core which > I worry much about because of its impact on intersection finding speed. > Actually, I would like to define a clear interface to the accelerator to make > > it changeable and be able to have BSP/kD/grid/octree structures. > This is usually done by making the accelerator have the same interface as the primitives themselves. This is what I've done in my raytracer, and is also what is done in pbrt. I would focus only on KD trees. You can use my patch to start, I'm not sure that I'll have time to make the necessary adaptations. Additional optimizations in the book include the following: - Represent all tree nodes with 8 bytes (64 bits) struct KDNode { union { unsigned int flags; //Both float split; // Interior unsigned int nprimitives; // Leaf }; union { unsigned int children; // Interior T *primitive; // Leaf T **primitives; // Leaf }; }; The trick is to encode the flags in the two least significant bits of the floating point split value. This will have huge caching benefits when iterating through the accelerator. - Use a cost model with different costs for iteration steps and object intersections. My accelerator (I think) assumes traversal steps cost nothing. > (You probably know the paper of Havran et al. "Statistical Comparison of > Ray-Shooting Efficiency Schemes"; kD seems to perform best in most > situations.) > Yes, in case you also were not aware, the BSP patch I wrote for POV-Ray was in fact an implementation of a lot of the features in Havran's Phd (Heuristic ray shooting algorithms). > If I understand you correctly, you propose that scene shapes can either be > represented as meshes (via tesselation of the shape) or via a bounding box > and their own intersection function (which computes exact intersections). > All objects must be able to provide an intersection function, some may also > provide a tesselation function. > (We can do this, however I don't like the tesselation very much because it > is one of the features of RT that it can easily be used on arbitrary geometry > > and not only on triangles and NURBS. And because a tesselated object > takes much more space in memory. But OTOH, if it improves speed, why > not allow that on a per-object basis?) > It actually improves speed immensely for most cases (I would not be suprised at 1 or 2 orders of magnitude difference between intersecting a finely tesselated nurb vs. an analytic method). I would make this the primary method of handling these types of patch surfaces and ignore analytic methods for anything but user-defined isosurfaces or infinite objects. Even for isosurface we might want to build in a conversion to be able to get fast previews. I think our system should make use of ray differentials to choose the level of tesselation. Check online for the paper "Ray differentials for multiresolution geometry caching" by Christensen. This way, highly incoherent rays for GI can use a low-res tesselation, saving lots of effort. One more thing (I don't think I was that clear) is that tesselated geometry can be put directly into the accelerator. So we could have the memory layout: ILGGGGLGGGILGGLLGGG I = interior KD tree node L = leaf node G = triangle This would have really nice cache performance. My current implementation in POV-Ray stores the leaves, object lists, and geometry in entirely seperate memory locations, so there are 2 layers of indirection just to intersect against the object list. This way almost all the processing for a leaf could be done without any indirections. The disadvantage is that for finely subdivided accelerators, we will need to duplicate a sizable number of triangles (but with good termination criteria for the construction, this can be reduced to a minimum). > But there are some other problems: > - Transferring the acceleration structure to the clients. This becomes > "interesting" especially for on-demand downloading of the scene graph > in case we want to do that (the new VM design will not do that for us > due to the lack of an indirection layer). I would recommend building accelerators locally and on-demand. This does not prohibit on-demand geometry transfers. > - Objects may not be representable by just an object pointer. > I like the idea of "procedural transformations" (if one can call it like > that) which means that an object transformation node in the CSG tree > can store properties like "1000 objects in a line" without having to > allocate 1000 object transformation matrices. > This means that the object should be represented by the transformation > node pointer and the transformation index (0..999 here). > Otherwise (if we just have the node) we would have to test all 1000 > contained "objects" for intersections. > Obviously, one index is not enough as soon as these things get nested. > Yes, procedurals should be expanded and put into an accelerator locally. There is not need to transfer anything except for the initial program to the rendering servers. Andrew ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-03-13 22:13:24
|
On Sunday 13 March 2005 04:06, Andrew Clinton wrote: > Quoting Wolfgang Wieser <ww...@gm...>: >> [Coroutines] > I think that this is actually a really good way to do it, and it is > supported by some of the research that I've been doing recently. In > particular, you should check out the Kilhauea renderer: > http://portal.acm.org/citation.cfm?id=569673.569675&dl=GUIDE&dl=ACM&type=se >ries&idx=SERIES10714&part=Proceedings&WantType=Proceedings&title=ACM%20Inter >national%20Conference%20Proceeding%20Series > > (If you don't have ACM access I can mail you the paper) > Thanks; I'll check that tomorrow. > The way they do it is to match the number of threads to the number of > processors on the machine. Within each thread, they use task parallelism > like you said, with application-controlled task switches. > This is also what I had in my mind. (Maybe using some threads more than CPUs to fill up disc latencies while reading data and the like.) [BTW, a coroutine switch is about the time of "for(int i=0; i<72; i++);" on my box.] > One thing that > Kilauea supports that we probably won't need is the ability to split a > large scene database among a collection of machines, which adds a lot of > complexity to their design. > This is a nice feature but I think we should not implement that unless (for some reason) we could do it easily (which I actually doubt). Instead, I would like to have on-demand loading of large images, meshes and NURBS which are the things which usually use up most of the storage. > I've been doing a lot of reading in my spare time, esp. on global > illumination topics. I've been making my way through "Physically Based > Rendering" by Pharr, Humphreys, which is turning out to be a really > excellent book (so far). > I just had a look at the table of contents of that book (available at amazon). Anyways I need to get some more clue about physically based rendering... > I've been busy lately, unfortunately the next two weeks I need to catch up > with a compiler project for school so I can't comment on everything you've > mentioned. > Compiler project - that's nice :) You know that I'm just starting to write a compiler? (For SPL.) > - floats or doubles: I'm leaning towards using floats in the renderer, at > least for storage and the majority of computation. Reasons: approx. 1/2 of > memory use for meshes, better cache performance, and improved computation > speed (but not twice as fast, float divide is 24 cycles on AMD64 while > double is 28 cycles). We could change to doubles for some computation paths > if it turns out there are precision issues. > Well, I'd rather stick to double for most of the computation. Okay, meshes, NURBS parameters and all the color computation could be done using floats but for the rest of the geometry and intersection computations (especially everything involving polynomial root solving), I would like to use double precision. At least that's my feeling after all the numerics I've been doing. We should, however, try to code in a way to allow changing that easily at compile time. Most of the code in lib/numerics is provided as float/double templates (notable exception: root solvers). > - SIMD processing: After some more research, I'm less confident of using a > SIMD approach to coherent rays, esp. for global illumination. I'm > wondering if we could get a better speedup just be making specific > optimizations for low-level vector oprations. > I've actually been a bit reluctant with SIMD right from the beginning and I am happy that what you propose here is pretty much what was planned right from the beginning: - Using a Vector2,3,4 C++ class which can implement vector operations using processor's SIMD instructions (inline assembly or built-in gcc functions). (e.g. lib/numerics/3d/vector3.h, SIMD not yet added.) - Implementing Vector2,3,4 types as native types in the VM to allow the VM to internally do SIMD operations on them. Collecting values and evaluating several of them them SIMD-like in the VM (e.g. for isosurfaces) would probably still increase performance especially when not JITCompiling the code. This could be done with a stream design using lots of kernels (e.g. one for isosurface evaluation). One then has to make sure that the kernel will process its input _if_needed_ (even if there is only _one_ value in the input stream) to prevent deadlock. This is similar to what you mentioned with ray priority. When not using a stream design, we can forget about higher level SIMD anyways. > - Triangles only: We could optimize our design by having most primitives > tesselate to triangles. Then the accelerator could be constructed with > geometry specified inline with the leaf nodes. We could do this by > templating the accelerator to the type of data that will be processed, in > our case either triangles or object pointers. Non-triangle primitives like > isosurfaces would be placed into a seperate accelerator. > The accelerator data structure is one of the things about the RT core which I worry much about because of its impact on intersection finding speed. Actually, I would like to define a clear interface to the accelerator to make it changeable and be able to have BSP/kD/grid/octree structures. (You probably know the paper of Havran et al. "Statistical Comparison of Ray-Shooting Efficiency Schemes"; kD seems to perform best in most situations.) If I understand you correctly, you propose that scene shapes can either be represented as meshes (via tesselation of the shape) or via a bounding box and their own intersection function (which computes exact intersections). All objects must be able to provide an intersection function, some may also provide a tesselation function. (We can do this, however I don't like the tesselation very much because it is one of the features of RT that it can easily be used on arbitrary geometry and not only on triangles and NURBS. And because a tesselated object takes much more space in memory. But OTOH, if it improves speed, why not allow that on a per-object basis?) But there are some other problems: - Transferring the acceleration structure to the clients. This becomes "interesting" especially for on-demand downloading of the scene graph in case we want to do that (the new VM design will not do that for us due to the lack of an indirection layer). - Objects may not be representable by just an object pointer. I like the idea of "procedural transformations" (if one can call it like that) which means that an object transformation node in the CSG tree can store properties like "1000 objects in a line" without having to allocate 1000 object transformation matrices. This means that the object should be represented by the transformation node pointer and the transformation index (0..999 here). Otherwise (if we just have the node) we would have to test all 1000 contained "objects" for intersections. Obviously, one index is not enough as soon as these things get nested. I'll re-read the other parts of your email when I had a look at the "Kilauea" paper. Thanks for the hint. Cheers, Wolfgang |
From: <ajc...@en...> - 2005-03-13 03:06:16
|
Quoting Wolfgang Wieser <ww...@gm...>: > Hello everybody! > > It's time to get some life onto the list again. > > 1. Coroutines > ------------- > > Using stream processing and kernels allows for support of modern > arcitectures to come up in the future (multi-core SMP-like systems > wich will probably be in consumer boxes in some years already (see > recent "Spektrum der Wissenschaft") as well as more radical ones like > the Cell architecture). > However, one of the problems is how to adequately implement streams > and kernels on a (currently) standard UP (uniprocessor) box. > > One solution for that could be the use of threads combined with the > use of coroutines. For details, see Knuth, "The Art of Computer > Programming" or the documentation for coroutine.{cc,h} in > devel-ray/src/lib/threads/. > The coroutine implementation in there is actually a thread-safe port > of libPCL (portable coroutine library) by Davide Libenzi. > > Basically, coroutines allows for cooperative fast user-space switching of > contexts without system interaction. > (On my linux-2.6 UP system, a coroutine context switch is 7 times as fast > as a thread context switch.) > > Coroutine switching is done cooperatively which additionally eliminates > locking required for concurrent threads. > I think that this is actually a really good way to do it, and it is supported by some of the research that I've been doing recently. In particular, you should check out the Kilhauea renderer: http://portal.acm.org/citation.cfm?id=569673.569675&dl=GUIDE&dl=ACM&type=series&idx=SERIES10714&part=Proceedings&WantType=Proceedings&title=ACM%20International%20Conference%20Proceeding%20Series (If you don't have ACM access I can mail you the paper) The way they do it is to match the number of threads to the number of processors on the machine. Within each thread, they use task parallelism like you said, with application-controlled task switches. One thing that Kilauea supports that we probably won't need is the ability to split a large scene database among a collection of machines, which adds a lot of complexity to their design. If we take this approach to threading, we'll need to worry about the following access control issues: - Access to mailbox numbers on primitives in the accelerator - Access to a read/write irradiance cache (if we use one) - Writeback to the frame buffer (we can do this in blocks - this is one reason we should probably use blocks of pixels rather than rays as the basic unit of parallelism). There are also issues that need to be resolved to make the stream model work: - How to handle dependent rays: will we allow shaders to trace rays that depend on the results of previous ray queries? A simple example of this is an adaptive sampler. Kilauea handles this by suspending shader execution (saving state) and then waiting until the query rays return. Deeper rays are given priority. - How to store state of tasks: For example, if we have seperate coroutines for running the accelerator traversal and triangle intersection, then it is necessary to cache the state of the accelerator while batches of rays are tested against primitives. If we use a KD-tree, then we'll need to cache an entire stack of untraversed nodes so that we can continue where we left off. On the other hand, if we use grids the amount of data in suspended state would need to be much less. After a more careful look at Purcell's thesis (Ray tracing on a stream processor), I've noticed that these issues are avoided by ignoring them (ie. dependent rays spawns are disallowed, and each ray carries its weight to the final image throughout the computation). They also use a basic grid accelerator that will be insufficient for use in a complex scene. After doing more research, the benefit of the stream model seems dubious on current architectures. Whatever advantage we gain from locality within a coroutine might be offset by the cost of maintaining suspended states. We'll need to weigh the potential benefits of the architecture against the complexity of implementation. > > 4. RT core > ---------- > > Any more ideas/suggestions on core design / stream processing / SIMD? > (Andrew: I would be very happy to see your ideas before you have to > leave us.) > I've been doing a lot of reading in my spare time, esp. on global illumination topics. I've been making my way through "Physically Based Rendering" by Pharr, Humphreys, which is turning out to be a really excellent book (so far). The book comes with a raytracer that they've used for teaching, but it is also fairly optimized in some areas (eg. they have implemented a highly efficient KD tree, with some optimizations that I did not know about before). I've been busy lately, unfortunately the next two weeks I need to catch up with a compiler project for school so I can't comment on everything you've mentioned. Here are some other design issues I've been thinking about: - floats or doubles: I'm leaning towards using floats in the renderer, at least for storage and the majority of computation. Reasons: approx. 1/2 of memory use for meshes, better cache performance, and improved computation speed (but not twice as fast, float divide is 24 cycles on AMD64 while double is 28 cycles). We could change to doubles for some computation paths if it turns out there are precision issues. - Triangles only: We could optimize our design by having most primitives tesselate to triangles. Then the accelerator could be constructed with geometry specified inline with the leaf nodes. We could do this by templating the accelerator to the type of data that will be processed, in our case either triangles or object pointers. Non-triangle primitives like isosurfaces would be placed into a seperate accelerator. - SIMD processing: After some more research, I'm less confident of using a SIMD approach to coherent rays, esp. for global illumination. I'm wondering if we could get a better speedup just be making specific optimizations for low-level vector oprations. I'll get back with some more comments as soon as I find some time... Andrew ---------------------------------------- This mail sent through www.mywaterloo.ca |
From: Wolfgang W. <ww...@gm...> - 2005-03-12 19:59:43
|
Hello everybody! It's time to get some life onto the list again. 1. Coroutines ------------- Using stream processing and kernels allows for support of modern arcitectures to come up in the future (multi-core SMP-like systems wich will probably be in consumer boxes in some years already (see recent "Spektrum der Wissenschaft") as well as more radical ones like the Cell architecture). However, one of the problems is how to adequately implement streams and kernels on a (currently) standard UP (uniprocessor) box. One solution for that could be the use of threads combined with the use of coroutines. For details, see Knuth, "The Art of Computer Programming" or the documentation for coroutine.{cc,h} in devel-ray/src/lib/threads/. The coroutine implementation in there is actually a thread-safe port of libPCL (portable coroutine library) by Davide Libenzi. Basically, coroutines allows for cooperative fast user-space switching of contexts without system interaction. (On my linux-2.6 UP system, a coroutine context switch is 7 times as fast as a thread context switch.) Coroutine switching is done cooperatively which additionally eliminates locking required for concurrent threads. 2. VM ----- After Andrew's remarks concerning the VM and a recent feeling on behalf of my sinde about the need for more performance, I thought about changing the (planned) internal VM layout somewhat. For some background, we were having the following problems: - We need a garbage collector because the user of the VM cannot be relied on to properly deallocate objects. However, during the GC run, other threads need to be stopped which is not easily possible using pthreads and similar APIs. - The VM has a two-stage lookup/dereferenciation for pointers. Advantage: Easily allows to use safe pointers (user cannot crash VM), allows to implement on-demand requests of objects over the network on the VM level. However, using real pointers would of couse be faster and the impact is not limited on the VM because all exported objects need to use the same notion of pointers as the VM itself. So, a possible way to go seems to be the following: - Use real pointers in the VM. The VM can use a checking mode where every pointer value is first looked up in a hash table to verify its validity. This is much slower than the previous two-stage lookup (which is O(1)) but the non-checking version of the VM (for real renderings) is faster. - Network distribution or on-demand requests of objects then need to be handeled at a higher level. (This will probably demand for an indirection layer for objects capable of such on-demand loading.) - The garbage collection is performed by the conservative GC (http://www.hpl.hp.com/personal/Hans_Boehm/gc/) which is also used in gcc and other projects. Hans Boehm, et al, spent a lot of time in writing a state-of-the-art GC which nowadays even supports multi-threading (by stopping other threads; they have platform-dependent code for that part). This also eliminates some development work for us. - SPL (as of my current design) allows for explicit deletion of objects (despite the GC). If the user explicitly deletes all his allocated objects, the application will run faster because the GC has less work (I verified that with boehm_gc). We can allow the user to not use garbage collection (and do his own memory management) if he wishes to do so (at his own risk, as always). The main problems with this approach are: - The VM needs to know the object base location to do dynamic casts which are also needed for virtual function calls. This can be solved by attaching an offset value to all base classes in all instances. Actually, no additional memory is required unless the object is a (base) class of non-zero size but has _no_ virtual functions. These cases are probably rare. - Pointer size is no longer constant for the VM since we support 32bit and 64bit systems. This means that the compiler cannot easily calculate offsets for the target machine. The easiest solution that comes to my mind is that the VM assembly uses 2 values for each address/offset specification, one for 32bit and one for 64bit systems. The VM can then select the correct version when loading the assembly file. - Explicit deletion changes in behaviour. Previously, a pointer to an explicitly deleted object would be NULL after deleting the object because the shared indirection layer index contains NULL. Now, a pointer is not automagically NULL and dereferencing a deleted object may crash the VM (or trigger an assertion for the checking VM). 3. SPL ------ Just by chance, I stumbled accross a tool called "treecc" (www.southern-storm.com.au/treecc.html). It helps in compiler development mainly by protecting the programmer from forgetting cases. I'm currently seeing if I will use it for the SPL compiler implementation. This is the reason why compilation of the code currently fails just before the end in the spl/ directory. Since treecc is quite small and not very common, I will put it into the 3rdparty/ directoy once it is clear that it will be used. The same holds for the garbage collector. It is correct that there exist binary distributions of it but we must make sure that we have the thread-safe version and also may play with parallel marking and other compile-time tuning. (Things already work in my development version but I would like to avoid adding lots of 3rdparty code to the CVS which will be removed again later.) 4. RT core ---------- Any more ideas/suggestions on core design / stream processing / SIMD? (Andrew: I would be very happy to see your ideas before you have to leave us.) Regards, Wolfgang |
From: Wolfgang W. <ww...@gm...> - 2005-02-26 10:42:26
|
Hello everybody... GLib was now removed from the 3rdparty directory. Developers need to have glib installed (at least version 2.4.6). Furthermore, there is an Athlon64 box running Linux at university and I successfully compiled the code on it. "Porting" was easy; the only required change was fixing an incorrect use of va_list. Wolfgang |