COcclusionCuller code

writhe
2007-11-09
2013-04-25
  • writhe

    writhe - 2007-11-09

    I've been profiling the Pixie source code and have found that the COcclusionCuller code takes up a large percentage of the rendering time and a large amount of memory. On a trivial scene (e.g. if I want to just preview a shader on a sphere) this can be more than 50% of the total rendering time and several MB of memory, depending on the size of the image.

    Initially, I was going to change the COcclusionCuller code to use a much faster and smaller array implementation (the current version wastes a lot of memory for node pointers in the tree).

    From the looks of things, the COcclusionCuller code was put in to add some sort of ‘hierarchical’ depth buffer. However, I've looked around the rest of Pixie's source code and can't find any parts that actually make use of this functionality.

    All I've find are that the Reyes renderers (CStochastic and CZbuffer) make use of the maximum depth in the COcclusionCuller (when using the ‘midpoint’ depth filter), but this can be computed much more easily than having to keep a tree of the depths (and all the associated overhead). Also, the actual depth values are duplicated in individual pixels / fragments anyway.

    Because of this I was thinking of just ripping the whole COcclusionCuller code out and making some minor changes to put in the functionality that actually is used. This would likely make some non-trivial performance and memory savings (as well as much less code).

    Can any of the Pixie developers give me some pointers as to how COcclusionCuller is used? Thanks.

     
    • Okan Arikan

      Okan Arikan - 2007-11-09

          Hi Writhe,

          Initially the occlusion culler was implemented to figure out whether a screen aligned box was completely occluded or not (in probeQuad and probeTri functions). The occlusion culler also keeps track of the minimum and maximum depth values in every node for quickly determining the depth extend of a bucket.

          This initial implementation was later stripped down to only manage the maximum depth values because the grid rasterization code was already fast enough and this early culling did not contrinute to any savings.

          Right now, we update the depth of a sample using the touchNode function in stochastic/zbuffer. This in turn updates the hierarchy of nodes. During the rendering, we use the maximum depth of the root node (which corresponds to the maximum depth of the bucket) to cull objects that are completely hidden.

          I'm curious about the performance penalty that you're seeing. According to my tests, you only see the effect of COcclusionCuller in a completely empty scene. I think before writing a lot of code, we should positively identify the problem so that your efforts are not wasted.

          Right now, one of the biggest problems in occlusion culling is the grid cracks that yield missing samples, which in turn presents gaps in the occlusion.

          Okan

       
    • George Harker

      George Harker - 2007-11-09

      Hi Writhe,

      > I've been profiling the Pixie source code and have found that the COcclusionCuller code takes up a large percentage of the rendering time
      > and a large amount of memory. On a trivial scene (e.g. if I want to just preview a shader on a sphere) this can be more than 50% of the total
      > rendering time and several MB of memory, depending on the size of the image.

      That's really rather odd.  With reasonable bucket sizes, and sort of scene other than the most trivial I would seriously expect the memory consumption to be dwarfed by other uses.
       
      > Initially, I was going to change the COcclusionCuller code to use a much faster and smaller array implementation (the current version wastes
      > a lot of memory for node pointers in the tree).

      It's possibly true that it could be tightened up usage wise - though I expect that would come at a performance penalty.  I don't see the same sort of wastage you describe.

      > From the looks of things, the COcclusionCuller code was put in to add some sort of ‘hierarchical’ depth buffer. However, I've looked around
      > the rest of Pixie's source code and can't find any parts that actually make use of this functionality.

      In fact, I'm improving this area right now.  The hierarchical nature is infact very very important.  And it's currently used to do gross occlusion culling.  This will be more heavily used in the next release to optimize render speeds and memory usage.  Honestly, the saving that you get from the data the structure provides is more than makes up for the memory it would usually use.

      > All I've find are that the Reyes renderers (CStochastic and CZbuffer) make use of the maximum depth in the COcclusionCuller (when using
      > the ‘midpoint’ depth filter), but this can be computed much more easily than having to keep a tree of the depths (and all the associated
      > overhead). Also, the actual depth values are duplicated in individual pixels / fragments anyway.

      That's not entirely true.  The implementation is _always_ used, it's just that the culling depth in midpoint may be 1 sample behind the occluded depth.

      > Because of this I was thinking of just ripping the whole COcclusionCuller code out and making some minor changes to put in the
      > functionality that actually is used. This would likely make some non-trivial performance and memory savings (as well as much less code).

      I would very much recommend against doing this.  I believe it's possible to tighten it up, but in my experience with non trivial scenes, the speed penalty is vanishingly small, and the saving (especially in the next release) is very much worth it.  The idea is to be able to resolve visibility easily and quickly for sub-bucket areas which means that you don't have to waste time sampling / dicing / displacing objects which are occluded.

      I agree with Okan, In trivial scenes, probing an area takes time (if not done right) and doesn't save much.  However, in geometry-heavy renders, we can now save a lot of memory (what would have been 800Mb can be done in ~140Mb in some test scenes).  This can be kept fast by not querying at the end depth and using the hierarchical nature of the structure.

      Cheers

      George

       
    • George Harker

      George Harker - 2007-11-09

      Hi Writhe,

      Do you have a test scene we can look at given that we're working on this stuff anyway.

      Cheers

      George

       
    • writhe

      writhe - 2007-11-14

      I was going to write a quite long winded reply but I thought it would just be easier to show you the code:
          http://writhe.org.uk/pixie/occlusion.h
          http://writhe.org.uk/pixie/occlusion.cpp

      I also noticed that the Subversion repository has been updated with the probeRect functionality. I'll get round to adding this at some point.

      Note that you can't just slot this code in as you have to make some changes in the CStochastic / CZbuffer constructors and also in rasterBegin().

      Essentially all I'm doing is storing all the depth values in a contiguous array and then being clever about the indexing for each level of the hierarchy.

      There are some other optimisations:
      - not rounding the array size up to a power of 2 (less wasted space),
      - separate width and height instead of using the maximum (less wasted space in the case of different X/Y pixel samples),
      - earlier termination of touchNode() in some cases.

      In the scenes I'm rendering this provides a significant improvement. Just over half the previous rendering time and a massive reduction in memory. For example, if I'm using 16 pixel samples then peak memory usage is down by about 100 MB (using 2 threads and the default bucket size).

      I think the memory usage is the most important optimisation here as there is a separate COcclusionCuller for each thread. With multicore systems becoming the norm, this is going to be very important.

      The touchNode() method now has two versions. Because CStochastic::CPixel currently stores a pointer to a COcclusionNode this means that the X and Y indices must be calculated using an expensive integer division.

      If you can calculate the X and Y indices (as in CZbuffer) then you can use the faster touchNode(int x, int y, float z) method.

       
      • George Harker

        George Harker - 2007-11-14

        Hi Writhe,

        I can absolutely see what you're getting at...  I know the pointer usage is inefficient (especially now that we have the depths array for the probeArea functionality).  The performance improvement from probeArea is really really quite big, so I wanted to get that out of the way first.

        I don't doubt that this is more efficient memory wise.  By the way, what's your pixel samples / filterSize?  Which archictecture are you on, and is it 64bit?

        Though the current code is not clear, I believe the termination criteria are similar by the way it's constructed. 

        What I'm really curious about is the performace difference.  I'll take a look at the code and examine the performances I get on my machines too.

        Very good point about using +1 for the odd buckets - I have something similar in there which can be simplified with that (was using (w+(w&1))>>1 but that can be simpler).

        Cheers

        George

         
        • writhe

          writhe - 2007-11-14

          I'm on a 32-bit x86 (MacBook Pro with 2 GHz Intel Core Duo). Memory savings will be much bigger on a 64-bit machine (COcclusionNode is 28 / 48 bytes on 32 / 64 bit machines).

          I typically render with 6–16 pixel samples. The "100 MB" peak memory saving comes from rendering with 16 pixel samples (with 2 threads, so 2 separate COcclusionCullers). I'm using the default pixel filter size (2).

          To be honest, the performance difference is only really noticeable on simple scenes, where the overhead of COcclusionCuller is relatively large. The memory savings should be universal (which indirectly speeds up the code through better cache usage and less swapping, TLB misses, etc.).

          A more realistic scene will probably only see a few percent difference in rendering times.

          I've also changed the rasterBegin() implementation in CStochastic / CZbuffer by removing the call to initToZero() and the ‘node clearing’ parts in the loop that look like:

          cNode = getNode(j,i);
          cNode->zmax = CRenderer::clipMax;

          There is now just a single call to resetHierarchy(CRenderer::clipMax) at the end. This is one of the main performance hotspots I was seeing.

          Having looked a bit more at the CPixel / CStochastic code, I think you can remove the COcclusionNode pointer in CPixel and change all touchNode() calls to use the current x and y indices (this is a bit confusing with all the macros but I think you can drop this right in). This avoids recalculating x and y from the pointer with an integer division as mentioned earlier.

           
          • George Harker

            George Harker - 2007-11-14

            >  I'm on a 32-bit x86 (MacBook Pro with 2 GHz Intel Core Duo). Memory savings will be much bigger on a 64-bit machine (COcclusionNode is 28 / 48 bytes on 32 / 64 bit machines).

            Good stuff, normally I am too.

            > I typically render with 6–16 pixel samples. The "100 MB" peak memory saving comes from rendering with 16 pixel samples (with 2 threads, so 2 separate
            > COcclusionCullers). I'm using the default pixel filter size (2).

            Absolutely.  This equates to the 50Mb saving I saw in 32bit.

            > To be honest, the performance difference is only really noticeable on simple scenes, where the overhead of COcclusionCuller is relatively large. The
            > memory savings should be universal (which indirectly speeds up the code through better cache usage and less swapping, TLB misses, etc.).

            The cache coherency of the algorithm is interesting.  I'm actually not sure if the current striping strategy combined with your no-pointers approach might be best.  It might make the math a total pain though.  But it does have nice properties in that each node has reasonable locality to it's parents - and this is the tight loop, when we recurse up to update the hierarchy.  I'd have to think about this a bit more and test it to work out what's canonically best.

            > A more realistic scene will probably only see a few percent difference in rendering times.

            Yeah, that probably equates roughly to what I'm seeing.  

            > I've also changed the rasterBegin() implementation in CStochastic / CZbuffer by removing the call to initToZero() and the ‘node clearing’ parts in the
            > loop that look like:
            >
            > cNode = getNode(j,i);
            > cNode->zmax = CRenderer::clipMax;

            > There is now just a single call to resetHierarchy(CRenderer::clipMax) at the end. This is one of the main performance hotspots I was seeing.

            I saw that, that's how I integrated your code also.  So I guess you're using shark then?  Nice isn't it.

            > Having looked a bit more at the CPixel / CStochastic code, I think you can remove the COcclusionNode pointer in CPixel and change all touchNode() calls to
            > use the current x and y indices (this is a bit confusing with all the macros but I think you can drop this right in). This avoids recalculating x and y
            > from the pointer with an integer division as mentioned earlier.

            That's what I did when I tested it.  You can just do touchNode(x,y,z) - the new defines in stochastic.cpp need modifying if you go to the new code.

            Cheers

            George

             
    • George Harker

      George Harker - 2007-11-14

      Hi Writhe,

      Looks good, there's not too many modifications to make to slot something like that in.

      On a relatively simple scene (but totally covered with geometry) the timings were very similar. 

      I used 32x32 pixels per buckets, 16x16 pixel samples - I wouldn't normally do this unless I had heavy DOF / MB.  I'm using 2 threads on Xeon.

      Memory is definitely reduced ~ 25mb per thread, which is worthwhile.  But the timings were very similar (2-5s in favor of your code) in this case.  It seems to perform slightly slower with 4x4 pixel samples and 16x16 pixels per bucket. 

      Are there particular types / styles of scenes which habve the performance characteristic?  If there's lots of time in touchNode, then I'm going to take a guess that either you have lots of MB / DOF, or there is a lot of small geometry?

      Let us know so we can improve Pixie.  Thanks again for your input.

      Cheers

      George

       
      • writhe

        writhe - 2007-11-14

        Yeah, the scenes I'm rendering are relatively simple. Mainly small amounts of geometry which I use to tweak various shaders. There are also some high quality render scenes with DOF.

        I was mainly interested in getting the render / edit shader / re-render cycle to be faster.

         

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks