Rendering with low capacity

Developers
Pete
2016-07-22
2016-07-26
  • Pete

    Pete - 2016-07-22

    Hi.

    I'm wondering why my PC does not render at full capacity?

    This piece has a 6-core processor and it can handle 12 "threads" -- or appear as 12-core or 12-virtual core thing. At render time, however, AoI does not seem to use even half of the capacity. Any idea why that is? During building caustics maps CPU-usage does go to 100%.

    In the picture you'll see the CPU usage, when rendering a movie. The CPU% drops momentarily between frames. The output is as .png-stream.

     
    Last edit: Pete 2016-07-22
  • Luke S

    Luke S - 2016-07-22

    Try pulling up the resource monitor "Avaa Resurssienvalvonta" to track how many cores are actually given any work. (You'll get a graph for each logical processer)

    If you find that not all cores are being used, go to the details tab of the task manager, and check on the processer affinity for java. Your graph looks sort of like only one virtual core per physical core is being used.

    depending on the size of your scene, you may also be running into the limits of how much data can be moved from main RAM to Cache at any given time.

     
  • Pete

    Pete - 2016-07-22

    All the virtual processors have the affinity and all of them are working. What seems to be happening is that each phycal processor is working at about 40% and has split the work so that the first virtual one is goin 60% and the other 20%. This happens systematically with all of the processors.

    And yet: This happens at rendering the picture. I have not tested all possible things, but at generating caustics maps there is no problem in going for all available capacity.... Somehow it seems to be limited to AoI@render.

    This scene contais 3 actual objects and something for the environment (2 cylinders, 1 spline mesh, 3 directional lights, 1 camera). I would't expect any memory related issues. :)

     
  • Luke S

    Luke S - 2016-07-23

    No, I would not expect memory issues either.

    I've got an i7 processer myself (mobile version, much smaller!) and my tests have shown AOI Saturating all virtual cores at render time. Now's the real question time:

    • What AOI Version?
    • What JVM?
    • What render settings?
    • What render size?

    Something to try: change you processer affinity so that only the even numbered virtual cores are selected. This will simulate running without hyperthreading, at least closely enough for our purposes today. Some CPU instructions don't take well to being hyper-threaded, while others are very good about sharing the processer. Perhaps this particular set of render options use some of those "Heavy" operations once the JVM gets through compiling to native.

    Also try running VirtualVM and grabbing a thread dump. It would be interesting to see if any AOI threads are waiting for anything. Actually, grab three or four at random times in the rendering of a frame, just to see how they compare.

    EDIT: I just re-read your original post. The one thing that is definitely normal about your graph is the drop between frames. Writting out the PNG and re-building the scene graph for the next frame are single-thread operations, and all other cores will go idle while this is happening.

     
    Last edit: Luke S 2016-07-23
  • Pete

    Pete - 2016-07-24

    AoI = 3.0.2
    Java = jre1.8.0_65 or jdk1.8.0_45 -- I'm not exactly sure which one is used when just using AoI. It should be the jdk?
    Render settings:
    Antialising on, for example 4/4 or 8/64
    Soft shadows on
    Gloss/Translucency on
    No GI or caustics or anything other special
    Image size: 960 x 540

    I have not noticed that render settings or image size would make any difference to CPU-usage.

    When I shut down half of the virtual processors, the total usage remained the same. So the remaining processors took the load from then disabled ones. Curously it did not matter what half of them I turned off. Only that when processors 0 - 5 were off and the rest on (or the other way around), the usage increased from some 36% to about 42%. -- Swithcing more on started to drop the total usage again.

    Shutting down more started to affect the total CPU, but only when I had only 3 left (a pair and a single one) the usage per processor was getting close to full. Typically some 95+% for the pair and sometimes 100% for the lone one. On the latest trial with just 0 and 1 on they both went happily to 100%.

    One more detail. No matter what happes with affinity settings, the clock speed stays a bit over 100%. When the rendering stops, then it drops to "idle" or some 30%+.

    And yes: I though, that the momentary drop of CPU% was expected for saving each image. When rendering a QT-movie, the drop does not happen.

     
    Last edit: Pete 2016-07-24
  • Pete

    Pete - 2016-07-24

    Ok.. and got some thread dumps .... (Eventually found jstack.exe... :D)

    Three from rendering images for the movie. One at building a caustics map for a different scene.

     
  • Pete

    Pete - 2016-07-24

    At least in the case of the caustics maps all "Worker Threads" are marked RUNNABE whereas on the ones captured during rendering they are mostly BLOCKED. I made more of these just rendering a single image and in those all WT:s were BLOCKED (Missed a few seconds in the beginning though...)

    When the rendering is done, then the Worker Threads disappear (which seems to make sense to me).

    EDIT: I ran the same thing again on a 4-core laptop and then again on this 12-vc machine. On the 4-core there naturally are 4 Worker Threads and the processor goes to 100% without a problem. The state of the WT:s keeps changin between WAITING, RUNNABLE and BLOCKED on both of them. Most usually jstack sees them BLOCKED on both platforms, but somehow I've got a feeling that "BLOCKED" a kind of dominates the 12-core case? I don't know if that makes any sense?

    A strange question, but could it be that the RayTracer core can not feed the threads fast enough?

     
    Last edit: Pete 2016-07-24
  • Pete

    Pete - 2016-07-24

    ... and more results: I revisited the case of building a caustics map. The thread dumps look exactly as those during rendering. The state of the Working Threads keeps changing between RUNNABE, WAITING and BLOCKED, BLOCKED being the favorite...

    Still the total processing power goes to 100% at caustics map but only to some 37% at render... :(

    Obviously all the virtual processors are recognized but something goes wrong, when the software shifts from one stage to another.

     
    Last edit: Pete 2016-07-24
  • Luke S

    Luke S - 2016-07-24

    Just for clarity: When you say that you get only 37% "At Render" does that mean "only when you run the scene without Caustics," or does it mean "Building a Caustics map runs at 100%, but the raytrace phase for the same render only uses 37%?"

    When you run the Caustics map, are you using a different scene?

    If so:

    • Please check how much memory you have allocated to AOI, and how much is used, during one of the stalled renders. jvisualvm.exe should be just a few items below jstack. You want the "size" and "used" numbers from the monitor tab.
    • When playing with processer affinity, its imperative that you use only even or only odd processers. {0, 2, 4, 6, 8, 10} or {1, 3, 5, 7, 9, 11} This is because processers {0, 1} share the same core, as do {2, 3} etc. The point here is to make sure that AOI is not trying to run two threads on the same physical core.
    • If this is something that only comes up with one scene for you, would you be willing to post the file?

    Some render/scene settings that might be handy to know:

    • What surface accuracy is the scene set to? (0.02 is default)
    • What are Max Ray Tree Depth and Min Ray intensity?

    I noticed that all the worker threads are blocked at the same location:

    ~~~~artofillusion.raytracer.Ray.getNextID()~~~~

    It seems possible that this method is being deadlocked, as for safety it is synchronised, but we need to check a few other things first.

    If you still see this happening with an entire odd group of processers, please grab a thread dump there as well ( I want to see if they're blocking at the same place)

     
    Last edit: Luke S 2016-07-24
  • Pete

    Pete - 2016-07-25

    Just for clarity: .... "Building a Caustics map runs at 100%, but the raytrace phase for the same render only uses 37%?"

    That is what I meant. This time the percentages were a bit higher, but the phenomenon is still there. See "Caustic_to_render.jpg". There are two cases, one with nothing reflective on the scene and one with a mirror texture on one object.

    When you run the Caustics map, are you using a different scene?

    Before I had a different scene to verify what happens with the caustics map but now made a combination. The new file is attached. (Rendertest.aoi) See "Scene_dim_or_shiny.jpg" for the different cases I mentioned above.

    Please check how much memory you have allocated

    I use one of these. I haven't noticed any difference in the behavior.

    javaw -Xmx12288m -jar ArtOfIllusion.jar
    javaw -Xmx1024m -jar ArtOfIllusion.jar
    

    See screen captures from jvisualvm.exe on "Memory_VVM.jpg"

    This scene is so small that the memory does not get challenged very badly. I don't know what would happen with something like a tree with 20 000 leafs... ? (That kind of a thing used to run the memory out on my old XP-machine.)

    Caustics maps sometimes ran into memory limits with that diamond file I had some time back, but then I started to use the -Xmx12288m to start AoI.

    When playing with processer affinity, its imperative that you use only even or only odd processers. {0, 2, 4, 6, 8, 10} or {1, 3, 5, 7, 9, 11} This is because processers {0, 1} share the same core, as do {2, 3} etc. The point here is to make sure that AOI is not trying to run two threads on the same physical core.

    Yes, I got the point. :) See "Processor_sets.jpg".

    The results may be a bit strange, but with for example with processors {0,2,4,6,8,10} there really is no noticeable change in performance during the actual rendering but for example with {0,1,2,3,4,5} the situation is improved to some extent. Actally this time (in the latter case) all available processors took 100% capacity and the total CPU-utilization went to about 55% (about 1% of that is other processes).

    If this is something that only comes up with one scene for you

    It happens with all scenes though the CPU% may vary. "Rendertest.aoi" attached.

    Surface Accuracy, Ray Tree ...?

    Those are at defaults.

    If you still see this happening with an entire odd group of processers, please grab a thread dump there as well ( I want to see if they're blocking at the same place)

    A series of thread dumps in the "Thread_Dumps_half_aff.zip", done with processors {0,2,4,6,8,10}. The first and last are before and after the processing and 2 and 3 are taken during building the caustics map. -- And it looks like thread dumps of jvisualvm would have different content from those captured by jstack? These are from jstack.

    I hope this helps to track down the problem and if there's anything else, that you'd wish me to check just let me know... :) And I have the souce on my PC, so if you have a fix or some options for a fix I can do some quick testing here and report back.

     
    Last edit: Pete 2016-07-25
  • Pete

    Pete - 2016-07-25

    One more.

    I don't know if this helps, but after the caustics map is done, the rendering starts at about 40% with about 60/20 split between processor pairs. The %'s start to go, up when antialasing reaches the areas where the caustic photons appear on the picture. Then at the bottom end it retuns to 40%.

    Interesting, that the rise happens mostly on those halves of processor pairs, that until then run at lower capacity.

     
    Last edit: Pete 2016-07-25
  • Luke S

    Luke S - 2016-07-25

    Thanks for the detailed info. I think I got it. Definitely not a memorry issue at all :) Please drop the attached Ray.java into your raytracer source tree, compile, and try out the results.

    What's happening is that the raytracer creates a new, incremented ID# for each new ray. (It keeps them cached, so that you will never need to recalculate a ray that has already been calculated)

    The version in production at this point uses a "Synchronised" method, to keep two threads from reading the same ID. This was the only option available at the time the code was created, but it has an issue: If two threads try to access this method at the same time, one will be blocked, and forced to wait, which is not too bad, but the wait does have a measurable, finite time. When you add more and more threads to the mix, the odds of several stepping on each other's toes go up substantially, and they spend more time waiting for each other than they do in useful work.

    You seem to have a pathological case. All your threads seem to reach for the lock at the same time, and never get a chance to get far enough out of step to avoid each other. In running some checks today, I see a similar effect with 8 threads, but their timings would get just far enough out of step to not be a major issue.

    I think that the AA passes in the Caustics region managed to break out of this because each ray will take a slightly different number of cycles to calculate, depending on how many caustics photons are nearby. This would stagger attempts to access the lock just enough to sort of cut through the bottleneck.

    Why was this never noticed? I think you are the first person to throw this much processing power at a single AOI instance.

    Solution:

    Replaced the lock with an atomic get and increment. No more threadlocks. On my machine, which always could peg the CPU at 100%, I get ~6-7% speed improvement. I suspect you will get more.

     
  • Pete

    Pete - 2016-07-25

    Thanks! It works! Great!!! :D

    So I was not too far off contemplating " .. that the RayTracer core can not feed the threads fast enough.." though the tehcnical side was a mystery to me...

    And a practical note to anyone else trying this: So copy the new, compiled Renderers.jar to the Plugins folder that you are using. That is what changed. :)

    I ran a couple of test cases:

    1) The same scene, 800 x 450, 64/256, 200 000 photons:
    * From 55 sec to 36 sec
    * Speed increase 53% - That is because generating the caustics map did not change much.

    2) And then 1600 x 900, 64/256, no caustics:
    * From 1:34 to 0:25
    * Speed increase 276%

    Wow! So it definitely did not just go from 40 to 100 as I had hoped but rather from 40 to 150!

    Thank you very much!

    EDIT: The compiled Renderers.jar attached for anyone who needs it. :) Used jdk1.8.9_45 to compile.

     
    Last edit: Pete 2016-07-25
  • Peter Eastman

    Peter Eastman - 2016-07-25

    Wow, that's fantastic. Atomics are definitely the modern way of doing this, but I wouldn't have guessed the difference would be that big. I guess with 12 virtual cores, contention starts to become a much bigger issue.

    Peter

     
  • Luke S

    Luke S - 2016-07-26

    Glad that you found that much improvement! For the simple ray tracing case, that's enornous!

    @Peter: I think that Pete found the absolute pathological case for contention in threads that do not depend on each other's results.

    A straightforward scene, most rays are traced quickly. I suspect that once the initial contention happened, each thread would be blocked for the lock every time it tried to spawn a new ray. So the work was limitied, to a very real extent, by how fast the lock could cycle threads through, revolving door fashion. In cases where the per-ray calculations take longer, the contention/locks would have a smaller impact.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks