what improvement should be expected?

  • Alberto

    Alberto - 2009-05-01

    I read from this and other forums, improvements ranging from 50x to 200x while using flam4 instead flam3.
    I'm experiencing only a 7x improvement instead.

    My configuration is:
    Asus G71v (notebook) 4GB ram
    CPU: core 2 duo 2.53ghz
    GPU: geforce 9700M GT (32 cores) 512MB
    Windows XP 32
    Driver 185.81 beta including Cuda 2.2

    I monitored with taskmanager CPU behaviour while rendering and noticed that both cores of CPU are going to 100%, almost all for kernel time versus no kernel time during rendering with apo.
    This fact made me wondering about some bottleneck in the communication link CPU-GPU, or maybe is the beta driver.
    Could you post your performance gain here?

    Thank you

    • Keldor

      Keldor - 2009-05-07

      You problem is very likely your GPU - at 32 cores, the 9700M GT is pretty scrawny when compared to the high end cards that achieve those benchmarks.  I have a slightly faster core 2 duo (2.8 GHz iirc) with a GTX 295, with 240x2 = 480 cores, so if we take your 7x speed increase and multiply that by the number of cores ratio (480/32 = 15) we get an expected 7*15=105x expected improvement for the GTX 295, which is right in line with my observed performance of "about 100x".  If you had two GTX 295s is SLI, you might get the 200x.

      The bottom line?  There's a really, really big difference between the low end and high end of current gen Nvidia cards, and laptops are almost exclusively the "low end" due to power comsumption issues, relative to their desktop counterparts.

      • Alberto

        Alberto - 2009-05-07

        Ah thanks, you are right, I have been a bit naive and lazy about this question.
        I got the answer by my self after I looked for card specs on the nvidia site.
        So, a very coarse rule of the thumb could be x_improvement=nvidia_cores/5.
        Still wondering why both cores of the *CPU* are still busy at 100% in kernel time while GPU is rendering.

    • Keldor

      Keldor - 2009-05-12

      What's happening (in the current version, anyway) is that even though the threads yield immediately upon starting a GPU kernal or issuing a command to the GPU control thread(s), Windows appears to say "Oh, look, there's another flam4 thread waiting to go!" and switches right back into one of the threads that had just switched out a moment ago.

      Actually, this is a good thing, since the control threads need to be able to respond immediately to a GPU finishing its workload, or else that GPU will sit there idling until its control thread happens to get scheduled again.  That's what I managed to improve in version 0.66, when I discovered that each GPU was spending more that 50% of the time idling!  Threads just weren't getting scheduled properly, so there was way too much latency between the command thread issuing a command and the GPU threads picking it up.

      Basically, two things were going on.  First off, the GPU threads were spin waiting for their kernals to finish, which would have reduced the command latency, except that there weren't enough CPU cores to allow each GPU control thread and the command thread to run at the same time.  Thus it was at the mercy of Windows' thread scheduler to ensure that each thread got its fair share, which simply wasn't working well at all.

      Second, the GPU control threads each needed local copies of the flame parameters for their current motion sample.  This is because each motion sample is different from the last, and so each GPU will be working against a slightly different parameter set at any given time.  The problem was that before copying a new set of parameters, it had to wait until the GPU thread was finished with its last task, and so each GPU control thread would have to wait on its local copy before sending each task to its associated GPU.

      The first problem was solved with a new feature in CUDA 2.2, which allowed threads to yield once they sent a kernal off to the GPU to run.  This gave the control thread more than enough time to run in the meanwhile, as well as giving all the other GPU control threads time to send their work units off to their assigned GPUs.  This alone practically doubled performance.

      The second problem was improved by adding a flag to the GPU control thread struct that told the main control thread the moment it was safe to write in a new parameter, since the current one had been safely shipped over to the GPU.  This means that the control thread can be preparing the next work load for a given GPU control thread while that GPU is still processing the previous unit (which has had its parameters loaded into the GPU's memory, so the CPU memory copy is no longer needed).

      I hope my long winded technical explanation about CPU usage satisfied your curiosity <.<

    • Alberto

      Alberto - 2009-05-12

      Yes it did, thanks for the deep explanation.
      So if I understand well, the CPU speed, the number of CPUs and the number of CPU cores do have an implication in the overall performance because of the feeding/gathering processors intercommunication even if the computation happens in the GPU.

  • Steven Brodhead Sr.

    My Macbook Pro laptop gets similar performance boost to what you are seeing. It is running the Mac version of Flam4, by the way. I have Vista on the laptop too and have run the PC Flam4 on that, but I have no benchmark info for that.

    The Macbook Pro has 2 GPU's, a 9600M GT and a 9400M. Flam4 can render using both at the same time.

    The Mac version will also break big images into separate tiles to get around the small GPU memory amounts on laptops.


Log in to post a comment.