CUDA and the mobile GPU

2009-03-19
2013-04-22
  • Aaron Reilly
    Aaron Reilly
    2009-03-19

    It would seem that nVidia has relesaed CUDA 2.1 for mobile GPU's, still in BETA ut as far as I have tested with multiple applications fully stable and supported. It's awesome, to say the least. I have been rendering from FLAM4 now for some time, did my first flame at 1920x1080 at 200,000 quality in a little over an hour and a half. Pretty amazing, considering it would have taken at least 10 days to do on my computer CPU alone.

    One question I do have, more of a feature request then anything, arg.. I'll just list them:

    - A section with avalable memory (useable) and how much memory the renderer will take.
    - A section for controlling oversampling (or multisampling)
    - Transparency for PNG's, JPEG support is not really required as PNG is a better format.
    - Ability to render in multiple parts (Like Apophysis where it can render something in multiple parts, because there is not enough memory to do the render right then)
    - Ability to integrate FLAM4 into Apophysis. (Which would mean giving FLAM4 command-line functions similar to flam3, at which point, could also be used with ES and rendering flames for upload to the server)

    Also, I am having a rendering problem with resolutions above 1920x1080. I get an "Out of memory" error then a CUDA error, usually ending up with me having to reboot.

     
    • Joe
      Joe
      2009-03-19

      That's quite a tall order for feature requests, though integrating any of those into flam4 would be great.

      Your out of memory error is given to you because of how much memory your GPU has on-board. It sounds like you have 256MB. I have a 512MB GPU and resolutions greater than ~3200x3200 give me problems. I don't think CUDA addresses TurboCache's shared system memory, though that would be great since recent GPU's can take advantage of a large amount of system RAM (for example, my notebook will allocate up to 1.7GB in TurboCache).

      One other sidenote:
      Why would you use 200K quality at that resolution? Seems like overkill, no? :P

      ---------
      If I may add one feature request:

      flam3 has the ability to morph movies from fractal to fractal. It sounds like some complicated math, but that's the most lacking thing I find about flam4--movies are just limited to a full fractal rotation.

       
      • Aaron Reilly
        Aaron Reilly
        2009-03-20

        My requests are for future design and program developement of course. None of these are expect in the next release, they are very complicated requests and I do understand that. Just something for the to-do list, you know?

        You are correct, I do have 256MB memory for my GPU (8600M GT), however, I will be attempting to get this to run on a newer card, hopefully a 9800GT or even a GTX295, depending on when I get my newest system. My notebook will allocate up to 1.7GB, the question is, how to implement it?

        As for the 200k render at 1920x1080; just for the heck of it! Why not? I am still fascinated with the power demonstrated by even low priced GPU's. So, thats why. It's somewhat unfortunate, because I do all of my still rendering at 7296x6652@20k quality. So, if I were able to utilize that power on my current GPU without having to buy the 4GB Quadro CX, that would be nice... (Which I wouldn't even If I had the money, heck I'd get 2 GTX 295's, or a TESLA).

         
        • Joe
          Joe
          2009-03-20

          Yeah, a feature request list is always good especially with newer programs like this.

          I don't know if the extra 1.7GB will work with CUDA-- I'm not entirely sure though. Implementation will certainly be an issue. Since flam4 is a Visual Studio program, though, I wonder if it will be possible to just skip TurboCache and go straight to system RAM, and then as you mentioned before, render in slices the GPU can handle.

          The power is certainly something to be in awe of! Your 32 stream processors and my 64 totally put even a desktop quad-core CPU to shame. Given, it took some work to get it going, but as you said, even a halfway decent mid-range GPU can compete with a high end CPU.

          I'd say wait for software to mature before jumping on a Tesla (though having one would be awesome! Teraflops at your disposal...). I feel like something as simple as a memory issue can be solved through software algorithms.

          I wonder what progress Keldor's made lately...

           
          • Aaron Reilly
            Aaron Reilly
            2009-03-20

            As for the memory issue, I wonder if it's a protection thing. I'm not sure, I've actually never worked with CUDA before, I should probably start learning it. If it were able to use system RAM, that would eliminate most resolution issues. Depenging on how much RAM you have of course.

            As for a tesla system, if FLAM 4 was able to be run on one of those? a 1920x1080@200k would take mere minutes. Full HD loops at 50k would take 20 minutes. Especially if you OC the cards? *starts drooling*

            I do also wonder what progress Keldor has made recently...

             
            • Keldor
              Keldor
              2009-03-21

              The GPU can't even see system memory.  Moreover, even if it could, it'd have to communicate over the PCIx bus, which is much too slow.  Remember, every single iteration must read in a value from the output buffer, add the new point's contribution, and then write it back.  Flam4 will easily maintain over 10 GB/s of memory activity.  Compare that to the 2-3GB/s maximum bandwidth of the PCIx bus and there's probably a 5x performance drop already, and that's not even taking into account the overhead of setting up the transfers!  This overhead would be pretty huge, since the data must go from system memory through the CPU, through the south(?) bridge to the PCIx bus, into graphics memory, and back again for every iteration.

               
              • Keldor
                Keldor
                2009-03-21

                Let's but it this way - having the GPU use system memory is just as bad as having the CPU use the hard drive.

                 
                • Joe
                  Joe
                  2009-03-21

                  Is there no way to expand the available memory to the GPU, then? Not even TurboCache (which is basically system memory)?

                   
                  • Keldor
                    Keldor
                    2009-03-22

                    Well, CUDA 2.2 will add the ability for the GPU to read/write to pinned system memory, similar to TurboCache.  Still, unless you are rendering images larger than perhaps 5x your GPU's memory (AND have system memory in these amounts), it'll still be faster to render the image as a set of chunks, and then stitch them together to produce the full image.  Flam4 doesn't (yet) have a way to automatically break renders into strips, but you can do this by hand by editing the parameter files to render to each strip's appropriate view window and resolution.

                     
    • Keldor
      Keldor
      2009-03-21

      A Tesla system will in theory "just work", though Flam4 still doesn't support multi-GPU.  Thus, there'd be little point in running on a Tesla 1070 over a GTX 280, since they both use the same actual processor.  The Tesla would have more memory though...

       
    • Aaron Reilly
      Aaron Reilly
      2009-03-23

      Is there any way that you could put in a function that would show the maximum resolution that you could render a flame at?
      For instance, something that would query the card for avalable memory and then have it know from there approx. what the size could be? Just because you have 256Mb of GPU memory, dosn't mean it's all free.

      For instance, an image at 7296x6652 will take 650Mb of system memory with no oversampling. And an image at 3680x2760 takes 160MB of system memory with no oversample. I should easily have 160MB of GPU memory free when using FLAM4. Unless FLAM4 is oversampling, then it would explain the issue I am having with that resolution.

      Also, about the GPU using system memory; I can see where you are comming from. There is a huge performance hit when transfering GPU to system memory because of the inferface, if it were possible in FLAM4. But, I do believe that FLAM4 still has more power then Apophysis or FLAM3 could ever have in rendering. Taking the hit, especially to take advantage of MUCH higher resolutions could be very worth it to some people. Unfortunatly, CUDA 2.2 will probably come out for Non-mobile GPU's first. Which means it will take a fair ammount of time for us to use that feature.

       
    • Aaron Reilly
      Aaron Reilly
      2009-03-23

      http://forums.nvidia.com/index.php?showtopic=92290

      Check that out. This may be one of the answers to the memory issue.

       
  • This will not help on the PC version of Flam4, but the Mac version does tiling when necessary to create really large fractal images. It also reports the memory current available on the GPU. The memory check for tiling is made before every render.

    Since the entire Mac user environment sits on OpenGL, the amount of free memory on the GPU can change without any notice. So memory blow ups can still happen, but this is not common.