Menu

Accelerating OpenMPS with GPUs

Technical
2019-09-26
2019-09-26
  • Matthew Jones

    Matthew Jones - 2019-09-26

    Hi Daniel,

    There are many ways to accelerate OpenMPS with GPUs. From previous discussions, you've mentioned putting some meaningful thought into this. Would you mind detailing your approach, here?

    It would be good to push this general discussion into the limelight. Others might have additional insights on how best to approach this, and additional requirements for accomplishing the task.

    MJ

     
    • Daniel Jaschke

      Daniel Jaschke - 2019-09-26

      Hi Matt,

      thanks for setting up the discussion here. The approaches are not limited
      to GPUs. But you can assume for instance that you have a cluster of n GPUs
      and generalize to other architectures. I can think of the following layers
      of parallelization:

      1) Parallelization over sites / qudits in your system: TEBD could run
      parallelize L = 2 * n sites, that is applying the two-site operators at the
      same time. Variational algorithms have outlined a similar scheme to
      parallelize, but it seems more cumbersome (maybe that is just my impression
      having implemented the parallel TEBD once). I don't know if we can
      parallelize all of our algorithms this way, e.g., we would have to check
      for TDVP, Krylov, LRK, excited state search. They contain all variational
      elements, so there is some chance we can reuse ideas. This approach (1) is
      the most work to implement, you must rewrite on a very high level employing
      tensor network intuition without being backed up by any standard libraries.

      2) Parallelize over symmetries (only if present): If symmetries are
      actively conserved in the simulation, we always have a loop over the
      quantum numbers similar to a block diagonal structure. Matrix operations
      (SVDs, QRs, tensor contractions) can be parallelized; the entries in the
      loop are independent from each other. OSMPS v3 is structured in a way, that
      you can put in an OPENACC statement right away in all the right spots.

      3) Parallelize matrix operations: You can accelerate each matrix operations
      a bit. I'd never got past a speed-up of 2.5 or 3 on CPUs. That basically
      works if you include the intel mkl library with openmp, i.e., you can
      profit from libraries without having much effort yourself.

      I assume on a GPU cluster, you could parallelize (2) over the rows of the
      GPU cores, and (3) over the columns of GPUs cores. (Just assuming a 2d
      layout to address the GPU cores)

      If that is not enough, you can think about other levels of parallelization:
      a) Run time evolution and measurement in parallel, but this highly depends
      on how much you measure.
      b) Parallelize the measurements, otherwise you get stuck in a serial
      section of the code during each measurement. Some are trivial to
      parallelize.

      I hope that gives a good overview and I could detail the points if
      necessary. But in most of my research, we had a huge parameter space on top
      a single simulation and parallelization over the parameter space is trivial
      and the most efficient use of limited resources. That does not sound as
      fancy as the above approaches, but should always remain the priority (and
      an option) for parallelization is OSMPS.

      Best regards,

      Daniel

      P.S. There has been some discussion that v3 lost some speed in comparison
      to v2. I have not forgotten the discussion, but it turns out to be not that
      trivial of an issue. My impression was that operations in v2 assumed that
      subroutines are executed in a certain order (implicitly, e.g., are elements
      unique or have they been sorted?) and I got rid of such assumptions in v3
      leading to some overhead. I already got some speed-up by throwing away some
      entries in our tensor and tensorc to reduce memory allocations. I had some
      branch for that, but no chance to follow-up on this.

      On Thu, Sep 26, 2019 at 5:57 PM Matthew Jones matjones@users.sourceforge.net wrote:

      Hi Daniel,

      There are many ways to accelerate OpenMPS with GPUs. From previous
      discussions, you've mentioned putting some meaningful thought into this.
      Would you mind detailing your approach, here?

      It would be good to push this general discussion into the limelight.
      Others might have additional insights on how best to approach this, and
      additional requirements for accomplishing the task.

      MJ

      Accelerating OpenMPS with GPUs
      https://sourceforge.net/p/openmps/discussion/tech/thread/27fffc94ef/?limit=25#4041


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/openmps/discussion/tech/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • Matthew Jones

        Matthew Jones - 2019-09-26

        This is a good high level summary. Thanks! I'll refer to your observations as (1), (2), etc..

        Regarding (1), you aptly point out that we have two tasks before us to accelerate the functions within OpenMPS: (a) algorithms and (b) data structures. So, I would break this down into two pieces. The first being the work required to restructure the algorithm for the target architecture, and the second being the work required to restructure the data structure itself.

        Algorithms of interest (in order of relative importance): excited state search, TDVP, LRK, Krylov. At first pass, we can batch GEMMs within the algorithms. This will give us an estimate for coverage. At second pass, we can look at restructuring the underlying data to implement cyclic/recursive distributions to optimally load balance the operations. We'll also need to minimize certain kinds of transactions (the utilization of memory, and its associated stage will be very important for performance). Getting the data structure down is of tantamount importance when taking this route. I can detail this more, if needed.

        Regarding (2, 3), I agree that we should be using OpenACC to accelerate various logical blocks and nested loop structures. We need to be very careful when applying "functional" acceleration over matrix operations. The benefit of doing this can be easily amortized by repeated requests for results within a routine (as one example). I think our first priority should be to decorate OpenMPS with the appropriate preprocessor directives backed by OpenACC and let the device driver / compiler manage the translation to compiled code.

        On the point of one of your assumptions, that we have a cluster of N GPUs: performance will also be impacted by the number of available CPU threads (call it M) per GPU (M/N). The underlying network and PCIe topologies can also grossly impact our ability to distribute and execute a workload. This is not as simple as specifying high-bandwidth networking. We'll need to rework a significant portion of the communications infrastructure within OpenMPS to facilitate the discovery of the optimal data path.

        Regarding the performance regression from v2 >> v3, if sorting was the reason for the regression, we shouldn't prioritize addressing it. This is performance we can easily recover in the GPU implementation.

        An implication of the above is that we'll need rewrite the bindings interface between Fortran and Python to minimize transient overheads associated with data migration. These can and will be significant.

        All of the above must be predicated with detailed profiles of all pertinent code-paths. This will tell us where the largest bottlenecks are. Profiling and memory analysis will be required at every stage of this development. We will have to worry about the effective duty cycle of each kernel executing on each GPU and the associated managerial threads on the CPU.

        Let me know what your thoughts are on the above, or if you have questions.
        MJ

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.