Matrix Product State (MPS) Simulations / Discussion / Technical: Accelerating OpenMPS with GPUs

Matthew Jones - 2019-09-26

Hi Daniel,

There are many ways to accelerate OpenMPS with GPUs. From previous discussions, you've mentioned putting some meaningful thought into this. Would you mind detailing your approach, here?

It would be good to push this general discussion into the limelight. Others might have additional insights on how best to approach this, and additional requirements for accomplishing the task.

MJ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Jaschke - 2019-09-26
  
  Hi Matt,
  
  thanks for setting up the discussion here. The approaches are not limited
  to GPUs. But you can assume for instance that you have a cluster of n GPUs
  and generalize to other architectures. I can think of the following layers
  of parallelization:
  
  1) Parallelization over sites / qudits in your system: TEBD could run
  parallelize L = 2 * n sites, that is applying the two-site operators at the
  same time. Variational algorithms have outlined a similar scheme to
  parallelize, but it seems more cumbersome (maybe that is just my impression
  having implemented the parallel TEBD once). I don't know if we can
  parallelize all of our algorithms this way, e.g., we would have to check
  for TDVP, Krylov, LRK, excited state search. They contain all variational
  elements, so there is some chance we can reuse ideas. This approach (1) is
  the most work to implement, you must rewrite on a very high level employing
  tensor network intuition without being backed up by any standard libraries.
  
  2) Parallelize over symmetries (only if present): If symmetries are
  actively conserved in the simulation, we always have a loop over the
  quantum numbers similar to a block diagonal structure. Matrix operations
  (SVDs, QRs, tensor contractions) can be parallelized; the entries in the
  loop are independent from each other. OSMPS v3 is structured in a way, that
  you can put in an OPENACC statement right away in all the right spots.
  
  3) Parallelize matrix operations: You can accelerate each matrix operations
  a bit. I'd never got past a speed-up of 2.5 or 3 on CPUs. That basically
  works if you include the intel mkl library with openmp, i.e., you can
  profit from libraries without having much effort yourself.
  
  I assume on a GPU cluster, you could parallelize (2) over the rows of the
  GPU cores, and (3) over the columns of GPUs cores. (Just assuming a 2d
  layout to address the GPU cores)
  
  If that is not enough, you can think about other levels of parallelization:
  a) Run time evolution and measurement in parallel, but this highly depends
  on how much you measure.
  b) Parallelize the measurements, otherwise you get stuck in a serial
  section of the code during each measurement. Some are trivial to
  parallelize.
  
  I hope that gives a good overview and I could detail the points if
  necessary. But in most of my research, we had a huge parameter space on top
  a single simulation and parallelization over the parameter space is trivial
  and the most efficient use of limited resources. That does not sound as
  fancy as the above approaches, but should always remain the priority (and
  an option) for parallelization is OSMPS.
  
  Best regards,
  
  Daniel
  
  P.S. There has been some discussion that v3 lost some speed in comparison
  to v2. I have not forgotten the discussion, but it turns out to be not that
  trivial of an issue. My impression was that operations in v2 assumed that
  subroutines are executed in a certain order (implicitly, e.g., are elements
  unique or have they been sorted?) and I got rid of such assumptions in v3
  leading to some overhead. I already got some speed-up by throwing away some
  entries in our tensor and tensorc to reduce memory allocations. I had some
  branch for that, but no chance to follow-up on this.
  
  On Thu, Sep 26, 2019 at 5:57 PM Matthew Jones matjones@users.sourceforge.net wrote:
  
  Hi Daniel,
  
  There are many ways to accelerate OpenMPS with GPUs. From previous
  discussions, you've mentioned putting some meaningful thought into this.
  Would you mind detailing your approach, here?
  
  It would be good to push this general discussion into the limelight.
  Others might have additional insights on how best to approach this, and
  additional requirements for accomplishing the task.
  
  MJ
  
  Accelerating OpenMPS with GPUs
  https://sourceforge.net/p/openmps/discussion/tech/thread/27fffc94ef/?limit=25#4041
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/openmps/discussion/tech/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Matthew Jones - 2019-09-26
    
    This is a good high level summary. Thanks! I'll refer to your observations as (1), (2), etc..
    
    Regarding (1), you aptly point out that we have two tasks before us to accelerate the functions within OpenMPS: (a) algorithms and (b) data structures. So, I would break this down into two pieces. The first being the work required to restructure the algorithm for the target architecture, and the second being the work required to restructure the data structure itself.
    
    Algorithms of interest (in order of relative importance): excited state search, TDVP, LRK, Krylov. At first pass, we can batch GEMMs within the algorithms. This will give us an estimate for coverage. At second pass, we can look at restructuring the underlying data to implement cyclic/recursive distributions to optimally load balance the operations. We'll also need to minimize certain kinds of transactions (the utilization of memory, and its associated stage will be very important for performance). Getting the data structure down is of tantamount importance when taking this route. I can detail this more, if needed.
    
    Regarding (2, 3), I agree that we should be using OpenACC to accelerate various logical blocks and nested loop structures. We need to be very careful when applying "functional" acceleration over matrix operations. The benefit of doing this can be easily amortized by repeated requests for results within a routine (as one example). I think our first priority should be to decorate OpenMPS with the appropriate preprocessor directives backed by OpenACC and let the device driver / compiler manage the translation to compiled code.
    
    On the point of one of your assumptions, that we have a cluster of N GPUs: performance will also be impacted by the number of available CPU threads (call it M) per GPU (M/N). The underlying network and PCIe topologies can also grossly impact our ability to distribute and execute a workload. This is not as simple as specifying high-bandwidth networking. We'll need to rework a significant portion of the communications infrastructure within OpenMPS to facilitate the discovery of the optimal data path.
    
    Regarding the performance regression from v2 >> v3, if sorting was the reason for the regression, we shouldn't prioritize addressing it. This is performance we can easily recover in the GPU implementation.
    
    An implication of the above is that we'll need rewrite the bindings interface between Fortran and Python to minimize transient overheads associated with data migration. These can and will be significant.
    
    All of the above must be predicated with detailed profiles of all pertinent code-paths. This will tell us where the largest bottlenecks are. Profiling and memory analysis will be required at every stage of this development. We will have to worry about the effective duty cycle of each kernel executing on each GPU and the associated managerial threads on the CPU.
    
    Let me know what your thoughts are on the above, or if you have questions.
    MJ
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Accelerating OpenMPS with GPUs

Numerical routines for variational matrix product state simulations.

Forums

Help

Accelerating OpenMPS with GPUs

MJ

Accelerating OpenMPS with GPUs

Numerical routines for variational matrix product state simulations.

Forums

Help

Accelerating OpenMPS with GPUs document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

MJ

Accelerating OpenMPS with GPUs