There are many ways to accelerate OpenMPS with GPUs. From previous discussions, you've mentioned putting some meaningful thought into this. Would you mind detailing your approach, here?
It would be good to push this general discussion into the limelight. Others might have additional insights on how best to approach this, and additional requirements for accomplishing the task.
MJ
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks for setting up the discussion here. The approaches are not limited
to GPUs. But you can assume for instance that you have a cluster of n GPUs
and generalize to other architectures. I can think of the following layers
of parallelization:
1) Parallelization over sites / qudits in your system: TEBD could run
parallelize L = 2 * n sites, that is applying the two-site operators at the
same time. Variational algorithms have outlined a similar scheme to
parallelize, but it seems more cumbersome (maybe that is just my impression
having implemented the parallel TEBD once). I don't know if we can
parallelize all of our algorithms this way, e.g., we would have to check
for TDVP, Krylov, LRK, excited state search. They contain all variational
elements, so there is some chance we can reuse ideas. This approach (1) is
the most work to implement, you must rewrite on a very high level employing
tensor network intuition without being backed up by any standard libraries.
2) Parallelize over symmetries (only if present): If symmetries are
actively conserved in the simulation, we always have a loop over the
quantum numbers similar to a block diagonal structure. Matrix operations
(SVDs, QRs, tensor contractions) can be parallelized; the entries in the
loop are independent from each other. OSMPS v3 is structured in a way, that
you can put in an OPENACC statement right away in all the right spots.
3) Parallelize matrix operations: You can accelerate each matrix operations
a bit. I'd never got past a speed-up of 2.5 or 3 on CPUs. That basically
works if you include the intel mkl library with openmp, i.e., you can
profit from libraries without having much effort yourself.
I assume on a GPU cluster, you could parallelize (2) over the rows of the
GPU cores, and (3) over the columns of GPUs cores. (Just assuming a 2d
layout to address the GPU cores)
If that is not enough, you can think about other levels of parallelization:
a) Run time evolution and measurement in parallel, but this highly depends
on how much you measure.
b) Parallelize the measurements, otherwise you get stuck in a serial
section of the code during each measurement. Some are trivial to
parallelize.
I hope that gives a good overview and I could detail the points if
necessary. But in most of my research, we had a huge parameter space on top
a single simulation and parallelization over the parameter space is trivial
and the most efficient use of limited resources. That does not sound as
fancy as the above approaches, but should always remain the priority (and
an option) for parallelization is OSMPS.
Best regards,
Daniel
P.S. There has been some discussion that v3 lost some speed in comparison
to v2. I have not forgotten the discussion, but it turns out to be not that
trivial of an issue. My impression was that operations in v2 assumed that
subroutines are executed in a certain order (implicitly, e.g., are elements
unique or have they been sorted?) and I got rid of such assumptions in v3
leading to some overhead. I already got some speed-up by throwing away some
entries in our tensor and tensorc to reduce memory allocations. I had some
branch for that, but no chance to follow-up on this.
There are many ways to accelerate OpenMPS with GPUs. From previous
discussions, you've mentioned putting some meaningful thought into this.
Would you mind detailing your approach, here?
It would be good to push this general discussion into the limelight.
Others might have additional insights on how best to approach this, and
additional requirements for accomplishing the task.
This is a good high level summary. Thanks! I'll refer to your observations as (1), (2), etc..
Regarding (1), you aptly point out that we have two tasks before us to accelerate the functions within OpenMPS: (a) algorithms and (b) data structures. So, I would break this down into two pieces. The first being the work required to restructure the algorithm for the target architecture, and the second being the work required to restructure the data structure itself.
Algorithms of interest (in order of relative importance): excited state search, TDVP, LRK, Krylov. At first pass, we can batch GEMMs within the algorithms. This will give us an estimate for coverage. At second pass, we can look at restructuring the underlying data to implement cyclic/recursive distributions to optimally load balance the operations. We'll also need to minimize certain kinds of transactions (the utilization of memory, and its associated stage will be very important for performance). Getting the data structure down is of tantamount importance when taking this route. I can detail this more, if needed.
Regarding (2, 3), I agree that we should be using OpenACC to accelerate various logical blocks and nested loop structures. We need to be very careful when applying "functional" acceleration over matrix operations. The benefit of doing this can be easily amortized by repeated requests for results within a routine (as one example). I think our first priority should be to decorate OpenMPS with the appropriate preprocessor directives backed by OpenACC and let the device driver / compiler manage the translation to compiled code.
On the point of one of your assumptions, that we have a cluster of N GPUs: performance will also be impacted by the number of available CPU threads (call it M) per GPU (M/N). The underlying network and PCIe topologies can also grossly impact our ability to distribute and execute a workload. This is not as simple as specifying high-bandwidth networking. We'll need to rework a significant portion of the communications infrastructure within OpenMPS to facilitate the discovery of the optimal data path.
Regarding the performance regression from v2 >> v3, if sorting was the reason for the regression, we shouldn't prioritize addressing it. This is performance we can easily recover in the GPU implementation.
An implication of the above is that we'll need rewrite the bindings interface between Fortran and Python to minimize transient overheads associated with data migration. These can and will be significant.
All of the above must be predicated with detailed profiles of all pertinent code-paths. This will tell us where the largest bottlenecks are. Profiling and memory analysis will be required at every stage of this development. We will have to worry about the effective duty cycle of each kernel executing on each GPU and the associated managerial threads on the CPU.
Let me know what your thoughts are on the above, or if you have questions.
MJ
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Daniel,
There are many ways to accelerate OpenMPS with GPUs. From previous discussions, you've mentioned putting some meaningful thought into this. Would you mind detailing your approach, here?
It would be good to push this general discussion into the limelight. Others might have additional insights on how best to approach this, and additional requirements for accomplishing the task.
MJ
Hi Matt,
thanks for setting up the discussion here. The approaches are not limited
to GPUs. But you can assume for instance that you have a cluster of n GPUs
and generalize to other architectures. I can think of the following layers
of parallelization:
1) Parallelization over sites / qudits in your system: TEBD could run
parallelize L = 2 * n sites, that is applying the two-site operators at the
same time. Variational algorithms have outlined a similar scheme to
parallelize, but it seems more cumbersome (maybe that is just my impression
having implemented the parallel TEBD once). I don't know if we can
parallelize all of our algorithms this way, e.g., we would have to check
for TDVP, Krylov, LRK, excited state search. They contain all variational
elements, so there is some chance we can reuse ideas. This approach (1) is
the most work to implement, you must rewrite on a very high level employing
tensor network intuition without being backed up by any standard libraries.
2) Parallelize over symmetries (only if present): If symmetries are
actively conserved in the simulation, we always have a loop over the
quantum numbers similar to a block diagonal structure. Matrix operations
(SVDs, QRs, tensor contractions) can be parallelized; the entries in the
loop are independent from each other. OSMPS v3 is structured in a way, that
you can put in an OPENACC statement right away in all the right spots.
3) Parallelize matrix operations: You can accelerate each matrix operations
a bit. I'd never got past a speed-up of 2.5 or 3 on CPUs. That basically
works if you include the intel mkl library with openmp, i.e., you can
profit from libraries without having much effort yourself.
I assume on a GPU cluster, you could parallelize (2) over the rows of the
GPU cores, and (3) over the columns of GPUs cores. (Just assuming a 2d
layout to address the GPU cores)
If that is not enough, you can think about other levels of parallelization:
a) Run time evolution and measurement in parallel, but this highly depends
on how much you measure.
b) Parallelize the measurements, otherwise you get stuck in a serial
section of the code during each measurement. Some are trivial to
parallelize.
I hope that gives a good overview and I could detail the points if
necessary. But in most of my research, we had a huge parameter space on top
a single simulation and parallelization over the parameter space is trivial
and the most efficient use of limited resources. That does not sound as
fancy as the above approaches, but should always remain the priority (and
an option) for parallelization is OSMPS.
Best regards,
Daniel
P.S. There has been some discussion that v3 lost some speed in comparison
to v2. I have not forgotten the discussion, but it turns out to be not that
trivial of an issue. My impression was that operations in v2 assumed that
subroutines are executed in a certain order (implicitly, e.g., are elements
unique or have they been sorted?) and I got rid of such assumptions in v3
leading to some overhead. I already got some speed-up by throwing away some
entries in our tensor and tensorc to reduce memory allocations. I had some
branch for that, but no chance to follow-up on this.
On Thu, Sep 26, 2019 at 5:57 PM Matthew Jones matjones@users.sourceforge.net wrote:
This is a good high level summary. Thanks! I'll refer to your observations as (1), (2), etc..
Regarding (1), you aptly point out that we have two tasks before us to accelerate the functions within OpenMPS: (a) algorithms and (b) data structures. So, I would break this down into two pieces. The first being the work required to restructure the algorithm for the target architecture, and the second being the work required to restructure the data structure itself.
Algorithms of interest (in order of relative importance): excited state search, TDVP, LRK, Krylov. At first pass, we can batch GEMMs within the algorithms. This will give us an estimate for coverage. At second pass, we can look at restructuring the underlying data to implement cyclic/recursive distributions to optimally load balance the operations. We'll also need to minimize certain kinds of transactions (the utilization of memory, and its associated stage will be very important for performance). Getting the data structure down is of tantamount importance when taking this route. I can detail this more, if needed.
Regarding (2, 3), I agree that we should be using OpenACC to accelerate various logical blocks and nested loop structures. We need to be very careful when applying "functional" acceleration over matrix operations. The benefit of doing this can be easily amortized by repeated requests for results within a routine (as one example). I think our first priority should be to decorate OpenMPS with the appropriate preprocessor directives backed by OpenACC and let the device driver / compiler manage the translation to compiled code.
On the point of one of your assumptions, that we have a cluster of N GPUs: performance will also be impacted by the number of available CPU threads (call it M) per GPU (M/N). The underlying network and PCIe topologies can also grossly impact our ability to distribute and execute a workload. This is not as simple as specifying high-bandwidth networking. We'll need to rework a significant portion of the communications infrastructure within OpenMPS to facilitate the discovery of the optimal data path.
Regarding the performance regression from v2 >> v3, if sorting was the reason for the regression, we shouldn't prioritize addressing it. This is performance we can easily recover in the GPU implementation.
An implication of the above is that we'll need rewrite the bindings interface between Fortran and Python to minimize transient overheads associated with data migration. These can and will be significant.
All of the above must be predicated with detailed profiles of all pertinent code-paths. This will tell us where the largest bottlenecks are. Profiling and memory analysis will be required at every stage of this development. We will have to worry about the effective duty cycle of each kernel executing on each GPU and the associated managerial threads on the CPU.
Let me know what your thoughts are on the above, or if you have questions.
MJ