Please note that I am still new to programming. My question is likely a simple one, but I'm afraid I don't yet know enough about the problem.
In order to calculate the ground state using fopenmpi (for which my install is configured and tests have all passed), I have followed others examples by simply using something of the form "mpirun -np 4 elk". I'm reasonably certain that I'm doing something wrong, because this seems to initiate four separate, identical ground state calculations. I'm not very familiar with how the mpi implementation of elk works for GS calculations, but my impression would be that caculation time would be reduced, rather than produce four identical outputs in each file.
Aside from modifying the make.inc and Makefile as indicated in the Elk documentation, is there anything else I need to do to benefit from using mpi for basic GS calculations? I haven't seen anyone else changing their elk.in files, so I assumed I didn't need to.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
are you sure you compiled the code correctly as indicated in the manual? You need to comment out the mpi_stub.f90 line in the Makefile, as this replaces the MPI routines with fake ones.
On how many nodes do you try to run the code? mpirun -np 4 elk seems like running on just one computer. Do you also use OpenMP? You should not try to parallelize using OpenMP and MPI on the same node. You should either switch off OpenMP (remove the flag from the compiler options) or make sure you use MPI only to distribute across the nodes (hybrid parallelism):
I've re-checked my Makefile and make.inc, and I believe everything is as it should be (according to the manual).
You are correct -- I am actually just running this on my laptop for the moment (to become comfortable with the process). According to Intel, my number of "cores/threads" is "2/4". I understand that this will hardly be an improvement, but I'd like to get a better idea on a small scale before I attempt to work on the supercomputer.
With regard to switching off OpenMP, would I just comment out the third and sixth lines (e.g. #F90_OPTS = -O3 -ffast-math -funroll-loops -fopenmp) in make.inc?
I have run "make clean" and "make" whenever I've made changes. First clue that I'm doing something incorrectly:
Info(main): several copies of Elk may be running in this path
(this could be intentional, or result from a previous crash,
or arise from an incorrect MPI compilation)
I'm still seeing the same results after commenting the fopenmp lines (if that is actually the correct thing to do). For example, with mpirun -np 2 elk, I have two GS calculations for each loop (two loop 1, two loop 2, etc.). As a first test, I tried the Al example with 770 k-points, and the mpirun took twice as long (with two loops each) when compared to the ordinary elk run.
I appreciate any help getting started with this.
Thanks
Sean
Edit: I also haven't seen anything in INFO.OUT that indicates the number of MPI processes, as described in the manual.
Last edit: Sean 2013-07-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I've solved the problem. This is a stupid mistake, it would appear -- I did not "make clean" INSIDE the src directory when I last made changes, only in the elk directory. Now everything appears to be working very nicely. Thank you for your patience, Markus.
Best
Sean
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm finally up and running on the university cluster, and I'm finding substantial improvement in my calculation times, especially with larger cells. I'd like to know more about how to use mutliple processors/threads efficiently. For example, for a 48 atom cell with 4 4 2 ngridk (18 k-points reduced), what might be an appropriate approach? My structure is as follows:
-pe openmpi* x
-l dedicated=y
which, according to the explanation given, indicates that each of x processes gets y processors on the same machine, or, said differently, that the program is using x processes where each process executes using y threads.
As I'm not entirely sure, aside from k-point parallelization, how elk runs differently in parallel, I'd like some advice from what others understand.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Elk uses MPI for k-point parallelization. Thus, the maximum number of nodes is given by the actual number of k-points in your calculation.
OpenMP is also used for k-points, but in a nested way also for further parallelism. You have to activate nested parallelism for this to work: set the environment variable OMP_NESTED=TRUE. That's easiest with a wrapper script around your elk call:
export OMP_NESTED=TRUE
elk
Then you need to call this script with mpirun.
You may also want to consider the iterative diagonalization (tseqit = .true.) to get some speedup in the first variational calculation. However, that's not so interesting if you need second variation, i.e., with magnetism and/or spin-orbit coupling.
Good luck!
Last edit: Markus 2013-07-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm finding that when I include export OMP_NESTED=TRUE in my submission file, the calculations are actually taking longer than when I do not include it. Does this make any sense?
Sean
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
this is where my expertise ends. You may have to play with OMP_NUM_THREADS, OMP_THREAD_LIMIT, OMP_MAX_ACTIVE_LEVELS and other OpenMP variables. I guess your calculation spawns too many threads, which block each other. Can you check this via top or so?
I use a rather small cluster with 5 nodes / 30 cores, so I never have to use this kind of parallelism.
Markus
P.S.: Make sure you do not use HyperThreading. Elk does not benefit from it and runs actually slower with it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please note that I am still new to programming. My question is likely a simple one, but I'm afraid I don't yet know enough about the problem.
In order to calculate the ground state using fopenmpi (for which my install is configured and tests have all passed), I have followed others examples by simply using something of the form "mpirun -np 4 elk". I'm reasonably certain that I'm doing something wrong, because this seems to initiate four separate, identical ground state calculations. I'm not very familiar with how the mpi implementation of elk works for GS calculations, but my impression would be that caculation time would be reduced, rather than produce four identical outputs in each file.
Aside from modifying the make.inc and Makefile as indicated in the Elk documentation, is there anything else I need to do to benefit from using mpi for basic GS calculations? I haven't seen anyone else changing their elk.in files, so I assumed I didn't need to.
Hi Sean,
are you sure you compiled the code correctly as indicated in the manual? You need to comment out the mpi_stub.f90 line in the Makefile, as this replaces the MPI routines with fake ones.
On how many nodes do you try to run the code? mpirun -np 4 elk seems like running on just one computer. Do you also use OpenMP? You should not try to parallelize using OpenMP and MPI on the same node. You should either switch off OpenMP (remove the flag from the compiler options) or make sure you use MPI only to distribute across the nodes (hybrid parallelism):
mpirun -np 10 -pernode -hostfile ~/.mpi_hostfile ~/bin/elk
-pernode makes sure, only one MPI process is run on each node.
By the way, MPI is used to parallelize over k-points in mpst cases where MPI is used. Thus, you benefit from MPI only if you have enough k-points.
Cheers
Markus
Hi Markus,
I've re-checked my Makefile and make.inc, and I believe everything is as it should be (according to the manual).
You are correct -- I am actually just running this on my laptop for the moment (to become comfortable with the process). According to Intel, my number of "cores/threads" is "2/4". I understand that this will hardly be an improvement, but I'd like to get a better idea on a small scale before I attempt to work on the supercomputer.
With regard to switching off OpenMP, would I just comment out the third and sixth lines (e.g. #F90_OPTS = -O3 -ffast-math -funroll-loops -fopenmp) in make.inc?
I have run "make clean" and "make" whenever I've made changes. First clue that I'm doing something incorrectly:
Info(main): several copies of Elk may be running in this path
(this could be intentional, or result from a previous crash,
or arise from an incorrect MPI compilation)
I'm still seeing the same results after commenting the fopenmp lines (if that is actually the correct thing to do). For example, with mpirun -np 2 elk, I have two GS calculations for each loop (two loop 1, two loop 2, etc.). As a first test, I tried the Al example with 770 k-points, and the mpirun took twice as long (with two loops each) when compared to the ordinary elk run.
I appreciate any help getting started with this.
Thanks
Sean
Edit: I also haven't seen anything in INFO.OUT that indicates the number of MPI processes, as described in the manual.
Last edit: Sean 2013-07-08
I think I've solved the problem. This is a stupid mistake, it would appear -- I did not "make clean" INSIDE the src directory when I last made changes, only in the elk directory. Now everything appears to be working very nicely. Thank you for your patience, Markus.
Best
Sean
I'm finally up and running on the university cluster, and I'm finding substantial improvement in my calculation times, especially with larger cells. I'd like to know more about how to use mutliple processors/threads efficiently. For example, for a 48 atom cell with 4 4 2 ngridk (18 k-points reduced), what might be an appropriate approach? My structure is as follows:
-pe openmpi* x
-l dedicated=y
which, according to the explanation given, indicates that each of x processes gets y processors on the same machine, or, said differently, that the program is using x processes where each process executes using y threads.
As I'm not entirely sure, aside from k-point parallelization, how elk runs differently in parallel, I'd like some advice from what others understand.
Elk uses MPI for k-point parallelization. Thus, the maximum number of nodes is given by the actual number of k-points in your calculation.
OpenMP is also used for k-points, but in a nested way also for further parallelism. You have to activate nested parallelism for this to work: set the environment variable OMP_NESTED=TRUE. That's easiest with a wrapper script around your elk call:
export OMP_NESTED=TRUE
elk
Then you need to call this script with mpirun.
You may also want to consider the iterative diagonalization (tseqit = .true.) to get some speedup in the first variational calculation. However, that's not so interesting if you need second variation, i.e., with magnetism and/or spin-orbit coupling.
Good luck!
Last edit: Markus 2013-07-09
Thanks Markus. I assume I can just use export OMP_NESTED=TRUE in the command line, correct?
Do you have any suggestions for initial "x" and "y" values, based on my entry above?
Markus,
I'm finding that when I include export OMP_NESTED=TRUE in my submission file, the calculations are actually taking longer than when I do not include it. Does this make any sense?
Sean
Sean,
this is where my expertise ends. You may have to play with OMP_NUM_THREADS, OMP_THREAD_LIMIT, OMP_MAX_ACTIVE_LEVELS and other OpenMP variables. I guess your calculation spawns too many threads, which block each other. Can you check this via top or so?
I use a rather small cluster with 5 nodes / 30 cores, so I never have to use this kind of parallelism.
Markus
P.S.: Make sure you do not use HyperThreading. Elk does not benefit from it and runs actually slower with it.