I want to run elk on slurm cluster, and I have compiled elk code with mpi. When I use command "srun -N 1 -n 1 -c 24 ~/elk-6.3.2/src/elk elk.in" to submit the job, it can run successfully. However, when I
use "srun -N 2 -n 2 -c 24" to try to use two nodes, the job will give an error report (here yhrun is equivalent to srun):
yhrun: error: slurm_receive_msg: Socket timed out on send/recv operation
From what I know, elk executable does not handle MPI submission.
I.e. it should be executed as "mpirun -nc 2 elk elk.in" other than just ./elk elk.in.
I recommend checking this post for more.
Thus, you should probably do something like:
srun -N 2 -n 2 -c 24 mpirun -nc 2 ~/elk-6.3.2/src/elk elk.in
That, however, depends on how to properly submit multinode / MPI jobs on that particular server, mpirun is just one possible option.
Also, just to be sure, check if your 1-node task does indeed use all 24 threads. If that is fine - you've got the openMP part right.
Good luck!
Andrew.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear All.
I want to run elk on slurm cluster, and I have compiled elk code with mpi. When I use command "srun -N 1 -n 1 -c 24 ~/elk-6.3.2/src/elk elk.in" to submit the job, it can run successfully. However, when I
use "srun -N 2 -n 2 -c 24" to try to use two nodes, the job will give an error report (here yhrun is equivalent to srun):
yhrun: error: slurm_receive_msg: Socket timed out on send/recv operation
yhrun: Job step creation temporarily disabled, retrying
yhrun: Job step created
forrtl: No such file or directory
forrtl: severe (28): CLOSE error, unit 95, file "Unknown"
Image PC Routine Line Source
elk 0000000002034CD8 for__io_return Unknown Unknown
elk 0000000002032BB6 for_close Unknown Unknown
elk 000000000043A42B Unknown Unknown Unknown
elk 000000000043B936 Unknown Unknown Unknown
elk 000000000042971E Unknown Unknown Unknown
libc-2.12.so 0000003921A1ED1D __libc_start_main Unknown Unknown
elk 0000000000429629 Unknown Unknown Unknown
yhrun: error: cn10342: task 1: Exited with exit code 28
yhrun: First task exited 60s ago
yhrun: task 0: running
yhrun: task 1: exited abnormally
yhrun: Terminating job step 14634421.0
slurmd[cn10341]: STEP 14634421.0 KILLED AT 2020-05-11T21:50:57 WITH SIGNAL 9
yhrun: Job step aborted: Waiting up to 2 seconds for job step to finish.
yhrun: error: cn10341: task 0: Killed
Can someone point out what mistake I am making? Thank you very much!
Regards,
Y. Gu
Dear Y.Gu,
From what I know, elk executable does not handle MPI submission.
I.e. it should be executed as "mpirun -nc 2 elk elk.in" other than just ./elk elk.in.
I recommend checking this post for more.
Thus, you should probably do something like:
srun -N 2 -n 2 -c 24 mpirun -nc 2 ~/elk-6.3.2/src/elk elk.in
That, however, depends on how to properly submit multinode / MPI jobs on that particular server, mpirun is just one possible option.
Also, just to be sure, check if your 1-node task does indeed use all 24 threads. If that is fine - you've got the openMP part right.
Good luck!
Andrew.