I installed the slurm-roll (for the first time) on a Dell cluster (5 compute nodes, each with 2 socks, 12 cores each) yesterday. I'm running rocks 6.2. The install went smoothly, after reboot and kickstart it seemed like I would be able to use everything correctly, but sinfo shows that the "CLUSTER" partition is down:
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
DEBUG up infinite 0 n/a
CLUSTER up infinite 5 down compute-0-[0-4]
WHEEL up infinite 5 down compute-0-[0-4]
WHEEL up infinite 1 idle wintermute
I then found in /var/log/munge on the compute-0-0 node (and subsequently on the rest of the nodes) that it is not finding /var/run/munge.socket.2. As root, if I "tentakel sinfo" I get the following for each node
### compute-0-0(stat: 0, dur(s): 2.32):
sinfo: error: If munged is up, restart with --num-threads=10
sinfo: error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
sinfo: error: authentication: Munged communication error
slurmloadpartitions: Protocol authentication error
On the head node, /var/run/munge has munge.socket.2: # ls -l /var/run/munge
total 4
-rw-r--r-- 1 munge munge 5 May 16 16:40 munged.pid
srwxrwxrwx 1 munge munge 0 May 16 16:40 munge.socket.2
For setup, I installed using the commands in the pdf that came with the latest release, and I have checked the compute nodes to see if munged is running (ps aux | grep munge), and it is not. I know nothing about munge, is there some setup that I should have done to get munged running on the compute nodes?
Thanks for your help.
~Lee Harding
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
and on the headnode it is
munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh
So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399
On compute-0-0, if I do ls -l /etc/munge/munge.key
-r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key
On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
-r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key
and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the correct setting is: 399 for uid and gid on all nodes.
Try to correct this.
And to not use munge from the Epel repository.
Best regards
Werner
On 05/18/2016 08:08 PM, Lee H. wrote:
They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
and on the headnode it is
munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh
So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399
On compute-0-0, if I do ls -l /etc/munge/munge.key
-r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key
On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
-r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key
and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?
Excellent. I have a user running in batch mode right now (since I don't have scheduling working), so it might be a few days before I can correct this. I will let you know if changing the gid on the headnode works.
Thanks so much for the quick replies!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
Just for reference, it took a bit to get it working (but it was all gid and uid related):
1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).
After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).
Werner, thanks for the help resolving this.
~Lee
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I really don't know, why you have trouble with the slurm roll.
If it's possible for you, reinstall the headnode with only the rolls,
that are tested (see documentation).
Then install the slurm-roll and install the compute nodes.
If all runs fine, you can now install additional rolls that you need,
but not the roll area51, and
reinstall the compute nodes.
I think, that is the fastest way ( about 2-3 hours ).
Best regards
Werner
On 05/23/2016 09:28 PM, Lee H. wrote:
I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
Just for reference, it took a bit to get it working (but it was all gid and uid related):
1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).
After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).
I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:
And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.
I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.
The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
when all compute nodes are are running,
you can execute rocks report slurm_hwinfo|sh.
This command will create the attribute attr=slurm_hwinfo for each
compute node with
detailed node and cpu layout.
I you then run: rocks sync slurm
then you will have a better description of the nodes in
/etc/slurm/nodennames.conf
and you can use the settings with sbatch.
You can also manually lower the values in the attribute slurm_hwinfo.
Best regards
Werner
On 05/25/2016 12:28 AM, Lee H. wrote:
Werner,
I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:
And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.
I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.
The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!
I installed the slurm-roll (for the first time) on a Dell cluster (5 compute nodes, each with 2 socks, 12 cores each) yesterday. I'm running rocks 6.2. The install went smoothly, after reboot and kickstart it seemed like I would be able to use everything correctly, but sinfo shows that the "CLUSTER" partition is down:
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
DEBUG up infinite 0 n/a
CLUSTER up infinite 5 down compute-0-[0-4]
WHEEL up infinite 5 down compute-0-[0-4]
WHEEL up infinite 1 idle wintermute
I then found in /var/log/munge on the compute-0-0 node (and subsequently on the rest of the nodes) that it is not finding /var/run/munge.socket.2. As root, if I "tentakel sinfo" I get the following for each node
### compute-0-0(stat: 0, dur(s): 2.32):
sinfo: error: If munged is up, restart with --num-threads=10
sinfo: error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
sinfo: error: authentication: Munged communication error
slurmloadpartitions: Protocol authentication error
On the head node, /var/run/munge has munge.socket.2:
# ls -l /var/run/munge
total 4
-rw-r--r-- 1 munge munge 5 May 16 16:40 munged.pid
srwxrwxrwx 1 munge munge 0 May 16 16:40 munge.socket.2
For setup, I installed using the commands in the pdf that came with the latest release, and I have checked the compute nodes to see if munged is running (ps aux | grep munge), and it is not. I know nothing about munge, is there some setup that I should have done to get munged running on the compute nodes?
Thanks for your help.
~Lee Harding
Please check /etc/passwd and /etc/group for munge user and group on the headnode and the compute nodes.
Regards
Werner
They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
and on the headnode it is
munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh
So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399
On compute-0-0, if I do ls -l /etc/munge/munge.key
-r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key
On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
-r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key
and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?
Hi,
the correct setting is: 399 for uid and gid on all nodes.
Try to correct this.
And to not use munge from the Epel repository.
Best regards
Werner
On 05/18/2016 08:08 PM, Lee H. wrote:
Excellent. I have a user running in batch mode right now (since I don't have scheduling working), so it might be a few days before I can correct this. I will let you know if changing the gid on the headnode works.
Thanks so much for the quick replies!
I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
Just for reference, it took a bit to get it working (but it was all gid and uid related):
1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).
After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).
Werner, thanks for the help resolving this.
~Lee
Hi,
I really don't know, why you have trouble with the slurm roll.
If it's possible for you, reinstall the headnode with only the rolls,
that are tested (see documentation).
Then install the slurm-roll and install the compute nodes.
If all runs fine, you can now install additional rolls that you need,
but not the roll area51, and
reinstall the compute nodes.
I think, that is the fastest way ( about 2-3 hours ).
Best regards
Werner
On 05/23/2016 09:28 PM, Lee H. wrote:
Werner,
I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:
scontrol
scontrol: update NodeName=compute-0-0 state=RESUME
And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.
I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.
The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!
Hi,
when all compute nodes are are running,
you can execute rocks report slurm_hwinfo|sh.
This command will create the attribute attr=slurm_hwinfo for each
compute node with
detailed node and cpu layout.
example:
rocks set host attr compute-0-0 attr=slurm_hwinfo value='Boards=1
SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1'
I you then run: rocks sync slurm
then you will have a better description of the nodes in
/etc/slurm/nodennames.conf
and you can use the settings with sbatch.
You can also manually lower the values in the attribute slurm_hwinfo.
Best regards
Werner
On 05/25/2016 12:28 AM, Lee H. wrote: