slurm-roll / Discussion / General Discussion: munged not running on compute nodes

Lee H. - 2016-05-17

I installed the slurm-roll (for the first time) on a Dell cluster (5 compute nodes, each with 2 socks, 12 cores each) yesterday. I'm running rocks 6.2. The install went smoothly, after reboot and kickstart it seemed like I would be able to use everything correctly, but sinfo shows that the "CLUSTER" partition is down:

# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
DEBUG up infinite 0 n/a
CLUSTER up infinite 5 down compute-0-[0-4]
WHEEL up infinite 5 down compute-0-[0-4]
WHEEL up infinite 1 idle wintermute

I then found in /var/log/munge on the compute-0-0 node (and subsequently on the rest of the nodes) that it is not finding /var/run/munge.socket.2. As root, if I "tentakel sinfo" I get the following for each node

### compute-0-0(stat: 0, dur(s): 2.32):
sinfo: error: If munged is up, restart with --num-threads=10
sinfo: error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
sinfo: error: authentication: Munged communication error
slurmloadpartitions: Protocol authentication error

On the head node, /var/run/munge has munge.socket.2:
# ls -l /var/run/munge
total 4
-rw-r--r-- 1 munge munge 5 May 16 16:40 munged.pid
srwxrwxrwx 1 munge munge 0 May 16 16:40 munge.socket.2

For setup, I installed using the commands in the pdf that came with the latest release, and I have checked the compute nodes to see if munged is running (ps aux | grep munge), and it is not. I know nothing about munge, is there some setup that I should have done to get munged running on the compute nodes?

Thanks for your help.

~Lee Harding

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Werner Saar - 2016-05-18
  
  Please check /etc/passwd and /etc/group for munge user and group on the headnode and the compute nodes.
  
  Regards
  Werner
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lee H. - 2016-05-18

They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
and on the headnode it is
munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh

So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399

On compute-0-0, if I do ls -l /etc/munge/munge.key
-r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key

On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
-r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key

and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Werner Saar - 2016-05-19
  
  Hi,
  
  the correct setting is: 399 for uid and gid on all nodes.
  Try to correct this.
  And to not use munge from the Epel repository.
  
  Best regards
  Werner
  
  On 05/18/2016 08:08 PM, Lee H. wrote:
  
  They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
  and on the headnode it is
  munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh
  
  So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
  2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
  2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
  2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399
  
  On compute-0-0, if I do ls -l /etc/munge/munge.key
  -r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key
  
  On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
  -r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key
  
  and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?
  
  munged not running on compute nodes
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lee H. - 2016-05-19

Excellent. I have a user running in batch mode right now (since I don't have scheduling working), so it might be a few days before I can correct this. I will let you know if changing the gid on the headnode works.

Thanks so much for the quick replies!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lee H. - 2016-05-23

I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
Just for reference, it took a bit to get it working (but it was all gid and uid related):
1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).

After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).

Werner, thanks for the help resolving this.

~Lee

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Werner Saar - 2016-05-24
  
  Hi,
  
  I really don't know, why you have trouble with the slurm roll.
  If it's possible for you, reinstall the headnode with only the rolls,
  that are tested (see documentation).
  Then install the slurm-roll and install the compute nodes.
  If all runs fine, you can now install additional rolls that you need,
  but not the roll area51, and
  reinstall the compute nodes.
  I think, that is the fastest way ( about 2-3 hours ).
  
  Best regards
  Werner
  
  On 05/23/2016 09:28 PM, Lee H. wrote:
  
  I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
  Just for reference, it took a bit to get it working (but it was all gid and uid related):
  1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
  2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
  3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).
  
  After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).
  
  Werner, thanks for the help resolving this.
  
  ~Lee
  
  munged not running on compute nodes
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lee H. - 2016-05-24

Werner,

I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:

scontrol

scontrol: update NodeName=compute-0-0 state=RESUME

And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.

I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.

The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Werner Saar - 2016-05-25
  
  Hi,
  
  when all compute nodes are are running,
  you can execute rocks report slurm_hwinfo|sh.
  This command will create the attribute attr=slurm_hwinfo for each
  compute node with
  detailed node and cpu layout.
  
  example:
  rocks set host attr compute-0-0 attr=slurm_hwinfo value='Boards=1
  SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1'
  
  I you then run: rocks sync slurm
  then you will have a better description of the nodes in
  /etc/slurm/nodennames.conf
  and you can use the settings with sbatch.
  
  You can also manually lower the values in the attribute slurm_hwinfo.
  
  Best regards
  Werner
  
  On 05/25/2016 12:28 AM, Lee H. wrote:
  
  Werner,
  
  I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:
  
  scontrol
  
  scontrol: update NodeName=compute-0-0 state=RESUME
  
  And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.
  
  I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.
  
  The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!
  
  munged not running on compute nodes
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

munged not running on compute nodes

Slurm Resource Manager for Rocks Clusters

Forums

Help

munged not running on compute nodes

scontrol

scontrol

munged not running on compute nodes

Slurm Resource Manager for Rocks Clusters

Forums

Help

munged not running on compute nodes document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

scontrol

scontrol

munged not running on compute nodes