Menu

munged not running on compute nodes

Lee H.
2016-05-17
2016-05-25
  • Lee H.

    Lee H. - 2016-05-17

    I installed the slurm-roll (for the first time) on a Dell cluster (5 compute nodes, each with 2 socks, 12 cores each) yesterday. I'm running rocks 6.2. The install went smoothly, after reboot and kickstart it seemed like I would be able to use everything correctly, but sinfo shows that the "CLUSTER" partition is down:

    # sinfo
    PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
    DEBUG up infinite 0 n/a
    CLUSTER up infinite 5 down compute-0-[0-4]
    WHEEL up infinite 5 down compute-0-[0-4]
    WHEEL up infinite 1 idle wintermute

    I then found in /var/log/munge on the compute-0-0 node (and subsequently on the rest of the nodes) that it is not finding /var/run/munge.socket.2. As root, if I "tentakel sinfo" I get the following for each node

    ### compute-0-0(stat: 0, dur(s): 2.32):
    sinfo: error: If munged is up, restart with --num-threads=10
    sinfo: error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
    sinfo: error: authentication: Munged communication error
    slurmloadpartitions: Protocol authentication error

    On the head node, /var/run/munge has munge.socket.2:
    # ls -l /var/run/munge
    total 4
    -rw-r--r-- 1 munge munge 5 May 16 16:40 munged.pid
    srwxrwxrwx 1 munge munge 0 May 16 16:40 munge.socket.2

    For setup, I installed using the commands in the pdf that came with the latest release, and I have checked the compute nodes to see if munged is running (ps aux | grep munge), and it is not. I know nothing about munge, is there some setup that I should have done to get munged running on the compute nodes?

    Thanks for your help.

    ~Lee Harding

     
    • Werner Saar

      Werner Saar - 2016-05-18

      Please check /etc/passwd and /etc/group for munge user and group on the headnode and the compute nodes.

      Regards
      Werner

       
  • Lee H.

    Lee H. - 2016-05-18

    They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
    and on the headnode it is
    munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh

    So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
    2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
    2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
    2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399

    On compute-0-0, if I do ls -l /etc/munge/munge.key
    -r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key

    On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
    -r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key

    and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?

     
    • Werner Saar

      Werner Saar - 2016-05-19

      Hi,

      the correct setting is: 399 for uid and gid on all nodes.
      Try to correct this.
      And to not use munge from the Epel repository.

      Best regards
      Werner

      On 05/18/2016 08:08 PM, Lee H. wrote:

      They are both there, although, the group on the compute nodes is "munge:x:399:", and on the head node it is "munge:x:402". munge in /etc/passwd on the compute nodes is munge:x:399:399:Runs Uid 'N' Gid Emporium:/var/run/munge:/sbin/nologin
      and on the headnode it is
      munge:x:402:402:Runs Uid 'N' Gid Emporium:/var/run/munge:/bin/sh

      So it looks like there is a conflict in the group number, on the compute node /var/log/munge/munged says:
      2016-05-18 07:24:57 Notice: Running on "compute-0-3.local" (8.8.8.251)
      2016-05-18 07:24:57 Info: PRNG seeded with 1024 bytes from "/dev/urandom"
      2016-05-18 07:24:57 Error: Keyfile is insecure: "/etc/munge/munge.key" should be owned by uid=399

      On compute-0-0, if I do ls -l /etc/munge/munge.key
      -r-------- 1 402 402 1.0K May 18 07:24 /etc/munge/munge.key

      On the compute-0-0 I just did "chown 399:399 /etc/munge/munge.key", then ls -l /etc/munge/munge.key gives:
      -r-------- 1 munge munge 1.0K May 18 07:24 /etc/munge/munge.key

      and rebooted and after reboot, it goes back to 402. I've tried rocks sync user; rocks sync slurm and /boot/kickstart/cluster-kickstart, and it still goes back to 402. Should I just change the gid on the head node to 399?


      munged not running on compute nodes


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
  • Lee H.

    Lee H. - 2016-05-19

    Excellent. I have a user running in batch mode right now (since I don't have scheduling working), so it might be a few days before I can correct this. I will let you know if changing the gid on the headnode works.

    Thanks so much for the quick replies!

     
  • Lee H.

    Lee H. - 2016-05-23

    I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
    Just for reference, it took a bit to get it working (but it was all gid and uid related):
    1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
    2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
    3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).

    After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).

    Werner, thanks for the help resolving this.

    ~Lee

     
    • Werner Saar

      Werner Saar - 2016-05-24

      Hi,

      I really don't know, why you have trouble with the slurm roll.
      If it's possible for you, reinstall the headnode with only the rolls,
      that are tested (see documentation).
      Then install the slurm-roll and install the compute nodes.
      If all runs fine, you can now install additional rolls that you need,
      but not the roll area51, and
      reinstall the compute nodes.
      I think, that is the fastest way ( about 2-3 hours ).

      Best regards
      Werner

      On 05/23/2016 09:28 PM, Lee H. wrote:

      I got munge working and munge.socket.2 working with your advice (correcting the group from 402 to 399).
      Just for reference, it took a bit to get it working (but it was all gid and uid related):
      1. I changed the gid on the headnode to 399 on the headnode (groupmod -g 399 munge). Then I did "rocks sync users" and "rocks sync slurm". The munge directories on the compute nodes were still showing uid and gid of 402.
      2. I eventually realized in /etc/passwd now had "munge:399:402", so I changed this using "usermod -u 399 -g 399 munge". This corrected /etc/passwd. After that I did "find / -group 402" and "find / -group 402" and did "chown 399:399" to correct all of the directories and files that still had the incorrect uid/gid.
      3. I then did "service munge restart", then "rocks sync users " and "rocks sync slurm" (not sure if that was necessary), and then I kickstarted the nodes (not sure if that was necessary).

      After all of that, sinfo showed the CLUSTER partition in an idle state. After trying to launch something using sbatch, I now I am show the state as "drain" and scontrol show node is showing "Reason=Bad core count", but I that is now a new issue (I'll start another thread if I can't figure it out).

      Werner, thanks for the help resolving this.

      ~Lee


      munged not running on compute nodes


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
  • Lee H.

    Lee H. - 2016-05-24

    Werner,

    I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:

    scontrol

    scontrol: update NodeName=compute-0-0 state=RESUME

    And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.

    I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.

    The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!

     
    • Werner Saar

      Werner Saar - 2016-05-25

      Hi,

      when all compute nodes are are running,
      you can execute rocks report slurm_hwinfo|sh.
      This command will create the attribute attr=slurm_hwinfo for each
      compute node with
      detailed node and cpu layout.

      example:
      rocks set host attr compute-0-0 attr=slurm_hwinfo value='Boards=1
      SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1'

      I you then run: rocks sync slurm
      then you will have a better description of the nodes in
      /etc/slurm/nodennames.conf
      and you can use the settings with sbatch.

      You can also manually lower the values in the attribute slurm_hwinfo.

      Best regards
      Werner

      On 05/25/2016 12:28 AM, Lee H. wrote:

      Werner,

      I have it up and running now, without having to reinstall the head node. I'm not sure how I got the "Bad Core Count", but I just reset the node status by using the following commands:

      scontrol

      scontrol: update NodeName=compute-0-0 state=RESUME

      And the CLUSTER partition back to idle. After that, I just had to figure out how to limit jobs to 24 cores (instead of 48 threads since the nodes have hyperthreading enabled). And I did that in the sbatch scripts using #SBATCH --ntasks-per-core.

      I think the entire problem was just that the somehow the munge uid:gid didn't sync correctly between the head node and compute nodes.

      The important thing is thanks to your slurm-roll, I now have scheduling on my cluster. Thanks so much!


      munged not running on compute nodes


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.