Menu

Vanilla Rocks 6.2 / SLURM install test

Mark
2015-05-19
2015-07-22
  • Mark

    Mark - 2015-05-19

    I tried a vanilla ROCKS 6.2 install with the latest SLURM roll. I get the following in slurmctld.log, and nothing is working. I'm going to make take a wild stab and guess that the same changes to the schema that broke the torque roll also broke slurm. The sql table you are expecting is simply no longer there. Most of the *_attributes tables were removed last April.

    fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again.

    https://github.com/rocksclusters/base/commit/ed19a154c09c8ffb481faafded204cb8cefd538b

     
  • Mark

    Mark - 2015-05-19

    Hmm, so much for that theory. Apparently you use /usr/bin/mysql, not the rocks mysql. I can get into the database with the info in /etc/slurm/slurmdbd.conf, but I see nothing at all in the log file when the slurmdbd service starts?

     
  • James

    James - 2015-05-19

    Try running the following to set up DB.
    export CLUSTER=$(/opt/rocks/bin/rocks list attr|awk ' /Info_ClusterName:/ { print $2 }')
    sacctmgr -i create cluster $CLUSTER

     
  • Werner Saar

    Werner Saar - 2015-05-19

    @mark , is the problem solved now, or do you need help

     
  • Mark

    Mark - 2015-05-19

    That seems to work now. Any idea why the cluster name was not set at install time?

    When I "qsub -I" in slurm I remain on head. When we ran torque you would find yourself on the "MOM" node. Is this normal behavior or configurable?

     
    • Werner Saar

      Werner Saar - 2015-05-20

      Hi,

      at the end of the script is a loop, that tries to set the cluster name.
      But if the database is still busy or locked, setting the cluster name fails
      and the script writes a warning message, that you should set the
      cluster name.

      Best Regards
      Werner

      On 05/19/2015 07:30 PM, Mark wrote:

      That seems to work now. Any idea why the cluster name was not set at install time?

      When I "qsub -I" in slurm I remain on head. When we ran torque you would find yourself on the "MOM" node. Is this normal behavior or configurable?


      Vanilla Rocks 6.2 / SLURM install test


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
      • Mark

        Mark - 2015-05-20

        So unlike torque you perhaps should install the slurm roll after ROCKS has been installed. It seems with SGE or torque you really can't install them after the fact. Both SGE and torque must be installed while ROCKS is installed in my experience.

         
    • Werner Saar

      Werner Saar - 2015-05-20

      With slurm, you alwas remain on the node, where you started the batch job.
      There is no MOM node.
      You can install one or more login nodes, to start batch jobs

      On 05/19/2015 07:30 PM, Mark wrote:

      That seems to work now. Any idea why the cluster name was not set at install time?

      When I "qsub -I" in slurm I remain on head. When we ran torque you would find yourself on the "MOM" node. Is this normal behavior or configurable?


      Vanilla Rocks 6.2 / SLURM install test


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
  • Mark

    Mark - 2015-05-20

    Am I remembering correctly that once a node is allocated to a job none of the other users can ssh to that node? I am remembering this from when I looked at slurm over a year ago. I remember being surprised that slurm affects access control once a job has been started. The question is what happens when two different people have some of the CPU cores running jobs, since our new nodes are 64 cores each. Or am I remembering this wrong?

    We plan on having 2 or more "login" nodes and limiting access to the head node.

     
    • Werner Saar

      Werner Saar - 2015-05-21

      Hi,

      an user can only ssh to a node, where he has resources allocated ( batch
      job or salloc ).
      If 2 people have resources allocated resources on one node, both can ssh
      to this node.

      Werner

      On 05/20/2015 07:35 PM, Mark wrote:

      Am I remembering correctly that once a node is allocated to a job none of the other users can ssh to that node? I am remembering this from when I looked at slurm over a year ago. I remember being surprised that slurm affects access control once a job has been started. The question is what happens when two different people have some of the CPU cores running jobs, since our new nodes are 64 cores each. Or am I remembering this wrong?

      We plan on having 2 or more "login" nodes and limiting access to the head node.


      Vanilla Rocks 6.2 / SLURM install test


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
  • Mark

    Mark - 2015-07-22

    Using the latest slurm roll on ROCKS 6.2 install you still appear to need "sacctmgr -i create cluster $CLUSTER" to get things started working.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.