I tried a vanilla ROCKS 6.2 install with the latest SLURM roll. I get the following in slurmctld.log, and nothing is working. I'm going to make take a wild stab and guess that the same changes to the schema that broke the torque roll also broke slurm. The sql table you are expecting is simply no longer there. Most of the *_attributes tables were removed last April.
fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again.
Hmm, so much for that theory. Apparently you use /usr/bin/mysql, not the rocks mysql. I can get into the database with the info in /etc/slurm/slurmdbd.conf, but I see nothing at all in the log file when the slurmdbd service starts?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Try running the following to set up DB.
export CLUSTER=$(/opt/rocks/bin/rocks list attr|awk ' /Info_ClusterName:/ { print $2 }')
sacctmgr -i create cluster $CLUSTER
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
at the end of the script is a loop, that tries to set the cluster name.
But if the database is still busy or locked, setting the cluster name fails
and the script writes a warning message, that you should set the
cluster name.
Best Regards
Werner
On 05/19/2015 07:30 PM, Mark wrote:
That seems to work now. Any idea why the cluster name was not set at install time?
When I "qsub -I" in slurm I remain on head. When we ran torque you would find yourself on the "MOM" node. Is this normal behavior or configurable?
So unlike torque you perhaps should install the slurm roll after ROCKS has been installed. It seems with SGE or torque you really can't install them after the fact. Both SGE and torque must be installed while ROCKS is installed in my experience.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With slurm, you alwas remain on the node, where you started the batch job.
There is no MOM node.
You can install one or more login nodes, to start batch jobs
On 05/19/2015 07:30 PM, Mark wrote:
That seems to work now. Any idea why the cluster name was not set at install time?
When I "qsub -I" in slurm I remain on head. When we ran torque you would find yourself on the "MOM" node. Is this normal behavior or configurable?
Am I remembering correctly that once a node is allocated to a job none of the other users can ssh to that node? I am remembering this from when I looked at slurm over a year ago. I remember being surprised that slurm affects access control once a job has been started. The question is what happens when two different people have some of the CPU cores running jobs, since our new nodes are 64 cores each. Or am I remembering this wrong?
We plan on having 2 or more "login" nodes and limiting access to the head node.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
an user can only ssh to a node, where he has resources allocated ( batch
job or salloc ).
If 2 people have resources allocated resources on one node, both can ssh
to this node.
Werner
On 05/20/2015 07:35 PM, Mark wrote:
Am I remembering correctly that once a node is allocated to a job none of the other users can ssh to that node? I am remembering this from when I looked at slurm over a year ago. I remember being surprised that slurm affects access control once a job has been started. The question is what happens when two different people have some of the CPU cores running jobs, since our new nodes are 64 cores each. Or am I remembering this wrong?
We plan on having 2 or more "login" nodes and limiting access to the head node.
I tried a vanilla ROCKS 6.2 install with the latest SLURM roll. I get the following in slurmctld.log, and nothing is working. I'm going to make take a wild stab and guess that the same changes to the schema that broke the torque roll also broke slurm. The sql table you are expecting is simply no longer there. Most of the *_attributes tables were removed last April.
fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again.
https://github.com/rocksclusters/base/commit/ed19a154c09c8ffb481faafded204cb8cefd538b
Hmm, so much for that theory. Apparently you use /usr/bin/mysql, not the rocks mysql. I can get into the database with the info in /etc/slurm/slurmdbd.conf, but I see nothing at all in the log file when the slurmdbd service starts?
Try running the following to set up DB.
export CLUSTER=$(/opt/rocks/bin/rocks list attr|awk ' /Info_ClusterName:/ { print $2 }')
sacctmgr -i create cluster $CLUSTER
@mark , is the problem solved now, or do you need help
That seems to work now. Any idea why the cluster name was not set at install time?
When I "qsub -I" in slurm I remain on head. When we ran torque you would find yourself on the "MOM" node. Is this normal behavior or configurable?
Hi,
at the end of the script is a loop, that tries to set the cluster name.
But if the database is still busy or locked, setting the cluster name fails
and the script writes a warning message, that you should set the
cluster name.
Best Regards
Werner
On 05/19/2015 07:30 PM, Mark wrote:
So unlike torque you perhaps should install the slurm roll after ROCKS has been installed. It seems with SGE or torque you really can't install them after the fact. Both SGE and torque must be installed while ROCKS is installed in my experience.
With slurm, you alwas remain on the node, where you started the batch job.
There is no MOM node.
You can install one or more login nodes, to start batch jobs
On 05/19/2015 07:30 PM, Mark wrote:
Am I remembering correctly that once a node is allocated to a job none of the other users can ssh to that node? I am remembering this from when I looked at slurm over a year ago. I remember being surprised that slurm affects access control once a job has been started. The question is what happens when two different people have some of the CPU cores running jobs, since our new nodes are 64 cores each. Or am I remembering this wrong?
We plan on having 2 or more "login" nodes and limiting access to the head node.
Hi,
an user can only ssh to a node, where he has resources allocated ( batch
job or salloc ).
If 2 people have resources allocated resources on one node, both can ssh
to this node.
Werner
On 05/20/2015 07:35 PM, Mark wrote:
Using the latest slurm roll on ROCKS 6.2 install you still appear to need "sacctmgr -i create cluster $CLUSTER" to get things started working.