Hi ,
I am trying to have slurm working on Rocks 7.0.
Installation went Okay.
Here is the error message to rocks report slurm_hwinfo:
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
Also when trying to rocks sync slurm
rocks sync slurm
compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-1: ssh exited with exit code 1
compute-0-2: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-2: ssh exited with exit code 1
compute-0-0: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-0: ssh exited with exit code 1
if I do "systemctl status slurmd.service" on compute-0-1, I have:
Starting Slurm node daemon...
Mar 14 18:14:27 compute-0-1.local slurmd[1774]: /usr/sbin/slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared obje...irectory
Mar 14 18:14:27 compute-0-1.local systemd[1]: slurmd.service: control process exited, code=exited status=127
Mar 14 18:14:27 compute-0-1.local systemd[1]: Failed to start Slurm node daemon.
Mar 14 18:14:27 compute-0-1.local systemd[1]: Unit slurmd.service entered failed state.
Mar 14 18:14:27 compute-0-1.local systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
It looks like lib libltdl.so is missing and mess up everything.
Here it is however
find / -name "libltdl.so*"
/usr/lib64/libltdl.so.7.3.0
/usr/lib64/libltdl.so.7
/opt/condor/lib/condor/libltdl.so.7
Last update/edit , on nodes (e.g compute-0-1), libltdl.so is only found in /opt/condor/lib/condor/libltdl.so.7
Any idea ?
Thanks
Last edit: Christophe Guilbert 2018-03-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You have to configure the system, so that slurmd loads the library from
/usr/lib64
Best regards
Werner
On 03/14/2018 11:20 PM, Christophe Guilbert wrote:
Hi ,
I am trying to have slurm working on Rocks 7.0.
Installation went Okay.
Here is the error message to rocks report slurm_hwinfo:
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
Also when trying to rocks sync slurm
rocks sync slurm
compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-1: ssh exited with exit code 1
compute-0-2: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-2: ssh exited with exit code 1
compute-0-0: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-0: ssh exited with exit code 1
if I do "systemctl status slurmd.service" on compute-0-1, I have:
Starting Slurm node daemon...
Mar 14 18:14:27 compute-0-1.local slurmd[1774]: /usr/sbin/slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared obje...irectory
Mar 14 18:14:27 compute-0-1.local systemd[1]: slurmd.service: control process exited, code=exited status=127
Mar 14 18:14:27 compute-0-1.local systemd[1]: Failed to start Slurm node daemon.
Mar 14 18:14:27 compute-0-1.local systemd[1]: Unit slurmd.service entered failed state.
Mar 14 18:14:27 compute-0-1.local systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
It looks like lib libltdl.so is missing and mess up everything.
Here it is however
find / -name "libltdl.so*"
/usr/lib64/libltdl.so.7.3.0
/usr/lib64/libltdl.so.7
/opt/condor/lib/condor/libltdl.so.7
Hi ,
I am trying to have slurm working on Rocks 7.0.
Installation went Okay.
Here is the error message to rocks report slurm_hwinfo:
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
Also when trying to rocks sync slurm
rocks sync slurm
compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-1: ssh exited with exit code 1
compute-0-2: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-2: ssh exited with exit code 1
compute-0-0: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-0: ssh exited with exit code 1
if I do "systemctl status slurmd.service" on compute-0-1, I have:
Starting Slurm node daemon...
Mar 14 18:14:27 compute-0-1.local slurmd[1774]: /usr/sbin/slurmd: error while loading shared libraries: libltdl.so.7: cannot open shared obje...irectory
Mar 14 18:14:27 compute-0-1.local systemd[1]: slurmd.service: control process exited, code=exited status=127
Mar 14 18:14:27 compute-0-1.local systemd[1]: Failed to start Slurm node daemon.
Mar 14 18:14:27 compute-0-1.local systemd[1]: Unit slurmd.service entered failed state.
Mar 14 18:14:27 compute-0-1.local systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
It looks like lib libltdl.so is missing and mess up everything.
Here it is however
find / -name "libltdl.so*"
/usr/lib64/libltdl.so.7.3.0
/usr/lib64/libltdl.so.7
/opt/condor/lib/condor/libltdl.so.7
Last update/edit , on nodes (e.g compute-0-1), libltdl.so is only found in /opt/condor/lib/condor/libltdl.so.7
Any idea ?
Thanks
Last edit: Christophe Guilbert 2018-03-14
Hi,
I don't have installed condor, so there is
no such line:
/opt/condor/lib/condor/libltdl.so.7
You have to configure the system, so that slurmd loads the library from
/usr/lib64
Best regards
Werner
On 03/14/2018 11:20 PM, Christophe Guilbert wrote:
Thanks for the answer Werner but how do you do that ? , it seems to me that Slurm roll for rocks should take care of it. correct ?
I think, that the condor roll should not ship
a shared library, that is still present in the system
On 03/15/2018 08:02 AM, Christophe Guilbert wrote: