Hello All
I have downloaded and installed LCM onto my 35 node beowulf linux
cluster, but I can't get the tool to monitor any of the compute nodes.
Monitoring the headnode works just fine. When I go to the main window
the compute nodes all have yellow dots although they all have the client
installed and running.
When I try to list running processes on the compute nodes for instance I
get messages similar to this for all the compute nodes..
"PARSLNXC1-n001
unable to connect to host"
>From the command line I can ping the hostname and the IP address. I can
rsh to the compute node.
So I am kind of stuck on why LCM can't "see" the compute nodes.. I even
copied the /etc/cluster.conf file from the headnode to all the compute
nodes and that did not help..
Thanks for any help!
Sincerely
Dan Roberts
PS..I get these types of errors as well>
[nm@parslnxc1-a lcm]$ ./lcmexec -n parslnxc1-n001 -c ls
can't read "node_list": no such variable
while executing
"# Compiled -- no source code available
error "called a copy of a compiled script""
(procedure "do_work" line 1)
invoked from within
"# Compiled -- no source code available
error "called a copy of a compiled script""
invoked from within
"tbcload::bceval {
TclPro ByteCode 2 0 1.4 8.4
21 0 178 36 1 0 132 1 5 21 21 -1 -1
178
w0E<!-fSs!&-<<!,l4pv,TA9v=htt!M%?6#4;tl#/HW<!PiA=!?M1qvLl76,;yUN..."
(file "/usr/local/lcm/lcmexec/lib/application/lcmexec.tcl" line 4)
invoked from within
"source $startup"
(file "/usr/local/lcm/lcmexec/main.tcl" line 18)
[nm@parslnxc1-a lcm]$
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well the second one is an easy one, you have a - inside your host name which is confusing lcmexec. I actually don't have a good work around on this, the best I can suggest is change the names to something without special characters. Note, this is just a label for LCM, it has nothing to do with host resolution or any of the functions. They all work from the IP address.
Now for the client nodes. It sounds like lcmclient isn't running properly.
1) You don't need the cluster.conf file anywhere but on your management node
2) When you install the client rpm (I assume you didn't put the full lcm package on every node, although that will work too) you need to either reboot or start the service. /etc/init.d/lcm start
3) You can check /var/log/lcm/lcm.log on any of the nodes to see if there is any additional information why they are not starting properly
4) If you still think they are working, try to telnet to any node:
telnet <node IP> 60000
The node should respond with
OK
Try sending "status"
The node should respond with the cpu and network count
If that works then we still have an additional problem, let me know. In the meantime I will have a look at the code and see if I can figure out a graceful way of handling your - character.
Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I now do see that lcm is NOT running on any of my compute nodes...eventhough the /etc/init.d/lcm start echoed back "starting LCM daemons" I wrongly surmised the process had started correctly..
When I start the service by hand on a compute node I see this>>
[root@parslnxc1-n001 lcm]# /usr/local/lcm/lcmclient
/usr/local/lcm/lcmclient: error while loading shared libraries: libstdc++-libc6.1-1.so.2: cannot open shared object file: No such file or directory
Offhand, do you know what I might be missing on the compute nodes?
Also as well telnet is disabled on the compute nodes..I can ssh or rsh to them from the headnode to the compute node only..
Thanks for the help!
Dan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You need the compat-libstdc++ libraries. It is probably compat-2004.<date>.rpm, depending on your distro.
I have noted this problem before and will make a change to the rpm pre-requisite for the next release.
Also, you can telnet to the lcmclient. Well it isn't really telnet, you are just using telnet to attach to a TCP port. You can do the same on a mail server by using port 25 (for example).
Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Additional info..
On my headnode where LCM is running without problems..I see this>
rpm -qf /usr/lib/libstdc++-libc6.1-1.so.2
file /usr/lib/libstdc++-libc6.1-1.so.2 is not owned by any package
So given when I have on the headnode and what I don't have on the compute node as shown below..where does libstdc++-libc6.1-1.so.2 come from??
Dan
[root@parslnxc1-n001 lcm]# /usr/local/lcm/lcmclient
/usr/local/lcm/lcmclient: error while loading shared libraries: libstdc++-libc6.1-1.so.2: cannot open shared object file: No such file or directory
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Try rpm -ql compat-libstdc++-7.3.2-96.128. When I search for compat-libstdc++-7.3-2.96.128 on the net it shows libstdc++-libc6.1-1.so.2 as being provided but not included in the file list.
Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello All
I have downloaded and installed LCM onto my 35 node beowulf linux
cluster, but I can't get the tool to monitor any of the compute nodes.
Monitoring the headnode works just fine. When I go to the main window
the compute nodes all have yellow dots although they all have the client
installed and running.
When I try to list running processes on the compute nodes for instance I
get messages similar to this for all the compute nodes..
"PARSLNXC1-n001
unable to connect to host"
>From the command line I can ping the hostname and the IP address. I can
rsh to the compute node.
So I am kind of stuck on why LCM can't "see" the compute nodes.. I even
copied the /etc/cluster.conf file from the headnode to all the compute
nodes and that did not help..
Thanks for any help!
Sincerely
Dan Roberts
PS..I get these types of errors as well>
[nm@parslnxc1-a lcm]$ ./lcmexec -n parslnxc1-n001 -c ls
can't read "node_list": no such variable
while executing
"# Compiled -- no source code available
error "called a copy of a compiled script""
(procedure "do_work" line 1)
invoked from within
"# Compiled -- no source code available
error "called a copy of a compiled script""
invoked from within
"tbcload::bceval {
TclPro ByteCode 2 0 1.4 8.4
21 0 178 36 1 0 132 1 5 21 21 -1 -1
178
w0E<!-fSs!&-<<!,l4pv,TA9v=htt!M%?6#4;tl#/HW<!PiA=!?M1qvLl76,;yUN..."
(file "/usr/local/lcm/lcmexec/lib/application/lcmexec.tcl" line 4)
invoked from within
"source $startup"
(file "/usr/local/lcm/lcmexec/main.tcl" line 18)
[nm@parslnxc1-a lcm]$
Well the second one is an easy one, you have a - inside your host name which is confusing lcmexec. I actually don't have a good work around on this, the best I can suggest is change the names to something without special characters. Note, this is just a label for LCM, it has nothing to do with host resolution or any of the functions. They all work from the IP address.
Now for the client nodes. It sounds like lcmclient isn't running properly.
1) You don't need the cluster.conf file anywhere but on your management node
2) When you install the client rpm (I assume you didn't put the full lcm package on every node, although that will work too) you need to either reboot or start the service. /etc/init.d/lcm start
3) You can check /var/log/lcm/lcm.log on any of the nodes to see if there is any additional information why they are not starting properly
4) If you still think they are working, try to telnet to any node:
telnet <node IP> 60000
The node should respond with
OK
Try sending "status"
The node should respond with the cpu and network count
If that works then we still have an additional problem, let me know. In the meantime I will have a look at the code and see if I can figure out a graceful way of handling your - character.
Michael
Thanks for the reply
I now do see that lcm is NOT running on any of my compute nodes...eventhough the /etc/init.d/lcm start echoed back "starting LCM daemons" I wrongly surmised the process had started correctly..
When I start the service by hand on a compute node I see this>>
[root@parslnxc1-n001 lcm]# /usr/local/lcm/lcmclient
/usr/local/lcm/lcmclient: error while loading shared libraries: libstdc++-libc6.1-1.so.2: cannot open shared object file: No such file or directory
Offhand, do you know what I might be missing on the compute nodes?
Also as well telnet is disabled on the compute nodes..I can ssh or rsh to them from the headnode to the compute node only..
Thanks for the help!
Dan
You need the compat-libstdc++ libraries. It is probably compat-2004.<date>.rpm, depending on your distro.
I have noted this problem before and will make a change to the rpm pre-requisite for the next release.
Also, you can telnet to the lcmclient. Well it isn't really telnet, you are just using telnet to attach to a TCP port. You can do the same on a mail server by using port 25 (for example).
Michael
Hello
Once I installed the correct libraries, the LCM worked correctly throughout my cluster..
Thanks!
DAn
Additional info..
On my headnode where LCM is running without problems..I see this>
rpm -qf /usr/lib/libstdc++-libc6.1-1.so.2
file /usr/lib/libstdc++-libc6.1-1.so.2 is not owned by any package
So given when I have on the headnode and what I don't have on the compute node as shown below..where does libstdc++-libc6.1-1.so.2 come from??
Dan
[root@parslnxc1-n001 lcm]# /usr/local/lcm/lcmclient
/usr/local/lcm/lcmclient: error while loading shared libraries: libstdc++-libc6.1-1.so.2: cannot open shared object file: No such file or directory
rpm's installed on the headnode
rpm -qa | grep libstd
libstdc++-ssa-3.5ssa-0.20030801.47
compat-libstdc++-devel-7.3-2.96.128
libstdc++-3.2.3-34
libstdc++-devel-3.2.3-34
libstdc++-ssa-devel-3.5ssa-0.20030801.47
compat-libstdc++-7.3-2.96.128
rpm's installed on the compute node
rpm -qa | grep libstd
libstdc++-3.2.3-34
Try rpm -ql compat-libstdc++-7.3.2-96.128. When I search for compat-libstdc++-7.3-2.96.128 on the net it shows libstdc++-libc6.1-1.so.2 as being provided but not included in the file list.
Michael