Menu

LCM Not Connecting to Nodes

Dan
2006-02-24
2013-04-08
  • Dan

    Dan - 2006-02-24

    Hello All
    I have downloaded and installed LCM onto my 35 node beowulf linux
    cluster, but I can't get the tool to monitor any of the compute nodes.
    Monitoring the headnode works just fine.  When I go to the main window
    the compute nodes all have yellow dots although they all have the client
    installed and running.
    When I try to list running processes on the compute nodes for instance I
    get messages similar to this for all the compute nodes..
    "PARSLNXC1-n001
    unable to connect to host"

    >From the command line I can ping the hostname and the IP address.  I can
    rsh to the compute node. 
    So I am kind of stuck on why LCM can't "see" the compute nodes.. I even
    copied the /etc/cluster.conf file from the headnode to all the compute
    nodes and that did not help..

    Thanks for any help!
    Sincerely
    Dan Roberts

    PS..I get these types of errors as well>
    [nm@parslnxc1-a lcm]$ ./lcmexec -n parslnxc1-n001 -c ls
    can't read "node_list": no such variable
        while executing
    "# Compiled -- no source code available
    error "called a copy of a compiled script""
        (procedure "do_work" line 1)
        invoked from within
    "# Compiled -- no source code available
    error "called a copy of a compiled script""
        invoked from within
    "tbcload::bceval {
    TclPro ByteCode 2 0 1.4 8.4
    21 0 178 36 1 0 132 1 5 21 21 -1 -1
    178
    w0E<!-fSs!&-<<!,l4pv,TA9v=htt!M%?6#4;tl#/HW<!PiA=!?M1qvLl76,;yUN..."
        (file "/usr/local/lcm/lcmexec/lib/application/lcmexec.tcl" line 4)
        invoked from within
    "source      $startup"
        (file "/usr/local/lcm/lcmexec/main.tcl" line 18)
    [nm@parslnxc1-a lcm]$

     
    • Michael England

      Michael England - 2006-02-25

      Well the second one is an easy one, you have a - inside your host name which is confusing lcmexec.  I actually don't have a good work around on this, the best I can suggest is change the names to something without special characters.  Note, this is just a label for LCM, it has nothing to do with host resolution or any of the functions.  They all work from the IP address.

      Now for the client nodes.  It sounds like lcmclient isn't running properly.
      1) You don't need the cluster.conf file anywhere but on your management node
      2) When you install the client rpm (I assume you didn't put the full lcm package on every node, although that will work too) you need to either reboot or start the service. /etc/init.d/lcm start
      3) You can check /var/log/lcm/lcm.log on any of the nodes to see if there is any additional information why they are not starting properly
      4) If you still think they are working, try to telnet to any node:
      telnet <node IP> 60000
      The node should respond with
      OK
      Try sending "status"
      The node should respond with the cpu and network count
      If that works then we still have an additional problem, let me know.  In the meantime I will have a look at the code and see if I can figure out a graceful way of handling your - character.

      Michael

       
    • Dan

      Dan - 2006-02-27

      Thanks for the reply

      I now do see that lcm is NOT running on any of my compute nodes...eventhough the /etc/init.d/lcm start echoed back "starting LCM daemons" I wrongly surmised the process had started correctly..

      When I start the service by hand on a compute node I see this>>
      [root@parslnxc1-n001 lcm]# /usr/local/lcm/lcmclient
      /usr/local/lcm/lcmclient: error while loading shared libraries: libstdc++-libc6.1-1.so.2: cannot open shared object file: No such file or directory

      Offhand, do you know what I might be missing on the compute nodes?
      Also as well telnet is disabled on the compute nodes..I can ssh or rsh to them from the headnode to the compute node only..
      Thanks for the help!
      Dan

       
    • Michael England

      Michael England - 2006-02-27

      You need the compat-libstdc++ libraries.  It is probably compat-2004.<date>.rpm, depending on your distro.

      I have noted this problem before and will make a change to the rpm pre-requisite for the next release.

      Also, you can telnet to the lcmclient.  Well it isn't really telnet, you are just using telnet to attach to a TCP port.  You can do the same on a mail server by using port 25 (for example).

      Michael

       
      • Dan

        Dan - 2006-03-03

        Hello
        Once I installed the correct libraries, the LCM worked correctly throughout my cluster..
        Thanks!
        DAn

         
    • Dan

      Dan - 2006-02-27

      Additional info..
      On my headnode where LCM is running without problems..I see this>

      rpm -qf /usr/lib/libstdc++-libc6.1-1.so.2
      file /usr/lib/libstdc++-libc6.1-1.so.2 is not owned by any package

      So given when I have on the headnode and what I don't have on the compute node as shown below..where does libstdc++-libc6.1-1.so.2 come from??

      Dan

      [root@parslnxc1-n001 lcm]# /usr/local/lcm/lcmclient
      /usr/local/lcm/lcmclient: error while loading shared libraries: libstdc++-libc6.1-1.so.2: cannot open shared object file: No such file or directory

       
    • Dan

      Dan - 2006-02-27

      rpm's installed on the headnode
      rpm -qa | grep libstd
      libstdc++-ssa-3.5ssa-0.20030801.47
      compat-libstdc++-devel-7.3-2.96.128
      libstdc++-3.2.3-34
      libstdc++-devel-3.2.3-34
      libstdc++-ssa-devel-3.5ssa-0.20030801.47
      compat-libstdc++-7.3-2.96.128

      rpm's installed on the compute node
      rpm -qa | grep libstd
      libstdc++-3.2.3-34

       
    • Michael England

      Michael England - 2006-02-27

      Try rpm -ql compat-libstdc++-7.3.2-96.128.  When I search for compat-libstdc++-7.3-2.96.128 on the net it shows libstdc++-libc6.1-1.so.2 as being provided but not included in the file list.

      Michael

       

Log in to post a comment.