Menu

jsd and jsh

Help
pete_G
2006-08-15
2013-04-24
  • pete_G

    pete_G - 2006-08-15

    Hi

    THX!! for this great distributed app!!

    Just installed 2.4 on a bunch of FreeBSD 6.1 i386 machines.

    all works well, and dsh is great!

    i do have a question on jsd and jsh

    Looking at the previous answers you gave, jsd AND barrierd if used should ONLY run on the "control" machine or "host"? as you call it..??
    The "slave nodes" only process the commands via rsh, i assume.. we use rsh because thsi is a CLOSED small LAN for processing that we use.

    we can use dsh, works great, but also there are times when a set of many jobs need to be distributed to the next available machine, we have 9 available machines as of now that we can make use of, these jobs may differ in the context switches we submit with each.. ie:

    #!/bin/sh
    exec job 1 params a b c
    exec job 2 params d e f
    ..
    exec job 120 params x y z

    so i thougt that for this, jsh (jsd) would do it and give the NEXT job to the machine that JUST finished the job it was working on.. etc

    at first i started jsd on the control machine and tried to use jsh just as i used dsh.. nuthin comes back

    then i tried putting and starting jsd on the slave nodes.. nuthin again..

    can you clarify as to use??

    thanx again!!!!!
    --Pete

     
    • Tim Rightnour

      Tim Rightnour - 2006-08-16

      Basically, jsd tries to keep N machines running 1 job each constantly.  What you described is more or less the proper usage.  Run jsd on your master node, and then issue jsh commands from the master node. Jsd should be logging output to the syslog daemon log.. you can also run jsd with the -d option, and make sure syslog is capturing daemon.debug.  This will show you exactly what jsd is attempting to do.

      Here is a run from my machine:
      Aug 15 17:11:54 polaris jsd[6840]: Job Scheduling Daemon Started
      Aug 15 17:11:54 polaris jsd[6840]: No JSD_BENCH_CMD environment setting, assuming homogenus cluster.
      Aug 15 17:11:54 polaris jsd[6840]: Entering main loop
      Aug 15 17:12:02 polaris jsd[6840]: We have a connection
      Aug 15 17:12:02 polaris jsd[6840]: Someone wants a node
      Aug 15 17:12:02 polaris jsd[6840]: Handing out node alshain to a jsh process
      Aug 15 17:12:02 polaris jsd[6840]: Entering main loop
      Aug 15 17:12:12 polaris jsd[6840]: We have a connection
      Aug 15 17:12:12 polaris jsd[6840]: Someone wants to free a node
      Aug 15 17:12:12 polaris jsd[6840]: Entered free_node
      Aug 15 17:12:12 polaris jsd[6840]: accepted new connection
      Aug 15 17:12:12 polaris jsd[6840]: got node alshain from client
      Aug 15 17:12:12 polaris jsd[6840]: freeing node alshain
      Aug 15 17:12:12 polaris jsd[6840]: Entering main loop

      In the above, I ran:
      jsd -d
      jsh date
      and got back the date on "alshain"

      Also note that by default jsd and jsh want to communicate on ports 2001 and 2002.  Make sure these are not being used by something else, and if they are, check the manpage for environment settings or commandline options to change them.

       
    • Tim Rightnour

      Tim Rightnour - 2006-08-16

      Also.. to clarify, jsd and barrierd can actually be run anywhere.  They should only be running on one machine in your cluster, and if you want you can specify that machine with -h hostname on the commandline when running jsh or barrier.  It's easiest to run them on the same node you issue commands from, because they default to contact localhost.  In the case of barrier, more often than not you will end up wanting to use the -h commandline option, because ideally barrier itself is run on the remote nodes, to synchronize them all together.  For example, a script like:

      #!/bin/sh
      date
      barrier -h master -s 5
      touch foo

      executed with:
      dsh -w node1,node2,node3,node4,node5 script.sh

      Will run date simultaneously on all nodes, then all nodes will wait for completion of eachother's date commands, and then touch foo.  Obviously this example is useless, but if "date" were replaced with something that took a long time, it becomes more obvious what the use is.  (for example, every node could wait for one node to start a database, and then they would all start up applications which connect to it.)

       

Log in to post a comment.