Just installed 2.4 on a bunch of FreeBSD 6.1 i386 machines.
all works well, and dsh is great!
i do have a question on jsd and jsh
Looking at the previous answers you gave, jsd AND barrierd if used should ONLY run on the "control" machine or "host"? as you call it..??
The "slave nodes" only process the commands via rsh, i assume.. we use rsh because thsi is a CLOSED small LAN for processing that we use.
we can use dsh, works great, but also there are times when a set of many jobs need to be distributed to the next available machine, we have 9 available machines as of now that we can make use of, these jobs may differ in the context switches we submit with each.. ie:
#!/bin/sh
exec job 1 params a b c
exec job 2 params d e f
..
exec job 120 params x y z
so i thougt that for this, jsh (jsd) would do it and give the NEXT job to the machine that JUST finished the job it was working on.. etc
at first i started jsd on the control machine and tried to use jsh just as i used dsh.. nuthin comes back
then i tried putting and starting jsd on the slave nodes.. nuthin again..
can you clarify as to use??
thanx again!!!!!
--Pete
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Basically, jsd tries to keep N machines running 1 job each constantly. What you described is more or less the proper usage. Run jsd on your master node, and then issue jsh commands from the master node. Jsd should be logging output to the syslog daemon log.. you can also run jsd with the -d option, and make sure syslog is capturing daemon.debug. This will show you exactly what jsd is attempting to do.
Here is a run from my machine:
Aug 15 17:11:54 polaris jsd[6840]: Job Scheduling Daemon Started
Aug 15 17:11:54 polaris jsd[6840]: No JSD_BENCH_CMD environment setting, assuming homogenus cluster.
Aug 15 17:11:54 polaris jsd[6840]: Entering main loop
Aug 15 17:12:02 polaris jsd[6840]: We have a connection
Aug 15 17:12:02 polaris jsd[6840]: Someone wants a node
Aug 15 17:12:02 polaris jsd[6840]: Handing out node alshain to a jsh process
Aug 15 17:12:02 polaris jsd[6840]: Entering main loop
Aug 15 17:12:12 polaris jsd[6840]: We have a connection
Aug 15 17:12:12 polaris jsd[6840]: Someone wants to free a node
Aug 15 17:12:12 polaris jsd[6840]: Entered free_node
Aug 15 17:12:12 polaris jsd[6840]: accepted new connection
Aug 15 17:12:12 polaris jsd[6840]: got node alshain from client
Aug 15 17:12:12 polaris jsd[6840]: freeing node alshain
Aug 15 17:12:12 polaris jsd[6840]: Entering main loop
In the above, I ran:
jsd -d
jsh date
and got back the date on "alshain"
Also note that by default jsd and jsh want to communicate on ports 2001 and 2002. Make sure these are not being used by something else, and if they are, check the manpage for environment settings or commandline options to change them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also.. to clarify, jsd and barrierd can actually be run anywhere. They should only be running on one machine in your cluster, and if you want you can specify that machine with -h hostname on the commandline when running jsh or barrier. It's easiest to run them on the same node you issue commands from, because they default to contact localhost. In the case of barrier, more often than not you will end up wanting to use the -h commandline option, because ideally barrier itself is run on the remote nodes, to synchronize them all together. For example, a script like:
Will run date simultaneously on all nodes, then all nodes will wait for completion of eachother's date commands, and then touch foo. Obviously this example is useless, but if "date" were replaced with something that took a long time, it becomes more obvious what the use is. (for example, every node could wait for one node to start a database, and then they would all start up applications which connect to it.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
THX!! for this great distributed app!!
Just installed 2.4 on a bunch of FreeBSD 6.1 i386 machines.
all works well, and dsh is great!
i do have a question on jsd and jsh
Looking at the previous answers you gave, jsd AND barrierd if used should ONLY run on the "control" machine or "host"? as you call it..??
The "slave nodes" only process the commands via rsh, i assume.. we use rsh because thsi is a CLOSED small LAN for processing that we use.
we can use dsh, works great, but also there are times when a set of many jobs need to be distributed to the next available machine, we have 9 available machines as of now that we can make use of, these jobs may differ in the context switches we submit with each.. ie:
#!/bin/sh
exec job 1 params a b c
exec job 2 params d e f
..
exec job 120 params x y z
so i thougt that for this, jsh (jsd) would do it and give the NEXT job to the machine that JUST finished the job it was working on.. etc
at first i started jsd on the control machine and tried to use jsh just as i used dsh.. nuthin comes back
then i tried putting and starting jsd on the slave nodes.. nuthin again..
can you clarify as to use??
thanx again!!!!!
--Pete
Basically, jsd tries to keep N machines running 1 job each constantly. What you described is more or less the proper usage. Run jsd on your master node, and then issue jsh commands from the master node. Jsd should be logging output to the syslog daemon log.. you can also run jsd with the -d option, and make sure syslog is capturing daemon.debug. This will show you exactly what jsd is attempting to do.
Here is a run from my machine:
Aug 15 17:11:54 polaris jsd[6840]: Job Scheduling Daemon Started
Aug 15 17:11:54 polaris jsd[6840]: No JSD_BENCH_CMD environment setting, assuming homogenus cluster.
Aug 15 17:11:54 polaris jsd[6840]: Entering main loop
Aug 15 17:12:02 polaris jsd[6840]: We have a connection
Aug 15 17:12:02 polaris jsd[6840]: Someone wants a node
Aug 15 17:12:02 polaris jsd[6840]: Handing out node alshain to a jsh process
Aug 15 17:12:02 polaris jsd[6840]: Entering main loop
Aug 15 17:12:12 polaris jsd[6840]: We have a connection
Aug 15 17:12:12 polaris jsd[6840]: Someone wants to free a node
Aug 15 17:12:12 polaris jsd[6840]: Entered free_node
Aug 15 17:12:12 polaris jsd[6840]: accepted new connection
Aug 15 17:12:12 polaris jsd[6840]: got node alshain from client
Aug 15 17:12:12 polaris jsd[6840]: freeing node alshain
Aug 15 17:12:12 polaris jsd[6840]: Entering main loop
In the above, I ran:
jsd -d
jsh date
and got back the date on "alshain"
Also note that by default jsd and jsh want to communicate on ports 2001 and 2002. Make sure these are not being used by something else, and if they are, check the manpage for environment settings or commandline options to change them.
Also.. to clarify, jsd and barrierd can actually be run anywhere. They should only be running on one machine in your cluster, and if you want you can specify that machine with -h hostname on the commandline when running jsh or barrier. It's easiest to run them on the same node you issue commands from, because they default to contact localhost. In the case of barrier, more often than not you will end up wanting to use the -h commandline option, because ideally barrier itself is run on the remote nodes, to synchronize them all together. For example, a script like:
#!/bin/sh
date
barrier -h master -s 5
touch foo
executed with:
dsh -w node1,node2,node3,node4,node5 script.sh
Will run date simultaneously on all nodes, then all nodes will wait for completion of eachother's date commands, and then touch foo. Obviously this example is useless, but if "date" were replaced with something that took a long time, it becomes more obvious what the use is. (for example, every node could wait for one node to start a database, and then they would all start up applications which connect to it.)