I'm wondering if I'm going about this correctly.
No one has discussed this aspect of queue very much, so maybe I missed something important!
Typically, when using a queue, I need to run many jobs (one per CPU) and harvest the results into a summary file when all jobs complete. Each job has a few input parameters, writes some results and needs its own temporary file work space. The number of workstations (about 3) is usually much smaller than the number of jobs (about 1000) in a batch.
How best to allocate these to queue? Perhaps,
for each_problem in problems.txt
do
queue --queue --batch -n -- solver $each_problem
done
Where,
problems.txt -- is a file with list of problem input parameters
solver -- is a script (with no standard input or output) that takes a single problem, builds a working directory on the remote host (under remotehost:/tmp), runs a calculation engine, and deposits the result in a shared directory in a file, something like: localhost:./`hostname`$$,
then removes the /tmp directory
I haven't tried this yet, because it leaves some potential problems, e.g. how should I detect batch completion or node failure? I'm trying to avoid something like
(queue --queue --wait -n -- solver) &
and then wait for background processes, because the do loop might generate more stub processes than the OS can contain at once.
Regards,
Mike White
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2001-02-19
Hi Mike,
I'm just starting to play with queue, so I may be off base
as well...
It occurred to me that a good way to fire off a large batch
of jobs could be with gnu make's -j option. Your makefile
could look something like:
SOLVER = queue -i -w -- solver
%.out: %.in
$(SOLVER)
Then you could queue up all of your jobs to run 3 at a time
with
make -j 3 1.out 2.out 3.out 4.out ...
A down side of this approach is that you have to specify up
front how many jobs you want running concurrently.
I've tried this approach and it seems to work, except that
I'm having (I think) unrelated sporadic problems with queue.
-- Trey
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been running a few experiments, with pretty good results. I let the makefile generate the names of the targets, and hand over to the solver input and output file names as command line args, with something like:
SOLVER = queue -i -w -- solver
list := $(patsubst %.tbd,%.done,$(wildcard *.done))
finally.dat: $(list)
cat $(list) > finally.dat
%.done: %.tbd
$(SOLVER) $< $@
With make -j 3, and all 3 machines have load < 0.10, the initial allocations are uneven. For example, two jobs might start on machine A, no jobs for B, and 1 job for C. Perhaps because "queue" is looking at the 1 minute load, it doesn't anticipate quickly enough that one of the jobs just submitted to "queued" will be increasing the load by one.
However, things start to balance after one of the jobs on the doubly loaded machine completes. So, this works pretty well as long as the number of submitted jobs is much larger than the number of machines.
I've had no luck so far with "queue" using the -r option, or the -q option. I can sometimes see the command start on the remote machine, and it shows up under the process list with ps, but it doesn't run for some reason. The combination queue -i -w seems to be much more reliable.
Regards,
Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm wondering if I'm going about this correctly.
No one has discussed this aspect of queue very much, so maybe I missed something important!
Typically, when using a queue, I need to run many jobs (one per CPU) and harvest the results into a summary file when all jobs complete. Each job has a few input parameters, writes some results and needs its own temporary file work space. The number of workstations (about 3) is usually much smaller than the number of jobs (about 1000) in a batch.
How best to allocate these to queue? Perhaps,
for each_problem in problems.txt
do
queue --queue --batch -n -- solver $each_problem
done
Where,
problems.txt -- is a file with list of problem input parameters
solver -- is a script (with no standard input or output) that takes a single problem, builds a working directory on the remote host (under remotehost:/tmp), runs a calculation engine, and deposits the result in a shared directory in a file, something like: localhost:./`hostname`$$,
then removes the /tmp directory
I haven't tried this yet, because it leaves some potential problems, e.g. how should I detect batch completion or node failure? I'm trying to avoid something like
(queue --queue --wait -n -- solver) &
and then wait for background processes, because the do loop might generate more stub processes than the OS can contain at once.
Regards,
Mike White
Hi Mike,
I'm just starting to play with queue, so I may be off base
as well...
It occurred to me that a good way to fire off a large batch
of jobs could be with gnu make's -j option. Your makefile
could look something like:
SOLVER = queue -i -w -- solver
%.out: %.in
$(SOLVER)
Then you could queue up all of your jobs to run 3 at a time
with
make -j 3 1.out 2.out 3.out 4.out ...
A down side of this approach is that you have to specify up
front how many jobs you want running concurrently.
I've tried this approach and it seems to work, except that
I'm having (I think) unrelated sporadic problems with queue.
-- Trey
Trey,
Thanks for the clever tip with make -j !
I have been running a few experiments, with pretty good results. I let the makefile generate the names of the targets, and hand over to the solver input and output file names as command line args, with something like:
SOLVER = queue -i -w -- solver
list := $(patsubst %.tbd,%.done,$(wildcard *.done))
finally.dat: $(list)
cat $(list) > finally.dat
%.done: %.tbd
$(SOLVER) $< $@
With make -j 3, and all 3 machines have load < 0.10, the initial allocations are uneven. For example, two jobs might start on machine A, no jobs for B, and 1 job for C. Perhaps because "queue" is looking at the 1 minute load, it doesn't anticipate quickly enough that one of the jobs just submitted to "queued" will be increasing the load by one.
However, things start to balance after one of the jobs on the doubly loaded machine completes. So, this works pretty well as long as the number of submitted jobs is much larger than the number of machines.
I've had no luck so far with "queue" using the -r option, or the -q option. I can sometimes see the command start on the remote machine, and it shows up under the process list with ps, but it doesn't run for some reason. The combination queue -i -w seems to be much more reliable.
Regards,
Mike