An issue with switcher (bug #1282422) revealed a rare
bug that leads to NUM_PROCS being set wrongly when
ssh to a cluster node shows additional error messages.
In the concrete situation ssh was returning something
like:
switcher:mpi: Cannot find modulefile for lam-6.5.6 --
skipping
in addition to what it was expected to print.
This lead to NUM_PROCS being set to 3 on a dual
CPU machine, which lead to problems with batch jobs.
Logged In: YES
user_id=338721
Fixed in trunk (r3681).
Logged In: YES
user_id=11722
What is NUM_PROCS?
And who is using lam-6.5.6? That is *WAY* old and is not part of any
OSCAR that I'm aware of...? (The 6.5.x series is no longer supported)
How was this fixed?
Logged In: YES
user_id=338721
NUM_PROCS is the number of CPUs of a node. It is used in
the PBS configuration to set the number of virtual CPUs
(or number of processes which should be allowed to run on
a node). It is set in scripts/post_install.
Forget about the error message, it is just an example to
show you the problem. Replace by
MY_MPI_WITH_STRANGE_IB_DRIVER, if you want. And: even if
you don't believe it, lam-6.5.X is frequently used by ISVs
because the codes are certified with that and
recompilation and new certification takes far too much
effort. No, I'm not requesting OSCAR to support that.