Menu

#720 NUM_PROCS wrong (sometimes)

5.0 (deprecated)
open-fixed
5
2005-11-29
2005-09-06
Erich Focht
No

An issue with switcher (bug #1282422) revealed a rare
bug that leads to NUM_PROCS being set wrongly when
ssh to a cluster node shows additional error messages.
In the concrete situation ssh was returning something
like:
switcher:mpi: Cannot find modulefile for lam-6.5.6 --
skipping
in addition to what it was expected to print.
This lead to NUM_PROCS being set to 3 on a dual
CPU machine, which lead to problems with batch jobs.

Discussion

  • Erich Focht

    Erich Focht - 2005-09-06

    Logged In: YES
    user_id=338721

    Fixed in trunk (r3681).

     
  • Erich Focht

    Erich Focht - 2005-09-06
    • assigned_to: nobody --> efocht
    • status: open --> open-fixed
     
  • Jeff Squyres

    Jeff Squyres - 2005-09-06
    • assigned_to: efocht --> nobody
    • status: open-fixed --> open
     
  • Jeff Squyres

    Jeff Squyres - 2005-09-06

    Logged In: YES
    user_id=11722

    What is NUM_PROCS?

    And who is using lam-6.5.6? That is *WAY* old and is not part of any
    OSCAR that I'm aware of...? (The 6.5.x series is no longer supported)

    How was this fixed?

     
  • Erich Focht

    Erich Focht - 2005-09-06
    • status: open --> open-fixed
     
  • Erich Focht

    Erich Focht - 2005-09-06

    Logged In: YES
    user_id=338721

    NUM_PROCS is the number of CPUs of a node. It is used in
    the PBS configuration to set the number of virtual CPUs
    (or number of processes which should be allowed to run on
    a node). It is set in scripts/post_install.

    Forget about the error message, it is just an example to
    show you the problem. Replace by
    MY_MPI_WITH_STRANGE_IB_DRIVER, if you want. And: even if
    you don't believe it, lam-6.5.X is frequently used by ISVs
    because the codes are certified with that and
    recompilation and new certification takes far too much
    effort. No, I'm not requesting OSCAR to support that.

     
  • Erich Focht

    Erich Focht - 2005-09-06
    • assigned_to: nobody --> efocht
     
  • John

    John - 2005-11-29
    • milestone: 473437 --> 5.0 (deprecated)