Re: [LaunchMON-devel] Increasing the level of abstraction offered by LaunchMON

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

In a recent conference call between Dong Ahn, David and myself were
discussing ways to enhance LaunchMON's abstraction of the "launch a
job and co-locate these daemons" task. Ideally we'd like to see
LaunchMON provide an MPI-independent interface to the API-user, at
least for simple/common cases. I've summarized our thoughts here -
Dong suggested bouncing this to the general distribution list because
he thought it's worth thinking through carefully.

>From a tool writer's perspective, LaunchMON would ideally abstract
away enough of the RM to enable starting and attaching to jobs in the
majority of cases. Currently, it abstracts the RM details away when
attaching daemons to a running process - only the hostname and pid of
the RM are necessary. However, when starting an mpi job with daemons
co-hosted, the path to the RM and a list of its arguments must be
provided. This passes RM-dependency back to the tool, which then needs
to discover the MPIs installed on the system, select the correct one,
understand whether it uses -np or -n or -proc to specify the number if
processes and so on. Of course, RMs typically provide a vast array of
options many of which are unique to that particular RM - it's not
expected that LaunchMON should abstract away all differences between
the RMs; just the ones they have in common and are most-often used. In
this case, the path of the target program, its arguments and the
number of processes might be enough.

To do this, LaunchMON could expose an extra, higher-level, way to
start a job, requiring only the path to target executable, its
arguments, the number of processes to launch and an additional array
of strings to pass to the RM as arguments. This is similar to the
level of abstraction provided to the user in DDT's GUI - the RM is
detected at install, and after that most users simply choose the
number of processes and their program. If they wish to use specific RM
features not abstracted by the GUI, they have the option of passing
extra arguments directly to the RM.

Note: I think it would be acceptable if LaunchMON did not attempt to
abstract out handling of queuing systems - this can be very complex.
Instead, the caller must ensure that, if necessary, it is executed by
the batch system with the appropriate configuration options set.

As mentioned, DDT already contains the above abstractions for a vast
list of RMs; this experience might be something we could contribute to
the LaunchMON project; from a purely technical perspective repackaging
these into a reusable bundle would not require an undue amount of
effort.

Mark
Allinea Software