Menu

#12 Attaching based on Cray ALPS apid in addition to aprun pid

open
Dong Ahn
3
2012-05-29
2012-05-29
Dong Ahn
No

On the Cray XT/XE/XK systems, a running app can be identified by either its 'apid' or the 'pid' of the aprun that started it (along with the node that the aprun is running on). STAT and STATGUI use the pid method. For STAT, our users often accidentally provide the apid instead of the pid. When they make that mistake, they get the sequence of error messages included below. Not surprisingly, the user has a hard time realizing their mistake after reading that.

I am sure that the call to alps_get_apid has failed and returned a bad status, That, no doubt started the stream of failure messages. Is it possible for something a bit more terse, yet to the point, to be announced?

Now it might be possible to cope with this user error more transparently. If the STAT interface was "STAT <aprun pid |
apid>", then the user could use either.

To implement this the code could try both the alps_get_apid(aprun_nid, aprun_pid) and the
alps_get_appinfo(apid,...) routines with whatever the user
provided:

alps_get_apid alps_get_appinfo
------------- ----------------
F F # Bogus user input (probably pid from wrong nid)
F T # It is an apid, proceed
T F # It is a pid, proceed
T T # "Almost" zero possibility: Error/warning

(Note: before writing any code, those two routines differ in their return code semantics.)

Bob

Messy error message example run:
--------------------------------
purie-p1 nwprod/jobs> STAT 210267
Attaching to job launcher and launching tool daemons...
<May 21 16:11:36> <LMON FE API> (ERROR): the engine reported parse errors with its connect-back <May 21 16:11:36> <LMON FE API> (ERROR): LMON_fe_acceptEngine failed <May 21 16:11:36> <STAT_FrontEnd.C: 374> STAT returned error type
STAT_LMON_ERROR: Failed to attach to job launcher and spawn daemons <May 21 16:11:36> <STAT_FrontEnd.C: 241> STAT returned error type
STAT_LMON_ERROR: Failed to attach and spawn daemons <May 21 16:11:36> <STAT.C: 104> STAT returned error type STAT_LMON_ERROR: Failed to launch MRNet tree() <May 21 16:11:36> <STAT_FrontEnd.C: 2432> STAT returned error type
STAT_FILE_ERROR: Output directory not created. Performance results not written.
<May 21 16:11:36> <STAT_FrontEnd.C: 2539> STAT returned error type
STAT_FILE_ERROR: Failed to dump performance results <May 21 16:11:36> <STAT_FrontEnd.C: 2432> STAT returned error type
STAT_FILE_ERROR: <May 21 16:11:36> Launchmon (ERROR): the engine deteced parsing errors.
Output directory not created. Performance results not written.
<May 21 16:11:36> <STAT_FrontEnd.C: 186> STAT returned error type
STAT_FILE_ERROR: Failed to dump performance results

Discussion


Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.