Re: [Scalablecr-discuss] Question about using SCR with SLRUM
Brought to you by:
kathrynmohror,
moody20
From: Adam T. M. <mo...@ll...> - 2014-04-18 21:27:05
|
Hi Keita, If the application calls SCR_Finalize, the library records this fact in the "halt" file. The halt file is stored in the SCR_PREFIX directory. If the halt file specifies an active reason for halting, the SCR library will not restart the run. This message from your output indicates that's what is happening in this case: SCR: rank 0 on chama648: Job exiting: Reason: SCR_FINALIZE_CALLED. This feature is by design, as it serves a way for the user to specify that SCR should *not* restart the job. For example, if the application reaches the end of the simulation and exits, you don't want SCR to restart the job. Calling SCR_Finalize is one way for the user to signal to SCR that it should not restart the job. This behavior is desirable when running real applications, but it gets in the way of testing. To work around this, you can delete the halt file using the scr_halt command, like so: scr_halt --remove <prefix_dir> If you execute scr_halt from the prefix directory, you don't need to specify the directory as an argument. Also, you can shorten "--remove" to just "-r". Let me know if that helps, -Adam Teranishi, Keita wrote: >Hi, > >I have been trying to run OpenMPI program with SCR on a PC cluster. This >machine uses SLRUM for batch scheduler. In the interactive session, I >have been trying to run an example program test_api 3 times in the same >batch session (I allocated 2 extra nodes). However, the second and third >execution tries to pick up the same nodes that may have hanging processes >(processes terminated without MPI_Finalize). I¹d like to know if I have >to put any environment variables or something I have to do with slrum to >pick up spare nodes. Please let me know. > > > >Here is the error message for the second and third execution. > >SCR ERROR: rank 0 on chama648: Failed to read username or jobname from >environment, disabling logging @ scr.c:6451 >SCR: rank 0 on chama648: Job exiting: Reason: SCR_FINALIZE_CALLED. >-------------------------------------------------------------------------- >mpirun has exited due to process rank 122 with PID 171448 on >node chama1046 exiting without calling "finalize". This may >have caused other processes in the application to be >terminated by signals sent by mpirun (as reported here). >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ > > >SCR ERROR: rank 0 on chama648: Failed to read username or jobname from >environment, disabling logging @ scr.c:6451 >SCR: rank 0 on chama648: Job exiting: Reason: SCR_FINALIZE_CALLED. >-------------------------------------------------------------------------- >mpirun has exited due to process rank 11 with PID 148920 on >node chama648 exiting without calling "finalize". This may >have caused other processes in the application to be >terminated by signals sent by mpirun (as reported here). >------------------------------------------------------------------------- > >--------------------------------------------------------------------------- >-- >Keita Teranishi >Principal Member of Technical Staff >Scalable Modeling and Analysis Systems >Sandia National Laboratories >Livermore, CA 94551 >+1 (925) 294-3738 > > >------------------------------------------------------------------------------ >Learn Graph Databases - Download FREE O'Reilly Book >"Graph Databases" is the definitive new guide to graph databases and their >applications. Written by three acclaimed leaders in the field, >this first edition is now available. Download your free book today! >http://p.sf.net/sfu/NeoTech >_______________________________________________ >Scalablecr-discuss mailing list >Sca...@li... >https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss > > |