Hi Arjun,
I heard through the grapevine that the MVAPICH team is helping you out. Let me know if they get stuck and you need more help from our end.
Kathryn
On Jan 27, 2014, at 8:56 PM, Arjun J Rao <rec...@gm...> wrote:
> I have write permissions on the location of the control directory. I have put the control directory as /home/username/Control and I own the directory. The same directory has been set in the scr.conf file as well in the CNTLDIR as well as in the SCR_CNTL_BASE variable. I have a /tmp too, and have write permissions on that directory (In fact, I have made my username own that directory)
> My system runs Scientific Linux 6.4
>
>
>
>
>
>
>
> On Tue, Jan 14, 2014 at 6:23 AM, Kathryn Mohror <ka...@ll...> wrote:
> Hi Arjun,
>
> Sorry for the late reply -- I took some vacation.
>
> Just as a first guess, did you make sure that you have write permissions on the location of the control directory? I believe (and the error message supports this) that SCR_CNTL_BASE can't be set in the environment, but has to be set in your SCR configuration file. It defaults to /tmp. Do you have a /tmp?
>
> Kathryn
>
> On Dec 25, 2013, at 8:59 PM, Arjun J Rao <rec...@gm...> wrote:
>
>> I understand that SCR was built to be used with custom application codes that
>> write their own checkpoints from within the application. However, MVAPICH2
>>
>> claims to have integrated SCR such that the checkpoints written by BLCR can
>>
>> be written to the parallel file system in a scalable manner later. However, I
>> am not currently writing out the checkpoints to a central location, but to a
>> local disk for testing.
>>
>> I first installed BLCR and SLURM. Then I installed MVAPICH2 with the
>> following options :
>> ./configure --enable-ckpt --with-scr --with-pm=no --with-pmi=slurm
>>
>> However, taking a simple MPI program and compiling using mpicc and then
>> running using srun yields the following errors :
>>
>> srun -N2 -n12 MPIExecutable
>> SCR v1.1-8 ABORT : rank 1 on machine2: Failed to create store descriptor
>> for control directory @
>> src/mpid/ch3/channels/common/src/scr/scr_storedesc.c:299
>> In: PMI_Abort(0,application called MPI_Abort(MPI_COMM_WORLD,0) - process 1)
>> SCR v1.1-8 ABORT : rank 0 on machine1: Failed to create store descriptor
>> for control directory @
>> src/mpid/ch3/channels/common/src/scr/scr_storedesc.c:299
>> In: PMI_Abort(0,application called MPI_Abort(MPI_COMM_WORLD,0) - process 0)
>> .
>> .
>> .
>> SCR v1.1-8 ERROR: rank 3 on machine1: SCR_CNTL_BASE cannot be set in the
>> environment or user configuration file, ignoring setting
>> SCR v1.1-8 ERROR: rank 1 on machine1: SCR_CNTL_BASE cannot be set in the
>> environment or user configuration file, ignoring setting
>> .
>> .
>> .
>> slurmd[machine1]: ***STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL
>> 9***
>> srun: Job step aborted: Waiting upto 2 seconds for job step to finish
>>
>> slurmd[machine1]: *** STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL
>> 9****
>> slurmd[machine2]: *** STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL
>> 9****
>> .
>> .
>> .
>>
>> There was actually a lot of output but i've just printed only one version
>> of each of the message types in the output.I have set SCR_CNTL_BASE, SCR_RUNS,
>> SCR_CACHE_BASE, SCR_PREFIX and SCR_FLUSH. What could be wrong with the
>> configuration or the environment of SCR to yield such errors ?
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk_______________________________________________
>> Scalablecr-discuss mailing list
>> Sca...@li...
>> https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss
>
> _________________________________________________________________
> Kathryn Mohror, ka...@ll..., http://scalability.llnl.gov/
> Scalability Team @ Lawrence Livermore National Laboratory, Livermore, CA, USA
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk_______________________________________________
> Scalablecr-discuss mailing list
> Sca...@li...
> https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss
_________________________________________________________________
Kathryn Mohror, ka...@ll..., http://scalability.llnl.gov/
Scalability Team @ Lawrence Livermore National Laboratory, Livermore, CA, USA
|