Re: [Scalablecr-discuss] [EXTERNAL] Re: Question about an error message when writing large checkpoi

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Adam,

After a careful investigation, I found the test program can handle two 8
Mbyte checkpointing files per process, but it failed with 16 Mbyte files
at the second checkpoint.

I just run a 16 process MPI job to get the space for /tmp.  The amount of
the space is 512M per node but only 66% is available, translated to 21.3
Mbytes per process. Is this the reason for failing 16 Mbyte checkpoint in
the second trial?

Regards,
Keita

(Here is the output from df)
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/ram                512000    172280    339720  34% /ram

On 4/24/14, 11:16 AM, "Adam T. Moody" <mo...@ll...> wrote:

>Hi Keita,
>I agree.  It's most likely running out of space in node-local storage.
>In fact, I think the error is coming from the test_api code which only
>writes to node local storage.
>
>Is it failing on the first checkpoint?
>
>Do you have SCR configured to write to /tmp, and if so, can you check
>the size of /tmp on your compute nodes?  For example, login into a
>compute node and run "df /tmp".
>-Adam
>
>
>Kathryn Mohror wrote:
>
>>Hi Keita,
>>
>>Sorry for the delay. I am not sure why that would be happening. Adam do
>>you have an idea?
>>
>>It might that you are running short of node local storage. SCR will
>>first write to node-local storage and then to the PFS. However 8 MB
>>seems a bit small to be causing that problem. Hopefully Adam will have
>>an idea.
>>
>>Kathryn
>>
>>On Apr 21, 2014, at 4:18 PM, Teranishi, Keita <kn...@sa...> wrote:
>>
>>  
>>
>>>Hi,
>>>
>>>I am still playing with test_api in example directory and found the code
>>>throws an error when I set file size bigger than 8 Mbytes.  I¹d like to
>>>know (1) what is the root cause of this error and (2) any possible ways
>>>to
>>>mitigate this problem.
>>>
>>>I set SCR_PREFIX to be a scratch space (1 Pbytes) in the lustre file
>>>system connected to the PC cluster. The rest of the parameters should be
>>>set to the default.
>>>
>>>1 on chama33: ERROR: Error writing: write(12, 0x2aaaba61000a, 13443078)
>>>errno=28 No space left on device @ test_common.c:86
>>>
>>>Thanks,
>>>------------------------------------------------------------------------
>>>---
>>>--
>>>Keita Teranishi
>>>Principal Member of Technical Staff
>>>Scalable Modeling and Analysis Systems
>>>Sandia National Laboratories
>>>Livermore, CA 94551
>>>+1 (925) 294-3738
>>>
>>>
>>>------------------------------------------------------------------------
>>>------
>>>Start Your Social Network Today - Download eXo Platform
>>>Build your Enterprise Intranet with eXo Platform Software
>>>Java Based Open Source Intranet - Social, Extensible, Cloud Ready
>>>Get Started Now And Turn Your Intranet Into A Collaboration Platform
>>>http://p.sf.net/sfu/ExoPlatform
>>>_______________________________________________
>>>Scalablecr-discuss mailing list
>>>Sca...@li...
>>>https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss
>>>    
>>>
>>
>>_________________________________________________________________
>>Kathryn Mohror, ka...@ll..., http://scalability.llnl.gov/
>>Scalability Team @ Lawrence Livermore National Laboratory, Livermore,
>>CA, USA
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>>------------------------------------------------------------------------
>>
>>-------------------------------------------------------------------------
>>-----
>>Start Your Social Network Today - Download eXo Platform
>>Build your Enterprise Intranet with eXo Platform Software
>>Java Based Open Source Intranet - Social, Extensible, Cloud Ready
>>Get Started Now And Turn Your Intranet Into A Collaboration Platform
>>http://p.sf.net/sfu/ExoPlatform
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>Scalablecr-discuss mailing list
>>Sca...@li...
>>https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss
>>  
>>
>

Re: [Scalablecr-discuss] [EXTERNAL] Re: Question about an error message when writing large checkpoi

Re: [Scalablecr-discuss] [EXTERNAL] Re: Question about an error message when writing large checkpoint files