Re: [Dmtcp-forum] Performance of Restarted MPI applications under DMTCP

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

First of all thank you Maksym.

I've dedicated some more time to it and ran perf and looked into the 
memory allocation with the numa_maps under /proc/<PID>/numa_maps.

First perf. I've recorded the instructions as well as page faults.
Recording the instructions revealed that after restart a way longer time 
was spent in an MPI call, but most likely this is due to synchronization 
between the MPI ranks. The reason for some MPI ranks being slower 
probably lays within the different memory allocation. In the picture 
below is the initial run before restart on the left and the run after 
restart on the right.

perf record -e instructions output

The idea behind looking at page faults was to maybe find some 
irregularities after restart since the "restart penalty" seems to scale 
with the speed of the used filesystem for restarting (without the 
restart itself obviously) and the number of tasks used. Meaning its 
faster when I restart it from RAM disk then from the local SSD, not only 
the restart itself which is obvious but also the execution of threads 
afterwards (less "restart penalty").
However I could not find anything of interest. The high amount of page 
faults within the mtcp_restart process was to be expected and should not 
influence the performance of the execution since it should just 
terminate after restoring threads and memory, if I'm not mistaken. 
However I've included the output anyways, again on the left side is the 
initial run and on the right side the restarted one.

perf record page faults output

And last I've looked at the memory allocation of the processes. I've 
first pinned eight processes in one NUMA domain and eight in the other. 
And when checking the numa_maps for each process it was obvious that all 
of them had most of their memory mapped in their NUMA domain. After 
restart I've checked the numa_maps again and all of them had most of 
their memory mapped in NUMA domain 0, which is probably the domain where 
the restart process started. This could be an explanation for the 
"restart penalty" and this could also explain why our nodes with the AMD 
EPYC Rome CPUs seemed to be more affected than the older Haswell nodes, 
since the AMD CPUs have more NUMA domains and therefore also a higher 
latency between the two which are the furthest apart. However this does 
still not explain why it also seems to scale with the speed of the for 
restarting used filesystem.
In the picture below is for each process shown how many pages they are 
mapped in NUMA domain 0 and how many in NUMA domain 1. The output on the 
left is before the restart and the one on the right after restart.

analysis of numa_maps

I wanted to publish my findings here in the hope it might help someone 
else in future. Sadly I won't have be able to dedicate any further time 
to it in near future. However I would still love to hear if someone 
finds a reason to why the "restart penalty" seems to scale with the 
speed (or possibly latency?) of the for restarting used filesystem or 
has in general anything to add or new ideas.

Best regards,

Cyrill

On 01.09.22 19:18, Maksym Planeta wrote:
> Hi Cyrill,
>
> I remember having similarly looking problem with CRIU [1]. Try to take 
> a look at performance counters.
>
>
> https://github.com/checkpoint-restore/criu/issues/1171
>
> On 8/29/22 09:47, Cyrill Burth wrote:
>> Hi,
>>
>> I was working the last few weeks with DMTCP and made some performance 
>> benchmarks. Therefore I have used the NPB 3.4.2 BT - MPI benchmark 
>> [1] at the Taurus Supercomputer at the TU Dresden always with 16 MPI 
>> ranks and gzip disabled.
>>
>> I have realized that if I would restart an application from its 
>> checkpoint it would (drastically) slow down compared to before the 
>> checkpoint, I will refer to this as phenomena as "restart penalty".
>>
>> I will describe shortly my methodology: I have performed an 
>> checkpoint in the 20th iteration and if I took the time before 
>> restart from the 21st to last iteration of the benchmark it would be 
>> between 25% to 45% less then when I did the same after restarting 
>> from the checkpoint in the 20th iteration. I verified this with the 
>> MPI benchmark (25%-45% "restart penalty") as well as with the OpenMP 
>> benchmark (consistent 15% restart penalty) which is also provided by 
>> NPB under [1]. I ran all tests multiple times on multiple nodes and 
>> all of them yielded the same results. To compile and run the 
>> benchmark I have used the intel/2019b toolchain, since I had some 
>> compatibility issues with newer versions.
>> I have repeated the tests with application initiated checkpointing as 
>> well as with the "-i" option, without modifying the benchmarks source 
>> code. Both yielded the same results.
>>
>> However the reason I am contacting you is since I have not only 
>> realized the behavior described above but also that the "restart 
>> penalty" seems to scale with the speed of the used filesystem at 
>> least when using MPI. If I would restart from our relatively slow 
>> local SSDs, I have seen a "restart penalty" of roughly 45%, however 
>> if I restarted the same checkpoint from a RAM disk, I would only see 
>> a "restart penalty" of 25%. This could only be seen when using the 
>> MPI version of the benchmark, for the OpenMP version there was seen a 
>> "restart penalty" of 15%, but it would not scale with the used 
>> filesystem.
>>
>> I was wondering if anyone could give me any insights that could 
>> explain this behavior.
>>
>> The restart times themselves obviously go up when the slower 
>> filesystem is used, but this was to be expected, however it appears 
>> rather odd that the performance after restart depends on the 
>> filesystem used for restart. Some further research showed that every 
>> single iteration of the benchmark gets slowed down. It is *not* the 
>> case that some iterations take significantly longer than others.
>> There were no further checkpoints taken except for the very first one 
>> in the 20th iteration from which I have restarted and which was 
>> excluded from the time measurements.
>>
>>
>> Thank you very much in advance.
>>
>>
>> Best regards,
>>
>> C. Burth
>>
>>
>> [1] https://www.nas.nasa.gov/software/npb.html
>>
>>
>>
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmt...@li...
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>

Re: [Dmtcp-forum] Performance of Restarted MPI applications under DMTCP

Checkpoint/Restart functionality for Linux processes

Re: [Dmtcp-forum] Performance of Restarted MPI applications under DMTCP