From: Cyrill B. <cyr...@ma...> - 2022-09-26 16:39:15
|
Hi, First of all thank you Maksym. I've dedicated some more time to it and ran perf and looked into the memory allocation with the numa_maps under /proc/<PID>/numa_maps. First perf. I've recorded the instructions as well as page faults. Recording the instructions revealed that after restart a way longer time was spent in an MPI call, but most likely this is due to synchronization between the MPI ranks. The reason for some MPI ranks being slower probably lays within the different memory allocation. In the picture below is the initial run before restart on the left and the run after restart on the right. perf record -e instructions output The idea behind looking at page faults was to maybe find some irregularities after restart since the "restart penalty" seems to scale with the speed of the used filesystem for restarting (without the restart itself obviously) and the number of tasks used. Meaning its faster when I restart it from RAM disk then from the local SSD, not only the restart itself which is obvious but also the execution of threads afterwards (less "restart penalty"). However I could not find anything of interest. The high amount of page faults within the mtcp_restart process was to be expected and should not influence the performance of the execution since it should just terminate after restoring threads and memory, if I'm not mistaken. However I've included the output anyways, again on the left side is the initial run and on the right side the restarted one. perf record page faults output And last I've looked at the memory allocation of the processes. I've first pinned eight processes in one NUMA domain and eight in the other. And when checking the numa_maps for each process it was obvious that all of them had most of their memory mapped in their NUMA domain. After restart I've checked the numa_maps again and all of them had most of their memory mapped in NUMA domain 0, which is probably the domain where the restart process started. This could be an explanation for the "restart penalty" and this could also explain why our nodes with the AMD EPYC Rome CPUs seemed to be more affected than the older Haswell nodes, since the AMD CPUs have more NUMA domains and therefore also a higher latency between the two which are the furthest apart. However this does still not explain why it also seems to scale with the speed of the for restarting used filesystem. In the picture below is for each process shown how many pages they are mapped in NUMA domain 0 and how many in NUMA domain 1. The output on the left is before the restart and the one on the right after restart. analysis of numa_maps I wanted to publish my findings here in the hope it might help someone else in future. Sadly I won't have be able to dedicate any further time to it in near future. However I would still love to hear if someone finds a reason to why the "restart penalty" seems to scale with the speed (or possibly latency?) of the for restarting used filesystem or has in general anything to add or new ideas. Best regards, Cyrill On 01.09.22 19:18, Maksym Planeta wrote: > Hi Cyrill, > > I remember having similarly looking problem with CRIU [1]. Try to take > a look at performance counters. > > > https://github.com/checkpoint-restore/criu/issues/1171 > > On 8/29/22 09:47, Cyrill Burth wrote: >> Hi, >> >> I was working the last few weeks with DMTCP and made some performance >> benchmarks. Therefore I have used the NPB 3.4.2 BT - MPI benchmark >> [1] at the Taurus Supercomputer at the TU Dresden always with 16 MPI >> ranks and gzip disabled. >> >> I have realized that if I would restart an application from its >> checkpoint it would (drastically) slow down compared to before the >> checkpoint, I will refer to this as phenomena as "restart penalty". >> >> I will describe shortly my methodology: I have performed an >> checkpoint in the 20th iteration and if I took the time before >> restart from the 21st to last iteration of the benchmark it would be >> between 25% to 45% less then when I did the same after restarting >> from the checkpoint in the 20th iteration. I verified this with the >> MPI benchmark (25%-45% "restart penalty") as well as with the OpenMP >> benchmark (consistent 15% restart penalty) which is also provided by >> NPB under [1]. I ran all tests multiple times on multiple nodes and >> all of them yielded the same results. To compile and run the >> benchmark I have used the intel/2019b toolchain, since I had some >> compatibility issues with newer versions. >> I have repeated the tests with application initiated checkpointing as >> well as with the "-i" option, without modifying the benchmarks source >> code. Both yielded the same results. >> >> However the reason I am contacting you is since I have not only >> realized the behavior described above but also that the "restart >> penalty" seems to scale with the speed of the used filesystem at >> least when using MPI. If I would restart from our relatively slow >> local SSDs, I have seen a "restart penalty" of roughly 45%, however >> if I restarted the same checkpoint from a RAM disk, I would only see >> a "restart penalty" of 25%. This could only be seen when using the >> MPI version of the benchmark, for the OpenMP version there was seen a >> "restart penalty" of 15%, but it would not scale with the used >> filesystem. >> >> I was wondering if anyone could give me any insights that could >> explain this behavior. >> >> The restart times themselves obviously go up when the slower >> filesystem is used, but this was to be expected, however it appears >> rather odd that the performance after restart depends on the >> filesystem used for restart. Some further research showed that every >> single iteration of the benchmark gets slowed down. It is *not* the >> case that some iterations take significantly longer than others. >> There were no further checkpoints taken except for the very first one >> in the 20th iteration from which I have restarted and which was >> excluded from the time measurements. >> >> >> Thank you very much in advance. >> >> >> Best regards, >> >> C. Burth >> >> >> [1] https://www.nas.nasa.gov/software/npb.html >> >> >> >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmt...@li... >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > |