From: Rick K. <rk...@nc...> - 2007-09-04 21:30:28
|
Daniel, SourceForge email control again ingraciously bounced your message, so I will forward on to the list, as again there's a lot of interesting stuff here. It sounds as though you've done some good investigative work and modifications, thanks very much. The results you report in memory consumption savings are indeed impressive: my calculations are that you report a reduction in incremental memory usage for profiling of over 95% for your test case. Apart from the specific results of the profiling with your patch that you report, it certainly motivates investigating a more memory-efficient approach. As an aside, I'll comment that the flat address space model for the profile buffers used in PerfSuite have some historical background from its development. For profil() and PAPI_sprofil() implementations, the underlying library routines expect a flat address space that they manage/modify, and those were the original mechanisms used. For itimer and IA-64/Perfmon implementations added later, there is no assumption of flat profiling buffers, but reusing the existing infrastructure simplified things. But as you demonstrate, nothing prevents a different approach. I should also mention that PAPI provides a way for the caller to install an overflow handler, but that has never been used in PerfSuite. I think you are right in that there are several different issues your experiments demonstrate. The issue regarding unallocated thread specific data is surprising, however I'm a bit suspicious of behavior of both setitimer() and profil() approaches in a multithreaded situation - the problem being that signals are not guaranteed to be delivered to any specific thread (in POSIX). I'm wondering if that might have something to do with it - although it could be any of a number of things. I'll take a look at the modified source you've sent and try a few things out with it on this end (may take some time, unfortunately, but I'm very interested), and get back to you directly. I'm not familiar with Ansys, it sounds as though it's hybrid MPI/threads, is that correct? Or is that related to the MPI implementation? Thanks again for the experiments, work, and info - very nice stuff. Rick On Tue, 4 Sep 2007, Daniel Thomas wrote: > Hi Rick, > > Attached the modified files to implement a Btree Mapping for Profiling with > Perfsuite > with a 4 CPU MPI Ansys job > No profiling Total RSS memory : 1.746 Gb > With current Perfsuite : 2.569 Gb > With new Perfsuite : 1.768 Gb > > BTW, Ansys exhibits serious psrun weakness handling the Ansys complicated > threads launch/exit. > the _ps_hwpc_profhandler is entering and there is no TSD data. I guess Ansys > works with not only the MPI > main process but also with some ghost Pthreads. I guess an Itmer interrupt > from the main process is actually triggered > by the ghost process before it set its thread specific stuff. I fixed it by > just ignoring the interrupt (return instead exit). Latter > even the ghost thread finally got its TSD data > This allow to run up to the end and to get reports. Unfortunately for some > reason I haven't found, > the reports miss the interesting data. So to profile Ansys I just wrote a > l"ight_psrun" that just run ps_hwpc_init, ps_hwpc_start at MPI_Init and call > ps_hwpc_stop at MPI_Finalize. > Then I got what I need. But even serious this is an other issue so back to > memory usage. > > I made the dynamic mapping the default as "as is" perfsuite cannot simply be > used with the kind of jobs we run at SGI > To return to direct mapping things must be compiled with > -DPS_DIRECT_SAMPLES_ACCESS > (so there is no Makefile change to get the dynamic mapping) > > -----------hwpc.h: > add some structures to handle the Btree mapping > ------------hwpc.c > malloc only space for a Tree head instead of full mapping space > in values->samples[map] (f non PAPI profiling) > Note that ps_hwpc_sample_t is not changed and casts are made when used in the > new context > so this is compatible with Papi profiling that is still using the full > mapping. > -----------profile.c > defines inc_samples routine that does the Btree update > call inc_samples(ps_hwpc_short_sample_head_t *) samples[map],pcoff): > instead of > samples[map][pcoff]++; > in _ps_jwpc_prof_handler and _ps_perfmon_handler > ----------hpc-xml.c and hpc-text.c > add routines and calls them (if non PAPI profling) in oder to go through the > sample in address order instead > of reading sequentially the mapping and to retain non zero samples only > > Hope this helps. please come to me if you want further explanations. > > Thanks > Daniel > > > |