Re: [PerfSuite-users] Memory usage with Itimer

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Daniel,

SourceForge email control again ingraciously bounced your message, so I 
will forward on to the list, as again there's a lot of interesting stuff 
here.

It sounds as though you've done some good investigative work and 
modifications, thanks very much.  The results you report in memory 
consumption savings are indeed impressive: my calculations are that you 
report a reduction in incremental memory usage for profiling of over 95% 
for your test case.  Apart from the specific results of the profiling with 
your patch that you report, it certainly motivates investigating a more 
memory-efficient approach.

As an aside, I'll comment that the flat address space model for the 
profile buffers used in PerfSuite have some historical background from 
its development.  For profil() and PAPI_sprofil() implementations, the 
underlying library routines expect a flat address space that they 
manage/modify, and those were the original mechanisms used.  For itimer 
and IA-64/Perfmon implementations added later, there is no assumption of 
flat profiling buffers, but reusing the existing infrastructure simplified 
things.  But as you demonstrate, nothing prevents a different approach.  I 
should also mention that PAPI provides a way for the caller to install an 
overflow handler, but that has never been used in PerfSuite.

I think you are right in that there are several different issues your 
experiments demonstrate.  The issue regarding unallocated thread specific 
data is surprising, however I'm a bit suspicious of behavior of both 
setitimer() and profil() approaches in a multithreaded situation - the 
problem being that signals are not guaranteed to be delivered to any 
specific thread (in POSIX).  I'm wondering if that might have something to 
do with it - although it could be any of a number of things.

I'll take a look at the modified source you've sent and try a few things 
out with it on this end (may take some time, unfortunately, but I'm very 
interested), and get back to you directly.  I'm not familiar with Ansys, 
it sounds as though it's hybrid MPI/threads, is that correct?  Or is that 
related to the MPI implementation?

Thanks again for the experiments, work, and info - very nice stuff.

Rick

On Tue, 4 Sep 2007, Daniel Thomas wrote:

> Hi Rick,
>
> Attached the modified files to implement a Btree Mapping for Profiling with 
> Perfsuite
> with a 4 CPU MPI Ansys job
> No profiling Total RSS memory : 1.746 Gb
> With current  Perfsuite : 2.569 Gb
> With new Perfsuite : 1.768 Gb
>
> BTW, Ansys exhibits serious psrun weakness handling the Ansys complicated 
> threads launch/exit.
> the _ps_hwpc_profhandler is entering  and there is no TSD data. I guess Ansys 
> works with not only the MPI
> main process but also with some ghost Pthreads. I guess an Itmer interrupt 
> from the main process is actually triggered
> by the ghost process before it set its thread specific stuff. I fixed it by 
> just ignoring the interrupt (return instead exit). Latter
> even the ghost thread finally got its TSD data
> This allow to run up to the end and to get reports. Unfortunately for some 
> reason I haven't found,
> the reports miss the interesting data. So to profile Ansys I just wrote a 
> l"ight_psrun" that just run ps_hwpc_init, ps_hwpc_start at MPI_Init and call 
> ps_hwpc_stop at MPI_Finalize.
> Then I got what I need. But even serious this is an other issue so back to 
> memory usage.
>
> I made the dynamic mapping the default as  "as is" perfsuite cannot simply be 
> used with the kind of jobs we run at SGI
> To return to direct mapping things must be compiled with 
> -DPS_DIRECT_SAMPLES_ACCESS
> (so there is no Makefile change to get the dynamic mapping)
>
> -----------hwpc.h:
> add some structures to handle the Btree mapping
> ------------hwpc.c
> malloc only space for a Tree head instead of full mapping space
> in values->samples[map] (f non PAPI profiling)
> Note that ps_hwpc_sample_t is not changed and casts are made when used in the 
> new context
> so this is compatible with Papi profiling that is still using the full 
> mapping.
> -----------profile.c
> defines inc_samples routine that does the Btree update
> call inc_samples(ps_hwpc_short_sample_head_t *) samples[map],pcoff):
> instead of
> samples[map][pcoff]++;
> in _ps_jwpc_prof_handler and _ps_perfmon_handler
> ----------hpc-xml.c and hpc-text.c
> add routines and calls them (if non PAPI profling) in oder to go through the 
> sample in address order instead
> of reading sequentially the mapping and to retain non zero samples only
>
> Hope this helps. please come to me if you want further explanations.
>
> Thanks
> Daniel
>
>
>