|
From: Ashley P. <as...@qu...> - 2005-07-21 15:08:18
Attachments:
qsw_v_read
|
Hi,
I've just been testing valgrind against our software (parallel MPI
librarys). The valgrind 3 branch just works(TM) which is a good
thing :) I have seen it running before but never quite as easily.
I do however have a couple of requests, one of the problems with
debugging parallel applications is you tend to get drowned in
information as everything is repeated N times (N may be multi thousand
and is often in the hundreds).
I'm trying to write a perl script to post-process the valgrind log files
to compress the output somewhat (there is precadent for this, I've got
some hierarchical stack trace generation code which is very similar).
A working cut of the code is attached, basically it splits each error
report out from the log files and makes a record of which processes saw
that error. It then prints out the process list and the error for every
error encountered anywhere during the job. Because most errors are seen
on multiple processes you should see a big win in the amount of data the
user has to sift through. Thats the plan anyway and it does appear to
be working.
The downside to this approach is you tend to lose context of what order
errors occour in and currently if you see the same error twice it only
gets reported once. There isn't much I can do about the first but the
second is something for me to work on. Current thinking is it's going
to report errors in order for all errors that occoured in process 0,
then all errors that occoured in process 1 but not 0 and so on.
Typical (trucated) output from the script is:
----------------
0
----------------
ERROR SUMMARY: 52 errors from 17 contexts (suppressed: 25 from 1)
malloc/free: in use at exit: 1698968 bytes in 287 blocks.
malloc/free: 419 allocs, 132 frees, 1885866 bytes allocated.
For counts of detected errors, rerun with: -v
searching for pointers to 287 not-freed blocks.
checked 3165336 bytes.
----------------
[0-1,5,7]
----------------
Invalid read of size 4
at 0x1B9B42BA: elan_createBalancedTree (common/groupUtil.c:1228)
by 0x1B984463: _elan_groupInit (elan4/group.c:663)
by 0x1B9871A0: elan_groupInit (elan4/group.c:1320)
by 0x1B99E7DF: base_allGroupInit (common/base.c:474)
by 0x1B99E98F: base_allGroupAlloc (common/base.c:552)
by 0x1B99EEEA: elan_baseInit (common/base.c:778)
by 0x1B9109B3: MPID_Init (adi2init.c:317)
by 0x1B92A44F: MPIR_Init (initutil.c:170)
by 0x1B929632: MPI_Init (init.c:163)
by 0x8048B81: main (in /usr/lib/mpi/mpi_gnu/bin/mping)
Address 0x2CBD5128 is 0 bytes after a block of size 16 alloc'd
at 0x1B8FC9B9: malloc (vg_replace_malloc.c:149)
by 0x1B9B2D64: _elan_gscCreate (common/groupUtil.c:461)
by 0x1B9B27A4: _elan_groupGlCreate (common/groupUtil.c:130)
by 0x1B983C8C: _elan_groupInit (elan4/group.c:454)
by 0x1B9871A0: elan_groupInit (elan4/group.c:1320)
by 0x1B99E7DF: base_allGroupInit (common/base.c:474)
by 0x1B99E98F: base_allGroupAlloc (common/base.c:552)
by 0x1B99EEEA: elan_baseInit (common/base.c:778)
by 0x1B9109B3: MPID_Init (adi2init.c:317)
by 0x1B92A44F: MPIR_Init (initutil.c:170)
by 0x1B929632: MPI_Init (init.c:163)
by 0x8048B81: main (in /usr/lib/mpi/mpi_gnu/bin/mping)
----------------
[0-7]
----------------
Invalid read of size 4
at 0x1B9B4365: elan_createBalancedTree (common/groupUtil.c:1243)
by 0x1B984463: _elan_groupInit (elan4/group.c:663)
by 0x1B9871A0: elan_groupInit (elan4/group.c:1320)
by 0x1B99E7DF: base_allGroupInit (common/base.c:474)
by 0x1B99E98F: base_allGroupAlloc (common/base.c:552)
by 0x1B99EEEA: elan_baseInit (common/base.c:778)
by 0x1B9109B3: MPID_Init (adi2init.c:317)
by 0x1B92A44F: MPIR_Init (initutil.c:170)
by 0x1B929632: MPI_Init (init.c:163)
by 0x8048B81: main (in /usr/lib/mpi/mpi_gnu/bin/mping)
Address 0x2CBD5128 is 0 bytes after a block of size 16 alloc'd
at 0x1B8FC9B9: malloc (vg_replace_malloc.c:149)
by 0x1B9B2D64: _elan_gscCreate (common/groupUtil.c:461)
by 0x1B9B27A4: _elan_groupGlCreate (common/groupUtil.c:130)
by 0x1B983C8C: _elan_groupInit (elan4/group.c:454)
by 0x1B9871A0: elan_groupInit (elan4/group.c:1320)
by 0x1B99E7DF: base_allGroupInit (common/base.c:474)
by 0x1B99E98F: base_allGroupAlloc (common/base.c:552)
by 0x1B99EEEA: elan_baseInit (common/base.c:778)
by 0x1B9109B3: MPID_Init (adi2init.c:317)
by 0x1B92A44F: MPIR_Init (initutil.c:170)
by 0x1B929632: MPI_Init (init.c:163)
by 0x8048B81: main (in /usr/lib/mpi/mpi_gnu/bin/mping)
In my sample case (eight process ping-pong) once passed through this
script the lines of output are reduced from 5102 to 1070 and if I remove
the multiple headers/footers (leaving one of each) this is reduced
further to 899 lines.
stratumi:V> wc -l *
693 valgrind.out.0.pid21666
631 valgrind.out.1.pid21667
636 valgrind.out.2.pid23039
617 valgrind.out.3.pid23040
650 valgrind.out.4.pid26916
618 valgrind.out.5.pid26917
639 valgrind.out.6.pid21641
618 valgrind.out.7.pid21642
5102 total
stratumi:V> qsw_v_read valgri* | wc -l
1070
So far so good, it's work in progress but is showing promise.
I run my programs as prun -n8 valgrind --log-file=valgrind.out
--log-file-qualifier=RMS_RANK mping
Now for my requests
1) would it be possible to drop the .pid<pid> suffix if
--log-file-qualifier is set, there seems little point in having two
qualifiers and if automating this process having as little ambiguity
over filenames as possible seems a good thing.
2) Can you put the qualifier in the logfile somewhere, maybe not in
place of ==<pid>== but possibly in the header, something like:
==21666== My PID = 21666, parent PID = 21664, qualifier = 0.
==21666== Prog and args are:
==21666== /usr/lib/mpi/mpi_gnu/bin/mping
==21666== For more details, rerun with: -v
Whilst this isn't strictly necessairy for the what I've described (I
currently get it from the filenames) at some point I intend to try using
a listener process instead of files.
The perl used is attached, ignore the first three functions, they are
lifted from dshbak and just do the process list to spec conversion.
Ashley,
|
|
From: Julian S. <js...@ac...> - 2005-07-23 09:26:45
|
Ashley > 1) would it be possible to drop the .pid<pid> suffix if > --log-file-qualifier is set, there seems little point in having two > qualifiers and if automating this process having as little ambiguity > over filenames as possible seems a good thing. Yeh, I wondered about this myself. Done (r4228). > 2) Can you put the qualifier in the logfile somewhere, maybe not in > place of ==<pid>== but possibly in the header, something like: > > ==21666== My PID = 21666, parent PID = 21664, qualifier = 0. > ==21666== Prog and args are: > ==21666== /usr/lib/mpi/mpi_gnu/bin/mping > ==21666== For more details, rerun with: -v > > Whilst this isn't strictly necessairy for the what I've described (I > currently get it from the filenames) at some point I intend to try using > a listener process instead of files. Hmm, not enthusiastic about changing the output format. J |
|
From: Ashley P. <as...@qu...> - 2005-07-26 11:02:13
|
On Sat, 2005-07-23 at 10:28 +0100, Julian Seward wrote: > Ashley > > > 1) would it be possible to drop the .pid<pid> suffix if > > --log-file-qualifier is set, there seems little point in having two > > qualifiers and if automating this process having as little ambiguity > > over filenames as possible seems a good thing. > > Yeh, I wondered about this myself. Done (r4228). Thank you. I have just noticed the --log-file-exactly option which would appear to do the same thing although in practice ignores the --log-file-qualifier flag. > > 2) Can you put the qualifier in the logfile somewhere, maybe not in > > place of ==<pid>== but possibly in the header, something like: > > > > ==21666== My PID = 21666, parent PID = 21664, qualifier = 0. > > ==21666== Prog and args are: > > ==21666== /usr/lib/mpi/mpi_gnu/bin/mping > > ==21666== For more details, rerun with: -v > > > > Whilst this isn't strictly necessairy for the what I've described (I > > currently get it from the filenames) at some point I intend to try using > > a listener process instead of files. > > Hmm, not enthusiastic about changing the output format. That's an understandable position however given that this option is new changing the output to include it if specified isn't going to catch anyone by surprise, the chances are if you specify the option then you care about what it's value is. I can't think of a way of using a valgrind-listener process without it. Bear in mind in parallel jobs it's not that uncommon for pids to be non-unique and it's the process rank that is universially used to identify specific processes across a job. Alternative ways of achieving the same result are most welcome. Ashley, |