|
From: Julian S. <js...@ac...> - 2006-04-17 13:21:11
|
Hi Josef I have been using kcachegrind to profile memcheck starting koffice (!) and this seems to work well. However, callgrind generates snapshot files of up to 75M long, and kcachegrind takes > 30 mins CPU time and > 450M memory to load the file. Have you seen this kind of thing before? Can you have a look at it if I send you the file? J |
|
From: Josef W. <Jos...@gm...> - 2006-04-17 23:05:31
|
[I only saw that this went to the mailing list after replying in private... ] On Monday 17 April 2006 15:20, you wrote: > I have been using kcachegrind to profile memcheck starting koffice > (!) Wow.=20 > and this seems to work well. =A0However, callgrind generates =A0 > snapshot files of up to 75M long So that is really _huge_ for profile data. You need a huge amount of executed code with profile data at instruction le= vel. So I suspect you are using --dump-instr=3Dyes? Still, I have some problem to understand this. 75 MB can mean a lot of memory consumption for KCachegrind. However, if data is only about annotationg source or assembler, KCachegrind does some lazy construction of data structures... which means: memory consumption is more or less dependent on the function count, and not on the absolute number of cost counters in the profile. So I still do not understand it. You would need a lot of functions (> 500 000), which is very unusual. I remember konqueror with around 25 000 unique functions in the profile... But you are "only" profiling valgrind (but see below). > , and kcachegrind takes > 30 mins=20 > CPU time and > 450M memory to load the file. =A0Have you seen this > kind of thing before? No, never such large loading times. With some options, you can "force" callgrind to create profiles with function counts in the million range (by producing a different "function" for every call trace, eg. with =2D-separate-callers=3D30). This can lead to this huge amount of memory consumption; but such profiles are really insane... > Can you have a look at it if I send you =A0 > the file? Sure. If the function count is really huge, I can understand at least the memory consumption. Can you check the number of functions in this 75 MB pro= file with "grep ^fn callgrind.out.XXX | wc" ? > (Note: I just thought you might like to see this. =A0I have been > using callgrind/kcachegrind for self-hosting Again: wow. I never thought of using callgrind with self-hosting because there is still this design problem that profile counters are never freed when discarding code. You made me aware of this with PPC32, where dynamic linking is done with self modifying code. I suspect that valgrind triggers this bug really bad when the translation cache is flushed and is reused for new instrumented code a few times (similar for other VMs like Java). Callgrind will never get rid of any counters, but will present you some kind of superposit= ion of counters for all the code versions in the translation cache. Maybe this way you get the huge function count? BTW: how much memory is the callgrind process using itself? I suspect that the best way to cope with this would be to handle calls into anonymous memory (eg. the translation cache?) as ending at the same (artifi= cial) function, and to not do any cost counting for such code. At least, KCachegrind can not do even assembler annotation, as for dynamic generated code, there is no binary to run objdump against. So there is no point to store costs on instruction granularity for this. Josef |
|
From: Julian S. <js...@ac...> - 2006-04-17 23:48:25
|
> So that is really _huge_ for profile data.
> You need a huge amount of executed code with profile data at instruction
> level. So I suspect you are using --dump-instr=3Dyes?
No, only --simulate-cache=3Dyes.
> So I still do not understand
> it. You would need a lot of functions (> 500 000), which is very
> unusual.
Ah .. but I'm profiling memcheck, and memcheck is producing=20
code with many calls to MC_(helperc_{LOAD,STORE}V*. I know
that memcheck (in the inner valgrind) produces about 80000
translations and so if each translation has eg 5 such calls
then there will be 400000 call points to these functions.
I suspect this is why the profile data is very large and why
kcachegrind is taking a long time.
> > kind of thing before?
>
> No, never such large loading times. With some options, you can "force"
> callgrind to create profiles with function counts in the million range
> (by producing a different "function" for every call trace, eg. with
> --separate-callers=3D30). This can lead to this huge amount of memory
> consumption; but such profiles are really insane...
>
> > Can you have a look at it if I send you =A0
> > the file?
>
> Sure. If the function count is really huge, I can understand at least the
> memory consumption. Can you check the number of functions in this 75 MB
> profile with "grep ^fn callgrind.out.XXX | wc" ?
sewardj@suse10:~/VgTRUNK$ grep ^fn koffice-started.1 | wc
80660 160583 1891572
> Again: wow. I never thought of using callgrind with self-hosting
> because there is still this design problem that profile counters are never
> freed when discarding code.
Well, self-hosting does not work when the inner valgrind starts to
discard code, so I don't think this is a problem.
> BTW: how much memory is the callgrind process using itself?
After about 10 CPU hours it was at 1600M virtual and about 670M
resident, but my machine only has 1G memory I had to stop it (so I
first did 'callgrind_control -d ...' (very cool btw)).
To stop it I did control-C for the application (kpresenter) and so=20
the outer valgrind (callgrind) finished normally, and produced a final
dump which is very small compared to the two intermediate ones I did:
=2Drw------- 1 sewardj users 55936636 2006-04-17 02:40 callgrind.out.2437=
1.1
=2Drw------- 1 sewardj users 63330227 2006-04-17 13:32 callgrind.out.2437=
1.2
=2Drw------- 1 sewardj users 1127604 2006-04-17 15:35 callgrind.out.24371
$ for f in callgrind.out.24371* ; do echo $f ; grep ^fn $f | wc ; done
callgrind.out.24371
1736 3070 36549
callgrind.out.24371.1
63941 127124 1483254
callgrind.out.24371.2
61833 123046 1467148
This confused me. I was even more confused to find that kcachegrind
still takes just as long to load the 1M file as the 63M file.
Anyway, after all this, (1) from the intermediate dumps I found that
MC_(realloc) might be a memcheck slow point in realloc-intensive code,
and (2) I restarted the run using cachegrind instead of callgrind, to=20
see if it can successfully complete the profiling run with the memory
I have available. So far it has been running 6.6 hours.
J
|
|
From: Josef W. <Jos...@gm...> - 2006-04-18 11:47:24
|
> > So I still do not understand
> > it. You would need a lot of functions (> 500 000), which is very
> > unusual.
>
> Ah .. but I'm profiling memcheck, and memcheck is producing
> code with many calls to MC_(helperc_{LOAD,STORE}V*. I know
> that memcheck (in the inner valgrind) produces about 80000
> translations and so if each translation has eg 5 such calls
> then there will be 400000 call points to these functions.
> I suspect this is why the profile data is very large and why
> kcachegrind is taking a long time.
>
> > Sure. If the function count is really huge, I can understand at least the
> > memory consumption. Can you check the number of functions in this 75 MB
> > profile with "grep ^fn callgrind.out.XXX | wc" ?
>
> sewardj@suse10:~/VgTRUNK$ grep ^fn koffice-started.1 | wc
> 80660 160583 1891572
OK, so there are around 80 thousand functions "detected" by callgrind (ie. targets
of the "Call" instruction). That is quite much, but I thought KCachegrind should
be able to cope with such a number, at least with 1GB physical memory...
> > Again: wow. I never thought of using callgrind with self-hosting
> > because there is still this design problem that profile counters are never
> > freed when discarding code.
>
> Well, self-hosting does not work when the inner valgrind starts to
> discard code, so I don't think this is a problem.
Ah, OK. So I suspect I should give this self-hosting a try myself.
> > BTW: how much memory is the callgrind process using itself?
>
> After about 10 CPU hours it was at 1600M virtual and about 670M
> resident, but my machine only has 1G memory I had to stop it
Memcheck seems to push the limits of callgrind ;-)
Perhaps there should be a possibility to limit the code areas
where callgrind stores full counters per client instruction.
Currently, even with no debug info, the counters are stored at
instruction granularity and only aggregated at dump time.
> (so I
> first did 'callgrind_control -d ...' (very cool btw)).
I really would like to have a general way to interact with a
running valgrind tool. My current approach, which checks for existance
of a command file every time the valgrind scheduler runs is really
screwed (have you ever done "strace callgrind ..." ?).
Some mini-telnet server taking text commands would be cool.
> To stop it I did control-C for the application (kpresenter) and so
> the outer valgrind (callgrind) finished normally, and produced a final
> dump which is very small compared to the two intermediate ones I did:
>
> -rw------- 1 sewardj users 55936636 2006-04-17 02:40 callgrind.out.24371.1
> -rw------- 1 sewardj users 63330227 2006-04-17 13:32 callgrind.out.24371.2
> -rw------- 1 sewardj users 1127604 2006-04-17 15:35 callgrind.out.24371
Looks really heavy...
> $ for f in callgrind.out.24371* ; do echo $f ; grep ^fn $f | wc ; done
> callgrind.out.24371
> 1736 3070 36549
> callgrind.out.24371.1
> 63941 127124 1483254
> callgrind.out.24371.2
> 61833 123046 1467148
The number of unique functions in multiple profile files loaded into
KCachegrind is more relevant to KCachegrinds memory consumption. Still, there
is a hit. Still, it would be good if there would be a merge command (another todo).
> This confused me. I was even more confused to find that kcachegrind
> still takes just as long to load the 1M file as the 63M file.
That is simple. KCachegrind more or less automatically appends a "*" wildcard when
loading a profile file. So when you do "kcachegrind callgrind.out.24371" KCachegrind is
always loading all 3 files. If you have a look
into the status bar, you should see which file currently is being loaded.
Perhaps I should get rid of this unexpected behavior. I did this to make "Trigger new
dump & reload the data with the new dump" more simple. There is a toolbar button in
KCachegrind to trigger this: If you press the "Force Dump" button while callgrind is
running in the current working directory, you get this behavior, and reloading all
profile parts is part of this.
> Anyway, after all this, (1) from the intermediate dumps I found that
> MC_(realloc) might be a memcheck slow point in realloc-intensive code,
> and (2) I restarted the run using cachegrind instead of callgrind, to
> see if it can successfully complete the profiling run with the memory
> I have available. So far it has been running 6.6 hours.
I would suspect that the intermediate dump shows similar behavior
to the final result. If you think that this changes over time instead,
you could do some kind of manual "sampling":
Make a script which does something like this in a loop:
while(1) {
# fast-forward for some time...
callgrind_control -i off
sleep 60
# switch on instrumentation for cache simulation again
callgrind_control -i on
# wait some time to warm up the simulated cache
sleep 1
# get rid of artifical cold misses of the warm up phase
callgrind_control -z
# run some time with full cache simulation
sleep 59
callgrind_control -d
}
The net effect is that you do cache simulation in small intervals
only, and run in valgrinds "none" mode in-between:
Above does 60 seconds with "none" speed (lets say slowdown of 3),
and 60 seconds with full simulation (perhaps slowdown of 60),
which gives you allover slowdown of around 6, with full simulation
only at 1/20 of the runtime.
Josef
|
|
From: Julian S. <js...@ac...> - 2006-04-18 00:23:08
|
On Monday 17 April 2006 23:55, you wrote: > On Monday 17 April 2006 22:24, you wrote: > > In fact this one is even better; a ~ 1MB file which takes 59 CPU > > minutes on a P4 1.7GHz to load. kcachegrind --version says: > > > > sewardj@suse10:~/VgTRUNK$ kcachegrind --version > > Qt: 3.3.4 > > KDE: 3.4.2 Level "b" > > KCachegrind: 0.4.6kde > > That is really strange. On my Pentium M, I get > > weidendo@linux:~/tmp/julian> time kcachegrind callgrind.out.24371 > > real 0m3.383s > user 0m0.864s > sys 0m0.116s Obviously I should upgrade my P4 to a Pentium M :-) Er, I don't know. That really is very strange. kcachegrind's loading progress bar (bottom RHS) goes from 0% to 81% in about ~ 15 seconds, but then it stays at 81% for 15 minutes and then 92% for another 15 mins. My guess is that it is somehow running slowly with some algorithm which analyses the call graph, and which goes really crazy due to the huge number of call points as mentioned in my previous message. I tried also on a PIII running the same SuSE 10.0, and it is also very slow. So it's not caused by the Pentium-4 denormalised FP number performance disaster phenomenon. But if we are running the same kcachegrind, then the difference can only be in Qt (3.3.4 vs 3.3.5, bug-fix level only) and KDE (3.4.2 Level "b" vs 3.5.2 Level "a"). Assuming the Qt difference is irrelevant, are you using KDE(libs) in some way to do graph-theory stuff? > Can you check with OProfile what is taking your time? Too difficult .. I might try 'valgrind --tool=none --profile-flags=10000000 kcachegrind ...' to find the top 100 basic blocks. J |
|
From: Josef W. <Jos...@gm...> - 2006-04-18 11:57:48
|
On Tuesday 18 April 2006 02:22, Julian Seward wrote: > Obviously I should upgrade my P4 to a Pentium M :-) Currently, I would recommend a Core Duo notebook ;-) > Er, I don't know. That really is very strange. Probably KCachegrind is simply loading all parts, and not only the 1 MB you sent me (see my other mail). > kcachegrind's loading progress bar (bottom RHS) goes from > 0% to 81% in about ~ 15 seconds, but then it stays at 81% for > 15 minutes and then 92% for another 15 mins. My guess is that > it is somehow running slowly with some algorithm which analyses > the call graph, and which goes really crazy due to the huge number > of call points as mentioned in my previous message. Hmm, no. The progress bar is only visible when it is really loading the data. Which means allocating lots of small C++ objects for KCachegrinds internal representation. I would have to recheck this, but for now, I use a lot of linked lists for linking the internal data, and this is really bad on the number of objects needed. I would suspect a net effect of around 10 mallocs for every function. The cycle detection phase is done after the loading. > I tried also on a PIII running the same SuSE 10.0, and it is > also very slow. So it's not caused by the Pentium-4 denormalised > FP number performance disaster phenomenon. Yeah. It is loading more than this 1MB file... > Too difficult .. Why this? Is it because of the root rights you need? OProfile is really working out of the box on Suse nowadays. Josef > > I might try 'valgrind --tool=none --profile-flags=10000000 kcachegrind ...' > to find the top 100 basic blocks. > > J > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > |
|
From: Julian S. <js...@ac...> - 2006-04-18 01:34:37
|
> But if we are running the same kcachegrind, then the difference can > only be in Qt (3.3.4 vs 3.3.5, bug-fix level only) and KDE (3.4.2 Level "b" > vs 3.5.2 Level "a"). Assuming the Qt difference is irrelevant, are > you using KDE(libs) in some way to do graph-theory stuff? > > > Can you check with OProfile what is taking your time? > > Too difficult .. > > I might try 'valgrind --tool=none --profile-flags=10000000 kcachegrind ...' > to find the top 100 basic blocks. This gets stranger all the time. On 'none' it dies at 7% loaded with ==31091== Process terminating with default action of signal 14 (SIGALRM) ==31091== at 0x50E1B89: __lll_mutex_lock_wait (in /lib/tls/libc-2.3.5.so) ==31091== by 0x4803A48: _kde_IceTransFreeConnInfo (in /opt/kde3/lib/libDCOP.so.4.2.0) ==31091== by 0x4803F0E: _kde_IceTransClose (in /opt/kde3/lib/libDCOP.so.4.2.0) ==31091== by 0x4802CB0: _kde_IceFreeConnection (in /opt/kde3/lib/libDCOP.so.4.2.0) ==31091== by 0x4802F79: KDE_IceCloseConnection (in /opt/kde3/lib/libDCOP.so.4.2.0) ==31091== by 0x47E8A96: DCOPClient::emergencyClose() (in /opt/kde3/lib/libDCOP.so.4.2.0) ==31091== by 0x470C3CE: KCrash::defaultCrashHandler(int) (in /opt/kde3/lib/libkdecore.so.4.2.0) ==31091== by 0x503E6C7: (within /lib/tls/libc-2.3.5.so) ==31091== by 0x507BA49: _int_malloc (in /lib/tls/libc-2.3.5.so) ==31091== by 0x507D5F2: malloc (in /lib/tls/libc-2.3.5.so) ==31091== by 0x4FBA3C6: operator new(unsigned) (in /usr/lib/libstdc++.so.6.0.6) ==31091== by 0x80778A6: (within /opt/kde3/bin/kcachegrind) On memcheck it doesn't die at that point, and no errors are reported. Me mystified. J |
|
From: Julian S. <js...@ac...> - 2006-04-18 11:12:47
|
> > > Can you check with OProfile what is taking your time? > > > > > > I might try 'valgrind --tool=none --profile-flags=10000000 kcachegrind > > ...' to find the top 100 basic blocks. > > This gets stranger all the time. On 'none' it dies at 7% loaded with > > ==31091== Process terminating with default action of signal 14 (SIGALRM) It dies like that with all the tools that do not replace malloc (none, lackey, cachegrind, callgrind). Runs ok on massif and memcheck. Eventually I got some info using massif. Took about 5 cpu hours. Just 10 bbs account for 93% of all executed bbs. They are: Total score = 29991666790 0: (2797737062 9.32%) 2797737062 9.32% 0x4D3FAEF QGList::next()+15 1: (5595474124 18.65%) 2797737062 9.32% 0x4D3FAE0 QGList::next() 2: (8392034273 27.98%) 2796560149 9.32% 0x4D3FAF6 QGList::next()+22 3: (11187229411 37.30%) 2795195138 9.31% 0x806670C 4: (13982424549 46.62%) 2795195138 9.31% 0x8066716 5: (16777619687 55.94%) 2795195138 9.31% 0x806673A 6: (19572238018 65.25%) 2794618331 9.31% 0x8076CBE 7: (22366361173 74.57%) 2794123155 9.31% 0x8076CCA 8: (25160484328 83.89%) 2794123155 9.31% 0x8076CC2 9: (27954312092 93.20%) 2793827764 9.31% 0x8076CB0 The 0x80xxxxx ones are from /opt/kde3/bin/kcachegrind itself, which is unfortunately stripped, so no symbols. I used objdump to get the bbs shown below for them. So maybe something in kcachegrind itself is doing many traversals of a very long QGList? Does this mean anything to you? J 806670c: 8b 91 e0 00 00 00 mov 0xe0(%ecx),%edx 8066712: 85 d2 test %edx,%edx 8066714: 74 24 je 806673a <_ZN8QPainter4saveEv@plt+0x7572> 8066716: 8b 82 48 01 00 00 mov 0x148(%edx),%eax 806671c: 85 c0 test %eax,%eax 806671e: 74 1a je 806673a <_ZN8QPainter4saveEv@plt+0x7572> 806673a: 5d pop %ebp 806673b: 89 d0 mov %edx,%eax 806673d: c3 ret 8076cbe: 39 c7 cmp %eax,%edi 8076cc0: 74 52 je 8076d14 <_ZN8QPainter4saveEv@plt+0x17b4c> 8076cca: 85 c0 test %eax,%eax 8076ccc: 89 c3 mov %eax,%ebx 8076cce: 75 e0 jne 8076cb0 <_ZN8QPainter4saveEv@plt+0x17ae8> 8076cc2: 89 34 24 mov %esi,(%esp) 8076cc5: e8 6e 5f fe ff call 805cc38 <_ZN6QGList4nextEv@plt> 8076cb0: 31 c0 xor %eax,%eax 8076cb2: 89 44 24 04 mov %eax,0x4(%esp) 8076cb6: 89 1c 24 mov %ebx,(%esp) 8076cb9: e8 42 fa fe ff call 8066700 <_ZN8QPainter4saveEv@plt+0x7538> |
|
From: Josef W. <Jos...@gm...> - 2006-04-18 12:14:46
|
On Tuesday 18 April 2006 13:12, Julian Seward wrote: > Just 10 bbs account for 93% of all executed bbs. They are: > > Total score = 29991666790 > > 0: (2797737062 9.32%) 2797737062 9.32% 0x4D3FAEF QGList::next()+15 > 1: (5595474124 18.65%) 2797737062 9.32% 0x4D3FAE0 QGList::next() > 2: (8392034273 27.98%) 2796560149 9.32% 0x4D3FAF6 QGList::next()+22 Ah, thanks. I should get rid of these linked lists... > 3: (11187229411 37.30%) 2795195138 9.31% 0x806670C > 4: (13982424549 46.62%) 2795195138 9.31% 0x8066716 > 5: (16777619687 55.94%) 2795195138 9.31% 0x806673A > 6: (19572238018 65.25%) 2794618331 9.31% 0x8076CBE > 7: (22366361173 74.57%) 2794123155 9.31% 0x8076CCA > 8: (25160484328 83.89%) 2794123155 9.31% 0x8076CC2 > 9: (27954312092 93.20%) 2793827764 9.31% 0x8076CB0 > > The 0x80xxxxx ones are from /opt/kde3/bin/kcachegrind itself, which > is unfortunately stripped, so no symbols. I used objdump to get the > bbs shown below for them. So maybe something in kcachegrind > itself is doing many traversals of a very long QGList? Does this > mean anything to you? I think I check the list before inserting an object, so there is no double insertion. Deep down in building the internal data model while loading... > 806670c: 8b 91 e0 00 00 00 mov 0xe0(%ecx),%edx > 8066712: 85 d2 test %edx,%edx > 8066714: 74 24 je 806673a > <_ZN8QPainter4saveEv@plt+0x7572> Now that is about painting... Probably the progress bar. As far as I can remember, I trigger repainting every second, so this is not really friendly for running under valgrind, but should not be relevant in reality. Josef |
|
From: Josef W. <Jos...@gm...> - 2006-04-18 12:02:42
|
On Tuesday 18 April 2006 03:34, Julian Seward wrote: > This gets stranger all the time. On 'none' it dies at 7% loaded with > > ==31091== Process terminating with default action of signal 14 (SIGALRM) > ... > ==31091== by 0x503E6C7: (within /lib/tls/libc-2.3.5.so) > ==31091== by 0x507BA49: _int_malloc (in /lib/tls/libc-2.3.5.so) > ==31091== by 0x507D5F2: malloc (in /lib/tls/libc-2.3.5.so) > ==31091== by 0x4FBA3C6: operator new(unsigned) > (in /usr/lib/libstdc++.so.6.0.6) > ==31091== by 0x80778A6: (within /opt/kde3/bin/kcachegrind) Now that is really strange. Perhaps some KDE guru here as an idea why a SIGALRM is raised? > On memcheck it doesn't die at that point, and no errors are reported. So this looks like some bug together with the native malloc implementation (or a KDE replacement for this)? > Me mystified. Me too. Josef |