From: Phillip E. <ez...@pe...> - 2002-03-04 21:53:45
|
Hi All, I started to play with oprofile, and I must say that I am pretty impressed. I've worked on Alpha performance tools (such as Iprobe and DCPI) for about five years and ported a few to Alpha/Linux. I've also built a GUI for DCPI and tools to visualize performance data. In any event, oprofile is much like DCPI, and I am impressed by how much of DCPI-similar functionality your reproduced in an open-source tool. Let me tell you some of the things that I've learned, that might help you in your project. 0) DCPI's dcpilabel function might be a helpful feature to add to oprofile. Here's a definition: "- dcpilabel - new command - It runs a specified program and labels all collected profiles with a user-defined label. This label can be passed to analysis tools to focus their attention on just the profiles with the specified label. This functionality can be used for example to compare two runs of a program within the same epoch, or to find the samples that fall within the kernel during the run of a program." 1) A per cpu breakdown of performance statistics is very important. I don't have oprofile yet working on an SMP box, but this feature (which is lacking in DCPI), is asked for again and again on DCPI/Tru64. As Linux grows to solve bigger and bigger problems, this will be more important. (It could be the killer app for oprofile in kernel development.) 2) Could oprofile adopts something similar to DCPI's concept of Epochs? Here's the explanation of Epochs from the DCPI page: "All samples are organized into non-overlapping epochs, each of which contains samples for some time interval. A new epoch is started (and the previous epoch terminated) using the dcpiepoch command. " I thought that oprofile had this feature and called them "sessions" (a much more sensible name, IMHO.) Unfortunately, I can't figure out anyway to specify a new session. 3) How does oprofile deal with modules on ramdisks? When I run my stock redhat 7.2, oprofile thinks that my SCSI drivers are in /lib/. "/proc/ksyms" says that my SCSI drivers are in /lib/, because that's where they exist on the ramdisk. "/lib" disappears when the system finishes booting, and oprofile can't find the "/lib/" files. 4) Can sample totals be stored at the beginning of the sample files? Performance is VERY slow when mapping and unmapping all of the images on my 128M machine. (Everything freezes while op_time is working) I believe that this is because it is mmaping all of the files in the sample directory. This isn't necessary for op_time. It only needs to know totals. The overhead of tracking this should be pretty small, but the performance benefit when running op_time would be huge. (Especially for the really big sample files.) 5) Can the sample files be compressed? Your sample files compress enormously. -rw-r--r-- 1 ezolt csdpg 23296560 Mar 4 16:36 red-carpet -rw-r--r-- 1 ezolt csdpg 37487 Mar 4 16:37 red-carpet.gz By using zlib to read and write them, you could dramatically reduce the amount of I/O necessary to total all of the samples. 6) Please keep the columns consistent/spread-sheet friendly. The column layout of the output from op_time and oprofpp differ. If one were to write a GUI that sits on top of oprofile, it would be easier to parse the output if it was identical in all cases. (from image to function to source line to assembly line) op_time: /usr/lib/mozilla/components/libgklayout.so 46796 4.40626% /usr/lib/libgdk-1.2.so.0.9.1 61809 5.81987% /usr/lib/mozilla/components/libgfx_gtk.so 63646 5.99284% /lib/i686/libc-2.2.4.so 90628 8.53344% /usr/src/linux-2.4.7-10/vmlinux 256315 24.1343% oprofpp: memmove[0x00088500]: 6.2965% (5650 samples) free[0x00080b40]: 6.7578% (6064 samples) __malloc[0x0007ff60]: 7.2448% (6501 samples) chunk_alloc[0x000801d0]: 10.4867% (9410 samples) chunk_free[0x00080c40]: 10.7151% (9615 samples) memcpy[0x00088c10]: 12.9261% (11599 samples) 7) Any plan on supporting the Pentium IV counters? Self Explanatory. I only mention #4 and #5, because my system swaps madly when I run op_time. I would really like to see a full featured profiling/performance tool for X86. I have some experience in the performance tool area, so if I can help, just drop me a line. --Phil Compaq: High Performance Server Systems Quality & Performance Engineering --------------------------------------------------------------------------- Phi...@co... Performance Tools/Analysis |
From: John L. <le...@mo...> - 2002-03-04 22:34:19
|
On Mon, Mar 04, 2002 at 04:52:22PM -0500, Phillip Ezolt wrote: > Hi All, Hi Phillip. > 0) DCPI's dcpilabel function might be a helpful feature to add to oprofile. > > Here's a definition: > > "- dcpilabel - new command > - It runs a specified program and labels all collected profiles with a > user-defined label. This label can be passed to analysis tools to > focus their attention on just the profiles with the specified label. > This functionality can be used for example to compare two runs of a > program within the same epoch, or to find the samples that fall within > the kernel during the run of a program." This sounds like it might indeed be a useful addition. We were planning to do comparisons with a slightly different (more automated) scheme. > 1) A per cpu breakdown of performance statistics is very important. This would actually be really trivial to do, just have the first word of the data read from the userspace daemon be the CPU number. Should go on the TODO list. > I thought that oprofile had this feature and called them "sessions" (a > much more sensible name, IMHO.) Unfortunately, I can't figure out anyway to > specify a new session. Currently a new session is started when some parameters change, between individual runs of the daemon. Adding in an epoch feature would be pretty easy I think, and another good idea. > 3) How does oprofile deal with modules on ramdisks? > > When I run my stock redhat 7.2, oprofile thinks that my SCSI drivers are in > /lib/. > > "/proc/ksyms" says that my SCSI drivers are in /lib/, because that's where they > exist on the ramdisk. We (quickly) ignore any such samples - there is no (automatic) way to find the binary. But I don't think there's any real requirement for us to have access to the binary whilst the daemon is running; currently we just map a sample file based on the size of the binary, when in fact we could be truncating the sample file to a larger size when we get an out of bounds sample (I think - Phil E. ?) So it's another feature request really, and hasn't been done yet simply because it's not a common requirement. > 4) Can sample totals be stored at the beginning of the sample files? We've been planning something similar for a while, the post-profile tools are WAY too slow. In the future, we'd like to able to freeze a profile state into a processed form, for quick analysis without all the trashing of pages. > I believe that this is because it is mmaping all of the files in the sample > directory. This isn't necessary for op_time. It only needs to know totals. Indeed. > 5) Can the sample files be compressed? > > Your sample files compress enormously. > > -rw-r--r-- 1 ezolt csdpg 23296560 Mar 4 16:36 red-carpet > -rw-r--r-- 1 ezolt csdpg 37487 Mar 4 16:37 red-carpet.gz try a stat :) Unless you're storing your sample files on vfat or something, they should be sparse files and take up very little actual disk space. > By using zlib to read and write them, you could dramatically reduce > the amount of I/O necessary to total all of the samples. This is pretty difficult compared to the current method of just incrementing a counter in a file-backed mmap page anyway :) > 6) Please keep the columns consistent/spread-sheet friendly. Phil has recently fixed this (good catch Phil ;) > 7) Any plan on supporting the Pentium IV counters? > Self Explanatory. lack of test machines :/ It wouldn't be too hard ... maybe one day I will write some support for the basic mechanism "blind", as at least a starting point for an interested developer with a P4 machine. PEBS support is a bigger task. > I would really like to see a full featured profiling/performance tool > for X86. I have some experience in the performance tool area, so if I > can help, just drop me a line. Well you've made some very good suggestions, we can always do with more of those ! And we definitely take patches :) regards john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Philippe E. <ph...@cl...> - 2002-03-05 05:40:29
|
From: "John Levon" <le...@mo...> Sent: Tuesday, March 05, 2002 1:33 AM > On Mon, Mar 04, 2002 at 04:08:19PM -0800, Osiris Pedroso wrote: > > > How can I help ? I may need some help myself in getting it started (have > > been away from Linux for the last 6 months or so). > ... > 2) fix the module/daemon to be able to use up to 18 counters instead of > the current 4. This is mostly a simple change of OP_MAX_COUNTERS, but we > need to see if taking 5 bits out of the sample entry for the counter > number isn't causing performance degradation (I doubt it actually) it's not a problem, I've tested it in the past through simulated counter. > 3) define and use a CPU_P4 type > > 4) test, fix and verify the APIC initialisation code it should already ok except than number of lvt vector is 5 for PIV, the added vector is LVT thermal monitor vector. To simplify work you can start with a kernel where the APIC is up at startup so you don't rely on the apic up code in oprofile. John is this require a 2.4.10 at least ? ... > Get the Intel manual (vol. 3 in particular) if you haven't already, and > flick through the perf ctr + APIC sections. see also 24896604.pdf "Intel Pentium 4 and Intel Xeon processor optimization" appendix B. An another usefull source is brinks which implement a P4, perf-ctr based, profiler. http://www.eg.bucknell.edu/~bsprunt/emon/brink/brink.shtm > Feel free to ask any questions you might have Can I become rich, beautiful, clever and young by using oprofile ? regards, Phil |
From: John L. <le...@mo...> - 2002-03-05 16:18:18
|
On Tue, Mar 05, 2002 at 06:28:23AM +0100, Philippe Elie wrote: > it's not a problem, I've tested it in the past through simulated counter. ok, good. > is up at startup so you don't rely on the apic up code in > oprofile. John is this require a 2.4.10 at least ? Yep, with the APIC options obviously enough :) > > Feel free to ask any questions you might have > > Can I become rich, beautiful, clever and young by using oprofile ? I didn't say I would answer them :) john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Osiris P. <ope...@sw...> - 2002-03-11 21:17:17
|
Ok, I have ordered the books from Intel (they are also available on .PDF form). I also installed RedHat 7.2 on my P4 machine. It comes with 2.4.7-10 kernel. What version of kernel to I need ? Thanks, Osiris ----- Original Message ----- From: "Philippe Elie" <ph...@cl...> To: "John Levon" <le...@mo...>; "Osiris Pedroso" <ope...@sw...> Cc: <opr...@li...> Sent: Monday, March 04, 2002 9:28 PM Subject: Re: Nice Work. > From: "John Levon" <le...@mo...> > Sent: Tuesday, March 05, 2002 1:33 AM > > > > On Mon, Mar 04, 2002 at 04:08:19PM -0800, Osiris Pedroso wrote: > > > > > How can I help ? I may need some help myself in getting it started (have > > > been away from Linux for the last 6 months or so). > > > > ... > > > 2) fix the module/daemon to be able to use up to 18 counters instead of > > the current 4. This is mostly a simple change of OP_MAX_COUNTERS, but we > > need to see if taking 5 bits out of the sample entry for the counter > > number isn't causing performance degradation (I doubt it actually) > > it's not a problem, I've tested it in the past through simulated counter. > > > 3) define and use a CPU_P4 type > > > > 4) test, fix and verify the APIC initialisation code > > it should already ok except than number of lvt vector is 5 > for PIV, the added vector is LVT thermal monitor vector. > To simplify work you can start with a kernel where the APIC > is up at startup so you don't rely on the apic up code in > oprofile. John is this require a 2.4.10 at least ? > > ... > > > Get the Intel manual (vol. 3 in particular) if you haven't already, and > > flick through the perf ctr + APIC sections. > > see also 24896604.pdf "Intel Pentium 4 and Intel Xeon processor > optimization" appendix B. An another usefull source is brinks > which implement a P4, perf-ctr based, profiler. > > http://www.eg.bucknell.edu/~bsprunt/emon/brink/brink.shtm > > > Feel free to ask any questions you might have > > Can I become rich, beautiful, clever and young by using oprofile ? > > regards, > Phil > > > > > |
From: Philippe E. <ph...@cl...> - 2002-03-11 22:24:18
|
From: "Osiris Pedroso" <ope...@sw...> Sent: Monday, March 11, 2002 10:20 PM > Ok, I have ordered the books from Intel (they are also available on .PDF > form). > > I also installed RedHat 7.2 on my P4 machine. It comes with 2.4.7-10 kernel. > > What version of kernel to I need ? UP kernel >= 2.4.11 with CONFIG_X86_LOCAL_APIC on (perhaps the option is called CONFIG_X86_LOCAL_APICUP) regards, Phil |
From: John L. <le...@mo...> - 2002-03-11 23:14:02
|
On Mon, Mar 11, 2002 at 01:20:33PM -0800, Osiris Pedroso wrote: > I also installed RedHat 7.2 on my P4 machine. > It comes with 2.4.7-10 kernel. That version is fine - oprofile works on most versions. regards john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Philippe E. <ph...@cl...> - 2002-03-05 05:40:31
|
From: "John Levon" <le...@mo...> Sent: Monday, March 04, 2002 11:31 PM > On Mon, Mar 04, 2002 at 04:52:22PM -0500, Phillip Ezolt wrote: > > > Hi All, > > Hi Phillip. Hi, [DCPI label - Epoch] > > I thought that oprofile had this feature and called them "sessions" (a > > much more sensible name, IMHO.) Unfortunately, I can't figure out > > anyway to specify a new session. > > Currently a new session is started when some parameters change, between > individual runs of the daemon. > > Adding in an epoch feature would be pretty easy I think, and another > good idea. We just need to add the notion of named session. I don't see exactly the difference between labels and epochs. Is labels just a trick to avoid restart a different epochs ? > > 3) How does oprofile deal with modules on ramdisks? > We (quickly) ignore any such samples - there is no (automatic) way to > find the binary. But I don't think there's any real requirement for us > to have access to the binary whilst the daemon is running; currently we > just map a sample file based on the size of the binary, when in fact we > could be truncating the sample file to a larger size when we get an out > of bounds sample (I think - Phil E. ?) A) growing and remapping the samples files would work. There is also a related problem in post-profile code. (nr samples calculation) > > 4) Can sample totals be stored at the beginning of the sample files? > > We've been planning something similar for a while, the post-profile > tools are WAY too slow. > > In the future, we'd like to able to freeze a profile state into a > processed form, for quick analysis without all the trashing of pages. > > > I believe that this is because it is mmaping all of the files in the > > sample directory. This isn't necessary for op_time. It only needs > > to know totals. B) At my eyes that's a problem in the linux kernel: when we mmap or read a sparse file the hole are stored in the cache and so on clobber all memory with zeroed blocks. try with a "little session": $ du -h /var/opd/samples $ free $ op_time $ free and look the cached field... > > 5) Can the sample files be compressed? > > > > Your sample files compress enormously. > > > > -rw-r--r-- 1 ezolt csdpg 23296560 Mar 4 16:36 red-carpet > > -rw-r--r-- 1 ezolt csdpg 37487 Mar 4 16:37 red-carpet.gz > > try a stat :) > > Unless you're storing your sample files on vfat or something, they > should be sparse files and take up very little actual disk space. C) Even when comparing it to du -h result the compression ratio is impressive. That's recall to me we don't *greatly* discourage the use of a filesystem w/o sparse files support (vfat and network fs ?) in the documentation. > > By using zlib to read and write them, you could dramatically reduce > > the amount of I/O necessary to total all of the samples. > > This is pretty difficult compared to the current method of just > incrementing a counter in a file-backed mmap page anyway :) D) I think Philipp means: using for post-profile the compressed files will give a great improvement and, at least at a memory use point of view, its a real improvement. I've think in the past to compress file when creating a new session but I often use post-profile tools on a running session... John, we have discuss in the past of the sparse file format. I think we must (see A to D point) eject it even if this increase (moderately) the overhead of oprofile. The best promising data struct, at my eyes, to store samples is an in-memory B tree build inside a growable mmaped file (one tree per image) with little size page (probably less than 16 entry by page) Philipp you have work with DCPI: what was roughly the overhead of DCPI ? > > 6) Please keep the columns consistent/spread-sheet friendly. I'm amazing than nobody, myself too, have never complain about the output format > > The column layout of the output from op_time and oprofpp differ. If one > > were to write a GUI that sits on top of oprofile, it would be easier to > > parse the output if it was identical in all cases. plain op_time (ie w/o -l option) have not yet a columned output. I have started an oprofpp like GUI tools a few days ago but I don't rely on the output of post-profile tools, I build it on the top of container used to store samples in post-profile tools. For now I've planned only two view, one grouping oprofpp -l/-L capabilities with a tree view of samples and one another to see graphically hotspot in code. I'm interested by idea/help on what to implement in GUI post-profile tools. Because we are near a release period I prefer to not open a branch in cvs and wait a few day before committing the GUI. > > 7) Any plan on supporting the Pentium IV counters? > > Self Explanatory. > > lack of test machines :/ > > It wouldn't be too hard ... maybe one day I will write some support for > the basic mechanism "blind", as at least a starting point for an > interested developer with a P4 machine. > > PEBS support is a bigger task. we can perhaps take a more precise eip value from the PEBS. I hope this eip doesn't suffer from the irq latency problem. It will be nice also to make port of oprofile to other architecture. regards, Phil |
From: <Phi...@co...> - 2002-03-05 16:04:28
|
Philippe, > We just need to add the notion of named session. I don't see exactly > the difference between labels and epochs. Is labels just a trick to avoid > restart a different epochs ? > An Epoch is a period of time, while a a dcpi-label marks everything associated with the running of a particular process. Let me give some examples. Epochs ------ When I start profiling a new epoch is automatically opened. I start the program that I want to profile. When it is finished with initialization, I start a new Epoch. I run my benchmark or whatever, and then start a new Epoch. I then stop my program and start a new Epoch. The "latest" epoch contains samples for the currently running system. The "latest-1" epoch contains samples for the shut-down of the program I was interested in. The "latest-2" epoch contains samples for the running of the program I was interested in. The "latest-3" epoch contains samples for the initialization of the program I was interested in. So, basically epochs allow you to profile different periods of time. Labels ------ This allows me to tag everything related to program I am running as releavnt that label. For example, I would start "dcpilabel game_devel/usr/local/bin/quake". Later, the post processing tools would accept the label, and ONLY show me samples that were attributes to that label. That way if I have two CPU hogs running simultaneously, and they both call libc, the samples won't be merged. By specifying the label, I can see which samples in libc were cause by "quake" and which came from other applications on the system. > Philipp you have work with DCPI: what was roughly the overhead > of DCPI ? DCPI definately uses some sort of compression, and it's overhead hovers at around 5% of the CPU. Not much at all for a profiling tool. Here's a rough approximation between the "uncompressed" and "compressed" DCPI sample files: -rw-r--r-- 1 ezolt system 363332 Mar 5 10:45 vi -rw-r--r-- 1 ezolt system 6208 Mar 4 23:40 vi_20000825012804fe49bb > For now I've planned only two view, one grouping oprofpp -l/-L > capabilities with a tree view of samples and one another to see > graphically hotspot in code. I'm interested by idea/help on what > to implement in GUI post-profile tools. Because we are near a > release period I prefer to not open a branch in cvs and wait a few > day before committing the GUI. Our internal tools can start at the system level, and drill all the way down to an assembly instruction using ctree's. (Open the system, open the image, open the function, open the source line.) Ideally, one would want some sort of "hints" at the assembly level to suggest what needs to be changed. Much of the time people ask about differences between runs of an application. Unfortunately, I don't have a good way of showing differences between runs. (Think about it from a developers point of view. I changed my code... It is 25% slower... What the heck is different? ) > It will be nice also to make port of oprofile to other architecture. Yes. I was briefly looking into that. There are definately architecture specific of profiling, but as long as you have a processor that can interrupt on a counter overflow, much of the code is the same. The details are a little different but the abstraction remains the same. I think that IA64 would probably the best place to take this. (Unfortunately, nobody has any hardware....) I don't know how possible it is, but it would be good if OS events could use the same infrastructure as oprofile to give information about the programs execution. For example, I would like to be able to profile "page-faults" or system calls. I ran into a situation last year where a huge program was generating alot of page faults. It was unclear WHERE they were coming from. To be able to trace the faults to a particular source line would be heaven. With a little kernel tinkering, you could profile ANY kernel system call. This would be very helpful with big codes that spend alot of time in the kernel. Sometimes it is unclear which function is doing the "read" or "write" or whatever. --Phil Compaq: High Performance Server Systems Quality & Performance Engineering --------------------------------------------------------------------------- Phi...@co... Performance Tools/Analysis |
From: John L. <le...@mo...> - 2002-03-05 16:39:50
|
On Tue, Mar 05, 2002 at 11:03:00AM -0500, Phi...@co... wrote: > > Philipp you have work with DCPI: what was roughly the overhead > > of DCPI ? > > DCPI definately uses some sort of compression, and it's overhead > hovers at around 5% of the CPU. Not much at all for a profiling tool. Sounds in the same range as oprofile now (hard to say though since it's dependent on frequency) > Much of the time people ask about differences between runs of an > application. Unfortunately, I don't have a good way of showing > differences between runs. > > (Think about it from a developers point of view. I changed my > code... It is 25% slower... What the heck is different? ) Indeed, it's a definitely handy feature. There are two different ways to represent it, firstly via source annotation, and secondly a symbol-based approach that doesn't require -g. You just go through symbol by symbol, comparing them and showing the differences. If the developer has only changed say 1 function, then that would show up with the difference (of course there will be "line noise" and the like). > There are definately architecture specific of profiling, but as long as you > have a processor that can interrupt on a counter overflow, much of the code > is the same. Indeed, the approach is extendible to a number of processors. Originally my M.Sc. thesis deadline was really the bounding factor in implementing the code nicely. We've improved a lot recently due to various things, but there are undoubtedly quite a lot of x86-specific things still there. > The details are a little different but the abstraction remains the > same. I think that IA64 would probably the best place to take this. > (Unfortunately, nobody has any hardware....) The HP guys do: I don't know if they are actually planning a port. Bob ? > For example, I would like to be able to profile "page-faults" or > system calls. I ran into a situation last year where a huge program > was generating alot of page faults. It was unclear WHERE they were > coming from. To be able to trace the faults to a particular source > line would be heaven. With a little kernel tinkering, you could > profile ANY kernel system call. This would be very helpful with big > codes that spend alot of time in the kernel. Sometimes it is unclear which > function is doing the "read" or "write" or whatever. This is stepping into dprobes/LTT territory, I think we should investigate if we can leverage them. This would be pretty far in oprofile's future anyway I think. It also loses one of the distinct advantages of oprofile: no patches needed. So it would have to be an option at least ;) regards john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Philippe E. <ph...@cl...> - 2002-03-06 02:53:27
|
From: "John Levon" <le...@mo...> Sent: Tuesday, March 05, 2002 5:36 PM > On Tue, Mar 05, 2002 at 11:03:00AM -0500, Phi...@co... wrote: > > Much of the time people ask about differences between runs of an > > application. Unfortunately, I don't have a good way of showing > > differences between runs. > > > > (Think about it from a developers point of view. I changed my > > code... It is 25% slower... What the heck is different? ) > > Indeed, it's a definitely handy feature. There are two different ways to > represent it, firstly via source annotation, and secondly a symbol-based > approach that doesn't require -g. The twice are probably complementary, the symbol based diffs is more easy to implement. I have started something like the first in the past based on annotated source diffing, but it's not easy to implement. > > There are definately architecture specific of profiling, but as long as you > > have a processor that can interrupt on a counter overflow, much of the code > > is the same. > > Indeed, the approach is extendible to a number of processors. Originally > my M.Sc. thesis deadline was really the bounding factor in implementing > the code nicely. We've improved a lot recently due to various things, > but there are undoubtedly quite a lot of x86-specific things still > there. the TODO entry replace our u32 etc with <stdint.h> is also to make a review of oprofile on the portability problem. Oprofile actually is ... hum ... weird in the type used to record samples. eg post-profile use sometimes size_t sometimes u32 sometimes uint ... > > The details are a little different but the abstraction remains the > > same. I think that IA64 would probably the best place to take this. > > (Unfortunately, nobody has any hardware....) [look like than intercepting sys call to profile page faults and other kernel events] > This is stepping into dprobes/LTT territory, I think we should > investigate if we can leverage them. > > This would be pretty far in oprofile's future anyway I think. It also > loses one of the distinct advantages of oprofile: no patches needed. So > it would have to be an option at least ;) I agree, we must *strongly* avoid to patch the kernel. It's un-maintainable. At least three other things comes directly from your M.Sc. thesis: - no need to instrument code - independant from the compiled language used to create binary. - low overhead the two first are important at my eyes, the third also but a little what less than the other. regards, Philippe Elie |
From: John L. <le...@mo...> - 2002-03-06 06:44:41
|
On Wed, Mar 06, 2002 at 03:19:46AM +0100, Philippe Elie wrote: > > Indeed, the approach is extendible to a number of processors. Originally > > my M.Sc. thesis deadline was really the bounding factor in implementing > > the code nicely. We've improved a lot recently due to various things, > > but there are undoubtedly quite a lot of x86-specific things still > > there. > > the TODO entry replace our u32 etc with <stdint.h> is also to make > a review of oprofile on the portability problem. Oprofile actually is > ... hum ... weird in the type used to record samples. eg post-profile > use sometimes size_t sometimes u32 sometimes uint ... oh, these are piddling little things. We need to worry about things like the atomicity of certain checks in oprofile.c that work on x86 and don't elsewhere... > I agree, we must *strongly* avoid to patch the kernel. It's un-maintainable. > At least three other things comes directly from your M.Sc. thesis: > > - no need to instrument code > - independant from the compiled language used to create binary. > - low overhead > > the two first are important at my eyes, the third also but a little what > less than the other. All three come under the banner of "convenience". Developer's time is precious commodity, and we don't want to waste it. regards john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Philippe E. <ph...@cl...> - 2002-03-06 02:53:23
|
From: <Phi...@co...> To: "Philippe Elie" <ph...@cl...> Philip, [epochs vs label] ok thank for the explanation, Epoch is likely to be implemented through session, label seems more problematic. > > Philipp you have work with DCPI: what was roughly the overhead > > of DCPI ? > > DCPI definately uses some sort of compression, and it's overhead > hovers at around 5% of the CPU. Not much at all for a profiling tool. Is it for profiling all running applications in the system ? > > For now I've planned only two view, one grouping oprofpp -l/-L > > capabilities with a tree view of samples and one another to see > > graphically hotspot in code. I'm interested by idea/help on what > > to implement in GUI post-profile tools. Because we are near a > > release period I prefer to not open a branch in cvs and wait a few > > day before committing the GUI. > > Our internal tools can start at the system level, and drill all the > way down to an assembly instruction using ctree's. (Open the system, > open the image, open the function, open the source line.) It's planned too, I think to just use a contextual menu on item to select the right sub-view needed by the user. Your help will be appreciate on this area (the gui is QT based) Philippe Elie |
From: <Phi...@co...> - 2002-03-06 22:03:17
|
> > Is it for profiling all running applications in the system ? Yes. Kernel and user space. > It's planned too, I think to just use a contextual menu on item to select > the right sub-view needed by the user. Your help will be appreciate > on this area (the gui is QT based) No problem. --Phil Compaq: High Performance Server Systems Quality & Performance Engineering --------------------------------------------------------------------------- Phi...@co... Performance Tools/Analysis On Wed, 6 Mar 2002, Philippe Elie wrote: > From: <Phi...@co...> > To: "Philippe Elie" <ph...@cl...> > > > Philip, > > [epochs vs label] > > ok thank for the explanation, Epoch is likely to be implemented > through session, label seems more problematic. > > > > Philipp you have work with DCPI: what was roughly the overhead > > > of DCPI ? > > > > DCPI definately uses some sort of compression, and it's overhead > > hovers at around 5% of the CPU. Not much at all for a profiling tool. > > Is it for profiling all running applications in the system ? > > > > For now I've planned only two view, one grouping oprofpp -l/-L > > > capabilities with a tree view of samples and one another to see > > > graphically hotspot in code. I'm interested by idea/help on what > > > to implement in GUI post-profile tools. Because we are near a > > > release period I prefer to not open a branch in cvs and wait a few > > > day before committing the GUI. > > > > Our internal tools can start at the system level, and drill all the > > way down to an assembly instruction using ctree's. (Open the system, > > open the image, open the function, open the source line.) > > It's planned too, I think to just use a contextual menu on item to select > the right sub-view needed by the user. Your help will be appreciate > on this area (the gui is QT based) > > Philippe Elie > > > > |
From: John L. <le...@mo...> - 2002-03-05 16:28:33
|
On Tue, Mar 05, 2002 at 06:31:06AM +0100, Philippe Elie wrote: > > > to know totals. > > B) At my eyes that's a problem in the linux kernel: when we mmap > or read a sparse file the hole are stored in the cache and so > on clobber all memory with zeroed blocks. try with a "little > session": is this really the case ? Why aren't zero page copies faulted in on demand ? I'm a bit dubious whether page size granularity is good enough anyway. > C) Even when comparing it to du -h result the compression ratio > is impressive. That's recall to me we don't *greatly* discourage > the use of a filesystem w/o sparse files support (vfat and > network fs ?) in the documentation. yes, fix that :) > D) I think Philipp means: using for post-profile the compressed for post-profile, fine yes... > to compress file when creating a new session but I often > use post-profile tools on a running session... not really an issue I think - just a "don't do that" > John, we have discuss in the past of the sparse file format. > I think we must (see A to D point) eject it even if this increase > (moderately) the overhead of oprofile. > > The best promising data struct, at my eyes, to store samples > is an in-memory B tree build inside a growable mmaped file > (one tree per image) with little size page (probably less than 16 > entry by page) This sounds very much like what HP have. We need to get some serious computer science going here to determine the best structure. So: I'm in favour of such a change. > plain op_time (ie w/o -l option) have not yet a columned output. I have ah ok. you see how little I've tested it ;) > release period I prefer to not open a branch in cvs and wait a few > day before committing the GUI. Hmm, Phil, getting these GUI working nicely is a /big/ job ... I'd really prefer a new module/branch for the GUI right now. Aren't there more important things like op_diff we need first ? I'd really like to get a 1.0 version out this year too, if possible. > we can perhaps take a more precise eip value from the PEBS. I hope > this eip doesn't suffer from the irq latency problem. From the sound of it, it looks like it doesn't. But let's take one step at a time: the basic facility is similar to the P6 one, PEBS will probably require some more radical changes. > It will be nice also to make port of oprofile to other architecture. Just you wait ! john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Philippe E. <ph...@cl...> - 2002-03-06 02:53:22
|
From: "John Levon" <le...@mo...> Sent: Tuesday, March 05, 2002 5:25 PM > On Tue, Mar 05, 2002 at 06:31:06AM +0100, Philippe Elie wrote: > [sparsed file and linux kernel buffer] > is this really the case ? Why aren't zero page copies faulted in on > demand ? yes they are but they are *also* cached in memory buffer wich is, IMHO, wrong. Try what I suggest in my last mail and the behaviour of kernel will become clear to your eyes. > I'm a bit dubious whether page size granularity is good enough anyway. Even page touched by daemon contains mainly zero count (at least 90% in the most common case and more than 99% in many case) > > to compress file when creating a new session but I often > > use post-profile tools on a running session... > > not really an issue I think - just a "don't do that" ok the doc say : "oprofile is a continous profiler" must I add "but never look at the result during profiling" :) > > The best promising data struct, at my eyes, to store samples > > is an in-memory B tree build inside a growable mmaped file > > (one tree per image) with little size page (probably less than 16 > > entry by page) > > This sounds very much like what HP have. We need to get some serious > computer science going here to determine the best structure. I prefer to not look at this work (it's copyrighted if my remember is correct) For the choose I've get our need comparing it to the most common data struct (hash code based, B tree, 2-3 tree, SBB tree etc ...) If someone have a better suggestion ... > > release period I prefer to not open a branch in cvs and wait a few > > day before committing the GUI. > > Hmm, Phil, getting these GUI working nicely is a /big/ job ... nope it not as many job you think. A basic oprofpp -l/-L like is already working nicely and it's easy to ehance it step by step. > I'd really prefer a new module/branch for the GUI right now. Aren't > there more important things like op_diff we need first ? A GUI is needed and it require less work than diffing profiling sessions. Another people can help easily on GUI, helping on op_diff seems at my eyes more problematic. > I'd really like to get a 1.0 version out this year too, if possible. ok. > > It will be nice also to make port of oprofile to other architecture. > > Just you wait ! I wait hardware, I've tried to get an old alpha workstation but ... Phil |
From: John L. <le...@mo...> - 2002-03-06 06:47:59
|
On Wed, Mar 06, 2002 at 02:29:38AM +0100, Philippe Elie wrote: > > > to compress file when creating a new session but I often > > > use post-profile tools on a running session... > > > > not really an issue I think - just a "don't do that" > > ok the doc say : "oprofile is a continous profiler" must > I add "but never look at the result during profiling" :) No I mean don't compress current sample files :) We'll have an "op_store" or whatever which stores the current session in a different directory and compresses the file (epoch). > > This sounds very much like what HP have. We need to get some serious > > computer science going here to determine the best structure. > > I prefer to not look at this work (it's copyrighted if my remember is > correct) Do you mean Judy ? I think Bob said they've replaced that now with another scheme. > > Hmm, Phil, getting these GUI working nicely is a /big/ job ... > > nope it not as many job you think. A basic oprofpp -l/-L like is > already working nicely and it's easy to ehance it step by step. ok :) regards john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: <Phi...@co...> - 2002-03-05 15:30:52
|
John, > We (quickly) ignore any such samples - there is no (automatic) way to > find the binary. But I don't think there's any real requirement for us > to have access to the binary whilst the daemon is running; currently we > just map a sample file based on the size of the binary, when in fact we > could be truncating the sample file to a larger size when we get an out > of bounds sample (I think - Phil E. ?) > The only reason I mention it, is that "op_time -l" fails when it can't find the modules that exist only on the ramdisk. "oprofpp: bfd_openr of /lib/ext3.o failed." I guess this really isn't a profiling issue, rather a post-profiling tools issue. It would be nice to ignore un-openable images, but the code just isn't written that way. (I saw no obvious way to change it... When it can't open it exits immediately. ) > In the future, we'd like to able to freeze a profile state into a > processed form, for quick analysis without all the trashing of > pages. This can be done, but it really isn't necessary if you have a really light-weight method of reading this samples. > try a stat :) > [ezolt@scrffy tmp]$ stat red-carpet* File: "red-carpet" Size: 23296560 Blocks: 4824 IO Block: 4096 Regular File Device: 802h/2050d Inode: 216090 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 9336/ ezolt) Gid: ( 1021/ csdpg) Access: Mon Mar 4 16:39:36 2002 Modify: Mon Mar 4 16:36:58 2002 Change: Mon Mar 4 16:36:58 2002 File: "red-carpet.gz" Size: 37487 Blocks: 80 IO Block: 4096 Regular File Device: 802h/2050d Inode: 216093 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 9336/ ezolt) Gid: ( 1021/ csdpg) Access: Mon Mar 4 16:37:06 2002 Modify: Mon Mar 4 16:37:06 2002 Change: Mon Mar 4 16:37:13 2002 > Unless you're storing your sample files on vfat or something, they > should be sparse files and take up very little actual disk space. This is on an ext3 file system. How does stat show a sparse file? (Maybe I am missing it, but it looks the uncompressed files really IS bigger than the compressed one.) > > By using zlib to read and write them, you could dramatically reduce > > the amount of I/O necessary to total all of the samples. > This is pretty difficult compared to the current method of just > incrementing a counter in a file-backed mmap page anyway :) There's probably better ways of doing it than zlib. I've used the zlib library before, and it is pretty straight forward if files are manipulated with file I/O, instead of mmap. Things would need to be redesigned a little bit. > It wouldn't be too hard ... maybe one day I will write some support for > the basic mechanism "blind", as at least a starting point for an > interested developer with a P4 machine. Hpmf. Unfortunately, I don't have one either. (Not yet at least...) > Well you've made some very good suggestions, we can always do with more > of those ! > > And we definitely take patches :) Ok. I'm trolling through your source as we speak. I probably should be using CVS head instead of the released kit. I'll probably just provide suggestions/comments for the time being. At least until I can see clear fixes or improvements. --Phil Compaq: High Performance Server Systems Quality & Performance Engineering --------------------------------------------------------------------------- Phi...@co... Performance Tools/Analysis |
From: John L. <le...@mo...> - 2002-03-05 16:44:50
|
On Tue, Mar 05, 2002 at 10:29:26AM -0500, Phi...@co... wrote: > The only reason I mention it, is that "op_time -l" fails when it can't > find the modules that exist only on the ramdisk. > > "oprofpp: bfd_openr of /lib/ext3.o failed." Yes, I've been meaning to fix that too (it's pretty stupid behaviour). > tools issue. It would be nice to ignore un-openable images, but the > code just isn't written that way. (I saw no obvious way to change > it... When it can't open it exits immediately. ) Phil is it possible you can do this now (before 0.1) ? > File: "red-carpet" > Size: 23296560 Blocks: 4824 IO Block: 4096 Regular File hmm, that's pretty heavily used ! Compare : moz moz 108 /usr/bin/stat /var/opd/samples/\}lib\}libc-2.2.so#0 File: "/var/opd/samples/}lib}libc-2.2.so#0" Size: 20289672 Blocks: 1664 Regular File > (Maybe I am missing it, but it looks the uncompressed files really IS > bigger than the compressed one.) I've no problem with bzip2ing sample files after they're stored in a session. There is only so much space at the top of the todo list however :) > There's probably better ways of doing it than zlib. I've used the > zlib library before, and it is pretty straight forward if files are > manipulated with file I/O, instead of mmap. Things would need to be > redesigned a little bit. in fact oprofile used to have a zlib interface. I forget why ... > I'll probably just provide suggestions/comments for the time being. At least > until I can see clear fixes or improvements. Thanks for your comments ! john -- I am a complete moron for forgetting about endianness. May I be forever marked as such. |
From: Philippe E. <ph...@cl...> - 2002-03-06 02:53:25
|
From: "John Levon" <le...@mo...> Sent: Tuesday, March 05, 2002 5:41 PM > On Tue, Mar 05, 2002 at 10:29:26AM -0500, Phi...@co... wrote: [op_time -l failure if some binary does not exist] > Phil is it possible you can do this now (before 0.1) ? yes, I'm looking comitting the patch probably the 6 [samples file size] > in fact oprofile used to have a zlib interface. I forget why ... I've a strong preference to drop the sparse file format. It was a good idea at the start of oprofile but it is become a little what problematic. regards, Philippe Elie |
From: John L. <le...@mo...> - 2002-05-01 20:21:30
|
On Mon, Mar 04, 2002 at 04:52:22PM -0500, Phillip Ezolt wrote: > Let me tell you some of the things that I've learned, that > might help you in your project. ... just clearing out my old mailboxes ... > 1) A per cpu breakdown of performance statistics is very important. > > I don't have oprofile yet working on an SMP box, but this feature > (which is lacking in DCPI), is asked for again and again on DCPI/Tru64. Can you describe a little further how you see this working (both implementation-wise, and on the user interface level) ? > 2) Could oprofile adopts something similar to DCPI's concept of Epochs? See my other post. regards john -- "Please let's not resume the argument with the usual whining about how this feature will wipe out humanity or bring us to the promised land." - Charles Campbell on magic words in Subject: headers |
From: <Phi...@co...> - 2002-05-01 20:26:33
|
John, > Can you describe a little further how you see this working (both > implementation-wise, and on the user interface level) ? I don't have strong ideas about the implementation, but there would have to be a histogram for each CPU in the system. The analysis tools must be able to quickly sum this when asked. For the user interface, my ideas are a little more concrete. When the user requests samples, he/she would be able to ask for a system-total view of all of the samples, or for samples that only occurred on a particular cpu. (The default view would be the system-total view) This will give users that ability to see if work is being balanced or whether CPU-affinity is working as it is supposed to. (This becomes VERY important with NUMA machines.) --Phil Compaq: High Performance Technical Computing/Visualization --------------------------------------------------------------------------- Phi...@co... Performance/Development On Wed, 1 May 2002, John Levon wrote: > On Mon, Mar 04, 2002 at 04:52:22PM -0500, Phillip Ezolt wrote: > > > Let me tell you some of the things that I've learned, that > > might help you in your project. > > ... just clearing out my old mailboxes ... > > > 1) A per cpu breakdown of performance statistics is very important. > > > > I don't have oprofile yet working on an SMP box, but this feature > > (which is lacking in DCPI), is asked for again and again on DCPI/Tru64. > > Can you describe a little further how you see this working (both > implementation-wise, and on the user interface level) ? > > > 2) Could oprofile adopts something similar to DCPI's concept of Epochs? > > See my other post. > > > regards > john > > -- > "Please let's not resume the argument with the usual whining about how this > feature will wipe out humanity or bring us to the promised land." > - Charles Campbell on magic words in Subject: headers > |