You can subscribe to this list here.
2004 |
Jan
|
Feb
(2) |
Mar
(2) |
Apr
|
May
|
Jun
(1) |
Jul
(6) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
|
Dec
(2) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(2) |
Feb
(2) |
Mar
|
Apr
(6) |
May
|
Jun
(4) |
Jul
(3) |
Aug
|
Sep
|
Oct
(2) |
Nov
(12) |
Dec
(10) |
2006 |
Jan
(27) |
Feb
(4) |
Mar
(3) |
Apr
(5) |
May
(5) |
Jun
(1) |
Jul
(2) |
Aug
|
Sep
(7) |
Oct
(5) |
Nov
(11) |
Dec
(5) |
2007 |
Jan
(15) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(3) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2008 |
Jan
(7) |
Feb
(9) |
Mar
(2) |
Apr
(1) |
May
|
Jun
(6) |
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
(3) |
Dec
(1) |
2009 |
Jan
(11) |
Feb
|
Mar
(2) |
Apr
(1) |
May
(8) |
Jun
(11) |
Jul
(9) |
Aug
(12) |
Sep
(1) |
Oct
(3) |
Nov
(10) |
Dec
|
2010 |
Jan
(3) |
Feb
(1) |
Mar
(5) |
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
2011 |
Jan
(2) |
Feb
(2) |
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(2) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
(1) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
From: George M. <ge...@ma...> - 2010-03-27 00:51:30
|
Dear all, thanks for your answers, I was wondering mainly for profiling only with this tool both hardware counters and MPI communication but ok you answered my question, I was just confused by the documentation. Thanks a lot, Best regards, George Rick Kufrin wrote: > George, > > Jeff is right - there are a number of different ways of profiling MPI > communication. I'm afraid the issue regarding PSPMPI, which was a > development project within PerfSuite several years ago is that the > pages sited at NCSA are quite out-of-date. PSPMPI was never released > because other tools/libraries were developed elsewhere that fill the > need quite nicely, so duplication/reinventing the wheel didn't make > sense. Sorry for the confusion. > > In addition to the tools mentioned below, I have found MPIP > (LLNL/ORNL) to be quite useful and very low overhead. There is > information located here: http://mpip.sourceforge.net/ > > TAU and Scalasca are both participating in the active POINT project [ > http://nic.uoregon.edu/point ], as is PerfSuite. > > Rick > > Jeff Hammond wrote: >> Communication of what kind? There are many tools for this type of >> thing. You might look at TAU since it supports MPI, Pthreads, OpenMP, >> Fortran, C, C++, etc. as well as low-level instrumentation with PAPI. >> There is also Scalasca, but I haven't tried it yet. >> >> If you mean communication below MPI, such as IB verbs, I don't know >> any tool for that. >> >> Jeff >> >> On Fri, Mar 26, 2010 at 11:34 AM, George Markomanolis >> <ge...@ma...> wrote: >> >>> Dear all, >>> >>> I would like to ask if it is possible to profile the communication of a >>> fortran program. I have compiled it with -lpspmpi, and I am not sure >>> how >>> I can get the xml files about communication, I get only about hardware >>> counters. I would be grateful for any help. Is it possible to profile >>> the communication of the application automatically? >>> >>> Best regards, >>> George >>> >>> ------------------------------------------------------------------------------ >>> >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> PerfSuite-users mailing list >>> Per...@li... >>> https://lists.sourceforge.net/lists/listinfo/perfsuite-users >>> >>> >> >> >> >> > > |
From: Rick K. <rk...@il...> - 2010-03-26 19:15:24
|
George, Jeff is right - there are a number of different ways of profiling MPI communication. I'm afraid the issue regarding PSPMPI, which was a development project within PerfSuite several years ago is that the pages sited at NCSA are quite out-of-date. PSPMPI was never released because other tools/libraries were developed elsewhere that fill the need quite nicely, so duplication/reinventing the wheel didn't make sense. Sorry for the confusion. In addition to the tools mentioned below, I have found MPIP (LLNL/ORNL) to be quite useful and very low overhead. There is information located here: http://mpip.sourceforge.net/ TAU and Scalasca are both participating in the active POINT project [ http://nic.uoregon.edu/point ], as is PerfSuite. Rick Jeff Hammond wrote: > Communication of what kind? There are many tools for this type of > thing. You might look at TAU since it supports MPI, Pthreads, OpenMP, > Fortran, C, C++, etc. as well as low-level instrumentation with PAPI. > There is also Scalasca, but I haven't tried it yet. > > If you mean communication below MPI, such as IB verbs, I don't know > any tool for that. > > Jeff > > On Fri, Mar 26, 2010 at 11:34 AM, George Markomanolis > <ge...@ma...> wrote: > >> Dear all, >> >> I would like to ask if it is possible to profile the communication of a >> fortran program. I have compiled it with -lpspmpi, and I am not sure how >> I can get the xml files about communication, I get only about hardware >> counters. I would be grateful for any help. Is it possible to profile >> the communication of the application automatically? >> >> Best regards, >> George >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> PerfSuite-users mailing list >> Per...@li... >> https://lists.sourceforge.net/lists/listinfo/perfsuite-users >> >> > > > > |
From: George M. <ge...@ma...> - 2010-03-26 18:58:02
|
Dear jeff, Thanks for tha answer, my question is if i can profile mpi communication with perfsuite, point to point communiication. Best regards, George -original message- Subject: Re: [PerfSuite-users] profiling communication From: Jeff Hammond <jef...@gm...> Date: 26/03/2010 7:45 pm Communication of what kind? There are many tools for this type of thing. You might look at TAU since it supports MPI, Pthreads, OpenMP, Fortran, C, C++, etc. as well as low-level instrumentation with PAPI. There is also Scalasca, but I haven't tried it yet. If you mean communication below MPI, such as IB verbs, I don't know any tool for that. Jeff On Fri, Mar 26, 2010 at 11:34 AM, George Markomanolis <ge...@ma...> wrote: > Dear all, > > I would like to ask if it is possible to profile the communication of a > fortran program. I have compiled it with -lpspmpi, and I am not sure how > I can get the xml files about communication, I get only about hardware > counters. I would be grateful for any help. Is it possible to profile > the communication of the application automatically? > > Best regards, > George > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > -- Jeff Hammond Argonne Leadership Computing Facility jha...@mc... / (630) 252-5381 http://www.linkedin.com/in/jeffhammond |
From: Jeff H. <jef...@gm...> - 2010-03-26 18:45:08
|
Communication of what kind? There are many tools for this type of thing. You might look at TAU since it supports MPI, Pthreads, OpenMP, Fortran, C, C++, etc. as well as low-level instrumentation with PAPI. There is also Scalasca, but I haven't tried it yet. If you mean communication below MPI, such as IB verbs, I don't know any tool for that. Jeff On Fri, Mar 26, 2010 at 11:34 AM, George Markomanolis <ge...@ma...> wrote: > Dear all, > > I would like to ask if it is possible to profile the communication of a > fortran program. I have compiled it with -lpspmpi, and I am not sure how > I can get the xml files about communication, I get only about hardware > counters. I would be grateful for any help. Is it possible to profile > the communication of the application automatically? > > Best regards, > George > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > -- Jeff Hammond Argonne Leadership Computing Facility jha...@mc... / (630) 252-5381 http://www.linkedin.com/in/jeffhammond |
From: George M. <ge...@ma...> - 2010-03-26 18:38:15
|
Dear all, I would like to ask if it is possible to profile the communication of a fortran program. I have compiled it with -lpspmpi, and I am not sure how I can get the xml files about communication, I get only about hardware counters. I would be grateful for any help. Is it possible to profile the communication of the application automatically? Best regards, George |
From: Rick K. <rk...@il...> - 2010-02-03 03:35:47
|
Subscribers to this list may be interested in an upcoming full-day tutorial involving PerfSuite (as part of the NSF SDCI POINT project) that will be offered on Monday, March 8, 2010, at the 11th International Conference on High-Performance Clustered Computing (LCI '10). LCI '10 is hosted by the Pittsburgh Supercomputing Center, Pittsburgh, PA, USA. Registration is now open for the conference and tutorial sessions. For more information, see: http://www.linuxclustersinstitute.org/conferences/ Rick |
From: Rick K. <rk...@il...> - 2010-01-25 17:30:24
|
PerfSuite 1.0.0 alpha 4 is now available. Highlights of this release include: * Updated for compatibility with PAPI version 4 (also known as Component PAPI, or PAPI-C). * Numerous enhancements to the psprocess utility (Java version). * A redesign of the Java metric calculation API for efficiency. * Additional bug fixes and enhancements. A complete listing of changes can be found in the CHANGES file. URL: http://www.sf.net/projects/perfsuite |
From: <jj...@nu...> - 2010-01-09 00:47:58
|
Rick, Thanks for your suggestion! Regards, Jie >From: Rick Kufrin <rk...@il...> >Reply-To: >To: jj...@nu... >Subject: Re: [PerfSuite-users] On hacking perfsuite >Date:Fri, 08 Jan 2010 14:53:40 -0600 > >Jie Jiang wrote: > > Hi Rick, > > > > I'd like to hack Perfsuite to record some special information about the > > target application along with the performance data collected by > > perfsuite (such as profile, hardware PMU data, etc.) > > > > We have hacked OS kernel to provide a new syscall to collect such > > desired information. And I wish this syscall be called in libperfsuite, > > collect information and write the results in perfsuite's output files, > > together with perfsuite's performance data. > > > > Where should I start? > > > > Regards, > > Jie Jiang > > > > > > > > > > > > Jie, > > That sounds interesting and shouldn't be too difficult to do. Here is > the way to go about it: > > - additions to the data collected (your new syscall) belong in the > library libpshwpc, not libperfsuite. > > - C structures that hold the data are #define'd in the file hwpc.h. > Assuming that the data you want to add is on a per-thread basis, it > should be added to the structure call ps_hwpc_values_t > > - the code that actually does the collection is contained in the file > hwpc.c. This is basically the top-level entry point to the collection > library. I am guessing you will want to call your new system call once, > at the end of execution. If that is the case, then the appropriate > function to do that in is ps_hwpc_stop(), where you would execute your > syscall and store the results in the existing structure "values" (the > type is as above) > > - finally, you will want the new item written out to the XML output > document. To do that, you will need to add additional code to the > output routine in hwpc-xml.c. That file should be pretty > straightforward to understand. You will probably want to define a new > XML element tag (a name of your choice) for your data. It's up to you > where you want it placed within the document. This will now become an > invalid document (from the XML perspective), but that shouldn't matter > as validation is not done in post-processing. > > Next, if you want the psprocess utility to be able to parse out and > display the new data you have added, you will need to modify its > implementation as well. There are now two versions of psprocess (Tcl > and Java), and the default is Tcl. The appropriate file to modify to > add the new data is called hwpcreport.tcl. I do not know if you are > familiar with Tcl, but perhaps basing changes on the existing code in > there is the way to go. If you intend to work with the Java version of > psprocess, let us know and we can try to assist you with that. > > Hope that helps, > > Rick > > |
From: Jie J. <jj...@nu...> - 2010-01-08 14:31:56
|
Hi Rick, I'd like to hack Perfsuite to record some special information about the target application along with the performance data collected by perfsuite (such as profile, hardware PMU data, etc.) We have hacked OS kernel to provide a new syscall to collect such desired information. And I wish this syscall be called in libperfsuite, collect information and write the results in perfsuite's output files, together with perfsuite's performance data. Where should I start? Regards, Jie Jiang |
From: Rick K. <rk...@il...> - 2009-11-23 17:28:18
|
Jie - I am glad the apparent discrepancy in elapsed time has been accounted for. Regarding your other question: I cannot locate the metric calculation you refer to in the Intel manuals, so unfortunately cannot comment or speculate. Rick Jie Jiang wrote: > Hi Rick, > > Thanks for your reply. > > I checked my platform and found that the CPU frequency will scale down > automatically when it is idle. > > After using "cpuspeed" command to adjust CPU speed to 2.53GHz, I got the > expected, right wallclock time. Thanks again. > > But I wonder why psrun can get the scale-down frequency. I need to check > when psrun reads it from /proc/cpuinfo. > > > Another question. When measuring cg.A with event > MEM_LAOD_RETIRED:LLC_UNSHARED_HIT, I got a counter value of 331743878. > > According to Intel manual Vol3b, the percentage of the load latency in > total run time can be calculated as following: > > ((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + > (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / > CPU_CLK_UNHALTED.THREAD) * 100 > > Here,the result percentage will be about 158.827%. This is intuitively > wrong since all overhead should be smaller than the total run time. > What's wrong? > P.S. I test it on my platform with the latest pfmon-3.9/perfmon2. It > gives a similar count value. > > Any idea? > > Regards, > Jie > > > > > > On 二, 2009-11-17 at 10:33 -0600, Rick Kufrin wrote: > >> Jie - >> >> It seems that the cause of the discrepancy in elapsed time reported is due to the differences in reporting of your machine's clock speed. I see from the content of the "brand string" element in your XML document that it is a Xeon E5540, 2.53 GHz. This information comes from the CPUID instruction. However, the "clockspeed" element reported in the document is 1.6 GHz; that information comes from /proc/cpuinfo. If you replace the clockspeed of 1600 with 2530, the numbers will be much closer. >> >> I am guessing there is some variable speed going on with your platform, and that the discrepancy stems from that, not from overhead generated by PerfSuite. >> >> Rick >> >> |
From: Rick K. <rk...@il...> - 2009-11-17 16:34:19
|
(forgot to copy the list on this reply)... ----- Forwarded Message ----- From: "Rick Kufrin" <rk...@il...> To: jj...@nu... Sent: Tuesday, November 17, 2009 10:33:02 AM GMT -06:00 US/Canada Central Subject: Re: [PerfSuite-users] Questions about "Wall clock time" Jie - It seems that the cause of the discrepancy in elapsed time reported is due to the differences in reporting of your machine's clock speed. I see from the content of the "brand string" element in your XML document that it is a Xeon E5540, 2.53 GHz. This information comes from the CPUID instruction. However, the "clockspeed" element reported in the document is 1.6 GHz; that information comes from /proc/cpuinfo. If you replace the clockspeed of 1600 with 2530, the numbers will be much closer. I am guessing there is some variable speed going on with your platform, and that the discrepancy stems from that, not from overhead generated by PerfSuite. Rick ----- Original Message ----- From: "Jie Jiang" <jj...@nu...> To: rk...@il... Cc: per...@li... Sent: Tuesday, November 17, 2009 8:38:49 AM GMT -06:00 US/Canada Central Subject: Re: [PerfSuite-users] Questions about "Wall clock time" Hi Rick, Enclosed is the original xml file. The wallclock time in the xml file is very different from the cputime. Please check it. Regards, Jie On 一, 2009-11-16 at 11:43 -0600, rk...@il... wrote: > Jie - > > My initial guess is that psprocess is miscalculating the wall clock time you mention. To be sure, I would like to see the original XML document that you got from the benchmark (the complete document). Can you please send me a copy? > > The wall clock you found in the XML document is not (I believe) displayed by psprocess, but in this instance it would seem to be more accurate; I am saying that based on its agreement with the "time" command. > > The difference between these two measures of time is that one is calculated from the elapsed clock ticks (this is the one that is off in your report); the other is gotten from information taken from the /proc filesystem related to the process/thread being measured. Another important difference is that the first is meant to be literally "wall clock time", while the second is "CPU time" where this is the amount of time actually spent using the processor (these can be very different depending on the application). > > Rick > > ----- Original Message ----- > From: "Jie Jiang" <jj...@nu...> > To: rk...@il... > Cc: per...@li... > Sent: Monday, November 16, 2009 8:03:13 AM GMT -06:00 US/Canada Central > Subject: [PerfSuite-users] Questions about "Wall clock time" > > Hi Rick, > > When processing the collected data with "psprocess", it always show the > "Wall clock time" result. > I have two questions about the "Wall clock time". > First, it is much larger than the run time of the target program. > > [root@node2 bin]# time psrun -c test_config1.xml ./cg.A > libpsrun.c:181 : SIGPROF ignored on startup. Handler=0x1, flags=14000000 > PerfSuite debugging enabled (debug level: PS_DEBUG_OFF) [PID 5562] > Library version: threaded > [PID 5562] Environment (entry of psrun_init) > [PID 5562] PSRUN_DOFORK = (null) > [PID 5562] LD_PRELOAD = libpsrun.so.0 > [PID 5562] PSRUN_PID = 5562 > [PID 5562] PS_HWPC_FILE = cg.A > > > NAS Parallel Benchmarks (NPB3.2-SER) - CG Benchmark > > Size: 14000 > Iterations: 15 > > Initialization time = 0.656 seconds > > iteration ||r|| zeta > 1 0.25789587124191E-12 19.9997581277040 > 2 0.25434985977194E-14 17.1140495745506 > 3 0.25346577542259E-14 17.1296668946143 > 4 0.25342984287709E-14 17.1302113581192 > 5 0.25247550490803E-14 17.1302338856353 > 6 0.25375789728060E-14 17.1302349879482 > 7 0.25309911213776E-14 17.1302350498916 > 8 0.24971158788969E-14 17.1302350537510 > 9 0.24662516791025E-14 17.1302350540101 > 10 0.25086578290790E-14 17.1302350540284 > 11 0.24878397192172E-14 17.1302350540298 > 12 0.24359141964394E-14 17.1302350540299 > 13 0.24247346800617E-14 17.1302350540299 > 14 0.24157219672237E-14 17.1302350540299 > 15 0.24243304908282E-14 17.1302350540299 > Benchmark completed > VERIFICATION SUCCESSFUL > Zeta is 0.171302350540E+02 > Error is 0.526781606656E-13 > > > CG Benchmark Completed. > Class = A > Size = 14000 > Iterations = 15 > Time in seconds = 2.06 > Mop/s total = 724.79 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.2.1 > Compile date = 09 Nov 2009 > > Compile options: > F77 = ifort > FLINK = $(F77) > F_LIB = (none) > F_INC = (none) > FFLAGS = -O -g > FLINKFLAGS = -O > RAND = randi8 > > > Please send all errors/feedbacks to: > > NPB Development Team > np...@na... > > > > real 0m2.756s > user 0m2.711s > sys 0m0.022s > > [root@node2 bin]# psprocess -m test_metric.xml cg.A.5562.node2.xml > PerfSuite Hardware Performance Summary Report > > Version : 1.0 > Created : Mon Nov 16 20:46:23 CST 2009 > Generator : psprocess 0.5 > XML Source : cg.A.5562.node2.xml > > Execution Information > ============================================================================================ > Collector : libpshwpc > Date : Mon Nov 16 20:45:34 2009 > Host : node2 > Process ID : 5562 > Thread : 0 > User : root > Command : cg.A > > Processor and System Information > ============================================================================================ > Node CPUs : 8 > Vendor : Intel > Family : Pentium Pro (P6) > Brand : Intel(R) Xeon(R) CPU E5540 @ > 2.53GHz > CPU Revision : 5 > Clock (MHz) : 1600.000 > Memory (MB) : 16078.69 > Pagesize (KB) : 4 > > Cache Information > ============================================================================================ > Cache levels : 3 > -------------------------------- > Level 1 > Type : instruction > Size (KB) : 32 > Linesize (B) : 64 > Assoc : 4 > Type : data > Size (KB) : 32 > Linesize (B) : 64 > Assoc : 8 > -------------------------------- > Level 2 > Type : unified > Size (KB) : 256 > Linesize (B) : 64 > Assoc : 8 > -------------------------------- > Level 3 > Type : unified > Size (KB) : 8192 > Linesize (B) : 64 > Assoc : 16 > > Index Description > Counter Value > ============================================================================================ > 1 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... > 338818848 > 2 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ > 3219718 > 3 UNHALTED_CORE_CYCLES (description not available)................. > 7312056865 > > Event Index > ============================================================================================ > 1: MEM_LOAD_RETIRED:LLC_UNSHARED_HIT 2: MEM_LOAD_RETIRED:LLC_MISS > 3: UNHALTED_CORE_CYCLES > > Statistics > ============================================================================================ > Counting domain........................................................ > user > Multiplexed............................................................ > no > Wall clock time (seconds).............................................. > 4.310 > ---------------------------------------------- > Here we can see that the "Wall clock time" output (4.31s) by psprocess > is quite larger than the runtime of cg.A (both in terms of the outputs > of cg.A,2.06s, and time command, about 2.7s.). > Where does other part of time go? What causes the overhead? > And what's the real meaning of the "Wall clock time" here? > > Second, in the output xml file of psrun, there is the count of cpu time: > <cputime units="seconds"> > <usertime>2.002680</usertime> > <systemtime>0.000010</systemtime> > </cputime> > > We can see that this is quite close to the real run time of cg.A. > Why does psprocess not show these valuse? > Will you add this function in upcoming ps-1.0? > > Regards, > Jie > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ PerfSuite-users mailing list Per...@li... https://lists.sourceforge.net/lists/listinfo/perfsuite-users |
From: Jie J. <jj...@nu...> - 2009-11-17 14:39:14
|
Hi Rick, Enclosed is the original xml file. The wallclock time in the xml file is very different from the cputime. Please check it. Regards, Jie On 一, 2009-11-16 at 11:43 -0600, rk...@il... wrote: > Jie - > > My initial guess is that psprocess is miscalculating the wall clock time you mention. To be sure, I would like to see the original XML document that you got from the benchmark (the complete document). Can you please send me a copy? > > The wall clock you found in the XML document is not (I believe) displayed by psprocess, but in this instance it would seem to be more accurate; I am saying that based on its agreement with the "time" command. > > The difference between these two measures of time is that one is calculated from the elapsed clock ticks (this is the one that is off in your report); the other is gotten from information taken from the /proc filesystem related to the process/thread being measured. Another important difference is that the first is meant to be literally "wall clock time", while the second is "CPU time" where this is the amount of time actually spent using the processor (these can be very different depending on the application). > > Rick > > ----- Original Message ----- > From: "Jie Jiang" <jj...@nu...> > To: rk...@il... > Cc: per...@li... > Sent: Monday, November 16, 2009 8:03:13 AM GMT -06:00 US/Canada Central > Subject: [PerfSuite-users] Questions about "Wall clock time" > > Hi Rick, > > When processing the collected data with "psprocess", it always show the > "Wall clock time" result. > I have two questions about the "Wall clock time". > First, it is much larger than the run time of the target program. > > [root@node2 bin]# time psrun -c test_config1.xml ./cg.A > libpsrun.c:181 : SIGPROF ignored on startup. Handler=0x1, flags=14000000 > PerfSuite debugging enabled (debug level: PS_DEBUG_OFF) [PID 5562] > Library version: threaded > [PID 5562] Environment (entry of psrun_init) > [PID 5562] PSRUN_DOFORK = (null) > [PID 5562] LD_PRELOAD = libpsrun.so.0 > [PID 5562] PSRUN_PID = 5562 > [PID 5562] PS_HWPC_FILE = cg.A > > > NAS Parallel Benchmarks (NPB3.2-SER) - CG Benchmark > > Size: 14000 > Iterations: 15 > > Initialization time = 0.656 seconds > > iteration ||r|| zeta > 1 0.25789587124191E-12 19.9997581277040 > 2 0.25434985977194E-14 17.1140495745506 > 3 0.25346577542259E-14 17.1296668946143 > 4 0.25342984287709E-14 17.1302113581192 > 5 0.25247550490803E-14 17.1302338856353 > 6 0.25375789728060E-14 17.1302349879482 > 7 0.25309911213776E-14 17.1302350498916 > 8 0.24971158788969E-14 17.1302350537510 > 9 0.24662516791025E-14 17.1302350540101 > 10 0.25086578290790E-14 17.1302350540284 > 11 0.24878397192172E-14 17.1302350540298 > 12 0.24359141964394E-14 17.1302350540299 > 13 0.24247346800617E-14 17.1302350540299 > 14 0.24157219672237E-14 17.1302350540299 > 15 0.24243304908282E-14 17.1302350540299 > Benchmark completed > VERIFICATION SUCCESSFUL > Zeta is 0.171302350540E+02 > Error is 0.526781606656E-13 > > > CG Benchmark Completed. > Class = A > Size = 14000 > Iterations = 15 > Time in seconds = 2.06 > Mop/s total = 724.79 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.2.1 > Compile date = 09 Nov 2009 > > Compile options: > F77 = ifort > FLINK = $(F77) > F_LIB = (none) > F_INC = (none) > FFLAGS = -O -g > FLINKFLAGS = -O > RAND = randi8 > > > Please send all errors/feedbacks to: > > NPB Development Team > np...@na... > > > > real 0m2.756s > user 0m2.711s > sys 0m0.022s > > [root@node2 bin]# psprocess -m test_metric.xml cg.A.5562.node2.xml > PerfSuite Hardware Performance Summary Report > > Version : 1.0 > Created : Mon Nov 16 20:46:23 CST 2009 > Generator : psprocess 0.5 > XML Source : cg.A.5562.node2.xml > > Execution Information > ============================================================================================ > Collector : libpshwpc > Date : Mon Nov 16 20:45:34 2009 > Host : node2 > Process ID : 5562 > Thread : 0 > User : root > Command : cg.A > > Processor and System Information > ============================================================================================ > Node CPUs : 8 > Vendor : Intel > Family : Pentium Pro (P6) > Brand : Intel(R) Xeon(R) CPU E5540 @ > 2.53GHz > CPU Revision : 5 > Clock (MHz) : 1600.000 > Memory (MB) : 16078.69 > Pagesize (KB) : 4 > > Cache Information > ============================================================================================ > Cache levels : 3 > -------------------------------- > Level 1 > Type : instruction > Size (KB) : 32 > Linesize (B) : 64 > Assoc : 4 > Type : data > Size (KB) : 32 > Linesize (B) : 64 > Assoc : 8 > -------------------------------- > Level 2 > Type : unified > Size (KB) : 256 > Linesize (B) : 64 > Assoc : 8 > -------------------------------- > Level 3 > Type : unified > Size (KB) : 8192 > Linesize (B) : 64 > Assoc : 16 > > Index Description > Counter Value > ============================================================================================ > 1 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... > 338818848 > 2 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ > 3219718 > 3 UNHALTED_CORE_CYCLES (description not available)................. > 7312056865 > > Event Index > ============================================================================================ > 1: MEM_LOAD_RETIRED:LLC_UNSHARED_HIT 2: MEM_LOAD_RETIRED:LLC_MISS > 3: UNHALTED_CORE_CYCLES > > Statistics > ============================================================================================ > Counting domain........................................................ > user > Multiplexed............................................................ > no > Wall clock time (seconds).............................................. > 4.310 > ---------------------------------------------- > Here we can see that the "Wall clock time" output (4.31s) by psprocess > is quite larger than the runtime of cg.A (both in terms of the outputs > of cg.A,2.06s, and time command, about 2.7s.). > Where does other part of time go? What causes the overhead? > And what's the real meaning of the "Wall clock time" here? > > Second, in the output xml file of psrun, there is the count of cpu time: > <cputime units="seconds"> > <usertime>2.002680</usertime> > <systemtime>0.000010</systemtime> > </cputime> > > We can see that this is quite close to the real run time of cg.A. > Why does psprocess not show these valuse? > Will you add this function in upcoming ps-1.0? > > Regards, > Jie > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > > |
From: <rk...@il...> - 2009-11-16 17:44:19
|
Jie - My initial guess is that psprocess is miscalculating the wall clock time you mention. To be sure, I would like to see the original XML document that you got from the benchmark (the complete document). Can you please send me a copy? The wall clock you found in the XML document is not (I believe) displayed by psprocess, but in this instance it would seem to be more accurate; I am saying that based on its agreement with the "time" command. The difference between these two measures of time is that one is calculated from the elapsed clock ticks (this is the one that is off in your report); the other is gotten from information taken from the /proc filesystem related to the process/thread being measured. Another important difference is that the first is meant to be literally "wall clock time", while the second is "CPU time" where this is the amount of time actually spent using the processor (these can be very different depending on the application). Rick ----- Original Message ----- From: "Jie Jiang" <jj...@nu...> To: rk...@il... Cc: per...@li... Sent: Monday, November 16, 2009 8:03:13 AM GMT -06:00 US/Canada Central Subject: [PerfSuite-users] Questions about "Wall clock time" Hi Rick, When processing the collected data with "psprocess", it always show the "Wall clock time" result. I have two questions about the "Wall clock time". First, it is much larger than the run time of the target program. [root@node2 bin]# time psrun -c test_config1.xml ./cg.A libpsrun.c:181 : SIGPROF ignored on startup. Handler=0x1, flags=14000000 PerfSuite debugging enabled (debug level: PS_DEBUG_OFF) [PID 5562] Library version: threaded [PID 5562] Environment (entry of psrun_init) [PID 5562] PSRUN_DOFORK = (null) [PID 5562] LD_PRELOAD = libpsrun.so.0 [PID 5562] PSRUN_PID = 5562 [PID 5562] PS_HWPC_FILE = cg.A NAS Parallel Benchmarks (NPB3.2-SER) - CG Benchmark Size: 14000 Iterations: 15 Initialization time = 0.656 seconds iteration ||r|| zeta 1 0.25789587124191E-12 19.9997581277040 2 0.25434985977194E-14 17.1140495745506 3 0.25346577542259E-14 17.1296668946143 4 0.25342984287709E-14 17.1302113581192 5 0.25247550490803E-14 17.1302338856353 6 0.25375789728060E-14 17.1302349879482 7 0.25309911213776E-14 17.1302350498916 8 0.24971158788969E-14 17.1302350537510 9 0.24662516791025E-14 17.1302350540101 10 0.25086578290790E-14 17.1302350540284 11 0.24878397192172E-14 17.1302350540298 12 0.24359141964394E-14 17.1302350540299 13 0.24247346800617E-14 17.1302350540299 14 0.24157219672237E-14 17.1302350540299 15 0.24243304908282E-14 17.1302350540299 Benchmark completed VERIFICATION SUCCESSFUL Zeta is 0.171302350540E+02 Error is 0.526781606656E-13 CG Benchmark Completed. Class = A Size = 14000 Iterations = 15 Time in seconds = 2.06 Mop/s total = 724.79 Operation type = floating point Verification = SUCCESSFUL Version = 3.2.1 Compile date = 09 Nov 2009 Compile options: F77 = ifort FLINK = $(F77) F_LIB = (none) F_INC = (none) FFLAGS = -O -g FLINKFLAGS = -O RAND = randi8 Please send all errors/feedbacks to: NPB Development Team np...@na... real 0m2.756s user 0m2.711s sys 0m0.022s [root@node2 bin]# psprocess -m test_metric.xml cg.A.5562.node2.xml PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Mon Nov 16 20:46:23 CST 2009 Generator : psprocess 0.5 XML Source : cg.A.5562.node2.xml Execution Information ============================================================================================ Collector : libpshwpc Date : Mon Nov 16 20:45:34 2009 Host : node2 Process ID : 5562 Thread : 0 User : root Command : cg.A Processor and System Information ============================================================================================ Node CPUs : 8 Vendor : Intel Family : Pentium Pro (P6) Brand : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz CPU Revision : 5 Clock (MHz) : 1600.000 Memory (MB) : 16078.69 Pagesize (KB) : 4 Cache Information ============================================================================================ Cache levels : 3 -------------------------------- Level 1 Type : instruction Size (KB) : 32 Linesize (B) : 64 Assoc : 4 Type : data Size (KB) : 32 Linesize (B) : 64 Assoc : 8 -------------------------------- Level 2 Type : unified Size (KB) : 256 Linesize (B) : 64 Assoc : 8 -------------------------------- Level 3 Type : unified Size (KB) : 8192 Linesize (B) : 64 Assoc : 16 Index Description Counter Value ============================================================================================ 1 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 338818848 2 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ 3219718 3 UNHALTED_CORE_CYCLES (description not available)................. 7312056865 Event Index ============================================================================================ 1: MEM_LOAD_RETIRED:LLC_UNSHARED_HIT 2: MEM_LOAD_RETIRED:LLC_MISS 3: UNHALTED_CORE_CYCLES Statistics ============================================================================================ Counting domain........................................................ user Multiplexed............................................................ no Wall clock time (seconds).............................................. 4.310 ---------------------------------------------- Here we can see that the "Wall clock time" output (4.31s) by psprocess is quite larger than the runtime of cg.A (both in terms of the outputs of cg.A,2.06s, and time command, about 2.7s.). Where does other part of time go? What causes the overhead? And what's the real meaning of the "Wall clock time" here? Second, in the output xml file of psrun, there is the count of cpu time: <cputime units="seconds"> <usertime>2.002680</usertime> <systemtime>0.000010</systemtime> </cputime> We can see that this is quite close to the real run time of cg.A. Why does psprocess not show these valuse? Will you add this function in upcoming ps-1.0? Regards, Jie ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ PerfSuite-users mailing list Per...@li... https://lists.sourceforge.net/lists/listinfo/perfsuite-users |
From: Jie J. <jj...@nu...> - 2009-11-16 14:03:40
|
Hi Rick, When processing the collected data with "psprocess", it always show the "Wall clock time" result. I have two questions about the "Wall clock time". First, it is much larger than the run time of the target program. [root@node2 bin]# time psrun -c test_config1.xml ./cg.A libpsrun.c:181 : SIGPROF ignored on startup. Handler=0x1, flags=14000000 PerfSuite debugging enabled (debug level: PS_DEBUG_OFF) [PID 5562] Library version: threaded [PID 5562] Environment (entry of psrun_init) [PID 5562] PSRUN_DOFORK = (null) [PID 5562] LD_PRELOAD = libpsrun.so.0 [PID 5562] PSRUN_PID = 5562 [PID 5562] PS_HWPC_FILE = cg.A NAS Parallel Benchmarks (NPB3.2-SER) - CG Benchmark Size: 14000 Iterations: 15 Initialization time = 0.656 seconds iteration ||r|| zeta 1 0.25789587124191E-12 19.9997581277040 2 0.25434985977194E-14 17.1140495745506 3 0.25346577542259E-14 17.1296668946143 4 0.25342984287709E-14 17.1302113581192 5 0.25247550490803E-14 17.1302338856353 6 0.25375789728060E-14 17.1302349879482 7 0.25309911213776E-14 17.1302350498916 8 0.24971158788969E-14 17.1302350537510 9 0.24662516791025E-14 17.1302350540101 10 0.25086578290790E-14 17.1302350540284 11 0.24878397192172E-14 17.1302350540298 12 0.24359141964394E-14 17.1302350540299 13 0.24247346800617E-14 17.1302350540299 14 0.24157219672237E-14 17.1302350540299 15 0.24243304908282E-14 17.1302350540299 Benchmark completed VERIFICATION SUCCESSFUL Zeta is 0.171302350540E+02 Error is 0.526781606656E-13 CG Benchmark Completed. Class = A Size = 14000 Iterations = 15 Time in seconds = 2.06 Mop/s total = 724.79 Operation type = floating point Verification = SUCCESSFUL Version = 3.2.1 Compile date = 09 Nov 2009 Compile options: F77 = ifort FLINK = $(F77) F_LIB = (none) F_INC = (none) FFLAGS = -O -g FLINKFLAGS = -O RAND = randi8 Please send all errors/feedbacks to: NPB Development Team np...@na... real 0m2.756s user 0m2.711s sys 0m0.022s [root@node2 bin]# psprocess -m test_metric.xml cg.A.5562.node2.xml PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Mon Nov 16 20:46:23 CST 2009 Generator : psprocess 0.5 XML Source : cg.A.5562.node2.xml Execution Information ============================================================================================ Collector : libpshwpc Date : Mon Nov 16 20:45:34 2009 Host : node2 Process ID : 5562 Thread : 0 User : root Command : cg.A Processor and System Information ============================================================================================ Node CPUs : 8 Vendor : Intel Family : Pentium Pro (P6) Brand : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz CPU Revision : 5 Clock (MHz) : 1600.000 Memory (MB) : 16078.69 Pagesize (KB) : 4 Cache Information ============================================================================================ Cache levels : 3 -------------------------------- Level 1 Type : instruction Size (KB) : 32 Linesize (B) : 64 Assoc : 4 Type : data Size (KB) : 32 Linesize (B) : 64 Assoc : 8 -------------------------------- Level 2 Type : unified Size (KB) : 256 Linesize (B) : 64 Assoc : 8 -------------------------------- Level 3 Type : unified Size (KB) : 8192 Linesize (B) : 64 Assoc : 16 Index Description Counter Value ============================================================================================ 1 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 338818848 2 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ 3219718 3 UNHALTED_CORE_CYCLES (description not available)................. 7312056865 Event Index ============================================================================================ 1: MEM_LOAD_RETIRED:LLC_UNSHARED_HIT 2: MEM_LOAD_RETIRED:LLC_MISS 3: UNHALTED_CORE_CYCLES Statistics ============================================================================================ Counting domain........................................................ user Multiplexed............................................................ no Wall clock time (seconds).............................................. 4.310 ---------------------------------------------- Here we can see that the "Wall clock time" output (4.31s) by psprocess is quite larger than the runtime of cg.A (both in terms of the outputs of cg.A,2.06s, and time command, about 2.7s.). Where does other part of time go? What causes the overhead? And what's the real meaning of the "Wall clock time" here? Second, in the output xml file of psrun, there is the count of cpu time: <cputime units="seconds"> <usertime>2.002680</usertime> <systemtime>0.000010</systemtime> </cputime> We can see that this is quite close to the real run time of cg.A. Why does psprocess not show these valuse? Will you add this function in upcoming ps-1.0? Regards, Jie |
From: Rick K. <rk...@il...> - 2009-11-10 22:00:56
|
Jie Jiang wrote: > Hi Rick, > > There is a problem with perfsuite-1.0.0a3. > In file $PS_INSTALL_DIR/share/perfsuite/tcllib/pkgIndex.tcl, line 8, a > version number should be added right after the "psbfd" field. > > [root@node2] psprocess cg.A.xml > ..... > error reading package index > file /usr/local/yhps/perfsuite-1.0.0/share/perfsuite/tcllib/pkgIndex.tcl:expected version number but got "load /usr/local/yhps/perfsuite-1.0.0/lib/psbfd/libpsbfd.so" > .... > > After adding a verison number (such as 0.1), it works well. > > But this should be corrected in perfsuite source code, not the installed > package. Right? > > > Regards, > Jie Jiang > > > > Jie - thank you for reporting this, it is an issue I have not seen before. You are correct, you can fix this problem by hand-editing the pkgIndex.tcl file but indeed it should be done in PerfSuite's configure/build step. We will fix this before the next release, again thank you for your report. Rick |
From: Rick K. <rk...@il...> - 2009-11-10 21:47:49
|
Jie, You have come across some "legacy" issues that remain in PerfSuite (but will change in the future). - the "Model(Type) Unknown" output from psinv is due to the psinv implementation lagging in terms of newer processors. This type of string-oriented output will be going away (to make maintenance easier), and instead will show numeric family, model info similar to /proc/cpuinfo. Since all recent Intel and AMD processors have the "brand string" capability ("model name" in /proc/cpuinfo), that should be a better description source. - you are correct about the "-e" option to psinv, again this is an area that has not been updated. Part of the reason is the current kernel support situation (perf_events). papi_native_avail will provide similar info, as you point out, and can be used to learn native event names. - you are also correct about the current behavior of supplying native event names as "preset"s. While this will work today (the parsing source code is more lenient than it might be), I would not recommend relying on this behavior, as it may go away in the future. Native events should be specified with 'type="native'" and with the XML element content being the event name. Rick JiangJie wrote: > Hi Rick, > > Recently, I'm trying perfsuite-1.0.0a3 on Nehalem/Linux platform. All > required software components (perfctr patch, papi-3.7.0, tcl, tk, > tdom,expat) work well and perfsuite-1.0.0a3 can be built successfuly. > > However, there are some problems confusing me. > > 1. Nehalem (Core i7) processor has family 6, model 26, stepping 5 as > shown by /proc/cpuinfo. > But the psinv command shows "Model(Type) Unknown". > I'm not clear about the "Model(Type)" in perfsuite. Should it be > Corei7 for this processor? > Perhaps psinv should be changed to support Core i7. > > > [root@node2]cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 26 > model name : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz > stepping : 5 > cpu M! Hz : 1600.000 > cache size : 8192 KB > physical id : 0 > siblings : 4 > core id : 0 > cpu cores : 4 > apicid : 0 > initial apicid : 0 > fpu : yes > fpu_exception : yes > cpuid level : 11 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx > rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor > ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 sse4_2 popcnt lahf_lm ida > bogomips : 5066.82 > clflush size : 64 > cache_alignment : 64 > address sizes &nb! sp; : 40 bits physical, 48 bits virtual > power management: > .......... > > [root@node2]psinv -e > System Information - > Node Name: node2 > OS Name: Linux > OS Release: 2.6.27perfctr > OS Build/Version: #1 SMP Fri Nov 6 11:38:00 CST 2009 > OS Machine: x86_64 > Processors: 8 > Total Memory (MB): 16078.68 > System Page Size (KB): 4.00 > > Processor Information - > Vendor: Intel > Processor family: Pentium Pro (P6) > Brand: Intel(R) Xeon(R) CPU E5540 @ 2.53GHz > Model (Type): (unknown)! > Revision: 5 > Clock Speed:&nbs! p; 1600.00 MHz > > Cache and TLB Information - > Cache levels: 3 > > Cache Details - > Level 1: > Type: Instruction > Size: 32 KB > Line size: 64 bytes > Associativity: 4-way set associative > > Type: Data > Size: 32 KB > Line size: 64 bytes > Associativity: 8-way set associative > > Level 2: > Type: Unified > Size: 256 KB > Line size: 64 bytes > Associ! ativity: 8-way set associative > > Level 3: > Type: Unified > Size: 8.00 MB > Line size: 64 bytes > Associativity: 16-way set associative > > TLB Details - > Level 1: > Type: Unified > Entries: 512 > Pagesize (KB): 4 > Associativity: 4-way set associative > > Type: Instruction > Entries: 128 > Pagesize (KB): 4 > Associativity: 4-way set associative > > Type: Instruc! tion > Entries: 7 > ! Pagesize (KB): 2048 4096 > Associativity: Fully associative > > Type: Data > Entries: 64 > Pagesize (KB): 4 > Associativity: 4-way set associative > > Type: Data > Entries: 32 > Pagesize (KB): 2048 4096 > Associativity: 4-way set associative > > The "-e" (or "--events") option is not supported on this system > > 2. From the above psinv outputs, we can see that "-e"(or "--evetns") > option only works if PERFMON is installed. > This will confuse users because in fact native events list can be > abtained by papi_native_avail on x86/x86_64 platform. > Will you pla! n to change psinv to show the native event list for > x86/x86_64 processsors? > Or I can help to do this. > > 3. The third problem is about the psrun config files. > As we know, there are preset events and native events in PAPI. > Does the "event type" in config file determines preset/native events > to be collected? > However, when combine type=preset with native_event_name, psrun works > well. > Should native event name be used with native type? > > So confusing. > > Regards, > Jie Jiang > > > > > > > > > ------------------------------------------------------------------------ > 搜索本应是快乐的,不是么? 快乐搜索,有问必应!微软隆重推出! 立即试用! > <http://bing.com.cn?FORM=M00HCN&Publ=WLHMTAG&Crea=TEXT_Search_Where_You_Are_1X1> |
From: Rick K. <rk...@il...> - 2009-11-03 22:56:18
|
Excellent news, Sherry - I am glad it works for you. I hope the user finds ParaProf's display of PerfSuite profiles helpful. I am copying the perfsuite-user mailing list to archive the workaround for this issue. Rick ----- Original Message ----- From: "Sherry Chang" <she...@na...> To: rk...@il... Cc: "Sherry Chang" <She...@na...>, "Henry Jin" <hj...@na...> Sent: Tuesday, November 3, 2009 4:40:51 PM GMT -06:00 US/Canada Central Subject: Re: How to get combined profile results Hi Rick, Thank you very much for your quick response and suggestion. I did try using ParaProf and it seems to work with my simple MPI pi program. I have informed the user to try using ParaProf to view the multiple profile xml files. Regarding our Altix IA64 systems, I am not sure when we will migrate from PP5.x to PP6.x. The current focus of our division is on the newer Altix ICE systems and the system group does not have time to worry about Altix IA64 right now. Regards, Sherry |
From: <rk...@il...> - 2009-11-03 15:51:36
|
Members of this list may be interested in the availability of a bootable LiveDVD that contains pre-installed copies of performance software and tools that have been developed by the POINT and VI-HPS projects (PerfSuite is participating in POINT). The LiveDVD is available as an ISO image that can be written to DVD or USB, which provides a handy way to try out software without having to go through full installations. More information has been posted here: http://www.ncsa.illinois.edu/News/09/1102POINTVIHPS.html The LiveDVD will be used as a hands-on training tool at an upcoming full-day tutorial to be held at Supercomputing 2009 in Portland, OR. Rick |
From: <rk...@il...> - 2009-11-03 00:40:02
|
Sherry, Your email to the PerfSuite user mailing list was bounced by SourceForge (I think because you are not a list subscriber), but I found a copy in my filtered email box. Regarding displaying combined profile results with psprocess: I am not sure I would say that is a "problem", but Henry is correct in that there is no support for displaying profiling results from parallel programs (i.e., multiple XML profiling documents) from psprocess. Primarily, this is due to the practical problem of how to summarize that type of information in the text-based output that psprocess concentrates on. There is only so much "screen space" to work with when writing to the "console". There isn't a PerfSuite-only solution to this situation, either in 0.x or 1.0. However, recent versions of the TAU performance system from the University of Oregon does support graphical display of profiling results through TAU's visualizer "ParaProf". ParaProf understands the PerfSuite file formats (you still have to translate the raw samples to source code locations through "psprocess -x"). Your user's questions are entirely reasonable. There was/is support for examining parallel profiles through the VProf package from Sandia, but that package seems to no longer be active, so it is deprecated in PerfSuite (also, you would need to have VProf available in the first place for display). Recent versions of PerfSuite (1.0) provide support for generating files that can be displayed by the Cube visualizer from the Scalasca project in Europe (http://www.scalasca.org), and I find that very useful and compact. However, before you consider upgrading, I would like to know what the target platform is, primarily because I am aware that Altix platforms with ProPack 6+ can have difficulties with the psrun command. If we can help you get things moving in a better way, please let us know, at present those are my comments on the current capabilities and options. Rick ---------------- Sent By"Sherry Chang" <she...@na...> On: November 2, 2009 12:42 PM To: per...@li... Cc: She...@na... Hi, Our site (NASA Ames) is currently using PerfSuite version 0.6.2b1. One of our users would like to get combined profile results from the individual profile *.xml file and but was not able to do so. Henry Jin mentioned that this is a known problem with version 0. Is this changed in version 1? Thank you, Sherry Chang User Services NASA Advanced Supercomputing Division >Sherry, >The problem of not reporting combined profile results is known >in the 0.x versions of PerfSuite. I'm not sure if anything has been >changed in the latest version (1.x). I don't really know a solution. >It's probably better to post an inquiry to ><per...@li...> >Rick Kufrin is very responsive in answering questions. >-Henry On 10/31/09 4:49 PM, Sherry Chang wrote: Hi Henry, Using the counting mode, one can combine results from individual *.xml files and get an overall report of the whole code instead of each individual process. For example, mpirun -np 4 psrun -f ./new_pi_g > kkk psprocess -c new_pi_g.120*.cfe1.xml > all_counting.xml psprocess all_counting.xml > all_counting.psprocess.out In the all_counting.psprocess.out, one sees: Minimum and Maximum Min Max = = = = = = = = = = = = = = = = = = = = ======================================================================== % CPU utilization................................... 98.06 [cfe1] 99.54 [cfe1] % cycles stalled on any resource.................... 56.66 [cfe1] 56.96 [cfe1] ... Aggregate Statistics Median Mean StdDev Sum = = = = = = = = = = = = = = = = = = = = ======================================================================== % CPU utilization....................... 99.46 99.13 0.71 396.51 % cycles stalled on any resource........ 56.73 56.77 0.13 227.08 Bandwidth used to level 1 cache (MB/s).. 814.28 821.81 17.04 3287.26 Instead of the counting mode, a user tried to use PerfSuite to profile his MPI code and get an overall statistics for his whole code instead of each process. He would like to know where in the code is taking longer time to run. But he was not able to do so. Here are part of his email: psrun generating a bunch of XML files, but I am unclear how to combine them to get overall statistics. First I tried, psprocess *.xml > out.txt That produced a report in plain text, but it looks like it just processed the first XML file. Then I tried, psprocess -c *.xml > all.out That produced an XML file, I think. It is much longer, but I am unclear what to do with it. I ran, psprocess all.out to process combined "out" file, but I got this error message: document contains profiling data (only vmon output is currently supported) I added --vmon and ran, psprocess --vmon all.out and got this error message: [stack]: cannot open `[stack]' (No such file or directory) I did similar experiments like he did and got the same behavior. The experiments I tried (papi_profile_cycles.xml and itimer.xml) both showed this behavior. Do you know any way that will aggregate the info from the profiling results of each process? Sherry |
From: Rick K. <rk...@nc...> - 2009-10-23 18:21:31
|
Jie, For some reason, I believe SourceForge (and/or NCSA) bounced your message and marked it as spam, but fortunately I did find a copy in my filtered email box. I just want to mention to you (and all subscribers to this list) that we do try to reply to all emails on a relatively prompt basis, so if anyone sends a note and does not receive a reply, please do not assume we are ignoring you. We may just not have received it... so please feel free to resend. Thank you for your kind comments on PerfSuite, they are much appreciated. We are always gratified to learn that someone may have been assisted in their work by using PerfSuite and welcome comments, good or bad. Regarding PerfSuite working with pfmon2: I assume you mean Perfmon2, the performance subsystem kernel work and user library led by Stephane Eranian. Actually, PerfSuite already contains support for Perfmon, but only an older version, one running on Itanium platforms under kernel version 2.4.x. That implementation leveraged PerfSuite's design to use other performance software beyond PAPI (which has been the workhorse so far). However, it has not been updated for subsequent versions of Perfmon, which grew to include many more CPUs than the IA-64 processors. We had been intending to update the Perfmon support to work directly with Perfmon2, but recent developments in the mainline kernel have altered the picture substantially. Some users on this list may be aware of the "Performance Counter Library" (PCL), now known as "perf events", which has been accepted into the mainline kernel. There has been much discussion of this on the kernel mailing lists, the Perfmon list, and the PAPI list. While all this settles out, we are not actively developing towards these other layers, but once things become more generally available, we would very much like to provide an alternate route to the counters in addition to PAPI (which, in turn, relies on perfctr/perfmon/pcl for kernel support). I hope this answers your question - the situation is pretty fluid at this point, but the short answer is: no, at present PerfSuite does not work directly with Perfmon2, but we do have plans to expand beyond PAPI. Rick > Hi Rick, > > I have been working with perfsuite for some time. > It helps me a lot in my work and thanks for your excellent work. > > Perfsuite depends on perfctr to access CPU performance counters. > However, there are some disadvantages with perfctr implementation, > compared with pfmon2, which is also widely used and supported in > performance tools. > > So, I'd like to know if the latest perfsuite can work with pfmon2 > package. If not, do you have any plan to do it? > > Regards, > Jie > > > |
From: Rick K. <rk...@il...> - 2009-10-16 16:24:23
|
Eugene, This is a behavior that some other users and we, too, have observed on some of our systems with more recent software stacks (glibc, dl, ld.so, etc). We are in the process of testing a change to PerfSuite that may avoid this problem. We hope to make this available with our next release (in the 1.0 series, not 0.6 which you say you have), targeted for the end of October, so just a couple of weeks away. Thanks for reporting this issue, Rick Eugéne Suter wrote: > Hi, > > I'm trying to use Perfsuite to obtain a PAPI profile of my > OpenMP-enabled program (using the papi3_core configuration file), but > I get this message from psrun just before is finishes: > > "Inconsistency detected by ld.so: dl-close.c: 719: _dl_close: > Assertion `map->l_init_called' failed!" > > What can I do to solve this? > My environment is Slackware64 with: Perfsuite 0.6.2, PAPI 3.6.2, GCC 4.3.3. > > Cheers, > Eugéne > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > > |
From: Eugéne S. <ea...@gm...> - 2009-10-14 16:38:01
|
Hi, I'm trying to use Perfsuite to obtain a PAPI profile of my OpenMP-enabled program (using the papi3_core configuration file), but I get this message from psrun just before is finishes: "Inconsistency detected by ld.so: dl-close.c: 719: _dl_close: Assertion `map->l_init_called' failed!" What can I do to solve this? My environment is Slackware64 with: Perfsuite 0.6.2, PAPI 3.6.2, GCC 4.3.3. Cheers, Eugéne |
From: Rick K. <rk...@il...> - 2009-09-01 16:06:16
|
The included email announcement of the new release of PerfSuite version 1.0.0a2 was sent to the mailing list "perfsuite-announce" yesterday afternoon, but does not seem to have propagated out to the list membership. Therefore, I am also sending to the perfsuite-users mailing list - apologies if you receive multiple copies. Details about the release are available on the PerfSuite websites as well as in the CHANGES file in the distribution. Rick |
From: Rick K. <rk...@il...> - 2009-08-27 15:54:48
|
Jie, Thanks again for the further updates and of course I am glad to learn that you had better luck with subsequent runs. Of course, the situation you described earlier as well as our own experiences on our platforms are still troubling, so we will continue to investigate. But I am indeed happy that you were able to achieve at least partial success. Rick ----- Original Message ----- From: "Robbie" <jj...@nu...> To: rk...@il... Cc: per...@li... Sent: Thursday, August 27, 2009 9:25:07 AM GMT -06:00 US/Canada Central Subject: Re: [PerfSuite-users] Problems about using Perfsuite to monitor OpenMP program (NPB-3.2.1) Rick, Today I have tested perfsuite with NPB-OMP benchmarks on an Xeon/Linux machine with icc/ifort compiler. Luckily, all OpenMP benchmarks finish successfully and performance data files are generated as expected. I'm not clear about the reason behind this problem. Maybe the difference between icc/ifort and gcc/gfortran can be a useful indication. Jie 2009-08-26 at 08:12 -0500, rk...@il... wrote: > Jie, > > Thanks for reporting on your further experiments. With the issue still present when using PAPI, it seems similar to an issue we have seen on our Altix. Unfortunately, this remains an unresolved issue that may be related to the way that psrun operates internally. I'm afraid I do not have a solution at present, but I have found that using the PerfSuite API directly produces the proper results. If you are able and willing to do so, using the API involves inserting a call to the following routines: > > call psf_hwpc_init() - from the main thread > call psf_hwpc_start() - from within a parallel region > call psf_hwpc_stop(filename) - from within a parallel region > > There is an example (in C) in the PerfSuite distribution of calling the API from an OpenMP program. You will find it in: > > $PREFIX/share/perfsuite/examples/cpi/cpi-omp.c > > I'm afraid I do not have a better solution at this time, but it is an issue we are following up on. > > Rick > > > ----- Original Message ----- > From: "Robbie" <jj...@nu...> > To: "Rick Kufrin" <rk...@il...> > Cc: per...@li... > Sent: Wednesday, August 26, 2009 7:27:05 AM GMT -06:00 US/Canada Central > Subject: Re: [PerfSuite-users] Problems about using Perfsuite to monitor OpenMP program (NPB-3.2.1) > > Rick, > > Thanks for your suggestion. > However, when I tried to monitor an OpenMP program with the default > configuration file, I still got some errors and the data files are not > created as expected. > > Followings are the platform information and how I measure the NPB-OMP > program with psrun. > Note here the target OpenMP program is compiled by gcc/gfortran with > -fopenmp option. > > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ uname -a > Linux UT43 2.6.27-perfctr #2 SMP Tue Apr 28 20:29:12 CST 2009 i686 > GNU/Linux > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ perfex -i > PerfCtr Info: > abi_version 0x05020501 > driver_version 2.6.37 DEBUG > cpu_type 14 (Intel Pentium M) > cpu_features 0x7 (rdpmc,rdtsc,pcint) > cpu_khz 798049 > tsc_to_cpu_mult 1 > cpu_nrctrs 2 > cpus [0], total: 1 > cpus_forbidden [], total: 0 > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ ls > bt.A ep.A is.A is.B > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ export OMP_NUM_THREADS=2 > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ psrun -p ./is.A > > > NAS Parallel Benchmarks (NPB3.2-OMP) - IS Benchmark > > Size: 8388608 (class A) > Iterations: 10 > Number of available threads: 2 > > > iteration > 1 > 2 > 3 > 4 > 5 > 6 > 7 > 8 > 9 > 10 > > > IS Benchmark Completed > Class = A > Size = 8388608 > Iterations = 10 > Time in seconds = 1.46 > Total threads = 2 > Avail threads = 2 > Mop/s total = 57.29 > Mop/s/thread = 28.64 > Operation type = keys ranked > Verification = SUCCESSFUL > Version = 3.2.1 > Compile date = 25 Aug 2009 > > Compile options: > CC = gcc > CLINK = $(CC) > C_LIB = -lm > C_INC = (none) > CFLAGS = -O -g -fopenmp > CLINKFLAGS = -O -fopenmp > > > Please send all errors/feedbacks to: > > NPB Development Team > np...@na... > > Inconsistency detected by ld.so: dl-close.c: 719: _dl_close: Assertion > `map->l_init_called' failed! > > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ ls > bt.A ep.A is.A is.A.0.30613.UT43.xml is.B > > > The program execution finishes with the error message in the last line > and there is only ONE xml output file, not two as expected. > This also happens to papi_profile_cycles.xml configuration file. > > What's wrong? > > Regards, > Jie Jiang > > > > Rick Kufrin wrote: > > Jie, > > > > My guess is that what is happening here is related to the use of the "itimer.xml" configuration file. The problem is that signal delivery is not defined with POSIX threads, and the results are unpredictable. POSIX threads enter the picture when you are using OpenMP. > > > > Does your system happen to have kernel support for hardware counters? If so, you may have better luck by profiling with performance counters such as total cycles rather than itimers. > > > > Rick > > > > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ PerfSuite-users mailing list Per...@li... https://lists.sourceforge.net/lists/listinfo/perfsuite-users |
From: Robbie <jj...@nu...> - 2009-08-27 14:25:41
|
Rick, Today I have tested perfsuite with NPB-OMP benchmarks on an Xeon/Linux machine with icc/ifort compiler. Luckily, all OpenMP benchmarks finish successfully and performance data files are generated as expected. I'm not clear about the reason behind this problem. Maybe the difference between icc/ifort and gcc/gfortran can be a useful indication. Jie 2009-08-26 at 08:12 -0500, rk...@il... wrote: > Jie, > > Thanks for reporting on your further experiments. With the issue still present when using PAPI, it seems similar to an issue we have seen on our Altix. Unfortunately, this remains an unresolved issue that may be related to the way that psrun operates internally. I'm afraid I do not have a solution at present, but I have found that using the PerfSuite API directly produces the proper results. If you are able and willing to do so, using the API involves inserting a call to the following routines: > > call psf_hwpc_init() - from the main thread > call psf_hwpc_start() - from within a parallel region > call psf_hwpc_stop(filename) - from within a parallel region > > There is an example (in C) in the PerfSuite distribution of calling the API from an OpenMP program. You will find it in: > > $PREFIX/share/perfsuite/examples/cpi/cpi-omp.c > > I'm afraid I do not have a better solution at this time, but it is an issue we are following up on. > > Rick > > > ----- Original Message ----- > From: "Robbie" <jj...@nu...> > To: "Rick Kufrin" <rk...@il...> > Cc: per...@li... > Sent: Wednesday, August 26, 2009 7:27:05 AM GMT -06:00 US/Canada Central > Subject: Re: [PerfSuite-users] Problems about using Perfsuite to monitor OpenMP program (NPB-3.2.1) > > Rick, > > Thanks for your suggestion. > However, when I tried to monitor an OpenMP program with the default > configuration file, I still got some errors and the data files are not > created as expected. > > Followings are the platform information and how I measure the NPB-OMP > program with psrun. > Note here the target OpenMP program is compiled by gcc/gfortran with > -fopenmp option. > > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ uname -a > Linux UT43 2.6.27-perfctr #2 SMP Tue Apr 28 20:29:12 CST 2009 i686 > GNU/Linux > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ perfex -i > PerfCtr Info: > abi_version 0x05020501 > driver_version 2.6.37 DEBUG > cpu_type 14 (Intel Pentium M) > cpu_features 0x7 (rdpmc,rdtsc,pcint) > cpu_khz 798049 > tsc_to_cpu_mult 1 > cpu_nrctrs 2 > cpus [0], total: 1 > cpus_forbidden [], total: 0 > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ ls > bt.A ep.A is.A is.B > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ export OMP_NUM_THREADS=2 > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ psrun -p ./is.A > > > NAS Parallel Benchmarks (NPB3.2-OMP) - IS Benchmark > > Size: 8388608 (class A) > Iterations: 10 > Number of available threads: 2 > > > iteration > 1 > 2 > 3 > 4 > 5 > 6 > 7 > 8 > 9 > 10 > > > IS Benchmark Completed > Class = A > Size = 8388608 > Iterations = 10 > Time in seconds = 1.46 > Total threads = 2 > Avail threads = 2 > Mop/s total = 57.29 > Mop/s/thread = 28.64 > Operation type = keys ranked > Verification = SUCCESSFUL > Version = 3.2.1 > Compile date = 25 Aug 2009 > > Compile options: > CC = gcc > CLINK = $(CC) > C_LIB = -lm > C_INC = (none) > CFLAGS = -O -g -fopenmp > CLINKFLAGS = -O -fopenmp > > > Please send all errors/feedbacks to: > > NPB Development Team > np...@na... > > Inconsistency detected by ld.so: dl-close.c: 719: _dl_close: Assertion > `map->l_init_called' failed! > > > jiejiang@UT43:~/NPB3.2.1/NPB3.2-OMP/bin$ ls > bt.A ep.A is.A is.A.0.30613.UT43.xml is.B > > > The program execution finishes with the error message in the last line > and there is only ONE xml output file, not two as expected. > This also happens to papi_profile_cycles.xml configuration file. > > What's wrong? > > Regards, > Jie Jiang > > > > Rick Kufrin wrote: > > Jie, > > > > My guess is that what is happening here is related to the use of the "itimer.xml" configuration file. The problem is that signal delivery is not defined with POSIX threads, and the results are unpredictable. POSIX threads enter the picture when you are using OpenMP. > > > > Does your system happen to have kernel support for hardware counters? If so, you may have better luck by profiling with performance counters such as total cycles rather than itimers. > > > > Rick > > > > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > PerfSuite-users mailing list > Per...@li... > https://lists.sourceforge.net/lists/listinfo/perfsuite-users > > |