From: Mala A. <ma...@us...> - 2002-08-22 17:23:08
|
>On Wed, Aug 21, 2002 at 01:07:09PM -0500, Mala Anand wrote: >> >> >On Wed, Aug 21, 2002 at 11:59:44AM -0500, Mala Anand wrote: >> >> The patch reduces the number of cylces by 25% >> >> >The data you are reporting is flawed: where are the average cycle >> >times spent in __kfree_skb with the patch? >> >> I measured the cycles for only the initialization code in alloc_skb >> and __kfree_skb. Since the init code is removed from __kfree_skb, >> no cycles are spent there. >Then the testing technique is flawed. You should include all of the >operations included in an alloc_skb/kfree_skb pair in order to see >the overall effect of the change, otherwise your change could have a >net negative effect which would not be noticed. Cycles for the whole routines alloc_skb and __kfree_skb are as follows: Baseline 2.5.25 ---------------- alloc/free average cycles ------------------------- Runs: 1st 2nd 3rd CPU0: 337/1163 336/1132 304/1100 CPU1: 318/1164 309/1153 311/1127 2.5.25+skbinit patch -------------------- alloc/free average cycles ------------------------- Runs: 1st 2nd 3rd CPU0: 447/1015 580/846 402/905 CPU1: 419/1003 383/915 547/856 The above figures indicate that the cycles spent in alloc_skb and __kfree_skb have gained 5% in the patch case. However if you take the absolute cycles and average them for the three runs it comes around 145 cycles saving that is close to what I posted earlier by measuring just the changed code. As the scope of the code measured widens the percentage improvement comes down. So the first two scopes, 1. measuring the cycles spent in changed code 2. measuring the cycles spent in alloc_skb and __kfree_skb, results are consistent. The third scope would be measuring this patch in a workload environment. We measured it in a web serving workload and found that we get 0.7% improvement. I would like to stress again that this patch helps only when the allocations and frees occur on two different CPUs. I measured it in a UNI system and did not see any impact. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 Benjamin LaHaise <bc...@re...> To: Mala Anand/Austin/IBM@IBMUS Sent by: cc: al...@lx..., Bill Hartner/Austin/IBM@IBMUS, da...@re..., lse...@li...ur lin...@vg..., lse...@li... ceforge.net Subject: [Lse-tech] Re: (RFC): SKB Initialization 08/21/02 01:16 PM On Wed, Aug 21, 2002 at 01:07:09PM -0500, Mala Anand wrote: > > >On Wed, Aug 21, 2002 at 11:59:44AM -0500, Mala Anand wrote: > >> The patch reduces the numer of cylces by 25% > > >The data you are reporting is flawed: where are the average cycle > >times spent in __kfree_skb with the patch? > > I measured the cycles for only the initialization code in alloc_skb > and __kfree_skb. Since the init code is removed from __kfree_skb, > no cycles are spent there. Then the testing technique is flawed. You should include all of the operations included in an alloc_skb/kfree_skb pair in order to see the overall effect of the change, otherwise your change could have a net negative effect which would not be noticed. -ben -- "You will be reincarnated as a toad; and you will be much happier." ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Lse-tech mailing list Lse...@li... https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Mala A. <ma...@us...> - 2002-08-23 14:45:36
|
Dave Hansen wrote.. >Mala Anand wrote: >> The third scope would be measuring this patch in a workload environment. >> We measured it in a web serving workload and found that we get 0.7% >> improvement. >First of all, the patch doesn't apply at all against the current >bitkeeper tree. I can post the exact one I used if you like. >I tried this under our Specweb99 setup. Here's a snippet of >readprofile with, then without the patch: >alloc:free ratio: 1.226 >(__kfree_skb+alloc_skb)/total = 3.14% >alloc:free ratio: 0.348 >(__kfree_skb+alloc_skb)/total = 2.79% >You can see the entire readprofile here: >http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap-08-22-2002-11.20.17/ >http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap-mala-08-22-2002-11.44.25/ >No, I don't know why I have so much idle time. Readprofile ticks are not as accurate as the cycles I measured. Moreover readprofile can give misleading information as it profiles on timer interrupts. The alloc_skb and __kfree_skb call memory management routines and interrupts are disabled in many parts of that code. So I don't trust the readprofile data. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: Dave H. <hav...@us...> - 2002-08-23 16:39:50
|
Mala Anand wrote: > Readprofile ticks are not as accurate as the cycles I measured. > Moreover readprofile can give misleading information as it profiles > on timer interrupts. The alloc_skb and __kfree_skb call memory > management routines and interrupts are disabled in many parts of that code. > So I don't trust the readprofile data. I don't believe your results to be accurate. They may be _precise_ for a small case, but you couldn't have been measuring them for very long. A claim of accuracy requires a large number of samples, which you apparently did not do. I can't use oprofile or other NMI-based profilers on my hardware, so we'll just have to guess. Is there any chance that you have access to a large Specweb setup on hardware that is close to mine and can run oprofile? Where are interrupts disabled? I just went through a set of kernprof data and traced up the call graph. In the most common __kfree_skb case, I do not believe that it has interupts disabled. I could be wrong, but I didn't see it. http://www.sr71.net/~specweb99/old/run-specweb-2300-nodynpost-2.5.31-bk+profilers-08-14-2002-02.19.22/callgraph The end result, as I can see it, is that your patches hurt Specweb performance. They moved the profile around, but there was an overall decline in performance. They partly address the symptom, but not the real problem. We don't need to _tune_ it, we need to fix it. The e1000's need to allocate/free fewer skbs. NAPI's polling mode _should_ help this, or at least make it possible to batch them up. -- Dave Hansen hav...@us... |
From: Bill H. <ha...@au...> - 2002-08-23 20:14:56
|
Dave Hansen wrote: > > Mala Anand wrote: > > Readprofile ticks are not as accurate as the cycles I measured. > > Moreover readprofile can give misleading information as it profiles > > on timer interrupts. The alloc_skb and __kfree_skb call memory > > management routines and interrupts are disabled in many parts of that code. > > So I don't trust the readprofile data. > > I don't believe your results to be accurate. They may be _precise_ > for a small case, but you couldn't have been measuring them for very > long. A claim of accuracy requires a large number of samples, which > you apparently did not do. Dave, What is your definition of a "very long time" ? Read the 1st email. There were 2.4 million samples. How many do you think is sufficient ? > > I can't use oprofile or other NMI-based profilers on my hardware, so > we'll just have to guess. Is there any chance that you have access to > a large Specweb setup on hardware that is close to mine and can run > oprofile? Why do you think oprofile is a better way to measure this ? BTW, Mala works with Troy Wilson who is running SPECweb99 on an 8-way system using Apache. Troy has run with Mala's patch and that data will be posted. > > Where are interrupts disabled? I just went through a set of kernprof > data and traced up the call graph. In the most common __kfree_skb > case, I do not believe that it has interupts disabled. I could be > wrong, but I didn't see it. What is the revelance of the above ? > > http://www.sr71.net/~specweb99/old/run-specweb-2300-nodynpost-2.5.31-bk+profilers-08-14-2002-02.19.22/callgraph > > The end result, as I can see it, is that your patches hurt Specweb > performance. Based on what ? A callgraph ? A profile ? Bill |
From: Dave H. <hav...@us...> - 2002-08-23 20:30:45
|
Bill Hartner wrote: > > Dave Hansen wrote: > >>Mala Anand wrote: >> >>>Readprofile ticks are not as accurate as the cycles I measured. >>>Moreover readprofile can give misleading information as it profiles >>>on timer interrupts. The alloc_skb and __kfree_skb call memory >>>management routines and interrupts are disabled in many parts of that code. >>>So I don't trust the readprofile data. >> >>I don't believe your results to be accurate. They may be _precise_ >>for a small case, but you couldn't have been measuring them for very >>long. A claim of accuracy requires a large number of samples, which >>you apparently did not do. > > What is your definition of a "very long time" ? > > Read the 1st email. There were 2.4 million samples. > > How many do you think is sufficient ? I must have misunderstood the data from the first email. I was under the impression that it was much smaller than that number. >>I can't use oprofile or other NMI-based profilers on my hardware, so >>we'll just have to guess. Is there any chance that you have access to >>a large Specweb setup on hardware that is close to mine and can run >>oprofile? > > Why do you think oprofile is a better way to measure this ? Mala's main complaint about readprofile is that it cannot profile while interrupts are disabled. oprofile's timer interrupts cannot be disabled, they _always_ occur. > BTW, Mala works with Troy Wilson who is running SPECweb99 on > an 8-way system using Apache. Troy has run with Mala's patch > and that data will be posted. I look forward to seeing it. >>Where are interrupts disabled? I just went through a set of kernprof >>data and traced up the call graph. In the most common __kfree_skb >>case, I do not believe that it has interupts disabled. I could be >>wrong, but I didn't see it. > > What is the revelance of the above ? Mala's main complaint about readprofile is that it cannot profile while interrupts are disabled. I didn't see the case where it was being called with interrupts disabled. I was hoping that you could point it out to me. -- Dave Hansen hav...@us... |
From: Troy W. <tc...@te...> - 2002-08-23 23:36:51
|
I found the number of simultaneous connections under SPECWeb99 * to be improved by ~1% when using Mala's SKB patch. 2.5.25 baseline = 2656 simultaneous connections 2.5.25 + SKB patch = 2688 simultaneous connections * SPEC(tm) and the benchmark name SPECweb(tm) are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only, and is non-compliant, with the following deviations from the rules - 1 - It was run on hardware that does not meed the SPEC availability-to-the-public criteria. The machine is an engineering sample. 2 - access_log wasn't kept for full accounting. It was being written, but deleted every 200 seconds. |
From: David S. M. <da...@re...> - 2002-08-23 22:56:54
|
From: Dave Hansen <hav...@us...> Date: Fri, 23 Aug 2002 09:39:13 -0700 Where are interrupts disabled? I just went through a set of kernprof data and traced up the call graph. In the most common __kfree_skb case, I do not believe that it has interupts disabled. I could be wrong, but I didn't see it. That's completely right. interrupts should never be disabled when __kfree_skb is executed. It used to be possible when we allowed it to be invoked from interrupt handlers, but that is illegal and we have kfree_skb_irq which just reschedules the actual __kfree_skb to a software interrupt. So I agree with you, Mala's claims seem totally bogus and not well founded at all. |
From: Bill H. <ha...@au...> - 2002-08-24 03:43:35
|
Dave Hansen wrote: > > > I can't use oprofile or other NMI-based profilers on my hardware, so > we'll just have to guess. Is there any chance that you have access to > a large Specweb setup on hardware that is close to mine and can run > oprofile? > Troy, Can you post the latest profile from your SPECweb99 run to the list. Bill |
From: Rick L. <ric...@us...> - 2002-08-23 15:48:59
|
Readprofile ticks are not as accurate as the cycles I measured. Moreover readprofile can give misleading information as it profiles on timer interrupts. The alloc_skb and __kfree_skb call memory management routines and interrupts are disabled in many parts of that code. So I don't trust the readprofile data. So how did you obtain these measurements then, and what makes them more accurate? Rick |
From: Bill H. <ha...@au...> - 2002-08-23 19:32:20
|
Rick Lindsley wrote: > > > So how did you obtain these measurements then, and what makes them more > accurate? On IA32, on a per cpu basis : pushf cli count += 1; cycles -= get_cycles(); . . . cycles += get_cycles(); popf; cycles / count = avg cpu cycles Bill |
From: Rick L. <ric...@us...> - 2002-08-23 20:29:19
|
pushf cli count += 1; cycles -= get_cycles(); . . . cycles += get_cycles(); popf; cycles / count = avg cpu cycles Thanks. So it appears you disable interrupts during the functions now when before you did not? Do you think that not having the cache get filled with interrupt-related data might itself have a positive effect on the code? (On a side note, is this done by hand-editing assembly code or is there a tool you use to do this?) Reducing cycles is nice, but it is quite possible to optimize small sections of code at the expense of others. Looking at the larger picture insures that any improvement is real. So far, the data presented has not been presented in a larger context, except through Dave's quick specweb run. That run suggested there was little or no gain with this change. Rick |
From: Bill H. <ha...@au...> - 2002-08-24 00:12:45
|
Rick Lindsley wrote: > > pushf > cli > count += 1; > cycles -= get_cycles(); > . > . > . > cycles += get_cycles(); > popf; > > cycles / count = avg cpu cycles > > Thanks. So it appears you disable interrupts during the functions now > when before you did not? Do you think that not having the cache get > filled with interrupt-related data might itself have a positive effect > on the code? Maybe. Doubtful. Interrupts were also disabled on the baseline and for more cycles. > > (On a side note, is this done by hand-editing assembly code or is there > a tool you use to do this?) struct _stats { unsigned long cnt; signed long long cycles; unsigned long pad[5]; } mystats[NR_CPUS] = {{0}}; void foo(void) { unsigned long flags; int this_cpu = smp_processor_id(); local_irq_save(flags); mystats[this_cpu].cnt++; mystats[this_cpu].cycles -= get_cycles(); ... something you want to measure ... mystats[this_cpu].cycles += get_cycles(); local_irq_restore(flags); } Use the symbol map and read /dev/kmem to retrieve the data from mystats. > > Reducing cycles is nice, but it is quite possible to optimize small > sections of code at the expense of others. Looking at the larger > picture insures that any improvement is real. So far, the data > presented has not been presented in a larger context, except through > Dave's quick specweb run. That run suggested there was little or no > gain with this change. Dave did not presenbt his data. Did I miss it ? If I am not mistaken, Dave's SPECweb99 results showed a 3 % improvement on a NUMA system. Correct ? Mala will post SPECweb99 data from a conforming run with the exception that logging is disabled and pre-release HW. This run also shows an improvement. Bill |
From: Rick L. <ric...@us...> - 2002-08-23 20:51:20
|
Read the 1st email. There were 2.4 million samples. How many do you think is sufficient ? I looked at my hand 2.4 million times and it was not wet each time. Therefore, it is not raining. Of course, if I am inside a roofed structure, the sampling is faulty. And (correct me if I'm wrong here, Dave) I think that's what we're asking about. Are the samples you're getting pertinent and significant? If, as you suggested in another email, you disable interrupts in the functions to take these measurements, you may be significantly altering the very environment you hope to measure. Why do you think oprofile is a better way to measure this ? BTW, Mala works with Troy Wilson who is running SPECweb99 on an 8-way system using Apache. Troy has run with Mala's patch and that data will be posted. That will be helpful. Microbenchmarks which measure cycles are far less interesting to the community than the end results of actual workloads. Note that Mala said "I measured the cycles for only the initialization code in alloc_skb and __kfree_skb" which could mean that even other parts of alloc_skb() or __kfree_skb() may have gotten worse and you would not have known. Later she admits, "As the scope of the code measured widens the percentage improvement comes down" and finally observes "We measured it in a web serving workload and found that we get 0.7% improvement" which is practically in the noise. Dave's observation was that it was slightly worse (0.35%). Either could be statistical noise. If the patch only creates statistical noise, the community won't be interested. Also, it is well known and widely recognized that more cpus add increasing complexity with cache and code interactions. Have you tested this on an 8-way machine, rather than a 2-way, and do the results still hold? Things which look very good on 2-proc can start to lose their lustre on 8-proc or bigger. I'm unfamiliar with netperf -- does it yield "results" which can be compared? If so, since it was used to generate the load, how did the results of the two runs compare? Rick |
From: Mala A. <ma...@us...> - 2002-08-23 23:14:38
|
Rick wrote.. >Note that Mala said "I measured the cycles for only the >initialization code in alloc_skb and __kfree_skb" which could mean that >even other parts of alloc_skb() or __kfree_skb() may have gotten worse >and you would not have known. Please look at my reply to Ben LeHaise which has the cycles for alloc_skb() and __kfree_skb(). You don't have to guess that. > Later she admits, "As the scope of the >code measured widens the percentage improvement comes down" and finally >observes "We measured it in a web serving workload and found that we >get 0.7% improvement" which is practically in the noise. That was initial results which had more than the posted patch. We are still working on getting numbers. >Dave's >observation was that it was slightly worse (0.35%). Are you basing this 0.35% degradation on your profile. According to Dave's SPECweb99 results there is a 2.97% improvement in simultaneous connections with my patch. Is that right Dave? Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: Mala A. <ma...@us...> - 2002-08-23 23:40:01
|
From: Dave Hansen <hav...@us...> Date: Fri, 23 Aug 2002 09:39:13 -0700 Where are interrupts disabled? I just went through a set of kernprof data and traced up the call graph. In the most common __kfree_skb case, I do not believe that it has interupts disabled. I could be wrong, but I didn't see it. >That's completely right. interrupts should never be disabled when >__kfree_skb is executed. It used to be possible when we allowed >it to be invoked from interrupt handlers, but that is illegal and >we have kfree_skb_irq which just reschedules the actual __kfree_skb >to a software interrupt. >So I agree with you, Mala's claims seem totally bogus and not well >founded at all. To name a few, interrupts are disabled when skbs are put back to the hot_list and when the cache list is accessed in the slab allocator. Am I missing something? Please help me to understand. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: David S. M. <da...@re...> - 2002-08-24 00:11:23
|
From: "Mala Anand" <ma...@us...> Date: Fri, 23 Aug 2002 18:38:59 -0500 To name a few, interrupts are disabled when skbs are put back to the hot_list A few cycles, at best, it should not be enough to skew the profiling and when the cache list is accessed in the slab allocator. Am I missing something? Please help me to understand. Which means if this were enough to skew profiling, it would make the program counter of the interrupt enable instruction in the SLAB code show up clearly in the profiles. |
From: Rick L. <ric...@us...> - 2002-08-27 20:16:35
|
Thanks. So it appears you disable interrupts during the functions now when before you did not? Do you think that not having the cache get filled with interrupt-related data might itself have a positive effect on the code? Maybe. Do you mean "maybe" you disable interrupts when before you did not? If so, I think we need to be more certain about this before we'll convince the community at large of its validity. Doubtful. Interrupts were also disabled on the baseline and for more cycles. But here's the problem: both runs, then, were made on code that will behave differently when doing anything but measuring cycles. They will never get hit by an interrupt -- an environment not present in any uninstrumented run. Any gain you see may be accentuated in this cache-warm and cache-stable situation, but minimized or even eliminated when that cycle gain is obscured by cache misses or even just the context switches to/from interrupts. Interrupts can be the most common cause of caches getting filled with "useless" data, so eliminating them can cause a significant skew between "experimental" and "real-world" results. why not try a run without blocking interrupts? The cycles counted will not, of course, represent the cycles utilized solely by the routine, but it should provide some more data to tell you if the overall performance of the routine has improved enough to make the patch worthwhile. or perhaps the real patch here is to disable interrupts during a time when you don't want the cache to be disturbed for performance reasons. Reducing cycles is nice, but it is quite possible to optimize small sections of code at the expense of others. Looking at the larger picture insures that any improvement is real. So far, the data presented has not been presented in a larger context, except through Dave's quick specweb run. That run suggested there was little or no gain with this change. Dave did not present his data. Did I miss it ? I'm referring to the profiling run. In the run with the patch, it revealed: (__kfree_skb+alloc_skb)/total = 3.14% and the run without the patch: (__kfree_skb+alloc_skb)/total = 2.79% That is to say, profiling found __kfree_skb+alloc_skb to be greater with the patch than without. The sum of cycles for these routines is what Mala found to have improved in her more narrow measurements, so it was the result I was referring to, above. If I am not mistaken, Dave's SPECweb99 results showed a 3 % improvement on a NUMA system. Correct ? That I don't know -- I was responding to the profiling post. Mala will post SPECweb99 data from a conforming run with the exception that logging is disabled and pre-release HW. This run also shows an improvement. I haven't seen this come through yet. Since it sounds like you already have it in hand, will this be posted soon? rick |
From: Mala A. <ma...@us...> - 2002-08-28 14:50:38
|
Rick Lindsley wrote.. >I'm referring to the profiling run. In the run with the patch, it revealed: > (__kfree_skb+alloc_skb)/total = 3.14% >and the run without the patch: > (__kfree_skb+alloc_skb)/total = 2.79% >That is to say, profiling found __kfree_skb+alloc_skb to be greater >with the patch than without. I looked at Dave Hansen's profile and results. The following are my calculations: Baseline: With Patch: --------- ---------- totalticks: 628695 623160 poll_idle: 225023 225838 alloc_skb: 4535 10778 __kfree_skb: 13001 8788 (total-poll_idle):403672 397322 Baseline: __kfree_skb+alloc_skb)/total-poll_idle = 0.043 WithPatch: __kfree_skb+alloc_skb)/total-poll_idle = 0.049 The above data still indicates that the patch case we spent 0.006 or 0.6% more ticks. It does not end there, you also need to look at the results. The SPECweb99 results indicate that 2.97% more concurrent connections are achieved. The more the throughput,the more skbs are used. So that answers the more ticks spent in these routines. Remember the read profile data does not tell how many times these routines are called. This is how I looked at his data. I also understand that Dave's SPECweb99 runs are not compliant, so he has high variance and I cannot rely on his results. >> Mala will post SPECweb99 data from a conforming run with the exception >> that logging is disabled and pre-release HW. This run also shows an >> improvement. >I haven't seen this come through yet. Since it sounds like you already >have it in hand, will this be posted soon? It was posted to lkml last week by Troy Wilson under the SKB Initialization thread. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 nevdull@beaverton .ibm.com To: Bill Hartner <ha...@au...> cc: Mala Anand/Austin/IBM@IBMUS, hav...@us..., 08/27/02 03:13 PM lse...@li... Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization Thanks. So it appears you disable interrupts during the functions now when before you did not? Do you think that not having the cache get filled with interrupt-related data might itself have a positive effect on the code? Maybe. Do you mean "maybe" you disable interrupts when before you did not? If so, I think we need to be more certain about this before we'll convince the community at large of its validity. Doubtful. Interrupts were also disabled on the baseline and for more cycles. But here's the problem: both runs, then, were made on code that will behave differently when doing anything but measuring cycles. They will never get hit by an interrupt -- an environment not present in any uninstrumented run. Any gain you see may be accentuated in this cache-warm and cache-stable situation, but minimized or even eliminated when that cycle gain is obscured by cache misses or even just the context switches to/from interrupts. Interrupts can be the most common cause of caches getting filled with "useless" data, so eliminating them can cause a significant skew between "experimental" and "real-world" results. why not try a run without blocking interrupts? The cycles counted will not, of course, represent the cycles utilized solely by the routine, but it should provide some more data to tell you if the overall performance of the routine has improved enough to make the patch worthwhile. or perhaps the real patch here is to disable interrupts during a time when you don't want the cache to be disturbed for performance reasons. Reducing cycles is nice, but it is quite possible to optimize small sections of code at the expense of others. Looking at the larger picture insures that any improvement is real. So far, the data presented has not been presented in a larger context, except through Dave's quick specweb run. That run suggested there was little or no gain with this change. Dave did not present his data. Did I miss it ? rick |
From: Rick L. <ric...@us...> - 2002-08-28 18:33:12
|
I looked at Dave Hansen's profile and results. The following are my calculations: Baseline: With Patch: --------- ---------- totalticks: 628695 623160 poll_idle: 225023 225838 alloc_skb: 4535 10778 __kfree_skb: 13001 8788 (total-poll_idle):403672 397322 Baseline: __kfree_skb+alloc_skb)/total-poll_idle = 0.043 WithPatch: __kfree_skb+alloc_skb)/total-poll_idle = 0.049 I don't understand why you are subtracting out the idle time -- that's all part of the sampling too. It doesn't make much statistical sense, and it's things like this that causes people to question other data and conclusions. Still ... I also understand that Dave's SPECweb99 runs are not compliant, so he has high variance and I cannot rely on his results. I agree, in part. It's not a valid SPECweb run and I don't think Dave ever claimed it was. Still, it is a "workload" in the sense that it exercises the code and while it may not be valid as a SPECweb run, it does serve to raise some of the questions we are addressing. Mala will post SPECweb99 data from a conforming run with the exception that logging is disabled and pre-release HW. This run also shows an improvement. I haven't seen this come through yet. Since it sounds like you already have it in hand, will this be posted soon? It was posted to lkml last week by Troy Wilson under the SKB Initialization thread. Ah, sorry, ok. The above said it would be coming from you so I wasn't sure if that was what we should look at. So ultimately, then, SPECweb measured a 1.2% improvement and netperf3 was 0.7% (was that the "web serving workload" you mentioned or was that something else?) I presume Troy's SPECweb was on an 8-way while the netperf3 was on the 2-way you cited for cycle measurements? Rick |
From: Mala A. <ma...@us...> - 2002-08-28 20:48:20
|
>Rick Lindley wrote.. >I don't understand why you are subtracting out the idle time -- that's >all part of the sampling too. It doesn't make much statistical sense, >and it's things like this that causes people to question other data and >conclusions. Still ... Because there is a gain in idle time with the patch. Performance gain is refelected not only in throughput but also in cpu utilization. >>I also understand that Dave's SPECweb99 runs are not compliant, so >>he has high variance and I cannot rely on his results. >I agree, in part. It's not a valid SPECweb run and I don't think Dave >ever claimed it was. Still, it is a "workload" in the sense that it >exercises the code and while it may not be valid as a SPECweb run, it >does serve to raise some of the questions we are addressing. There is variance between runs, according to Martin, so it does not even make sense to compare two runs. Before making any comparisons like this you have to make sure that the results are repeatable. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: Rick L. <ric...@us...> - 2002-08-28 21:22:02
|
I don't understand why you are subtracting out the idle time -- that's all part of the sampling too. It doesn't make much statistical sense, and it's things like this that causes people to question other data and conclusions. Still ... Because there is a gain in idle time with the patch. Performance gain is refelected not only in throughput but also in cpu utilization. But it is a unnecessarily misleading and unusual interpretation of the data, which you don't explain until challenged. Claiming that something ran X% of the non-idle time is rather like claiming this year it rained X% of the days in which the sky was not sunny. It's not what people expect, and it makes it looks like you're twisting data in mathematical ways to offer more impressive numbers. Since you consider idle time important, what was the idle time in the SPECweb numbers you *did* use? That wasn't included in Troy's brief note. Did the idle time gain hold in a validated run? I agree, in part. It's not a valid SPECweb run and I don't think Dave ever claimed it was. Still, it is a "workload" in the sense that it exercises the code and while it may not be valid as a SPECweb run, it does serve to raise some of the questions we are addressing. There is variance between runs, according to Martin, so it does not even make sense to compare two runs. Before making any comparisons like this you have to make sure that the results are repeatable. I would suggest that even with the variance, this data would have been more interesting than the "cycle measurements" you chose to use. That is an unfamiliar measurement tool to most of the community. So what do you attribute the lower cycles to? "Warmer" cache because the skb is already being accessed by the CPU that frees it? Rick |
From: Mala A. <ma...@us...> - 2002-08-28 21:44:12
|
>Since you consider idle time important, what was the idle time in the >SPECweb numbers you *did* use? That wasn't included in Troy's brief >note. Did the idle time gain hold in a validated run? Troy's runs are 100% complaint. That means 100% CPU utilized at conforming throughput rate. >I would suggest that even with the variance, this data would have been >more interesting than the "cycle measurements" you chose to use. That >is an unfamiliar measurement tool to most of the community. I guess that is just your opinion. After providing the information Ben Lehaise asked for, I have not seen any complaint from open source community. You all didn't even look at Dave's data properly and jumped to conclusions. >So what do you attribute the lower cycles to? "Warmer" cache because >the skb is already being accessed by the CPU that frees it? Doesn't look like you have followed the discussion in lkml. Basically the freed skb that was also initialized on CPU0 (as a result nearly 5 dirty cachelines are in CPU0's L2 cache) gets allocated on CPU1 and ends up wasting cycles. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: Dave H. <hav...@us...> - 2002-08-28 23:52:50
|
Mala Anand wrote: >>Since you consider idle time important, what was the idle time in the >>SPECweb numbers you *did* use? That wasn't included in Troy's brief >>note. Did the idle time gain hold in a validated run? > > Troy's runs are 100% complaint. That means 100% CPU utilized at conforming > throughput rate. There is absolutely no direction relationship between conformance and cpu utilization. A 100% compliant load with 10 connections is going to use quite a bit less CPU than a 100% compliant load with 2300 connections. Yes, as you add more connections you use more CPUs, but there there is no direct relationship. In a perfect world where the OS was never waiting on I/O, never making scheduling mistakes, and where the benchmark is calibrated perfectly, and where there are no random variances in your testing conditions, you will get 100% conforming connections, with every single last tick of every single CPU in the system used in a productive manner. I know for a fact that at most one of these conditions is met in my testing environment. Troy's is no different. One of the tenets of Specweb99 is that a compliant connection is served at a rate of at least 320 kbytes/sec. For argument's sake, let's pretend that we have a perfect world. - I have a 32 Mbit/sec pipe coming out of my server. - There is no routing, or transport layer overhead - I have exactly enough CPU to satisfy 100 connections - My CPUs are 100% utilized - The OS is completely fair to each client - each connection is served at 320 Kbytes/sec - There is 100% conformance when tested Now, I go to 101 connections. The bandwidth awarded to each on is the same portion as before: total_bandwidth/clients = bandwidth_per_client, 316.831 Kbytes/sec. Since the OS is fair, each client gets this much share, no more, no less. Since the aggregate bandwidth hasn't changed, the CPU time hasn't changed either: still 100%. But, according to Specweb, the compliance is 0.00000%, because all connections are served at less than 320 kbytes/sec. Tricky, huh? Conformance percentage across runs means basically zilch. You can _never_ say that it is X% better because it increased conformance by Y%. You make be able to make a case that RunA was better than RunB, but be very, very careful. When performance testing Linux, the area of maximum throughput is rarely the most interesting place to analyze. We don't care when it is working well, we want to know where it breaks! Take your test and triple the requested connections. Now, that's interesting! See how your patch holds up under sever torture. (The wli method) This is how I understand Specweb. I would be delighted if someone more knowledgable than me can teach me more about it. >>I would suggest that even with the variance, this data would have been >>more interesting than the "cycle measurements" you chose to use. That >>is an unfamiliar measurement tool to most of the community. > > I guess that is just your opinion. After providing the information > Ben Lehaise asked for, I have not seen any complaint from open source > community. You all didn't even look at Dave's data properly and jumped > to conclusions. I've found silence in the open source community to be more damning than open criticism. -- Dave Hansen hav...@us... |
From: Andrew T. <hab...@us...> - 2002-08-29 02:24:13
|
> This is how I understand Specweb. I would be delighted if someone more > knowledgable than me can teach me more about it. Dave, I agree, the runs do not need to be compliant. I don't think anyone here really cares if we follow spec's rules, because we are not publishing. We are trying to make the kernel better. However we really do care that the results are repeatable. If we cannot do this, there is no point at all. If the results are not consistent, that would imply the workload is not consistent. If the workload is not consistent, there's no way profiling data will be either. I know specweb can be a PITA. 1.5+ hours for one datapoint is not what I call fun. Anyone who runs it naturally wants to find a way to cut down the times. But this is tricky! I'd bet spec designed their workload in such a manner primarily to maintain consistency. Modifying the rules tend to screw this up. I am very curious how much time your specweb runs take, and how consistent your results are. This info should be easy to get, since Specweb makes 3 runs/results for every datapoint. What is the consistency between the 3 runs? Maybe before we debate the merits of Mala's patch, we should agree on how to run Specweb, or any benchmark that we use. In my case, running NetBench, I try to maintain less than 0.5% variation on runs. Generally it's within 0.2%. If this window widens, I find what is wrong before I continue, because otherwise I just can't trust what I am getting. I was able to reduce the test time from ~3 hours to 45 minutes without comprimising this repeatability. Any shorter times, and it could not be trusted. Anyway, I guess my point of the email is that we need to agree on whats the valid way to measure/analyze performance. There's no point in discussing the merits of any patch unless we get this out of the way first. -Andrew |
From: Rick L. <ric...@us...> - 2002-08-28 21:51:18
|
Doesn't look like you have followed the discussion in lkml. No, I have been; I just wanted to make sure I understood before I asked the next question. Basically the freed skb that was also initialized on CPU0 (as a result nearly 5 dirty cachelines are in CPU0's L2 cache) gets allocated on CPU1 and ends up wasting cycles. So initializing these fields causes us to pull them in because they are not in our cache, and moving the initialization to the free causes them to access warm cache entries for them (presumably they just got done being used for something.) If this structure is being allocated because the calling function wants to use it, doesn't that just move the dirty cache issue to another routine (which would have had the cache warmed by the allocator before?) Stated another way, how do you know you haven't just moved cache misses rather than eliminating them? Rick |