From: <rw...@ea...> - 2002-07-13 16:31:43
|
> 6. Randy Hron - Trying to determine which patch causes an > improvement in pipe throughput. Believe it is due to > irq balancing. The 2.4.19-pre10-jam2 patches are identified: irqrate lowers pipe latency measured by LMbench on SMP. irqbalance lowers AF_UNIX latency. > However on a networking benchmark Andrew > Theurer saw a 10% decrease with the irq balancing patch. > Randy will try it on different workloads if he determines > it is the patch that causes the improvement. I'm running my usual set of tests on 2.4.19pre10aa2 + irqrate and irqbalance. (pre10aa2 was the basis for jam2) netperf is very configurable and is likely to be a valueable addition to the things I run. It would be helpful to know more about the testing configuration and method that had a netperf regression. I.E. how many processors in box? how many samples? networked or loopback? If there was a script that ran the tests, that would help a lot. -- Randy Hron http://home.earthlink.net/~rwhron/kernel/bigbox.html |
From: <rw...@ea...> - 2002-07-14 17:07:40
|
> I have recently been experimenting with 2.4.19-rc1-aa2, and it is my > understanding that Andrea has his own IRQbalance in this kernel (I am not > sure exactly when/what version it appeared). Andrea's irq-balance appeared in 2.4.19pre10aa3. > At some point I will try netbench with that version with IRQbalance on/off. That would be great. BTW, was your test netperf or netbench? > Any chance you could try it with your benchmarks? Let's see what my benchmarks show with just using the irqrate and irqbalance from pre10-jam2 on pre10aa2. Based on that, it will be clearer whether my benchmarks are useful for measuring the effect. > Anyway, I still don't understand how Ingo's IRQbalance is making a difference. It's apparently subtle. I selected those latency metric differences in jam and aa because they are large, and aa/jam share mostly the same codebase. >> networked or loopback? > network Excellent. That's definitely a better environment for measuring IRQs. Was the test on a switch? BTW, 2.4.19-pre10-mjc1 had a few metrics that stood apart from most other pre10 kernels. pre10-mjc1 didn't have irqrate or irqbalance. The mjc tree is based off Alan Cox tree. The differences between mjc and ac is greater than the difference between jam and aa, so may be harder to find. This is an SMP or hardware phenomenon as well; my older k6/2 uniprocessor box doesn't have these differences. Here are a few of the interesting ones: *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------------- kernel Pipe AF_UNIX UDP RPC/UDP TCP RPC/TCP TCPconn --------------------- ------- ------- ------- ------- ------- ------- ------- 2.4.19-pre10-aa2 34.208 62.732 56.0924 58.9890 54.5218 80.1824 86.815 2.4.19-pre10-aa4 33.941 70.216 55.9494 59.7014 50.4220 83.0785 88.732 2.4.19-pre10-ac2 32.765 50.555 51.6682 60.9304 52.1370 84.1731 87.001 2.4.19-pre10-jam2 7.877 16.699 54.5461 61.2108 52.5340 80.4152 90.124 2.4.19-pre10-jam3 33.133 66.825 53.4208 60.0692 52.1661 83.4825 85.912 2.4.19-pre10-mjc1 8.727 14.756 22.8093 58.6576 34.0537 78.3135 105.548 So, pre10-mjc1 has the low latency in pipe/af_unix that Ingo's irqrate/irqbalance give, but doesn't include those patches. Also, tcp and udp latency is unusually low. tcp connect latency was higher. On the bandwidth front, mjc1 has unusually high af_unix and tcp numbers. *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- kernel Pipe AF_UNIX TCP --------------------- ------- ------- ------- 2.4.19-pre10 467.59 273.12 160.23 2.4.19-pre10-aa2 528.91 243.77 161.29 2.4.19-pre10-aa4 540.58 250.66 163.23 2.4.19-pre10-ac2 91.66 232.21 177.10 2.4.19-pre10-jam2 525.91 252.05 167.32 2.4.19-pre10-jam3 542.29 226.26 163.53 2.4.19-pre10-mjc1 540.09 538.72 373.39 The mjc1 differences could be compiler related too. mjc1 used gcc-3.1 with -march=pentium3 -Os -mmmx -msse. Perhaps running mjc1 with gcc-2.96 and -march=i686 -O2 is the place to start with it. Michael J. Cohen said gcc-3.1 does checksumming better than the older gcc's. -- Randy Hron http://home.earthlink.net/~rwhron/kernel/bigbox.html |
From: Andrew T. <hab...@us...> - 2002-07-15 15:07:45
|
On Sunday 14 July 2002 12:06, rw...@ea... wrote: > > I have recently been experimenting with 2.4.19-rc1-aa2, and it is my > > understanding that Andrea has his own IRQbalance in this kernel (I am= not > > sure exactly when/what version it appeared). > > Andrea's irq-balance appeared in 2.4.19pre10aa3. > > > At some point I will try netbench with that version with IRQbalance > > on/off. > > That would be great. BTW, was your test netperf or netbench? Netbench only. Netbench is a CIFS benchmark which happens to generate a=20 resonable amount of network traffic.=20 > > Any chance you could try it with your benchmarks? > > Let's see what my benchmarks show with just using the irqrate and > irqbalance from pre10-jam2 on pre10aa2. Based on that, it will be clea= rer > whether my benchmarks are useful for measuring the effect. > > > Anyway, I still don't understand how Ingo's IRQbalance is making a > > difference. > > It's apparently subtle. I selected those latency metric differences in= jam > and aa because they are large, and aa/jam share mostly the same codebas= e. OK, sounds good. I was just thinking, if Andrea's version of irqbalance=20 neither hurts my performance, and helps your performance, than maybe we h= ave=20 a winnner. But you are right, in your case, the first step is to identif= y=20 the responsible patch, then we can move to step 2. > >> networked or loopback? > > > > network > > Excellent. That's definitely a better environment for measuring IRQs. > Was the test on a switch? For Netperf, there is no switch, just point to point to clients. =20 For Netbench, I do have a 2 switchs, vlan'd for 4 network subnets. Each=20 subnet has a Gbps to the server and 12 100Mbps to 12 clients, for a total= of=20 4 Gbps on the server and 48 clients. =20 > BTW, 2.4.19-pre10-mjc1 had a few metrics that stood apart from most oth= er > pre10 kernels. pre10-mjc1 didn't have irqrate or irqbalance. The mjc > tree is based off Alan Cox tree. > > The differences between mjc and ac is greater than the difference betwe= en > jam and aa, so may be harder to find. This is an SMP or hardware > phenomenon as well; my older k6/2 uniprocessor box doesn't have these > differences. Here are a few of the interesting ones: Sorry if you mentined this already, but is this being run on the 4-way xe= on=20 system with the SMP kernel config option? I have a 2-way PIII here, mayb= e I=20 can get together some results here and we can compare. =20 -Andrew Theurer |
From: Mala A. <ma...@us...> - 2002-07-15 17:41:10
|
>> netperf is very configurable and is likely to be a valueable >> addition to the things I run. It would be helpful to know more >> about the testing configuration and method that had a netperf >> regression. >> >> I.E. >> how many processors in box? >4 or 8 use more cpus on client to drive your server Run tcp_stream or tcp_rr >> how many samples? Use the script file that is included in netperf2.2 you will get 3 different socket buffer size using 5 or 6 message size. >> networked or loopback? >network you can use both >Mala Anand (of IBM LTC) can help you with details. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.comdeveloperworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: <rw...@ea...> - 2002-07-16 01:41:18
|
>> Let's see what my benchmarks show with just using the irqrate and >> irqbalance from pre10-jam2 on pre10aa2. Based on that, it will be clearer >> whether my benchmarks are useful for measuring the effect. The box locked up after completing dbench on reiserfs. This is about 1/2 way through the entire set of things I run. So perhaps irqrate and irqbalance were depending on something else in pre10-jam2 or the patch needed some manual help (they applied with offsets). pre10aa2 and pre10-jam2 by themselves were stable. The kernel build test, which uses pipes a bit was about 1% better than other kernels running that test. But I don't have the number for pre10aa2. > For Netbench, I do have a 2 switchs, vlan'd for 4 network subnets. Each > subnet has a Gbps to the server and 12 100Mbps to 12 clients, for a total of > 4 Gbps on the server and 48 clients. Very nice. Sounds like a great environment for simulating more realistic tests. > Sorry if you mentined this already, but is this being run on the 4-way xeon > system with the SMP kernel config option? I have a 2-way PIII here, maybe I > can get together some results here and we can compare. Yes. 4-way xeon config'd with SMP. In the process of narrowing down the patches, i ran lat_pipe and lat_unix 100 times per kernel. >> BTW, 2.4.19-pre10-mjc1 had a few metrics that stood apart from most other >> pre10 kernels. pre10-mjc1 didn't have irqrate or irqbalance. I'm running pre10-mjc1 with the standard compiler instead of gcc-3.1 to get a datapoint on what that changes on the quad xeon. mjc1 also had HZ=1000, which may have helped the latency numbers. The ~doubled tcp bandwidth in mjc would be nice to track down. Looking at the individual runs, the min and max between mjc1 and ac2 are close, but mjc1 got a lot more "good" runs. I ran mjc1 a second time with preempt, and it had a similar number of "good" runs. ac tcp bandwidth is pretty consistent through the several releases i've tested. 25 tcp_bw runs on 2.4.19-pre10-mjc1 (sorted) 318.35 MB/sec 315.23 MB/sec 314.11 MB/sec 312.78 MB/sec 312.64 MB/sec 310.19 MB/sec 307.38 MB/sec 306.02 MB/sec 297.75 MB/sec 293.60 MB/sec 292.40 MB/sec 292.20 MB/sec 289.76 MB/sec 288.82 MB/sec 283.44 MB/sec 279.03 MB/sec 278.45 MB/sec 271.68 MB/sec 154.95 MB/sec 154.81 MB/sec 154.60 MB/sec 154.28 MB/sec 153.50 MB/sec 151.04 MB/sec 148.61 MB/sec 25 tcp_bw runs on 2.4.19-pre10-ac2 (sorted) 308.27 MB/sec 289.02 MB/sec 276.38 MB/sec 260.48 MB/sec 259.65 MB/sec 246.61 MB/sec 223.86 MB/sec 221.59 MB/sec 221.20 MB/sec 156.19 MB/sec 155.98 MB/sec 155.70 MB/sec 155.65 MB/sec 155.61 MB/sec 155.22 MB/sec 154.67 MB/sec 154.60 MB/sec 154.06 MB/sec 153.99 MB/sec 153.96 MB/sec 153.66 MB/sec 153.54 MB/sec 153.41 MB/sec 151.60 MB/sec 151.34 MB/sec Two things I'd like to know here: 1) Does this difference show up in the real world? 1) What causes it? -- Randy Hron http://home.earthlink.net/~rwhron/kernel/bigbox.html |
From: Andrew T. <hab...@us...> - 2002-07-16 18:17:02
|
I admit I do not know the details of the this test, but could the doublin= g=20 difference be an effect of both the sender and receiver process running o= n=20 the same CPU vs different CPUs? Could you set maxcpus=3D1 and see if all= the=20 results are in the 300 MB range? -Andrew > mjc1 also had HZ=3D1000, which may have helped the latency numbers. > The ~doubled tcp bandwidth in mjc would be nice to track down. > Looking at the individual runs, the min and max between mjc1 and > ac2 are close, but mjc1 got a lot more "good" runs. I ran mjc1 > a second time with preempt, and it had a similar number of "good" > runs. ac tcp bandwidth is pretty consistent through the several > releases i've tested. > > 25 tcp_bw runs on 2.4.19-pre10-mjc1 (sorted) > 318.35 MB/sec > 315.23 MB/sec > 314.11 MB/sec > 312.78 MB/sec > 312.64 MB/sec > 310.19 MB/sec > 307.38 MB/sec > 306.02 MB/sec > 297.75 MB/sec > 293.60 MB/sec > 292.40 MB/sec > 292.20 MB/sec > 289.76 MB/sec > 288.82 MB/sec > 283.44 MB/sec > 279.03 MB/sec > 278.45 MB/sec > 271.68 MB/sec > 154.95 MB/sec > 154.81 MB/sec > 154.60 MB/sec > 154.28 MB/sec > 153.50 MB/sec > 151.04 MB/sec > 148.61 MB/sec > > 25 tcp_bw runs on 2.4.19-pre10-ac2 (sorted) > 308.27 MB/sec > 289.02 MB/sec > 276.38 MB/sec > 260.48 MB/sec > 259.65 MB/sec > 246.61 MB/sec > 223.86 MB/sec > 221.59 MB/sec > 221.20 MB/sec > 156.19 MB/sec > 155.98 MB/sec > 155.70 MB/sec > 155.65 MB/sec > 155.61 MB/sec > 155.22 MB/sec > 154.67 MB/sec > 154.60 MB/sec > 154.06 MB/sec > 153.99 MB/sec > 153.96 MB/sec > 153.66 MB/sec > 153.54 MB/sec > 153.41 MB/sec > 151.60 MB/sec > 151.34 MB/sec > > Two things I'd like to know here: > 1) Does this difference show up in the real world? > 1) What causes it? |
From: <rw...@ea...> - 2002-07-16 22:32:54
|
>>The mjc1 differences could be compiler related too. mjc1 used gcc-3.1 >>with -march=pentium3 -Os -mmmx -msse. Perhaps running mjc1 with gcc-2.96 >>and -march=i686 -O2 is the place to start with it. Michael J. Cohen said >>gcc-3.1 does checksumming better than the older gcc's. >The mjc1 results are impressive. Where do you change the -march compiler >flag. >What makefile has this? Thanks. It was done by the patchset. It happens in linux/arch/i386/Makefile. The specific patch in mjc1 is 40_gcc31-compile-opts. My current run with gcc-2.96 and -march=i686 should be finishing up some time tonight. -- Randy Hron http://home.earthlink.net/~rwhron/kernel/bigbox.html |
From: <rw...@ea...> - 2002-07-16 22:42:19
|
> I admit I do not know the details of the this test, but could the doubling > difference be an effect of both the sender and receiver process running on > the same CPU vs different CPUs? That makes sense. > Could you set maxcpus=1 and see if all the > results are in the 300 MB range? Good idea. I'll try maxcpus=2 as well. -- Randy Hron http://home.earthlink.net/~rwhron/kernel/bigbox.html |
From: <rw...@ea...> - 2002-07-17 10:59:47
|
>The mjc1 differences could be compiler related too. mjc1 used gcc-3.1 >with -march=pentium3 -Os -mmmx -msse. Perhaps running mjc1 with gcc-2.96 >and -march=i686 -O2 is the place to start with it. The 2.4.19-pre10-mjc1 run with gcc-2.96 is complete. I didn't notice any metric varying by more than a few percentage points between 2.96 and gcc-3.1. For the most part, gcc-3.1 numbers were a little better, which is encouraging for the long term. The results are incoporated here: http://home.earthlink.net/~rwhron/kernel/bigbox.html -- Randy Hron |
From: Andrew T. <hab...@us...> - 2002-07-13 20:09:52
|
> > 6. Randy Hron - Trying to determine which patch causes an > > improvement in pipe throughput. Believe it is due to > > irq balancing. > > The 2.4.19-pre10-jam2 patches are identified: > > irqrate lowers pipe latency measured by LMbench on SMP. > irqbalance lowers AF_UNIX latency. > > > However on a networking benchmark Andrew > > Theurer saw a 10% decrease with the irq balancing patch. > > Randy will try it on different workloads if he determines > > it is the patch that causes the improvement. > > I'm running my usual set of tests on 2.4.19pre10aa2 + > irqrate and irqbalance. (pre10aa2 was the basis for jam2) > I have recently been experimenting with 2.4.19-rc1-aa2, and it is my understanding that Andrea has his own IRQbalance in this kernel (I am not sure exactly when/what version it appeared). A quick monitoring of IRQs shows that it does not move the destination CPU around as much, but I have not looked at the patch yet, so this is just guess work at best. At some point I will try netbench with that version with IRQbalance on/off. Any chance you could try it with your benchmarks? Anyway, I still don't understand how Ingo's IRQbalance is making a difference. > netperf is very configurable and is likely to be a valueable > addition to the things I run. It would be helpful to know more > about the testing configuration and method that had a netperf > regression. > > I.E. > how many processors in box? 4 or 8 > how many samples? > networked or loopback? network Mala Anand (of IBM LTC) can help you with details. -Andrew Theurer |