From: Martin J. B. <mb...@ar...> - 2003-02-03 23:13:16
|
People keep extolling the virtues of gcc 3.2 to me, which I'm reluctant to switch to, since it compiles so much slower. But it supposedly generates better code, so I thought I'd compile the kernel with both and compare the results. This is gcc 2.95 and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench tests still use 2.95 for the compile-time stuff. The results below leaves me distinctly unconvinced by the supposed merits of modern gcc's. Not really better or worse, within experimental error. But much slower to compile things with. Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus) Elapsed User System CPU 2.5.59 46.08 563.88 118.38 1480.00 2.5.59-gcc3.2 45.86 563.63 119.58 1489.33 Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus) Elapsed User System CPU 2.5.59 47.45 568.02 143.17 1498.17 2.5.59-gcc3.2 47.15 567.41 143.72 1507.50 DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only, and the run results are non-compliant and not-comparable with any published results. Results are shown as percentages of the first set displayed SDET 1 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 0.8% 2.5.59-gcc3.2 95.3% 5.2% SDET 2 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 0.6% 2.5.59-gcc3.2 91.9% 7.1% SDET 4 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 5.7% 2.5.59-gcc3.2 98.8% 5.3% SDET 8 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 1.4% 2.5.59-gcc3.2 105.3% 4.7% SDET 16 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 1.7% 2.5.59-gcc3.2 103.1% 1.8% SDET 32 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 1.5% 2.5.59-gcc3.2 101.0% 1.6% SDET 64 (see disclaimer) Throughput Std. Dev 2.5.59 100.0% 0.7% 2.5.59-gcc3.2 103.1% 1.1% SDET 128 (see disclaimer) Throughput Std. Dev NUMA schedbench 4: AvgUser Elapsed TotalUser TotalSys 2.5.59 0.00 38.88 82.78 0.65 2.5.59-gcc3.2 0.00 41.80 107.76 0.73 NUMA schedbench 8: AvgUser Elapsed TotalUser TotalSys 2.5.59 0.00 49.30 247.80 1.93 2.5.59-gcc3.2 0.00 38.00 229.83 2.11 NUMA schedbench 16: AvgUser Elapsed TotalUser TotalSys 2.5.59 0.00 57.37 843.12 3.77 2.5.59-gcc3.2 0.00 57.28 839.21 2.85 NUMA schedbench 32: AvgUser Elapsed TotalUser TotalSys 2.5.59 0.00 116.99 1805.79 6.05 2.5.59-gcc3.2 0.00 118.44 1788.09 6.25 NUMA schedbench 64: AvgUser Elapsed TotalUser TotalSys 2.5.59 0.00 235.18 3632.73 15.45 2.5.59-gcc3.2 0.00 234.55 3633.76 15.02 ------------------------------------------------------------------------------ And with the same kernel, comparing the compile times for gcc 2.95 to 3.2 Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus) Elapsed User System CPU gcc2.95 46.08 563.88 118.38 1480.00 gcc3.21 69.93 923.17 114.36 1483.17 Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus) Elapsed User System CPU gcc2.95 47.45 568.02 143.17 1498.17 gcc3.21 71.44 926.45 134.89 1485.33 pft. |
From: Andi K. <ak...@su...> - 2003-02-03 23:23:06
|
On Mon, Feb 03, 2003 at 03:05:06PM -0800, Martin J. Bligh wrote: > The results below leaves me distinctly unconvinced by the supposed > merits of modern gcc's. Not really better or worse, within experimental > error. But much slower to compile things with. Curious - could you compare it with a gcc 3.3 snapshot too? It should be even slower at compiling, but generate better code. -Andi |
From: Richard B. J. <ro...@ch...> - 2003-02-03 23:28:41
|
On Mon, 3 Feb 2003, Martin J. Bligh wrote: > People keep extolling the virtues of gcc 3.2 to me, which I'm > reluctant to switch to, since it compiles so much slower. But > it supposedly generates better code, so I thought I'd compile > the kernel with both and compare the results. This is gcc 2.95 > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench > tests still use 2.95 for the compile-time stuff. > [SNIPPED tests...] Don't let this get out, but egcs-2.91.66 compiled FFT code works about 50 percent of the speed of whatever M$ uses for Visual C++ Version 6.0 I was awfully disheartened when I found that identical code executed twice as fast on M$ than it does on Linux. I tried to isolate what was causing the difference. So I replaced 'hypot()' with some 'C' code that does sqrt(x^2 + y^2) just to see if it was the 'C' library. It didn't help. When I find out what type (section) of code is running slower, I'll report. In the meantime, it's fast enough, but I don't like being beat by M$. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Why is the government concerned about the lunatic fringe? Think about it. |
From: J.A. M. <jam...@ab...> - 2003-02-04 00:43:29
|
On 2003.02.04 Richard B. Johnson wrote: > On Mon, 3 Feb 2003, Martin J. Bligh wrote: > > > People keep extolling the virtues of gcc 3.2 to me, which I'm > > reluctant to switch to, since it compiles so much slower. But > > it supposedly generates better code, so I thought I'd compile > > the kernel with both and compare the results. This is gcc 2.95 > > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench > > tests still use 2.95 for the compile-time stuff. > > > [SNIPPED tests...] > > Don't let this get out, but egcs-2.91.66 compiled FFT code > works about 50 percent of the speed of whatever M$ uses for > Visual C++ Version 6.0 I was awfully disheartened when I > found that identical code executed twice as fast on M$ than > it does on Linux. I tried to isolate what was causing the > difference. So I replaced 'hypot()' with some 'C' code that > does sqrt(x^2 + y^2) just to see if it was the 'C' library. > It didn't help. When I find out what type (section) of code > is running slower, I'll report. In the meantime, it's fast > enough, but I don't like being beat by M$. > I face a simliar problem. As everybody says that SSE is so marvelous, we are trying to put some SSE code in our render engine, to speed up this. But look at the results of the code below (box is a P4@1.8, Xeon with ht): annwn:~/sse> ss-g Proc std: 5020 kticks Proc std inline: 4320 kticks Proc sse: 4290 kticks Proc sse inline: 3890 kticks So what ? Just around 500 ticks for updating to sse ? As Computer Architecture people at the school says, it is something called 'spill code' (did I wrote it ok?). In short, too much sse but too less registers, so Intel ia32 turns into crap when you need some indexes, out of registers and copy to and from the stack. #include <stdlib.h> #include <time.h> #include <stdio.h> #if defined(__INTEL_COMPILER) #include <xmmintrin.h> #endif #define LOOPS 1000 #define SZ 100000 #if defined(__GNUC__) && defined(__SSE__) typedef void __ve_reg __attribute__((__mode__(V4SF))); #endif typedef struct point point; struct point { float v[4]; }; void mulp_std(const point* a,const point* b,point* r) { int i; for (i=0; i<4; i++) r->v[i] = a->v[i] * b->v[i]; } inline void mulpi_std(const point* a,const point* b,point* r) { int i; for (i=0; i<4; i++) r->v[i] = a->v[i] * b->v[i]; } void mulp_sse(const point* a,const point* b,point* r) { #if defined(__GNUC__) && defined(__SSE__) __ve_reg xmm0,xmm1,xmm2; xmm0 = __builtin_ia32_loadups((float*)a->v); xmm1 = __builtin_ia32_loadups((float*)b->v); xmm2 = __builtin_ia32_mulps(xmm0,xmm1); __builtin_ia32_storeups(r->v,xmm2); #endif #if defined(__INTEL_COMPILER) __m128 xmm0,xmm1,xmm2; xmm0 = _mm_loadu_ps((float*)a->v); xmm1 = _mm_loadu_ps((float*)b->v); xmm2 = _mm_mul_ps(xmm0,xmm1); _mm_storeu_ps(r->v,xmm2); #endif } inline void mulpi_sse(const point* a,const point* b,point* r) { #if defined(__GNUC__) && defined(__SSE__) __ve_reg xmm0,xmm1,xmm2; xmm0 = __builtin_ia32_loadups((float*)a->v); xmm1 = __builtin_ia32_loadups((float*)b->v); xmm2 = __builtin_ia32_mulps(xmm0,xmm1); __builtin_ia32_storeups(r->v,xmm2); #endif #if defined(__INTEL_COMPILER) #if defined(__INTEL_COMPILER) __m128 xmm0,xmm1,xmm2; xmm0 = _mm_loadu_ps((float*)a->v); xmm1 = _mm_loadu_ps((float*)b->v); xmm2 = _mm_mul_ps(xmm0,xmm1); _mm_storeu_ps(r->v,xmm2); #endif #endif } int main(int argc, char** argv) { point *a; point *b; point *c; int i,j; unsigned long t0,t1; a = malloc(SZ*sizeof(point)); b = malloc(SZ*sizeof(point)); c = malloc(SZ*sizeof(point)); printf("Proc std:\n"); t0 = clock(); for (i=0; i<LOOPS; i++) { for (j=0; j<SZ; j++) mulp_std(&a[j],&b[j],&c[j]); for (j=0; j<SZ; j++) mulp_std(&b[j],&b[j],&a[j]); } t1 = clock(); printf("%10d kticks\n",(t1-t0)/1000); printf("Proc std inline:\n"); t0 = clock(); for (i=0; i<LOOPS; i++) { for (j=0; j<SZ; j++) mulpi_std(&a[j],&b[j],&c[j]); for (j=0; j<SZ; j++) mulpi_std(&b[j],&b[j],&a[j]); } t1 = clock(); printf("%10d kticks\n",(t1-t0)/1000); printf("Proc sse:\n"); t0 = clock(); for (i=0; i<LOOPS; i++) { for (j=0; j<SZ; j++) mulp_sse(&a[j],&b[j],&c[j]); for (j=0; j<SZ; j++) mulp_sse(&b[j],&b[j],&a[j]); } t1 = clock(); printf("%10d kticks\n",(t1-t0)/1000); printf("Proc sse inline:\n"); t0 = clock(); for (i=0; i<LOOPS; i++) { for (j=0; j<SZ; j++) mulpi_sse(&a[j],&b[j],&c[j]); for (j=0; j<SZ; j++) mulpi_sse(&b[j],&b[j],&a[j]); } t1 = clock(); printf("%10d kticks\n",(t1-t0)/1000); free(c); free(b); free(a); return 0; } -- J.A. Magallon <jam...@ab...> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.1 (Cooker) for i586 Linux 2.4.21-pre4-jam1 (gcc 3.2.1 (Mandrake Linux 9.1 3.2.1-5mdk)) |
From: Richard B. J. <ro...@ch...> - 2003-02-04 13:41:00
|
On Tue, 4 Feb 2003, J.A. Magallon wrote: > > On 2003.02.04 Richard B. Johnson wrote: > > On Mon, 3 Feb 2003, Martin J. Bligh wrote: > > > > > People keep extolling the virtues of gcc 3.2 to me, which I'm > > > reluctant to switch to, since it compiles so much slower. But > > > it supposedly generates better code, so I thought I'd compile > > > the kernel with both and compare the results. This is gcc 2.95 > > > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench > > > tests still use 2.95 for the compile-time stuff. > > > > > [SNIPPED tests...] > > > > Don't let this get out, but egcs-2.91.66 compiled FFT code > > works about 50 percent of the speed of whatever M$ uses for > > Visual C++ Version 6.0 I was awfully disheartened when I > > found that identical code executed twice as fast on M$ than > > it does on Linux. I tried to isolate what was causing the > > difference. So I replaced 'hypot()' with some 'C' code that > > does sqrt(x^2 + y^2) just to see if it was the 'C' library. > > It didn't help. When I find out what type (section) of code > > is running slower, I'll report. In the meantime, it's fast > > enough, but I don't like being beat by M$. > > > > I face a simliar problem. As everybody says that SSE is so marvelous, > we are trying to put some SSE code in our render engine, to speed up this. > But look at the results of the code below (box is a P4@1.8, Xeon with ht): [SNIPPED good demo code] I'm going to answer all the comments on this topic with just one observation. Sorry that I don't have the time to answer all who responded personally, but I have to take a "work break" today and tommorrow (design review). gcc is a marvelous compiler because it was designed to be readily ported to different architectures. However, is not an optimum compiler for ix86 machines and probably is not optimum for any one kind of machine. I often hear complaints about the ix86 processors as being "register starved", etc. This could not be further from fact. There are enough registers. However, various registers were designed to do various things. Once you decide that you know more than the processor developers, and start using registers for things they were not designed for, you start to have excellent test benchmarks, but awful overall performance. For example, the ECX register was designed to be used as a counter. It can be told to decrement and perform a conditional jump with the 'loop' instruction. The loop instruction comes in various flavors, also, like loopz, loopnz. Somebody decided that 'dec ecx; jnz' was faster. They measured this to "prove" that it's faster. In the meantime, other code suffers (stumbles) because there was really no spare time to be grabbed. Data needs to be fetched to and from memory. The instruction unit ends up being starved while data are acquired. This would not normally hurt anything because the RAM bandwidth ends up being the dominant pole in the transfer function, but you end up with something I call the "accordion problem". I will first demonstrate the accordion problem and then explain where it comes from. Note a smooth slow of traffic on a highway. All the cars are traveling at the same speed. Their speed increases until they don't dare go any faster. They are now "bandwidth limited". Somebody sees a traffic cop. Somebody slows down, it takes a few hundred milliseconds for the next car to slow down, this transient moves backwards though the line of cars until cars several miles back actually have to perform emergency braking to stay off the bumper ahead. Then, the cars start accelerating again. This acceleration, deceleration ripple moves through the line of cars like the bellows of an accordion. The average speed of the line of traffic is now reduced even though there are oscillatory accelerations above the speed-limit. Now, visualize a CPU and RAM combination running in lock-step. The speed of the execution unit is matched to the speed of the processor I/O so the instructions are fetched and executed in a more-or-less synchronized manner. This is like the high-speed line of cars before somebody sees the traffic cop. Now, perturb this execution by throwing in some faster-than-normal program sequences. You may start the accordion effect. The problem is that both instructions and data come through the same hole-in- the wall, regardless of caching. When the prefetch unit needs more data (instructions) it must contend with the data I/O. This may cause an oscillatory condition, actually reducing throughput. Anybody who uses CPUs in laboratories with sensitive receiving equipment knows that, regardless of the FCC rules, these machines generate great gobs of radio frequency interference. That's why they need to be in shielded boxes. If you want to "hear" the stumble I'm talking about, just listen to the AM audio output using a field-intensity meter. When you have a fast smoothly-running machine, the interference sounds like noise. When you have the accordion effect, the interference has a repetitive pattern to it, a tone, usually low-frequency. If you capture enough data in a logic analyzer, you will see the pattern and can see actual pauses in bus I/O where the CPU just isn't doing a damn thing at all! FYI, there is a difference in power supply current required to write 0xffffffff to RAM than 0x00000000 (honest!). If you are doing a memory-test, writing such a pattern that the load on the power supply changes at a rate that will disturb the power supply servo-loop, you can make the voltage bounce! This has nothing to do with slow CPU execution speed, but just demonstrates that there are a lot of interactions that should be considered when designing or proving-out a system. It's not just a local bench-mark that counts. The Intel Compiler(s) I have used generate code that uses the registers just like Intel specified. It uses EBX, ESI, EDI as index registers just like the 16-bit BX, SI, DI. I have never seen code output from an Intel 'C' compiler that uses EAX as in index register, even though it's available and "faster". They seem to stick with the "un-optimized" string instructions like rep movsb, repnz cmpsb, etc., and they use 'loop'. Maybe, just maybe, Intel knows something about their processor that shouldn't be second-guessed by clever programmers. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Why is the government concerned about the lunatic fringe? Think about it. |
From: John B. <jo...@gr...> - 2003-02-04 14:19:20
|
There is some discussion about compiler optimisations in this Linux Journal article: http://www.linuxjournal.com/article.php?sid=4885 John. |
From: Momchil V. <ve...@fa...> - 2003-02-10 14:18:20
|
J.A. Magallon (jam...@ab...) writes: >I face a simliar problem. As everybody says that SSE is so marvelous, >we are trying to put some SSE code in our render engine, to speed up >this. But look at the results of the code below (box is a P4@1.8, >Xeon with ht): [summary: SSE rocks :] Here are my results on a PIII Xeon, 1.26MHz, 512K cache: $ gcc --version gcc (GCC) 3.2.1 20021104 (prerelease) $ gcc -std=c99 -march=pentium3 -O2 -fomit-frame-pointer -funroll-loops -falign-loops kick.c $ ./a.out Proc std: 12990 kticks Proc std inline: 12770 kticks Proc sse: 12830 kticks Proc sse inline: 12620 kticks IOW, close to nothing, ~1.2% SSE speedup. Even the inlining speedup is larger - ~1.6%. OTOH, look at ``mulp_sse'' .globl mulp_sse .type mulp_sse,@function mulp_sse: movl 4(%esp), %eax movups (%eax), %xmm1 movl 8(%esp), %eax movups (%eax), %xmm0 movl 12(%esp), %eax mulps %xmm0, %xmm1 movups %xmm1, (%eax) ret I seriuosly doubt any compiler would compile it much better :) With the Intel'c compiler: $ icc -V Intel(R) C++ Compiler for 32-bit applications, Version 7.0 Build 20021021Z $ icc -restrict -tpp6 -ip kick.c $ ./a.out Proc std: 12950 kticks Proc std inline: 12840 kticks Proc sse: 12850 kticks Proc sse inline: 12660 kticks Similar lack of difference, <2% speedup. So, if the hypothesis is that the code generator did his best, we have to search for the bottleneck somewhere else. Of course, the usual suspect is the cache usage. Let's see if we can optimize the cache usage of the benchmark by decreasing the size and increasing the loops. old: #define LOOPS 1000 #define SZ 100000 new: #define LOOPS 10000 #define SZ 10000 $ gcc -std=c99 -march=pentium3 -O2 -fomit-frame-pointer -funroll-loops kick.c $ ./a.out Proc std: 3350 kticks Proc std inline: 2030 kticks Proc sse: 2950 kticks Proc sse inline: 1270 kticks $ icc -restrict -tpp6 -ip kick.c $ ./a.out Proc std: 2650 kticks Proc std inline: 1630 kticks Proc sse: 2660 kticks Proc sse inline: 1430 kticks First, we see between 400% and 1000% speedup for each test, which confirms our hypothesis that the working set / cache footprint is the bottleneck. Having eliminated to some extent the memory bottleneck, let's see the effect of SEE and inlining. SSE speedup is between 14% and 60% (3350/2950 and 2030/1270) for GCC and between -1% and 14% (2650/2660 and 1630/1430) for ICC. Inlining speedup is between 65% and 132% (3350 / 2030 and 2960 / 1270) for GCC and 63% and 86% (2650/1630 and 2660/1430) for ICC. Obviously, a large share of the running time is occupied by the function calls and related bookkeeping, which is evident from the large inlining speedup. Therefore, if we are interested in SSE (dis)advantages, we have to compare inlined x87 and SSE versions, which are the above 60% (GCC) and 14% (ICC) numbers. ~velco PS. FWIW, I have had cases where the same executable performed some 100% faster with SSE on PIII and gave almost no speedup on PIV. |
From: Denis V. <vd...@po...> - 2003-02-04 07:05:58
|
On 4 February 2003 01:31, Richard B. Johnson wrote: > On Mon, 3 Feb 2003, Martin J. Bligh wrote: > > People keep extolling the virtues of gcc 3.2 to me, which I'm > > reluctant to switch to, since it compiles so much slower. But > > it supposedly generates better code, so I thought I'd compile > > the kernel with both and compare the results. This is gcc 2.95 > > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench > > tests still use 2.95 for the compile-time stuff. > > [SNIPPED tests...] What was the size of uncompressed kernel binaries? This is a simple (and somewhat inaccurate) measure of compiler improvement ;) > Don't let this get out, but egcs-2.91.66 compiled FFT code > works about 50 percent of the speed of whatever M$ uses for > Visual C++ Version 6.0 I was awfully disheartened when I Yes. M$ (and some other compilers) beat GCC badly. > found that identical code executed twice as fast on M$ than > it does on Linux. I tried to isolate what was causing the > difference. So I replaced 'hypot()' with some 'C' code that > does sqrt(x^2 + y^2) just to see if it was the 'C' library. > It didn't help. When I find out what type (section) of code > is running slower, I'll report. In the meantime, it's fast > enough, but I don't like being beat by M$. I'm afraid it's code generation engine. It is just worse than M$ or Intel's one. It is not easily fixable, GCC folks have tremendous task at hand. I wonder whether some big companies supposedly supporting Linux (e.g. Intel) can help GCC team (for example by giving away some code and/or developer time). -- vda |
From: Martin J. B. <mb...@ar...> - 2003-02-04 07:13:44
|
> I'm afraid it's code generation engine. It is just worse than > M$ or Intel's one. It is not easily fixable, > GCC folks have tremendous task at hand. > > I wonder whether some big companies supposedly supporting > Linux (e.g. Intel) can help GCC team (for example by giving > away some code and/or developer time). Comparing Intel's compiler vs GCC on Linux would be more interesting. Anyone got a copy and some time to burn? M. |
From: Adrian B. <bu...@fs...> - 2003-02-04 12:28:46
|
On Mon, Feb 03, 2003 at 11:13:31PM -0800, Martin J. Bligh wrote: > > I'm afraid it's code generation engine. It is just worse than > > M$ or Intel's one. It is not easily fixable, > > GCC folks have tremendous task at hand. > > > > I wonder whether some big companies supposedly supporting > > Linux (e.g. Intel) can help GCC team (for example by giving > > away some code and/or developer time). > > Comparing Intel's compiler vs GCC on Linux would be more interesting. > Anyone got a copy and some time to burn? There are already people who have done this, e.g. http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html compares g++ and Intel's C++ compiler with C++ code. > M. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed |
From: Martin J. B. <mb...@ar...> - 2003-02-04 15:55:15
|
>> Comparing Intel's compiler vs GCC on Linux would be more interesting. >> Anyone got a copy and some time to burn? > > There are already people who have done this, e.g. > > http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html > > compares g++ and Intel's C++ compiler with C++ code. C would be infinitely more interesting ;-) M. |
From: Martin J. B. <mb...@ar...> - 2003-02-04 16:30:46
|
>>> Comparing Intel's compiler vs GCC on Linux would be more interesting. >>> Anyone got a copy and some time to burn? >> >> There are already people who have done this, e.g. >> >> http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html >> >> compares g++ and Intel's C++ compiler with C++ code. > > C would be infinitely more interesting ;-) Speaking of which, has anyone ever compiled the ia32 Linux kernel with the Intel compiler? I thought I saw some patches floating around to make it compile the ia64 kernel .... that'd be an interesting test case ... might give us some ideas about what could be tweaked in GCC (or code rejiggled in the kernel). M. |
From: Timothy D. W. <wo...@os...> - 2003-02-04 19:13:14
|
On Mon, 2003-02-03 at 22:54, Denis Vlasenko wrote: snip > > I'm afraid it's code generation engine. It is just worse than > M$ or Intel's one. It is not easily fixable, > GCC folks have tremendous task at hand. > > I wonder whether some big companies supposedly supporting > Linux (e.g. Intel) can help GCC team (for example by giving > away some code and/or developer time). > -- I'm hesitant to enter into this. But from my own experience the issue with big companies supporting these sort of changes in gcc have more to do with the acceptance process of changes into gcc than a lack of desire on the large companies part. Tim > vda > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Timothy D. Witham - Lab Director - wo...@os... Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2436 (fax) |
From: John B. <jo...@gr...> - 2003-02-04 19:34:47
|
> I'm hesitant to enter into this. But from my own experience > the issue with big companies supporting these sort of changes > in gcc have more to do with the acceptance process of changes > into gcc than a lack of desire on the large companies part. Maybe we should create a KGCC fork, optimise it for kernel complilations, then try to get our changes merged back in to GCC mainline at a later date. John. |
From: Dave J. <da...@co...> - 2003-02-04 19:49:29
|
On Tue, Feb 04, 2003 at 07:35:06PM +0000, John Bradford wrote: > Maybe we should create a KGCC fork, optimise it for kernel > complilations, then try to get our changes merged back in to GCC > mainline at a later date. What exactly do you mean by "optimise for kernel compilations" ? Dave -- | Dave Jones. http://www.codemonkey.org.uk | SuSE Labs |
From: John B. <jo...@gr...> - 2003-02-04 20:11:28
|
> > Maybe we should create a KGCC fork, optimise it for kernel > > complilations, then try to get our changes merged back in to GCC > > mainline at a later date. > > What exactly do you mean by "optimise for kernel compilations" ? I don't, that was a bad way of phrasing it - I didn't mean fork GCC just to create one which compiles the kernel so it runs faster, as the expense of other code. What I was thinking was that if we forked GCC, we could try out all of these ideas that have been floating around in this thread, and if, as was hinted at earlier in this thread, $bigcompanies[] have not offered contributions because of reluctance to accept them by the GCC team, we would be more in a position to try them out, because we only need to concern ourselves with breaking the compilation of the kernel, not every single program that currently compiles with GCC. The way I see it, the development series would be optimised for KGCC, and when we start to think about stabilising that development series, we try to get our KGCC changes merged back in to GCC mainline. If they are not accepted, either KGCC becomes the recommended kernel compiler, which should cause no great difficulties, (having one compiler for kernels, and one for userland applications), or we start making sure that we haven't broken compilation with GCC, (and since a there would probably always be people compiling with GCC anyway, even if there was a KGCC, we would effectively always know if we broke compilation with GCC), and then the recommended compiler is just not the optimal one, and it would be up to the various distributions to decide which one they are going to use. John. |
From: John B. <jo...@gr...> - 2003-02-04 20:23:21
|
Sorry, that last post didn't make sense, please apply this diff: - just to create one which compiles the kernel so it runs faster, as the + just to create one which compiles the kernel so it runs faster, at the expense of other code. John. |
From: Herman O. <Herman@WirelessNetworksInc.com> - 2003-02-04 20:41:49
|
Hi there, From my experience, the speed issue is caused by misalligned memory accesses, causing inefficient SDRAM to Cache movement of data and instructions. I don't think that you necessarily need a modification to the compiler. What you can do is carefully place the ALLIGN switch in a few critical places in the kernel code, to ensure that the code and data will be properly alligned for whatever processor it is compiled for, be that a Pentium, an ARM, a MIPS or whatever. It would be nice if GCC can be suitably improved to do this correcly for all architectures, but a little bit of human help can do wonders, without having to fork the GCC project. Cheers, -- ------------------------------------------------------------------------ Herman Oosthuysen B.Eng.(E), Member of IEEE Wireless Networks Inc. http://www.WirelessNetworksInc.com E-mail: Herman@WirelessNetworksInc.com Phone: 1.403.569-5687, Fax: 1.403.235-3965 ------------------------------------------------------------------------ John Bradford wrote: >> > Maybe we should create a KGCC fork, optimise it for kernel >> > complilations, then try to get our changes merged back in to GCC >> > mainline at a later date. >> >>What exactly do you mean by "optimise for kernel compilations" ? > > > I don't, that was a bad way of phrasing it - I didn't mean fork GCC > just to create one which compiles the kernel so it runs faster, as the > expense of other code. > > What I was thinking was that if we forked GCC, we could try out all of > these ideas that have been floating around in this thread, and if, as > was hinted at earlier in this thread, $bigcompanies[] have not offered > contributions because of reluctance to accept them by the GCC team, we > would be more in a position to try them out, because we only need to > concern ourselves with breaking the compilation of the kernel, not > every single program that currently compiles with GCC. > > The way I see it, the development series would be optimised for KGCC, > and when we start to think about stabilising that development series, > we try to get our KGCC changes merged back in to GCC mainline. If > they are not accepted, either KGCC becomes the recommended kernel > compiler, which should cause no great difficulties, (having one > compiler for kernels, and one for userland applications), or we start > making sure that we haven't broken compilation with GCC, (and since a > there would probably always be people compiling with GCC anyway, even > if there was a KGCC, we would effectively always know if we broke > compilation with GCC), and then the recommended compiler is just not > the optimal one, and it would be up to the various distributions to > decide which one they are going to use. > > John. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > |
From: Timothy D. W. <wo...@os...> - 2003-02-04 21:48:54
|
On Tue, 2003-02-04 at 12:45, Herman Oosthuysen wrote: > Hi there, > > From my experience, the speed issue is caused by misalligned memory > accesses, causing inefficient SDRAM to Cache movement of data and > instructions. > > I don't think that you necessarily need a modification to the compiler. > What you can do is carefully place the ALLIGN switch in a few critical > places in the kernel code, to ensure that the code and data will be > properly alligned for whatever processor it is compiled for, be that a > Pentium, an ARM, a MIPS or whatever. > I guess I would like the compiler to do that without having to go in and futz the code. > It would be nice if GCC can be suitably improved to do this correcly for > all architectures, but a little bit of human help can do wonders, > without having to fork the GCC project. > > Cheers, -- Timothy D. Witham - Lab Director - wo...@os... Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2436 (fax) |
From: Denis V. <vd...@po...> - 2003-02-05 07:30:41
|
On 4 February 2003 22:45, Herman Oosthuysen wrote: > Hi there, > > From my experience, the speed issue is caused by misalligned memory > accesses, causing inefficient SDRAM to Cache movement of data and > instructions. > > I don't think that you necessarily need a modification to the > compiler. What you can do is carefully place the ALLIGN switch in a > few critical places in the kernel code, to ensure that the code and > data will be properly alligned for whatever processor it is compiled > for, be that a Pentium, an ARM, a MIPS or whatever. > > It would be nice if GCC can be suitably improved to do this correcly > for all architectures, but a little bit of human help can do wonders, > without having to fork the GCC project. NO. GCC already went this way, i.e. it aligns functions and loops by ridiculous (IMHO) amounts like 16 bytes. That's 7,5 bytes per alignment on average. Now count lk functions and loops and mourn for lost icache. Or just disassemble any .o module and read the damn code. This is the primary reason why people report larger kernels for GCC 3.x I am damn sure that if you compile with less sadistic alignment you will get smaller *and* faster kernel. -- vda |
From: Andreas S. <sc...@su...> - 2003-02-05 10:36:51
|
Denis Vlasenko <vd...@po...> writes: |> I am damn sure that if you compile with less sadistic alignment |> you will get smaller *and* faster kernel. So why don't you try it out? GCC offers everything you need for this experiment. Andreas. -- Andreas Schwab, SuSE Labs, sc...@su... SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." |
From: Denis V. <vd...@po...> - 2003-02-05 11:52:11
|
On 5 February 2003 12:36, Andreas Schwab wrote: > Denis Vlasenko <vd...@po...> writes: > |> I am damn sure that if you compile with less sadistic alignment > |> you will get smaller *and* faster kernel. > > So why don't you try it out? GCC offers everything you need for this > experiment. I did. Others did it too on occasion. My argument was against overusing optimization techniques. You cannot speed up kernel by aligning *everything* to 32 bytes, or by unrolling all loops, or by aggressive inlining. That's too easy to work. You get kernel which is bigger *and* slower. -- vda |
From: Dave J. <da...@co...> - 2003-02-05 12:24:40
|
On Wed, Feb 05, 2003 at 01:41:34PM +0200, Denis Vlasenko wrote: > > So why don't you try it out? GCC offers everything you need for this > > experiment. > > I did. Others did it too on occasion. You seem to have forgotten to attach the numbers to your mail. Dave -- | Dave Jones. http://www.codemonkey.org.uk | SuSE Labs |
From: Dipankar S. <dip...@in...> - 2003-02-05 13:07:30
|
On Wed, Feb 05, 2003 at 01:41:34PM +0200, Denis Vlasenko wrote: > My argument was against overusing optimization techniques. > You cannot speed up kernel by aligning *everything* to 32 bytes, > or by unrolling all loops, or by aggressive inlining. > That's too easy to work. You get kernel which is bigger > *and* slower. I am not getting into this debate, just wanted to point out that effect of compiler optimization on UNIX kernels have been studied before. One paper I recall is - http://www.usenix.org/publications/library/proceedings/sf94/full_papers/partridge.ps They used prfile-guided optimization, so that is whole another angle altogether. Thanks Dipankar |
From: Martin J. B. <mb...@ar...> - 2003-02-05 15:30:16
|
> GCC already went this way, i.e. it aligns functions and loops by > ridiculous (IMHO) amounts like 16 bytes. That's 7,5 bytes per alignment > on average. Now count lk functions and loops and mourn for lost icache. > Or just disassemble any .o module and read the damn code. > > This is the primary reason why people report larger kernels for GCC 3.x > > I am damn sure that if you compile with less sadistic alignment > you will get smaller *and* faster kernel. There's only one real way to know that. Do it, test it. M. |