From: Ray B. <ra...@en...> - 2003-05-21 03:51:44
|
William Lee Irwin III wrote: > > A brief discussion of the issue with Janis Johnson and Zack Weinberg > helped clarify my thoughts on this subject (though none of this is > quoting them so don't blame them for my errors). > > On Tue, May 20, 2003 at 05:58:24PM -0500, Ray Bryant wrote: > > What happens on ia64 is that gcc loads the structure into registers on > > the caller side and the unloads it in the prologue code. We have found > > it more efficient to define a cpumask_t as > > This is the copying semantics of passing by value. The way to avoid it > in the case where the things happen to be large is to pass them by > reference, not necessarily to remove the structure wrapper from the > type (though that certainly implies it will be passed by reference). Yes, that is the issue, I think. Of course, if things explicitly change from pass by value to pass by reference in C, then the resulting type is a pointer type, and is not directly assignable without a dereference; I thought the rationale for putting the array inside the structure was to make it an assignable type in order to minimize the number of small irritating changes in the kernel source that have to be dealt with here. So I don't think the issue is "put the array in a structure" or not, the issues are: (1) Is cpumask_t a directly assignable type (without a dereference)? If it is, this reduces the kernel changes for the multiword bitmasks and that is a good thing. The only way I am aware of to make this an assignable type is to wrapper the bitmask in a structure. (2) If we accept cpumask_t as an assignable type, does the code generated for procedure call and prologue for procedures that accept a cpumask_t have a significant impact on kernel performance? This is a harder question to answer. (3) If we don't like the answer to (2), then we need to make the cpumask_t variables pass by reference. But this is contrary to issue #1, since now we are back in the lots of small kernel changes, all of which are alike, issue. > > On Tue, May 20, 2003 at 05:58:24PM -0500, Ray Bryant wrote: > > typedef unsigned long cpumask_t[CPU_ARRAY_SIZE]; > > Of course, the downside of this is that this type is not an assignable > > type, (c.f. your version, where cpumask_t is an assignable type), so the > > change this implies ricochets through the kernel source -- an admittedly > > unappetizing prospect. Nonetheless, we have found this generates better > > code. > > > On Tue, May 20, 2003 at 05:58:24PM -0500, Ray Bryant wrote: > > Attached are some examples and the resulting .s files, compiled with > > "gcc -O2 -S -c". Look in particular at the number of instrucions > > generated between the procedure entry "test" and the first label of the > > procedure .L24, as well as the number of instructions in the calling > > code for procedure test. Of course, 512 CPU's is a bit more than Linux > > can handle at the moment. But the impact of this change is proprotional > > to the number of CPU's and even at 128 CPU's it is noticeable, I would > > think. (The versions test128.c and test128_1.c are the same code for > > 128 CPU case.) > > They actually weren't identical modulo the structure wrapper. arg2 was > not present in test128_1.c. Some of the prototypes etc. differed too. > And the whole call by value vs. call by reference difference. Of course, but that was the intent of these examples, to compare the code generated by using the call by value case and the assignable cpumask_t type versus the code generated by using a call by reference implementation. The "algorithm" inside of the test function was put there merely to keep the compiler from optimizing away the entire function. > > Here are what I consider true equivalents, shored up, as MIME > attachments. Assembly diff included in-line. I'll take a look at these in the morning; my mind is getting a little foggy now. :-) > > I believe this difference in code generation is acceptable; now the > question in my mind is, is there a way to get pure functions of the > things ambiguous with respect to call by reference and call by value? > > I'm not sure. Call by reference uniformly would be suboptimal for > smaller systems. It would at least require additional hooks to make > things understand what's going on and change behavior with the size > of the bitmaps. > > Would you regard this as a blocking issue with respect to the > implementation I've presented? I'm not sure yet. Looking at these differences in isolation doesn't tell the whole story. Tomorrow, I'm going to look at how many procedures there are (or would be) in the kernel that would have a cpumask_t as an argument. Then depending on whether any of these are on a hot path through the kernel we can make an estimate as to whether or not are performance critical. Then we can make an assessment. (A lot of them are "static inline". I'll have to look at what kind of code is generated in that case as well.) The other issue is how far we (SGI) want to push this. At an 8-word bitmask, the calling sequence and prologue code is pretty ugly, but I am not sure whether (or if) SGI will want to push SSI all the way to 512 CPUs. We are already pretty interested in looking at performance of 128P systems, but who knows if will want to go any further than that. It depends in part what our customers want, I guess. I imagine the two different approaches are equivalent at 128P (i.e., the extra code doensn't matter that much). We just need to look at it all a little bit more to make sure. At any rate, I appreciate your attention to detail on this and the work you have done to get the patch this far. We'll get a solution worked out here shortly, I imagine, or even come to an understanding that your approach is fine for us now. > > -- wli > > ia64: > > --- /tmp/test128_1.s 2003-05-20 19:38:32.000000000 -0700 > +++ /tmp/test128_2.s 2003-05-20 19:38:32.000000000 -0700 > @@ -1,4 +1,4 @@ > - .file "test128_1.c" > + .file "test128_2.c" > .pred.safe_across_calls p1-p5,p16-p63 > .text > .align 16 > @@ -37,18 +37,18 @@ > .save rp, r33 > mov r33 = b0 > .body > - mov r16 = r0 > + mov r15 = r0 > addl r14 = @ltoff(arg1#), gp > ;; > - ld8 r15 = [r14] > + ld8 r16 = [r14] > ;; > .L27: > - shladd r14 = r16, 3, r15 > + shladd r14 = r15, 3, r16 > ;; > - st8 [r14] = r16 > - adds r16 = 1, r16 > + st8 [r14] = r15 > + adds r15 = 1, r15 > ;; > - cmp.geu p6, p7 = 1, r16 > + cmp.geu p6, p7 = 1, r15 > (p6) br.cond.dptk .L27 > addl r35 = @ltoff(arg1#), gp > ;; > @@ -65,31 +65,23 @@ > br.call.sptk.many b0 = printf# > ;; > mov r1 = r32 > - mov r16 = r0 > ;; > - addl r14 = @ltoff(arg2#), gp > + addl r16 = @ltoff(arg2#), gp > ;; > - ld8 r18 = [r14] > + ld8 r16 = [r16] > addl r14 = @ltoff(arg1#), gp > ;; > - ld8 r17 = [r14] > -.L32: > - shladd r14 = r16, 3, r0 > + ld8 r14 = [r14] > ;; > - add r15 = r14, r18 > - add r14 = r14, r17 > + ld8 r15 = [r14], 8 > ;; > + st8 [r16] = r15, 8 > ld8 r14 = [r14] > ;; > - st8 [r15] = r14 > - adds r16 = 1, r16 > - ;; > - cmp.geu p6, p7 = 1, r16 > - (p6) br.cond.dptk .L32 > - addl r35 = @ltoff(arg2#), gp > + st8 [r16] = r14 > + addl r35 = @ltoff(arg1#), gp > ;; > ld8 r35 = [r35] > - mov r32 = r1 > br.call.sptk.many b0 = test# > ;; > mov r1 = r32 > > i386: > > --- test128_1.s 2003-05-20 19:27:21.000000000 -0700 > +++ test128_2.s 2003-05-20 19:27:22.000000000 -0700 > @@ -1,4 +1,4 @@ > - .file "test128_1.c" > + .file "test128_2.c" > .text > .globl test > .type test,@function > @@ -13,10 +13,9 @@ > jbe .L5 > jmp .L3 > .L5: > + movl 8(%ebp), %edx > movl -4(%ebp), %eax > - leal 0(,%eax,4), %edx > - movl 8(%ebp), %eax > - movl (%eax,%edx), %edx > + movl (%edx,%eax,4), %edx > leal -8(%ebp), %eax > addl %edx, (%eax) > leal -4(%ebp), %eax > @@ -64,22 +63,12 @@ > pushl $.LC0 > call printf > addl $16, %esp > - movl $0, -4(%ebp) > -.L11: > - cmpl $1, -4(%ebp) > - jbe .L14 > - jmp .L12 > -.L14: > - movl -4(%ebp), %edx > - movl -4(%ebp), %eax > - movl arg1(,%eax,4), %eax > - movl %eax, arg2(,%edx,4) > - leal -4(%ebp), %eax > - incl (%eax) > - jmp .L11 > -.L12: > + movl arg1, %eax > + movl arg1+4, %edx > + movl %eax, arg2 > + movl %edx, arg2+4 > subl $12, %esp > - pushl $arg2 > + pushl $arg1 > call test > addl $16, %esp > movl %eax, -8(%ebp) > > ------------------------------------------------------------------------ > Name: test128_1.c > test128_1.c Type: text/x-csrc > Description: test128_1.c > > Name: test128_2.c > test128_2.c Type: text/x-csrc > Description: test128_2.c > > Name: test128_1.s > test128_1.s Type: Plain Text (text/plain) > Description: test128_1.s > > Name: test128_2.s > test128_2.s Type: Plain Text (text/plain) > Description: test128_2.s > > Name: test128_1.s > test128_1.s Type: Plain Text (text/plain) > Description: test128_1_ia64.s > > Name: test128_2.s > test128_2.s Type: Plain Text (text/plain) > Description: test128_2_ia64.s -- Best Regards, Ray ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) ra...@sg... ra...@au... The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- |