From: Denys R. <rt...@ma...> - 2007-04-28 06:37:13
|
Hi, I'm looking for the most high-performance access to 1D and 2D arrays in SBCL. The operations I am interested in are: arbitrary access to any cell, copying of fragments (lines or rectangles) from one array to another and transferring the data to a foreign functions (which, as I expect, can be made by allocating a foreign array and copying the data into it). The arrays contain bytes/dwords/floating-point numbers/structures. So I suppose that in each case SBCL would be able to allocate a single chunk of memory with all the objects in-lined. I created a simple test first and found that it does not work as good as I'd like it to be. The test allocated an array of bytes and fills it three times with different values. Here is the code: (declaim (optimize speed (safety 0) (debug 0))) (defun bench () (declare (optimize speed (safety 0) (debug 0))) (let* ((a (make-array 10 :element-type '(unsigned-byte 8)))) (loop for i from 0 below (array-dimension a 0) do (setf (aref a i) 7)) (loop for i from 0 below (array-dimension a 0) do (setf (aref a i) 8)) (loop for i from 0 below (array-dimension a 0) do (setf (aref a i) 9)))) However, when I look at the assembler source I see this (/the fragment with the second loop/): ; C09: 31C9 XOR ECX, ECX ; C0B: EB11 JMP L6 ; C0D: L5: 8BD9 MOV EBX, ECX ; C0F: C1FB02 SAR EBX, 2 ; C12: B808000000 MOV EAX, 8 ; C17: 88441A01 MOV [EDX+EBX+1], AL ; C1B: 83C104 ADD ECX, 4 ; C1E: L6: 83F928 CMP ECX, 40 ; C21: 7CEA JL L5 The issues I noticed are: 1. Instead of allocating 1 byte for the element SBCL has allocated 4(?). This is seen from the line C1B. How can I make it allocate only 1 byte? I have already told it that I want an array with unsigned bytes, each is 8 bits long, didn't I? 2. The "MOV" command on line C17 seems to be too heavyweight, doesn't it? I do not see why three values should be added just to get the address on the next entry. 3. On line C12 it is clearly seen that EAX is filled instantly with the value "8" which is a constant. It would be great if I can make SBCL put it out of the loop, between the lines C09 and C0B. 4. The "SAR EBX, 2" instruction seems to be quite mysterious for me. Maybe this one can be evaded, but I am not sure about that yet. -------- So, I'd be thankful if someone could suggest me what else can I do to optimize the code except for the (declare) statements I have used. Also, It would be great if someone could point me out where can I take a look for the SBCL's optimization of the assembly. Perhaps I could tune it a bit? Thank you, Denys Rtveliashvili |
From: Christophe R. <cs...@ca...> - 2007-04-28 08:48:37
|
Denys Rtveliashvili <rt...@ma...> writes: > ; C09: 31C9 XOR ECX, ECX > ; C0B: EB11 JMP L6 > ; C0D: L5: 8BD9 MOV EBX, ECX > ; C0F: C1FB02 SAR EBX, 2 > ; C12: B808000000 MOV EAX, 8 > ; C17: 88441A01 MOV [EDX+EBX+1], AL > ; C1B: 83C104 ADD ECX, 4 > ; C1E: L6: 83F928 CMP ECX, 40 > ; C21: 7CEA JL L5 > > The issues I noticed are: You are misreading the assembly in several places, I'm afraid. > 1. Instead of allocating 1 byte for the element SBCL has allocated 4(?). > This is seen from the line C1B. No, it isn't. ECX holds the lisp variable i in your loop, which is a live variable. The actual array access is done byte-by-byte, which is achieved by your "mysterious" point four. > 2. The "MOV" command on line C17 seems to be too heavyweight, doesn't > it? I do not see why three values should be added just to get the > address on the next entry. > > 3. On line C12 it is clearly seen that EAX is filled instantly with the > value "8" which is a constant. It would be great if I can make SBCL put > it out of the loop, between the lines C09 and C0B. These two issues fall under the general class of loop optimizations: strength reduction and invariant hoisting. There is some apparatus in SBCL for performing this kind of optimization (for instance, loop analysis is used in the register allocator) but not very much is done. If you are interested in this kind of thing, that is where you should start. > So, I'd be thankful if someone could suggest me what else can I do to > optimize the code except for the (declare) statements I have used. If you're filling a vector with a constant value, call FILL instead. > Also, It would be great if someone could point me out where can I take a > look for the SBCL's optimization of the assembly. Perhaps I could tune > it a bit? There is no one place where SBCL generates assembly in the sense that you mean. Instead, the compiler converts into a high-level internal representation, performs various optimizations on that representation, converts to a lower level representation, performs some more optimizations, and then emits assembly to implement what remains of that lower-level internal representation. (To address another of your questions which I have snipped: it is possible to pass vectors of numbers to foreign code without copying.) Cheers, Christophe |
From: Denys R. <rt...@ma...> - 2007-04-28 14:28:52
|
Hi Christophe, >> ; C09: 31C9 XOR ECX, ECX >> ; C0B: EB11 JMP L6 >> ; C0D: L5: 8BD9 MOV EBX, ECX >> ; C0F: C1FB02 SAR EBX, 2 >> ; C12: B808000000 MOV EAX, 8 >> ; C17: 88441A01 MOV [EDX+EBX+1], AL >> ; C1B: 83C104 ADD ECX, 4 >> ; C1E: L6: 83F928 CMP ECX, 40 >> ; C21: 7CEA JL L5 >> > No, it isn't. ECX holds the lisp variable i in your loop, which is a > live variable. The actual array access is done byte-by-byte, which is > achieved by your "mysterious" point four. > Ah.. I think I got the idea. "SAR EBX, 2" shifts "EBX" by two bits to the right and thus divides the EBX. However, the code seems to be slightly overcomplicated. You see, the ECX does not hold the value of "i". Instead, it holds "i" multiplied by 4. That's why "CMP ECX, 40" compares it to 40 instead of 10. And when an index is necessary, ECX is copied into EBX (so that ECX does not get changed), divided by 4 and only then used to specify the index. I suppose that the magic number "4" appeared because usually the array elements in SBCL are 4 bytes long. Anyway, a better (?) code would look like this: XOR ECX, ECX MOV EAX, 8 JMP L6 L5: MOV [EDX+ECX+1], AL INC ECX L6: CMP ECX, 10 JL L5 In this way the EAX is loaded only once and then re-used. Also, the body of the loop is now only 4 instructions instead of 7. This might be not a big win, but that's good anyway. The better results would be achieved with the "LOOP" instruction, but it is clear that SBCL may not convert the (loop ...) with the counter raising from 0 to a positive value to an assembly with that instruction. >> 2. The "MOV" command on line C17 seems to be too heavyweight, doesn't >> it? I do not see why three values should be added just to get the >> address on the next entry. >> >> 3. On line C12 it is clearly seen that EAX is filled instantly with the >> value "8" which is a constant. It would be great if I can make SBCL put >> it out of the loop, between the lines C09 and C0B. >> > > These two issues fall under the general class of loop optimizations: > strength reduction and invariant hoisting. There is some apparatus in > SBCL for performing this kind of optimization (for instance, loop > analysis is used in the register allocator) but not very much is done. > If you are interested in this kind of thing, that is where you should > start. > It would be great to take a look at that. Could you suggest where should I begin? > If you're filling a vector with a constant value, call FILL instead > Thank you very much for the hint! By the way, I suppose FILL does not work for 2D arrays. Could you also suggest some functions for copying/filling specific regions of arrays? Or maybe there is some way to interpret a 2D array as a 1D one? >> Also, It would be great if someone could point me out where can I take a >> look for the SBCL's optimization of the assembly. Perhaps I could tune >> it a bit? >> > > There is no one place where SBCL generates assembly in the sense that > you mean. Instead, the compiler converts into a high-level internal > representation, performs various optimizations on that representation, > converts to a lower level representation, performs some more > optimizations, and then emits assembly to implement what remains of > that lower-level internal representation. > I have almost forget the x86 assembly and I am quite new to LISP, but I'd like to take a look if I can figure out how does it work and perhaps tune it a bit. It looks like this goes out of the sbcl-help scope, but could you suggest a place to start? > (To address another of your questions which I have snipped: it is > possible to pass vectors of numbers to foreign code without copying.) > Oh, that's great :-) Thank you. With best wishes, Denys |