## [Sbcl-help] how to do fast vector math in sbcl?

 [Sbcl-help] how to do fast vector math in sbcl? From: thomas weidner <3.14159@gm...> - 2007-08-18 18:40:20 ```Hi, I want to do heavy 4D vector calculations in sbcl (yes...for 3D graphics) and currently try to investigate how to get the best asm code out of sbcl. Unfortunately the direct approach using a (simple-array single-float (4)) gives bad results, as sbcl seems to be unable to optimize several array accesses into register operations and falls back to memory->register->memory for every operation. SBCL does not optimize out temporary arrays (maybe as a result of the lack of array optimizations at all). is the "values" way of doing vector math the best currently possible? Will there be advantages for the sbcl optimizer with using arrays in the near future? It should be able to generate code similar to the values version. thx in advance, thomas here is my code i tested: (declaim (optimize speed (safety 0))) (deftype vec () '(simple-array single-float (4))) (declaim (inline make-vec copy-vec make-vec-uninitialized +v-into +v -v-into -v *sv *sv-into dot vec->values)) (defun make-vec-uninitialized () (make-array 4 :element-type 'single-float)) (defun make-vec (x y z w) (let ((result (make-vec-uninitialized))) (setf (aref result 0) x) (setf (aref result 1) y) (setf (aref result 2) z) (setf (aref result 3) w) result)) (defun copy-vec (v) (declare (vec v)) (make-vec (aref v 0) (aref v 1) (aref v 2) (aref v 3))) (defun +v-into (d a b) (declare (vec d a b)) (setf (aref d 0) (+ (aref a 0) (aref b 0))) (setf (aref d 1) (+ (aref a 1) (aref b 1))) (setf (aref d 2) (+ (aref a 2) (aref b 2))) (setf (aref d 3) (+ (aref a 3) (aref b 3))) d) (defun +v (a b) (declare (vec a b)) (+v-into (make-vec-uninitialized) a b)) (defun -v-into (d a b) (declare (vec d a b)) (setf (aref d 0) (- (aref a 0) (aref b 0))) (setf (aref d 1) (- (aref a 1) (aref b 1))) (setf (aref d 2) (- (aref a 2) (aref b 2))) (setf (aref d 3) (- (aref a 3) (aref b 3))) d) (defun -v (a b) (declare (vec a b)) (-v-into (make-vec-uninitialized) a b)) (defun *sv-into (d s v) (declare (vec d v) (single-float s)) (setf (aref d 0) (* s (aref v 0))) (setf (aref d 1) (* s (aref v 1))) (setf (aref d 2) (* s (aref v 2))) (setf (aref d 3) (* s (aref v 3))) d) (defun *sv (s v) (declare (vec v) (single-float s)) (*sv-into (make-vec-uninitialized) s v)) (defun dot (a b) (declare (vec a b)) (+ (* (aref a 0) (aref b 0)) (* (aref a 1) (aref b 1)) (* (aref a 2) (aref b 2)) (* (aref a 3) (aref b 3)))) ;;;;; values (defmacro +v/values (a b) (let ((c1 (loop repeat 4 collect (gensym))) (c2 (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c1 ,a (declare (single-float ,@c1)) (multiple-value-bind ,c2 ,b (declare (single-float ,@c2)) (values ,@(loop for i in c1 for j in c2 collect `(+ ,i ,j))))))) (defmacro -v/values (a b) (let ((c1 (loop repeat 4 collect (gensym))) (c2 (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c1 ,a (declare (single-float ,@c1)) (multiple-value-bind ,c2 ,b (declare (single-float ,@c2)) (values ,@(loop for i in c1 for j in c2 collect `(- ,i ,j))))))) (defmacro dot/values (a b) (let ((c1 (loop repeat 4 collect (gensym))) (c2 (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c1 ,a (declare (single-float ,@c1)) (multiple-value-bind ,c2 ,b (declare (single-float ,@c2)) (+ ,@(loop for i in c1 for j in c2 collect `(* ,i ,j))))))) (defmacro *sv/values (s v) (let ((c1 (loop repeat 4 collect (gensym))) (s1 (gensym))) `(let ((,s1 ,s)) (declare (single-float ,s1)) (multiple-value-bind ,c1 ,v (declare (single-float ,@c1)) (values ,@(loop for i in c1 collect `(* ,s1 ,i))))))) (defmacro vec-let ((&rest bindings) &body forms) (if bindings (let ((b (car bindings)) (c (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c ,(second b) (declare (single-float ,@c)) (symbol-macrolet ((,(first b) (values ,@c))) (vec-let ,(cdr bindings) ,@forms)))) `(progn ,@forms))) (defun vec->values (v) (declare (vec v)) (values (aref v 0) (aref v 1) (aref v 2) (aref v 3))) (defmacro values->vec (form) (let ((c (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c ,form (declare (single-float ,@c)) (make-vec ,@c)))) ;;; Test code (defun reflect-1 (L N) (declare (vec L N)) (-v (*sv (* 2.0 (dot L N)) N) L)) (defun reflect-1/inplace (L N) (let ((result (copy-vec N))) (*sv-into result (* 2.0 (dot L N)) N) (-v-into result result L))) (defun reflect-2 (L N) (declare (vec L N)) (vec-let ((Lv (vec->values L)) (Nv (vec->values N))) (values->vec (-v/values (*sv/values (* 2.0 (dot/values Lv Nv)) Nv) Lv)))) (defun reflect-2/expanded (L-x L-y L-z L-w N-x N-y N-z N-w) (declare (single-float L-x L-y L-z L-w N-x N-y N-z N-w)) (vec-let ((Lv (values L-x L-y L-z L-w)) (Nv (values N-x N-y N-z N-w))) (-v/values (*sv/values (* 2.0 (dot/values Lv Nv)) Nv) Lv))) and here are the disassemblies (sbcl 1.0.7 amd64): CL-USER> (disassemble #'reflect-1) ; 03A6C31F: F30F104A01 MOVSS XMM1, [RDX+1] ; no-arg-parsing entry point ; 324: F30F105701 MOVSS XMM2, [RDI+1] ; 329: F30F10D9 MOVSS XMM3, XMM1 ; 32D: F30F59DA MULSS XMM3, XMM2 ; 331: F30F104A05 MOVSS XMM1, [RDX+5] ; 336: F30F105705 MOVSS XMM2, [RDI+5] ; 33B: F30F59CA MULSS XMM1, XMM2 ; 33F: F30F58D9 ADDSS XMM3, XMM1 ; 343: F30F104A09 MOVSS XMM1, [RDX+9] ; 348: F30F105709 MOVSS XMM2, [RDI+9] ; 34D: F30F59CA MULSS XMM1, XMM2 ; 351: F30F58D9 ADDSS XMM3, XMM1 ; 355: F30F104A0D MOVSS XMM1, [RDX+13] ; 35A: F30F10570D MOVSS XMM2, [RDI+13] ; 35F: F30F59CA MULSS XMM1, XMM2 ; 363: F30F58CB ADDSS XMM1, XMM3 ; 367: 488B0D62FFFFFF MOV RCX, [RIP-158] ; 2.0 ; 36E: 488BC1 MOV RAX, RCX ; 371: 48C1E820 SHR RAX, 32 ; 375: 66480F6ED0 MOVD XMM2, RAX ; 37A: F30F59CA MULSS XMM1, XMM2 ; 37E: B9D6000000 MOV ECX, 214 ; 383: BB20000000 MOV EBX, 32 ; 388: BE10000000 MOV ESI, 16 ; 38D: 488D461F LEA RAX, [RSI+31] ; 391: 4883E0F0 AND RAX, -16 ; 395: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 39E: 4D8B5C2440 MOV R11, [R12+64] ; 3A3: 4C01D8 ADD RAX, R11 ; 3A6: 4939442448 CMP [R12+72], RAX ; 3AB: 0F860F010000 JBE L4 ; 3B1: 4989442440 MOV [R12+64], RAX ; 3B6: 498BC3 MOV RAX, R11 ; 3B9: L0: 488D400F LEA RAX, [RAX+15] ; 3BD: 488948F1 MOV [RAX-15], RCX ; 3C1: 488958F9 MOV [RAX-7], RBX ; 3C5: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 3CE: 7402 JEQ L1 ; 3D0: CC09 BREAK 9 ; pending interrupt trap ; 3D2: L1: F30F105701 MOVSS XMM2, [RDI+1] ; 3D7: F30F59D1 MULSS XMM2, XMM1 ; 3DB: F30F115001 MOVSS [RAX+1], XMM2 ; 3E0: F30F105705 MOVSS XMM2, [RDI+5] ; 3E5: F30F59D1 MULSS XMM2, XMM1 ; 3E9: F30F115005 MOVSS [RAX+5], XMM2 ; 3EE: F30F105709 MOVSS XMM2, [RDI+9] ; 3F3: F30F59D1 MULSS XMM2, XMM1 ; 3F7: F30F115009 MOVSS [RAX+9], XMM2 ; 3FC: F30F10570D MOVSS XMM2, [RDI+13] ; 401: F30F59CA MULSS XMM1, XMM2 ; 405: F30F11480D MOVSS [RAX+13], XMM1 ; 40A: BBD6000000 MOV EBX, 214 ; 40F: BE20000000 MOV ESI, 32 ; 414: BF10000000 MOV EDI, 16 ; 419: 488D4F1F LEA RCX, [RDI+31] ; 41D: 4883E1F0 AND RCX, -16 ; 421: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 42A: 4D8B5C2440 MOV R11, [R12+64] ; 42F: 4C01D9 ADD RCX, R11 ; 432: 49394C2448 CMP [R12+72], RCX ; 437: 0F869A000000 JBE L5 ; 43D: 49894C2440 MOV [R12+64], RCX ; 442: 498BCB MOV RCX, R11 ; 445: L2: 488D490F LEA RCX, [RCX+15] ; 449: 488959F1 MOV [RCX-15], RBX ; 44D: 488971F9 MOV [RCX-7], RSI ; 451: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 45A: 7402 JEQ L3 ; 45C: CC09 BREAK 9 ; pending interrupt trap ; 45E: L3: F30F104801 MOVSS XMM1, [RAX+1] ; 463: F30F105201 MOVSS XMM2, [RDX+1] ; 468: F30F5CCA SUBSS XMM1, XMM2 ; 46C: F30F114901 MOVSS [RCX+1], XMM1 ; 471: F30F104805 MOVSS XMM1, [RAX+5] ; 476: F30F105205 MOVSS XMM2, [RDX+5] ; 47B: F30F5CCA SUBSS XMM1, XMM2 ; 47F: F30F114905 MOVSS [RCX+5], XMM1 ; 484: F30F104809 MOVSS XMM1, [RAX+9] ; 489: F30F105209 MOVSS XMM2, [RDX+9] ; 48E: F30F5CCA SUBSS XMM1, XMM2 ; 492: F30F114909 MOVSS [RCX+9], XMM1 ; 497: F30F10480D MOVSS XMM1, [RAX+13] ; 49C: F30F10520D MOVSS XMM2, [RDX+13] ; 4A1: F30F5CCA SUBSS XMM1, XMM2 ; 4A5: F30F11490D MOVSS [RCX+13], XMM1 ; 4AA: 488BD1 MOV RDX, RCX ; 4AD: 488D65F0 LEA RSP, [RBP-16] ; 4B1: F8 CLC ; 4B2: 488B6DF8 MOV RBP, [RBP-8] ; 4B6: C20800 RET 8 ; 4B9: 90 NOP ; 4BA: 90 NOP ; 4BB: 90 NOP ; 4BC: 90 NOP ; 4BD: 90 NOP ; 4BE: 90 NOP ; 4BF: 90 NOP ; 4C0: L4: 492B442440 SUB RAX, [R12+64] ; 4C5: 50 PUSH RAX ; 4C6: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 4CE: 41FFD3 CALL R11 ; 4D1: 58 POP RAX ; 4D2: E9E2FEFFFF JMP L0 ; 4D7: L5: 492B4C2440 SUB RCX, [R12+64] ; 4DC: 51 PUSH RCX ; 4DD: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 4E5: 41FFD3 CALL R11 ; 4E8: 59 POP RCX ; 4E9: E957FFFFFF JMP L2 ; NIL CL-USER> (disassemble #'reflect-1/inplace) ; 03A6C58F: F30F104701 MOVSS XMM0, [RDI+1] ; no-arg-parsing entry point ; 594: F30F104F05 MOVSS XMM1, [RDI+5] ; 599: F30F105709 MOVSS XMM2, [RDI+9] ; 59E: F30F105F0D MOVSS XMM3, [RDI+13] ; 5A3: B8D6000000 MOV EAX, 214 ; 5A8: BB20000000 MOV EBX, 32 ; 5AD: BE10000000 MOV ESI, 16 ; 5B2: 488D4E1F LEA RCX, [RSI+31] ; 5B6: 4883E1F0 AND RCX, -16 ; 5BA: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 5C3: 4D8B5C2440 MOV R11, [R12+64] ; 5C8: 4C01D9 ADD RCX, R11 ; 5CB: 49394C2448 CMP [R12+72], RCX ; 5D0: 0F862A010000 JBE L2 ; 5D6: 49894C2440 MOV [R12+64], RCX ; 5DB: 498BCB MOV RCX, R11 ; 5DE: L0: 488D490F LEA RCX, [RCX+15] ; 5E2: 488941F1 MOV [RCX-15], RAX ; 5E6: 488959F9 MOV [RCX-7], RBX ; 5EA: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 5F3: 7402 JEQ L1 ; 5F5: CC09 BREAK 9 ; pending interrupt trap ; 5F7: L1: F30F114101 MOVSS [RCX+1], XMM0 ; 5FC: F30F114905 MOVSS [RCX+5], XMM1 ; 601: F30F115109 MOVSS [RCX+9], XMM2 ; 606: F30F11590D MOVSS [RCX+13], XMM3 ; 60B: F30F104A01 MOVSS XMM1, [RDX+1] ; 610: F30F105701 MOVSS XMM2, [RDI+1] ; 615: F30F10D9 MOVSS XMM3, XMM1 ; 619: F30F59DA MULSS XMM3, XMM2 ; 61D: F30F104A05 MOVSS XMM1, [RDX+5] ; 622: F30F105705 MOVSS XMM2, [RDI+5] ; 627: F30F59CA MULSS XMM1, XMM2 ; 62B: F30F58D9 ADDSS XMM3, XMM1 ; 62F: F30F104A09 MOVSS XMM1, [RDX+9] ; 634: F30F105709 MOVSS XMM2, [RDI+9] ; 639: F30F59CA MULSS XMM1, XMM2 ; 63D: F30F58D9 ADDSS XMM3, XMM1 ; 641: F30F104A0D MOVSS XMM1, [RDX+13] ; 646: F30F10570D MOVSS XMM2, [RDI+13] ; 64B: F30F59CA MULSS XMM1, XMM2 ; 64F: F30F58CB ADDSS XMM1, XMM3 ; 653: 488B1DE6FEFFFF MOV RBX, [RIP-282] ; 2.0 ; 65A: 488BC3 MOV RAX, RBX ; 65D: 48C1E820 SHR RAX, 32 ; 661: 66480F6ED0 MOVD XMM2, RAX ; 666: F30F59CA MULSS XMM1, XMM2 ; 66A: F30F105701 MOVSS XMM2, [RDI+1] ; 66F: F30F59D1 MULSS XMM2, XMM1 ; 673: F30F115101 MOVSS [RCX+1], XMM2 ; 678: F30F105705 MOVSS XMM2, [RDI+5] ; 67D: F30F59D1 MULSS XMM2, XMM1 ; 681: F30F115105 MOVSS [RCX+5], XMM2 ; 686: F30F105709 MOVSS XMM2, [RDI+9] ; 68B: F30F59D1 MULSS XMM2, XMM1 ; 68F: F30F115109 MOVSS [RCX+9], XMM2 ; 694: F30F10570D MOVSS XMM2, [RDI+13] ; 699: F30F59CA MULSS XMM1, XMM2 ; 69D: F30F11490D MOVSS [RCX+13], XMM1 ; 6A2: F30F104901 MOVSS XMM1, [RCX+1] ; 6A7: F30F105201 MOVSS XMM2, [RDX+1] ; 6AC: F30F5CCA SUBSS XMM1, XMM2 ; 6B0: F30F114901 MOVSS [RCX+1], XMM1 ; 6B5: F30F104905 MOVSS XMM1, [RCX+5] ; 6BA: F30F105205 MOVSS XMM2, [RDX+5] ; 6BF: F30F5CCA SUBSS XMM1, XMM2 ; 6C3: F30F114905 MOVSS [RCX+5], XMM1 ; 6C8: F30F104909 MOVSS XMM1, [RCX+9] ; 6CD: F30F105209 MOVSS XMM2, [RDX+9] ; 6D2: F30F5CCA SUBSS XMM1, XMM2 ; 6D6: F30F114909 MOVSS [RCX+9], XMM1 ; 6DB: F30F10490D MOVSS XMM1, [RCX+13] ; 6E0: F30F10520D MOVSS XMM2, [RDX+13] ; 6E5: F30F5CCA SUBSS XMM1, XMM2 ; 6E9: F30F11490D MOVSS [RCX+13], XMM1 ; 6EE: 488BD1 MOV RDX, RCX ; 6F1: 488D65F0 LEA RSP, [RBP-16] ; 6F5: F8 CLC ; 6F6: 488B6DF8 MOV RBP, [RBP-8] ; 6FA: C20800 RET 8 ; 6FD: 90 NOP ; 6FE: 90 NOP ; 6FF: 90 NOP ; 700: L2: 492B4C2440 SUB RCX, [R12+64] ; 705: 51 PUSH RCX ; 706: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 70E: 41FFD3 CALL R11 ; 711: 59 POP RCX ; 712: E9C7FEFFFF JMP L0 ; NIL CL-USER> (disassemble #'reflect-2) ; 03A6C7BF: F30F105201 MOVSS XMM2, [RDX+1] ; no-arg-parsing entry point ; 7C4: F30F105A05 MOVSS XMM3, [RDX+5] ; 7C9: F30F106209 MOVSS XMM4, [RDX+9] ; 7CE: F30F106A0D MOVSS XMM5, [RDX+13] ; 7D3: F30F107701 MOVSS XMM6, [RDI+1] ; 7D8: F30F107F05 MOVSS XMM7, [RDI+5] ; 7DD: F3440F104709 MOVSS XMM8, [RDI+9] ; 7E3: F3440F104F0D MOVSS XMM9, [RDI+13] ; 7E9: F30F10CA MOVSS XMM1, XMM2 ; 7ED: F30F59CE MULSS XMM1, XMM6 ; 7F1: F3440F10D3 MOVSS XMM10, XMM3 ; 7F6: F3440F59D7 MULSS XMM10, XMM7 ; 7FB: F3410F58CA ADDSS XMM1, XMM10 ; 800: F3440F10D4 MOVSS XMM10, XMM4 ; 805: F3450F59D0 MULSS XMM10, XMM8 ; 80A: F3410F58CA ADDSS XMM1, XMM10 ; 80F: F3440F10D5 MOVSS XMM10, XMM5 ; 814: F3450F59D1 MULSS XMM10, XMM9 ; 819: F3410F58CA ADDSS XMM1, XMM10 ; 81E: 488B0D4BFFFFFF MOV RCX, [RIP-181] ; 2.0 ; 825: 488BC1 MOV RAX, RCX ; 828: 48C1E820 SHR RAX, 32 ; 82C: 664C0F6ED0 MOVD XMM10, RAX ; 831: F3410F59CA MULSS XMM1, XMM10 ; 836: F30F59F1 MULSS XMM6, XMM1 ; 83A: F30F59F9 MULSS XMM7, XMM1 ; 83E: F3440F59C1 MULSS XMM8, XMM1 ; 843: F3410F59C9 MULSS XMM1, XMM9 ; 848: F30F5CF2 SUBSS XMM6, XMM2 ; 84C: F30F5CFB SUBSS XMM7, XMM3 ; 850: F3440F5CC4 SUBSS XMM8, XMM4 ; 855: F30F5CCD SUBSS XMM1, XMM5 ; 859: B8D6000000 MOV EAX, 214 ; 85E: BA20000000 MOV EDX, 32 ; 863: BB10000000 MOV EBX, 16 ; 868: 488D4B1F LEA RCX, [RBX+31] ; 86C: 4883E1F0 AND RCX, -16 ; 870: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 879: 4D8B5C2440 MOV R11, [R12+64] ; 87E: 4C01D9 ADD RCX, R11 ; 881: 49394C2448 CMP [R12+72], RCX ; 886: 7648 JBE L2 ; 888: 49894C2440 MOV [R12+64], RCX ; 88D: 498BCB MOV RCX, R11 ; 890: L0: 488D490F LEA RCX, [RCX+15] ; 894: 488941F1 MOV [RCX-15], RAX ; 898: 488951F9 MOV [RCX-7], RDX ; 89C: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 8A5: 7402 JEQ L1 ; 8A7: CC09 BREAK 9 ; pending interrupt trap ; 8A9: L1: F30F117101 MOVSS [RCX+1], XMM6 ; 8AE: F30F117905 MOVSS [RCX+5], XMM7 ; 8B3: F3440F114109 MOVSS [RCX+9], XMM8 ; 8B9: F30F11490D MOVSS [RCX+13], XMM1 ; 8BE: 488BD1 MOV RDX, RCX ; 8C1: 488D65F0 LEA RSP, [RBP-16] ; 8C5: F8 CLC ; 8C6: 488B6DF8 MOV RBP, [RBP-8] ; 8CA: C20800 RET 8 ; 8CD: 90 NOP ; 8CE: 90 NOP ; 8CF: 90 NOP ; 8D0: L2: 492B4C2440 SUB RCX, [R12+64] ; 8D5: 51 PUSH RCX ; 8D6: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 8DE: 41FFD3 CALL R11 ; 8E1: 59 POP RCX ; 8E2: EBAC JMP L0 ; NIL CL-USER> (disassemble #'reflect-2/expanded) ; 03A6CB1C: F30F10CA MOVSS XMM1, XMM2 ; no-arg-parsing entry point ; 20: F30F59CE MULSS XMM1, XMM6 ; 24: F3440F10D3 MOVSS XMM10, XMM3 ; 29: F3440F59D7 MULSS XMM10, XMM7 ; 2E: F3410F58CA ADDSS XMM1, XMM10 ; 33: F3440F10D4 MOVSS XMM10, XMM4 ; 38: F3450F59D0 MULSS XMM10, XMM8 ; 3D: F3410F58CA ADDSS XMM1, XMM10 ; 42: F3440F10D5 MOVSS XMM10, XMM5 ; 47: F3450F59D1 MULSS XMM10, XMM9 ; 4C: F3410F58CA ADDSS XMM1, XMM10 ; 51: 488B0DF8FEFFFF MOV RCX, [RIP-264] ; 2.0 ; 58: 488BC1 MOV RAX, RCX ; 5B: 48C1E820 SHR RAX, 32 ; 5F: 664C0F6ED0 MOVD XMM10, RAX ; 64: F3410F59CA MULSS XMM1, XMM10 ; 69: F30F59F1 MULSS XMM6, XMM1 ; 6D: F30F59F9 MULSS XMM7, XMM1 ; 71: F3440F59C1 MULSS XMM8, XMM1 ; 76: F3410F59C9 MULSS XMM1, XMM9 ; 7B: F30F5CF2 SUBSS XMM6, XMM2 ; 7F: F30F5CFB SUBSS XMM7, XMM3 ; 83: F3440F5CC4 SUBSS XMM8, XMM4 ; 88: F30F5CCD SUBSS XMM1, XMM5 ; 8C: 66480F7EF2 MOVD RDX, XMM6 ; 91: 48C1E220 SHL RDX, 32 ; 95: 4883CA1A OR RDX, 26 ; 99: 66480F7EFF MOVD RDI, XMM7 ; 9E: 48C1E720 SHL RDI, 32 ; A2: 4883CF1A OR RDI, 26 ; A6: 664C0F7EC6 MOVD RSI, XMM8 ; AB: 48C1E620 SHL RSI, 32 ; AF: 4883CE1A OR RSI, 26 ; B3: 66480F7EC8 MOVD RAX, XMM1 ; B8: 48C1E020 SHL RAX, 32 ; BC: 4883C81A OR RAX, 26 ; C0: 488945E0 MOV [RBP-32], RAX ; C4: 488BDD MOV RBX, RBP ; C7: B920000000 MOV ECX, 32 ; CC: 488B6DF8 MOV RBP, [RBP-8] ; D0: 488D63E0 LEA RSP, [RBX-32] ; D4: F9 STC ; D5: FF63F0 JMP QWORD PTR [RBX-16] ; D8: 90 NOP ; D9: 90 NOP ; DA: 90 NOP ; DB: 90 NOP ; DC: 90 NOP ; DD: 90 NOP ; DE: 90 NOP ; DF: 90 NOP ; NIL CL-USER> ```

 [Sbcl-help] how to do fast vector math in sbcl? From: thomas weidner <3.14159@gm...> - 2007-08-18 18:40:20 ```Hi, I want to do heavy 4D vector calculations in sbcl (yes...for 3D graphics) and currently try to investigate how to get the best asm code out of sbcl. Unfortunately the direct approach using a (simple-array single-float (4)) gives bad results, as sbcl seems to be unable to optimize several array accesses into register operations and falls back to memory->register->memory for every operation. SBCL does not optimize out temporary arrays (maybe as a result of the lack of array optimizations at all). is the "values" way of doing vector math the best currently possible? Will there be advantages for the sbcl optimizer with using arrays in the near future? It should be able to generate code similar to the values version. thx in advance, thomas here is my code i tested: (declaim (optimize speed (safety 0))) (deftype vec () '(simple-array single-float (4))) (declaim (inline make-vec copy-vec make-vec-uninitialized +v-into +v -v-into -v *sv *sv-into dot vec->values)) (defun make-vec-uninitialized () (make-array 4 :element-type 'single-float)) (defun make-vec (x y z w) (let ((result (make-vec-uninitialized))) (setf (aref result 0) x) (setf (aref result 1) y) (setf (aref result 2) z) (setf (aref result 3) w) result)) (defun copy-vec (v) (declare (vec v)) (make-vec (aref v 0) (aref v 1) (aref v 2) (aref v 3))) (defun +v-into (d a b) (declare (vec d a b)) (setf (aref d 0) (+ (aref a 0) (aref b 0))) (setf (aref d 1) (+ (aref a 1) (aref b 1))) (setf (aref d 2) (+ (aref a 2) (aref b 2))) (setf (aref d 3) (+ (aref a 3) (aref b 3))) d) (defun +v (a b) (declare (vec a b)) (+v-into (make-vec-uninitialized) a b)) (defun -v-into (d a b) (declare (vec d a b)) (setf (aref d 0) (- (aref a 0) (aref b 0))) (setf (aref d 1) (- (aref a 1) (aref b 1))) (setf (aref d 2) (- (aref a 2) (aref b 2))) (setf (aref d 3) (- (aref a 3) (aref b 3))) d) (defun -v (a b) (declare (vec a b)) (-v-into (make-vec-uninitialized) a b)) (defun *sv-into (d s v) (declare (vec d v) (single-float s)) (setf (aref d 0) (* s (aref v 0))) (setf (aref d 1) (* s (aref v 1))) (setf (aref d 2) (* s (aref v 2))) (setf (aref d 3) (* s (aref v 3))) d) (defun *sv (s v) (declare (vec v) (single-float s)) (*sv-into (make-vec-uninitialized) s v)) (defun dot (a b) (declare (vec a b)) (+ (* (aref a 0) (aref b 0)) (* (aref a 1) (aref b 1)) (* (aref a 2) (aref b 2)) (* (aref a 3) (aref b 3)))) ;;;;; values (defmacro +v/values (a b) (let ((c1 (loop repeat 4 collect (gensym))) (c2 (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c1 ,a (declare (single-float ,@c1)) (multiple-value-bind ,c2 ,b (declare (single-float ,@c2)) (values ,@(loop for i in c1 for j in c2 collect `(+ ,i ,j))))))) (defmacro -v/values (a b) (let ((c1 (loop repeat 4 collect (gensym))) (c2 (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c1 ,a (declare (single-float ,@c1)) (multiple-value-bind ,c2 ,b (declare (single-float ,@c2)) (values ,@(loop for i in c1 for j in c2 collect `(- ,i ,j))))))) (defmacro dot/values (a b) (let ((c1 (loop repeat 4 collect (gensym))) (c2 (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c1 ,a (declare (single-float ,@c1)) (multiple-value-bind ,c2 ,b (declare (single-float ,@c2)) (+ ,@(loop for i in c1 for j in c2 collect `(* ,i ,j))))))) (defmacro *sv/values (s v) (let ((c1 (loop repeat 4 collect (gensym))) (s1 (gensym))) `(let ((,s1 ,s)) (declare (single-float ,s1)) (multiple-value-bind ,c1 ,v (declare (single-float ,@c1)) (values ,@(loop for i in c1 collect `(* ,s1 ,i))))))) (defmacro vec-let ((&rest bindings) &body forms) (if bindings (let ((b (car bindings)) (c (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c ,(second b) (declare (single-float ,@c)) (symbol-macrolet ((,(first b) (values ,@c))) (vec-let ,(cdr bindings) ,@forms)))) `(progn ,@forms))) (defun vec->values (v) (declare (vec v)) (values (aref v 0) (aref v 1) (aref v 2) (aref v 3))) (defmacro values->vec (form) (let ((c (loop repeat 4 collect (gensym)))) `(multiple-value-bind ,c ,form (declare (single-float ,@c)) (make-vec ,@c)))) ;;; Test code (defun reflect-1 (L N) (declare (vec L N)) (-v (*sv (* 2.0 (dot L N)) N) L)) (defun reflect-1/inplace (L N) (let ((result (copy-vec N))) (*sv-into result (* 2.0 (dot L N)) N) (-v-into result result L))) (defun reflect-2 (L N) (declare (vec L N)) (vec-let ((Lv (vec->values L)) (Nv (vec->values N))) (values->vec (-v/values (*sv/values (* 2.0 (dot/values Lv Nv)) Nv) Lv)))) (defun reflect-2/expanded (L-x L-y L-z L-w N-x N-y N-z N-w) (declare (single-float L-x L-y L-z L-w N-x N-y N-z N-w)) (vec-let ((Lv (values L-x L-y L-z L-w)) (Nv (values N-x N-y N-z N-w))) (-v/values (*sv/values (* 2.0 (dot/values Lv Nv)) Nv) Lv))) and here are the disassemblies (sbcl 1.0.7 amd64): CL-USER> (disassemble #'reflect-1) ; 03A6C31F: F30F104A01 MOVSS XMM1, [RDX+1] ; no-arg-parsing entry point ; 324: F30F105701 MOVSS XMM2, [RDI+1] ; 329: F30F10D9 MOVSS XMM3, XMM1 ; 32D: F30F59DA MULSS XMM3, XMM2 ; 331: F30F104A05 MOVSS XMM1, [RDX+5] ; 336: F30F105705 MOVSS XMM2, [RDI+5] ; 33B: F30F59CA MULSS XMM1, XMM2 ; 33F: F30F58D9 ADDSS XMM3, XMM1 ; 343: F30F104A09 MOVSS XMM1, [RDX+9] ; 348: F30F105709 MOVSS XMM2, [RDI+9] ; 34D: F30F59CA MULSS XMM1, XMM2 ; 351: F30F58D9 ADDSS XMM3, XMM1 ; 355: F30F104A0D MOVSS XMM1, [RDX+13] ; 35A: F30F10570D MOVSS XMM2, [RDI+13] ; 35F: F30F59CA MULSS XMM1, XMM2 ; 363: F30F58CB ADDSS XMM1, XMM3 ; 367: 488B0D62FFFFFF MOV RCX, [RIP-158] ; 2.0 ; 36E: 488BC1 MOV RAX, RCX ; 371: 48C1E820 SHR RAX, 32 ; 375: 66480F6ED0 MOVD XMM2, RAX ; 37A: F30F59CA MULSS XMM1, XMM2 ; 37E: B9D6000000 MOV ECX, 214 ; 383: BB20000000 MOV EBX, 32 ; 388: BE10000000 MOV ESI, 16 ; 38D: 488D461F LEA RAX, [RSI+31] ; 391: 4883E0F0 AND RAX, -16 ; 395: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 39E: 4D8B5C2440 MOV R11, [R12+64] ; 3A3: 4C01D8 ADD RAX, R11 ; 3A6: 4939442448 CMP [R12+72], RAX ; 3AB: 0F860F010000 JBE L4 ; 3B1: 4989442440 MOV [R12+64], RAX ; 3B6: 498BC3 MOV RAX, R11 ; 3B9: L0: 488D400F LEA RAX, [RAX+15] ; 3BD: 488948F1 MOV [RAX-15], RCX ; 3C1: 488958F9 MOV [RAX-7], RBX ; 3C5: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 3CE: 7402 JEQ L1 ; 3D0: CC09 BREAK 9 ; pending interrupt trap ; 3D2: L1: F30F105701 MOVSS XMM2, [RDI+1] ; 3D7: F30F59D1 MULSS XMM2, XMM1 ; 3DB: F30F115001 MOVSS [RAX+1], XMM2 ; 3E0: F30F105705 MOVSS XMM2, [RDI+5] ; 3E5: F30F59D1 MULSS XMM2, XMM1 ; 3E9: F30F115005 MOVSS [RAX+5], XMM2 ; 3EE: F30F105709 MOVSS XMM2, [RDI+9] ; 3F3: F30F59D1 MULSS XMM2, XMM1 ; 3F7: F30F115009 MOVSS [RAX+9], XMM2 ; 3FC: F30F10570D MOVSS XMM2, [RDI+13] ; 401: F30F59CA MULSS XMM1, XMM2 ; 405: F30F11480D MOVSS [RAX+13], XMM1 ; 40A: BBD6000000 MOV EBX, 214 ; 40F: BE20000000 MOV ESI, 32 ; 414: BF10000000 MOV EDI, 16 ; 419: 488D4F1F LEA RCX, [RDI+31] ; 41D: 4883E1F0 AND RCX, -16 ; 421: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 42A: 4D8B5C2440 MOV R11, [R12+64] ; 42F: 4C01D9 ADD RCX, R11 ; 432: 49394C2448 CMP [R12+72], RCX ; 437: 0F869A000000 JBE L5 ; 43D: 49894C2440 MOV [R12+64], RCX ; 442: 498BCB MOV RCX, R11 ; 445: L2: 488D490F LEA RCX, [RCX+15] ; 449: 488959F1 MOV [RCX-15], RBX ; 44D: 488971F9 MOV [RCX-7], RSI ; 451: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 45A: 7402 JEQ L3 ; 45C: CC09 BREAK 9 ; pending interrupt trap ; 45E: L3: F30F104801 MOVSS XMM1, [RAX+1] ; 463: F30F105201 MOVSS XMM2, [RDX+1] ; 468: F30F5CCA SUBSS XMM1, XMM2 ; 46C: F30F114901 MOVSS [RCX+1], XMM1 ; 471: F30F104805 MOVSS XMM1, [RAX+5] ; 476: F30F105205 MOVSS XMM2, [RDX+5] ; 47B: F30F5CCA SUBSS XMM1, XMM2 ; 47F: F30F114905 MOVSS [RCX+5], XMM1 ; 484: F30F104809 MOVSS XMM1, [RAX+9] ; 489: F30F105209 MOVSS XMM2, [RDX+9] ; 48E: F30F5CCA SUBSS XMM1, XMM2 ; 492: F30F114909 MOVSS [RCX+9], XMM1 ; 497: F30F10480D MOVSS XMM1, [RAX+13] ; 49C: F30F10520D MOVSS XMM2, [RDX+13] ; 4A1: F30F5CCA SUBSS XMM1, XMM2 ; 4A5: F30F11490D MOVSS [RCX+13], XMM1 ; 4AA: 488BD1 MOV RDX, RCX ; 4AD: 488D65F0 LEA RSP, [RBP-16] ; 4B1: F8 CLC ; 4B2: 488B6DF8 MOV RBP, [RBP-8] ; 4B6: C20800 RET 8 ; 4B9: 90 NOP ; 4BA: 90 NOP ; 4BB: 90 NOP ; 4BC: 90 NOP ; 4BD: 90 NOP ; 4BE: 90 NOP ; 4BF: 90 NOP ; 4C0: L4: 492B442440 SUB RAX, [R12+64] ; 4C5: 50 PUSH RAX ; 4C6: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 4CE: 41FFD3 CALL R11 ; 4D1: 58 POP RAX ; 4D2: E9E2FEFFFF JMP L0 ; 4D7: L5: 492B4C2440 SUB RCX, [R12+64] ; 4DC: 51 PUSH RCX ; 4DD: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 4E5: 41FFD3 CALL R11 ; 4E8: 59 POP RCX ; 4E9: E957FFFFFF JMP L2 ; NIL CL-USER> (disassemble #'reflect-1/inplace) ; 03A6C58F: F30F104701 MOVSS XMM0, [RDI+1] ; no-arg-parsing entry point ; 594: F30F104F05 MOVSS XMM1, [RDI+5] ; 599: F30F105709 MOVSS XMM2, [RDI+9] ; 59E: F30F105F0D MOVSS XMM3, [RDI+13] ; 5A3: B8D6000000 MOV EAX, 214 ; 5A8: BB20000000 MOV EBX, 32 ; 5AD: BE10000000 MOV ESI, 16 ; 5B2: 488D4E1F LEA RCX, [RSI+31] ; 5B6: 4883E1F0 AND RCX, -16 ; 5BA: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 5C3: 4D8B5C2440 MOV R11, [R12+64] ; 5C8: 4C01D9 ADD RCX, R11 ; 5CB: 49394C2448 CMP [R12+72], RCX ; 5D0: 0F862A010000 JBE L2 ; 5D6: 49894C2440 MOV [R12+64], RCX ; 5DB: 498BCB MOV RCX, R11 ; 5DE: L0: 488D490F LEA RCX, [RCX+15] ; 5E2: 488941F1 MOV [RCX-15], RAX ; 5E6: 488959F9 MOV [RCX-7], RBX ; 5EA: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 5F3: 7402 JEQ L1 ; 5F5: CC09 BREAK 9 ; pending interrupt trap ; 5F7: L1: F30F114101 MOVSS [RCX+1], XMM0 ; 5FC: F30F114905 MOVSS [RCX+5], XMM1 ; 601: F30F115109 MOVSS [RCX+9], XMM2 ; 606: F30F11590D MOVSS [RCX+13], XMM3 ; 60B: F30F104A01 MOVSS XMM1, [RDX+1] ; 610: F30F105701 MOVSS XMM2, [RDI+1] ; 615: F30F10D9 MOVSS XMM3, XMM1 ; 619: F30F59DA MULSS XMM3, XMM2 ; 61D: F30F104A05 MOVSS XMM1, [RDX+5] ; 622: F30F105705 MOVSS XMM2, [RDI+5] ; 627: F30F59CA MULSS XMM1, XMM2 ; 62B: F30F58D9 ADDSS XMM3, XMM1 ; 62F: F30F104A09 MOVSS XMM1, [RDX+9] ; 634: F30F105709 MOVSS XMM2, [RDI+9] ; 639: F30F59CA MULSS XMM1, XMM2 ; 63D: F30F58D9 ADDSS XMM3, XMM1 ; 641: F30F104A0D MOVSS XMM1, [RDX+13] ; 646: F30F10570D MOVSS XMM2, [RDI+13] ; 64B: F30F59CA MULSS XMM1, XMM2 ; 64F: F30F58CB ADDSS XMM1, XMM3 ; 653: 488B1DE6FEFFFF MOV RBX, [RIP-282] ; 2.0 ; 65A: 488BC3 MOV RAX, RBX ; 65D: 48C1E820 SHR RAX, 32 ; 661: 66480F6ED0 MOVD XMM2, RAX ; 666: F30F59CA MULSS XMM1, XMM2 ; 66A: F30F105701 MOVSS XMM2, [RDI+1] ; 66F: F30F59D1 MULSS XMM2, XMM1 ; 673: F30F115101 MOVSS [RCX+1], XMM2 ; 678: F30F105705 MOVSS XMM2, [RDI+5] ; 67D: F30F59D1 MULSS XMM2, XMM1 ; 681: F30F115105 MOVSS [RCX+5], XMM2 ; 686: F30F105709 MOVSS XMM2, [RDI+9] ; 68B: F30F59D1 MULSS XMM2, XMM1 ; 68F: F30F115109 MOVSS [RCX+9], XMM2 ; 694: F30F10570D MOVSS XMM2, [RDI+13] ; 699: F30F59CA MULSS XMM1, XMM2 ; 69D: F30F11490D MOVSS [RCX+13], XMM1 ; 6A2: F30F104901 MOVSS XMM1, [RCX+1] ; 6A7: F30F105201 MOVSS XMM2, [RDX+1] ; 6AC: F30F5CCA SUBSS XMM1, XMM2 ; 6B0: F30F114901 MOVSS [RCX+1], XMM1 ; 6B5: F30F104905 MOVSS XMM1, [RCX+5] ; 6BA: F30F105205 MOVSS XMM2, [RDX+5] ; 6BF: F30F5CCA SUBSS XMM1, XMM2 ; 6C3: F30F114905 MOVSS [RCX+5], XMM1 ; 6C8: F30F104909 MOVSS XMM1, [RCX+9] ; 6CD: F30F105209 MOVSS XMM2, [RDX+9] ; 6D2: F30F5CCA SUBSS XMM1, XMM2 ; 6D6: F30F114909 MOVSS [RCX+9], XMM1 ; 6DB: F30F10490D MOVSS XMM1, [RCX+13] ; 6E0: F30F10520D MOVSS XMM2, [RDX+13] ; 6E5: F30F5CCA SUBSS XMM1, XMM2 ; 6E9: F30F11490D MOVSS [RCX+13], XMM1 ; 6EE: 488BD1 MOV RDX, RCX ; 6F1: 488D65F0 LEA RSP, [RBP-16] ; 6F5: F8 CLC ; 6F6: 488B6DF8 MOV RBP, [RBP-8] ; 6FA: C20800 RET 8 ; 6FD: 90 NOP ; 6FE: 90 NOP ; 6FF: 90 NOP ; 700: L2: 492B4C2440 SUB RCX, [R12+64] ; 705: 51 PUSH RCX ; 706: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 70E: 41FFD3 CALL R11 ; 711: 59 POP RCX ; 712: E9C7FEFFFF JMP L0 ; NIL CL-USER> (disassemble #'reflect-2) ; 03A6C7BF: F30F105201 MOVSS XMM2, [RDX+1] ; no-arg-parsing entry point ; 7C4: F30F105A05 MOVSS XMM3, [RDX+5] ; 7C9: F30F106209 MOVSS XMM4, [RDX+9] ; 7CE: F30F106A0D MOVSS XMM5, [RDX+13] ; 7D3: F30F107701 MOVSS XMM6, [RDI+1] ; 7D8: F30F107F05 MOVSS XMM7, [RDI+5] ; 7DD: F3440F104709 MOVSS XMM8, [RDI+9] ; 7E3: F3440F104F0D MOVSS XMM9, [RDI+13] ; 7E9: F30F10CA MOVSS XMM1, XMM2 ; 7ED: F30F59CE MULSS XMM1, XMM6 ; 7F1: F3440F10D3 MOVSS XMM10, XMM3 ; 7F6: F3440F59D7 MULSS XMM10, XMM7 ; 7FB: F3410F58CA ADDSS XMM1, XMM10 ; 800: F3440F10D4 MOVSS XMM10, XMM4 ; 805: F3450F59D0 MULSS XMM10, XMM8 ; 80A: F3410F58CA ADDSS XMM1, XMM10 ; 80F: F3440F10D5 MOVSS XMM10, XMM5 ; 814: F3450F59D1 MULSS XMM10, XMM9 ; 819: F3410F58CA ADDSS XMM1, XMM10 ; 81E: 488B0D4BFFFFFF MOV RCX, [RIP-181] ; 2.0 ; 825: 488BC1 MOV RAX, RCX ; 828: 48C1E820 SHR RAX, 32 ; 82C: 664C0F6ED0 MOVD XMM10, RAX ; 831: F3410F59CA MULSS XMM1, XMM10 ; 836: F30F59F1 MULSS XMM6, XMM1 ; 83A: F30F59F9 MULSS XMM7, XMM1 ; 83E: F3440F59C1 MULSS XMM8, XMM1 ; 843: F3410F59C9 MULSS XMM1, XMM9 ; 848: F30F5CF2 SUBSS XMM6, XMM2 ; 84C: F30F5CFB SUBSS XMM7, XMM3 ; 850: F3440F5CC4 SUBSS XMM8, XMM4 ; 855: F30F5CCD SUBSS XMM1, XMM5 ; 859: B8D6000000 MOV EAX, 214 ; 85E: BA20000000 MOV EDX, 32 ; 863: BB10000000 MOV EBX, 16 ; 868: 488D4B1F LEA RCX, [RBX+31] ; 86C: 4883E1F0 AND RCX, -16 ; 870: 41808C249000000008 OR BYTE PTR [R12+144], 8 ; 879: 4D8B5C2440 MOV R11, [R12+64] ; 87E: 4C01D9 ADD RCX, R11 ; 881: 49394C2448 CMP [R12+72], RCX ; 886: 7648 JBE L2 ; 888: 49894C2440 MOV [R12+64], RCX ; 88D: 498BCB MOV RCX, R11 ; 890: L0: 488D490F LEA RCX, [RCX+15] ; 894: 488941F1 MOV [RCX-15], RAX ; 898: 488951F9 MOV [RCX-7], RDX ; 89C: 4180B4249000000008 XOR BYTE PTR [R12+144], 8 ; 8A5: 7402 JEQ L1 ; 8A7: CC09 BREAK 9 ; pending interrupt trap ; 8A9: L1: F30F117101 MOVSS [RCX+1], XMM6 ; 8AE: F30F117905 MOVSS [RCX+5], XMM7 ; 8B3: F3440F114109 MOVSS [RCX+9], XMM8 ; 8B9: F30F11490D MOVSS [RCX+13], XMM1 ; 8BE: 488BD1 MOV RDX, RCX ; 8C1: 488D65F0 LEA RSP, [RBP-16] ; 8C5: F8 CLC ; 8C6: 488B6DF8 MOV RBP, [RBP-8] ; 8CA: C20800 RET 8 ; 8CD: 90 NOP ; 8CE: 90 NOP ; 8CF: 90 NOP ; 8D0: L2: 492B4C2440 SUB RCX, [R12+64] ; 8D5: 51 PUSH RCX ; 8D6: 4C8D1C2510D74100 LEA R11, [#x41D710] ; alloc_tramp ; 8DE: 41FFD3 CALL R11 ; 8E1: 59 POP RCX ; 8E2: EBAC JMP L0 ; NIL CL-USER> (disassemble #'reflect-2/expanded) ; 03A6CB1C: F30F10CA MOVSS XMM1, XMM2 ; no-arg-parsing entry point ; 20: F30F59CE MULSS XMM1, XMM6 ; 24: F3440F10D3 MOVSS XMM10, XMM3 ; 29: F3440F59D7 MULSS XMM10, XMM7 ; 2E: F3410F58CA ADDSS XMM1, XMM10 ; 33: F3440F10D4 MOVSS XMM10, XMM4 ; 38: F3450F59D0 MULSS XMM10, XMM8 ; 3D: F3410F58CA ADDSS XMM1, XMM10 ; 42: F3440F10D5 MOVSS XMM10, XMM5 ; 47: F3450F59D1 MULSS XMM10, XMM9 ; 4C: F3410F58CA ADDSS XMM1, XMM10 ; 51: 488B0DF8FEFFFF MOV RCX, [RIP-264] ; 2.0 ; 58: 488BC1 MOV RAX, RCX ; 5B: 48C1E820 SHR RAX, 32 ; 5F: 664C0F6ED0 MOVD XMM10, RAX ; 64: F3410F59CA MULSS XMM1, XMM10 ; 69: F30F59F1 MULSS XMM6, XMM1 ; 6D: F30F59F9 MULSS XMM7, XMM1 ; 71: F3440F59C1 MULSS XMM8, XMM1 ; 76: F3410F59C9 MULSS XMM1, XMM9 ; 7B: F30F5CF2 SUBSS XMM6, XMM2 ; 7F: F30F5CFB SUBSS XMM7, XMM3 ; 83: F3440F5CC4 SUBSS XMM8, XMM4 ; 88: F30F5CCD SUBSS XMM1, XMM5 ; 8C: 66480F7EF2 MOVD RDX, XMM6 ; 91: 48C1E220 SHL RDX, 32 ; 95: 4883CA1A OR RDX, 26 ; 99: 66480F7EFF MOVD RDI, XMM7 ; 9E: 48C1E720 SHL RDI, 32 ; A2: 4883CF1A OR RDI, 26 ; A6: 664C0F7EC6 MOVD RSI, XMM8 ; AB: 48C1E620 SHL RSI, 32 ; AF: 4883CE1A OR RSI, 26 ; B3: 66480F7EC8 MOVD RAX, XMM1 ; B8: 48C1E020 SHL RAX, 32 ; BC: 4883C81A OR RAX, 26 ; C0: 488945E0 MOV [RBP-32], RAX ; C4: 488BDD MOV RBX, RBP ; C7: B920000000 MOV ECX, 32 ; CC: 488B6DF8 MOV RBP, [RBP-8] ; D0: 488D63E0 LEA RSP, [RBX-32] ; D4: F9 STC ; D5: FF63F0 JMP QWORD PTR [RBX-16] ; D8: 90 NOP ; D9: 90 NOP ; DA: 90 NOP ; DB: 90 NOP ; DC: 90 NOP ; DD: 90 NOP ; DE: 90 NOP ; DF: 90 NOP ; NIL CL-USER> ```