"...on a sidenote, it is also not quite obvious to me that the inlin=
e

expansion of FILL is a r= eal win these days:"

=

expansion of FILL is a r= eal win these days:"

Instead =
of inlining FILL with a loop, we should either emit REPE STOS for Intel, or=
just call a function. The function call is nearly never the overhead on a =
fill. =C2=A0Last I checked, the Go compiler people seemed to think that Int=
el string instructions are only "worth it" at 1KB or more but I o=
n the other hand think it's almost never not worth it.

The real inefficiency, imo, of our stack-allocation =
is that we do "initialization" twice: once with 'REPE STOS=
9; and then *again* with a FILL of either the inlined or not variety. First=
time is because we don't want stray pointers above the stack bottom I =
suppose.

; =C2=
=A0 =C2=A0 =C2=A0BA3: =C2=A0 =C2=A0 =C2=A0 488953F9 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 MOV [RBX-7], RDX

; =C2=A0 =C2=A0 =C2=A0BA7: =C2=A0 =C2=A0 =C2=A0 31C0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 XOR EAX, EAX

; =C2=A0 =C2=A0 =C2=A0BA9: =C2=A0 =C2=A0 =C2=A0 F348AB =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 REPE STOSQ

; =C2=A0 =C2=A0 =C2=A0BAC: =C2=A0 =C2=A0 =C2=A0 488BD3 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 MOV RDX, RBX

; =C2=A0 =C2=A0 =C2=A0BAF: =C2=
=A0 =C2=A0 =C2=A0 488D5C24F0 =C2=A0 =C2=A0 =C2=A0 LEA RBX, [RSP-16]<=
/div>

; =C2=A0 =C2=A0 =C2=A0BB4: =
=C2=A0 =C2=A0 =C2=A0 4883EC20 =C2=A0 =C2=A0 =C2=A0 =C2=A0 SUB RSP, 32

; =C2=A0 =C2=A0 =C2=A0BB8=
: =C2=A0 =C2=A0 =C2=A0 488BFE =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 MOV RDI, R=
SI

; =C2=A0 =C2=A0 =C2=A0BBB: =C2=
=A0 =C2=A0 =C2=A0 31F6 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 XOR ESI, E=
SI

; =C2=A0 =C2=A0 =
=C2=A0BBD: =C2=A0 =C2=A0 =C2=A0 48C743F017001020 MOV QWORD PTR [RBX-16], 53=
7919511

; =C2=A0 =C2=A0 =C2=A0BC5: =C2=A0 =C2=A0 =C2=A0 488B054CFFFFFF =
=C2=A0 MOV RAX, [RIP-180] =C2=A0 =C2=A0 =C2=A0 =C2=A0 ; #<FDEFINITION ob=
ject for VECTOR-FILL*>

Fwiw, have fun with the attached test program.

For the first test, down the left side is the array size (in words) and =
3 ways of filling.

Next test is 5 ways of moving bytes. You can't out=
pace 'memcpy' in any case but 5 words or less.

=

* (=
time-fill-simple-vector)

Columns: FILL, 'REPNE STOSQ', me=
mset_pattern

=C2=A0 =C2=A0 =C2=A0 =C2=A00 (2000000 trials): =C2=A0 0.0106 =C2=A0 0.=
0088 [ =C2=A0-17.0%] =C2=A0 0.0074 [ =C2=A0-30.3%]

=C2=A0 =C2=A0 =
=C2=A0 =C2=A01 (2000000 trials): =C2=A0 0.0108 =C2=A0 0.0091 [ =C2=A0-16.2%=
] =C2=A0 0.0076 [ =C2=A0-30.3%]

=C2=A0 =C2=A0 =C2=A0 =C2=A02 (100=
0000 trials): =C2=A0 0.0215 =C2=A0 0.0181 [ =C2=A0-16.2%] =C2=A0 0.0150 [ =
=C2=A0-30.3%]

=C2=A0 =C2=A0 =C2=A0 =C2=A05 ( 400000 trials): =C2=A0 0.0515 =C2=A0 0.=
0448 [ =C2=A0-13.0%] =C2=A0 0.0369 [ =C2=A0-28.3%]

=C2=A0 =C2=A0 =
=C2=A0 10 ( 200000 trials): =C2=A0 0.0949 =C2=A0 0.0895 [ =C2=A0 -5.6%] =C2=
=A0 0.0724 [ =C2=A0-23.7%]

=C2=A0 =C2=A0 =C2=A0 20 ( 100000 trial=
s): =C2=A0 0.1553 =C2=A0 0.1674 [ =C2=A0 +7.8%] =C2=A0 0.1392 [ =C2=A0-10.4=
%]

=C2=A0 =C2=A0 =C2=A0 30 ( =C2=A066666 trials): =C2=A0 0.1951 =C2=A0 0.=
2314 [ =C2=A0+18.6%] =C2=A0 0.1960 [ =C2=A0 +0.5%]

=C2=A0 =C2=A0 =
=C2=A0 40 ( =C2=A050000 trials): =C2=A0 0.2185 =C2=A0 0.2919 [ =C2=A0+33.6%=
] =C2=A0 0.2652 [ =C2=A0+21.4%]

=C2=A0 =C2=A0 =C2=A0 60 ( =C2=A03=
3333 trials): =C2=A0 0.2913 =C2=A0 0.4059 [ =C2=A0+39.4%] =C2=A0 0.3870 [ =
=C2=A0+32.9%]

=C2=A0 =C2=A0 =C2=A0 80 ( =C2=A025000 trials): =C2=A0 0.3459 =C2=A0 0.=
5225 [ =C2=A0+51.1%] =C2=A0 0.5097 [ =C2=A0+47.3%]

=C2=A0 =C2=A0 =
=C2=A0100 ( =C2=A020000 trials): =C2=A0 0.3689 =C2=A0 0.6199 [ =C2=A0+68.1%=
] =C2=A0 0.5996 [ =C2=A0+62.5%]

=C2=A0 =C2=A0 =C2=A0200 ( =C2=A01=
0000 trials): =C2=A0 0.4718 =C2=A0 0.9779 [ +107.3%] =C2=A0 0.9721 [ +106.0=
%]

=C2=A0 =C2=A0 =C2=A0400 ( =C2=A0 5000 trials): =C2=A0 0.5812 =C2=A0 1.=
4980 [ +157.8%] =C2=A0 1.3905 [ +139.3%]

=C2=A0 =C2=A0 =C2=A0600 =
( =C2=A0 3333 trials): =C2=A0 0.5704 =C2=A0 1.6310 [ +185.9%] =C2=A0 1.5688=
[ +175.0%]

=C2=A0 =C2=A0 =C2=A0800 ( =C2=A0 2500 trials): =C2=A0=
0.6148 =C2=A0 1.8978 [ +208.7%] =C2=A0 1.7754 [ +188.8%]

=C2=A0 =C2=A0 1000 ( =C2=A0 2000 trials): =C2=A0 0.6189 =C2=A0 1.9677 =
[ +217.9%] =C2=A0 1.9330 [ +212.3%]

=C2=A0 =C2=A0 2000 ( =C2=A0 1=
000 trials): =C2=A0 0.6187 =C2=A0 2.1903 [ +254.0%] =C2=A0 2.1394 [ +245.8%=
]

=C2=A0 =C2=A0 3000 ( =C2=A0 =C2=A0666 trials): =C2=A0 0.6274 =
=C2=A0 2.3055 [ +267.5%] =C2=A0 2.2732 [ +262.3%]

=C2=A0 =C2=A0 4000 ( =C2=A0 =C2=A0500 trials): =C2=A0 0.6295 =C2=A0 2.=
2727 [ +261.0%] =C2=A0 2.0852 [ +231.2%]

=C2=A0 =C2=A0 5000 ( =C2=
=A0 =C2=A0400 trials): =C2=A0 0.6329 =C2=A0 1.5362 [ +142.7%] =C2=A0 1.4895=
[ +135.3%]

=C2=A0 =C2=A010000 ( =C2=A0 =C2=A0200 trials): =C2=A0=
0.6548 =C2=A0 1.6051 [ +145.1%] =C2=A0 1.5240 [ +132.7%]

=C2=A0 =C2=A020000 ( =C2=A0 =C2=A0100 trials): =C2=A0 0.6340 =C2=A0 1.=
5598 [ +146.0%] =C2=A0 1.5447 [ +143.7%]

=C2=A0 =C2=A030000 ( =C2=
=A0 =C2=A0 66 trials): =C2=A0 0.6267 =C2=A0 1.5743 [ +151.2%] =C2=A0 1.5087=
[ +140.7%]

=C2=A0 =C2=A050000 ( =C2=A0 =C2=A0 40 trials): =C2=A0=
0.6127 =C2=A0 1.5651 [ +155.5%] =C2=A0 1.2212 [ =C2=A0+99.3%]

=C2=A0 100000 ( =C2=A0 =C2=A0 20 trials): =C2=A0 0.5629 =C2=A0 1.5191 =
[ +169.9%] =C2=A0 1.1547 [ +105.1%]

=C2=A0 200000 ( =C2=A0 =C2=A0=
10 trials): =C2=A0 0.4781 =C2=A0 1.5016 [ +214.1%] =C2=A0 0.5834 [ =C2=A0+=
22.0%]

* (time-copy-dst-aligned/src-aligned)

Columns: byte-copy-loop, REPLACE, 'REPNE MOVSQ', memmove=
, memcpy

=C2=A0 =C2=A0 =C2=A0 =C2=A00 (2000000 trials): =C2=
=A0 0.0109 =C2=A0 0.0105 [ =C2=A0 -4.1%] =C2=A0 0.0106 [ =C2=A0 -2.8%] =C2=
=A0 0.0095 [ =C2=A0-13.2%] =C2=A0 0.0095 [ =C2=A0-12.7%]

=C2=A0 =C2=A0 =C2=A0 =C2=A01 (2000000 trials): =C2=A0 0.0106 =C2=A0 0.0101 =
[ =C2=A0 -4.4%] =C2=A0 0.0097 [ =C2=A0 -7.8%] =C2=A0 0.0095 [ =C2=A0-10.2%]=
=C2=A0 0.0095 [ =C2=A0 -9.9%]

=C2=A0 =C2=A0 =C2=A0 =C2=A02 (1000=
000 trials): =C2=A0 0.0211 =C2=A0 0.0202 [ =C2=A0 -4.1%] =C2=A0 0.0185 [ =
=C2=A0-12.1%] =C2=A0 0.0190 [ =C2=A0 -9.8%] =C2=A0 0.0191 [ =C2=A0 -9.3%]

=C2=A0 =C2=A0 =C2=A0 =C2=A05 ( 400000 trials): =C2=A0 0.0488 =C2=A0 0.=
0508 [ =C2=A0 +4.1%] =C2=A0 0.0369 [ =C2=A0-24.4%] =C2=A0 0.0472 [ =C2=A0 -=
3.3%] =C2=A0 0.0476 [ =C2=A0 -2.5%]

=C2=A0 =C2=A0 =C2=A0 10 ( 200=
000 trials): =C2=A0 0.0837 =C2=A0 0.0935 [ =C2=A0+11.7%] =C2=A0 0.0738 [ =
=C2=A0-11.8%] =C2=A0 0.0894 [ =C2=A0 +6.9%] =C2=A0 0.0881 [ =C2=A0 +5.2%]

=C2=A0 =C2=A0 =C2=A0 20 ( 100000 trials): =C2=A0 0.1462 =C2=A0 0.1874 =
[ =C2=A0+28.2%] =C2=A0 0.1347 [ =C2=A0 -7.8%] =C2=A0 0.1759 [ =C2=A0+20.3%]=
=C2=A0 0.1770 [ =C2=A0+21.1%]

=C2=A0 =C2=A0 =C2=A0 30 ( =C2=A066=
666 trials): =C2=A0 0.1841 =C2=A0 0.2699 [ =C2=A0+46.6%] =C2=A0 0.1767 [ =
=C2=A0 -4.1%] =C2=A0 0.2471 [ =C2=A0+34.2%] =C2=A0 0.2516 [ =C2=A0+36.6%]

--001a11c306b86b081e04edd5b191--=C2=A0 =C2=A0 =C2=A0 40 ( =C2=A050000 trials): =C2=A0 0.1958 =C2=A0 0.=
3655 [ =C2=A0+86.7%] =C2=A0 0.3238 [ =C2=A0+65.4%] =C2=A0 0.3199 [ =C2=A0+6=
3.4%] =C2=A0 0.3236 [ =C2=A0+65.3%]

=C2=A0 =C2=A0 =C2=A0 60 ( =C2=
=A033333 trials): =C2=A0 0.2531 =C2=A0 0.5322 [ +110.3%] =C2=A0 0.4072 [ =
=C2=A0+60.9%] =C2=A0 0.4788 [ =C2=A0+89.2%] =C2=A0 0.4798 [ =C2=A0+89.5%]

=C2=A0 =C2=A0 =C2=A0 80 ( =C2=A025000 trials): =C2=A0 0.2769 =C2=A0 0.=
6853 [ +147.5%] =C2=A0 0.6507 [ +135.0%] =C2=A0 0.5822 [ +110.3%] =C2=A0 0.=
5819 [ +110.1%]

=C2=A0 =C2=A0 =C2=A0100 ( =C2=A020000 trials): =
=C2=A0 0.2836 =C2=A0 0.7596 [ +167.8%] =C2=A0 0.6223 [ +119.4%] =C2=A0 0.74=
21 [ +161.6%] =C2=A0 0.7329 [ +158.4%]

=C2=A0 =C2=A0 =C2=A0200 ( =C2=A010000 trials): =C2=A0 0.3614 =C2=A0 1.=
3499 [ +273.6%] =C2=A0 1.4721 [ +307.4%] =C2=A0 1.6335 [ +352.0%] =C2=A0 1.=
6787 [ +364.6%]

=C2=A0 =C2=A0 =C2=A0400 ( =C2=A0 5000 trials): =
=C2=A0 0.4198 =C2=A0 1.9705 [ +369.4%] =C2=A0 2.7673 [ +559.2%] =C2=A0 2.95=
97 [ +605.0%] =C2=A0 3.1549 [ +651.5%]

=C2=A0 =C2=A0 =C2=A0600 ( =C2=A0 3333 trials): =C2=A0 0.4436 =C2=A0 2.=
3436 [ +428.3%] =C2=A0 3.7633 [ +748.3%] =C2=A0 4.4014 [ +892.1%] =C2=A0 4.=
4035 [ +892.6%]

=C2=A0 =C2=A0 =C2=A0800 ( =C2=A0 2500 trials): =
=C2=A0 0.4470 =C2=A0 2.6060 [ +483.0%] =C2=A0 4.6477 [ +939.7%] =C2=A0 5.17=
59 [+1057.8%] =C2=A0 5.4848 [+1127.0%]

=C2=A0 =C2=A0 1000 ( =C2=A0 2000 trials): =C2=A0 0.4252 =C2=A0 2.6758 =
[ +529.4%] =C2=A0 5.4204 [+1174.9%] =C2=A0 6.0197 [+1315.9%] =C2=A0 5.8429 =
[+1274.3%]

=C2=A0 =C2=A0 2000 ( =C2=A0 1000 trials): =C2=A0 0.455=
4 =C2=A0 3.1504 [ +591.8%] =C2=A0 8.7826 [+1828.6%] =C2=A0 9.8783 [+2069.2%=
] =C2=A010.0350 [+2103.6%]

=C2=A0 =C2=A0 3000 ( =C2=A0 =C2=A0666 trials): =C2=A0 0.4545 =C2=A0 3.=
2838 [ +622.6%] =C2=A011.1122 [+2345.2%] =C2=A011.0177 [+2324.4%] =C2=A011.=
0976 [+2342.0%]

=C2=A0 =C2=A0 4000 ( =C2=A0 =C2=A0500 trials): =
=C2=A0 0.4271 =C2=A0 3.1543 [ +638.5%] =C2=A011.7782 [+2657.7%] =C2=A011.60=
83 [+2618.0%] =C2=A011.9908 [+2707.5%]

=C2=A0 =C2=A0 5000 ( =C2=A0 =C2=A0400 trials): =C2=A0 0.4275 =C2=A0 3.=
1793 [ +643.7%] =C2=A012.5323 [+2831.6%] =C2=A013.2266 [+2994.0%] =C2=A013.=
1570 [+2977.7%]

=C2=A0 =C2=A010000 ( =C2=A0 =C2=A0200 trials): =
=C2=A0 0.4333 =C2=A0 3.4169 [ +688.6%] =C2=A016.2177 [+3643.0%] =C2=A014.59=
96 [+3269.6%] =C2=A016.6041 [+3732.2%]

=C2=A0 =C2=A020000 ( =C2=A0 =C2=A0100 trials): =C2=A0 0.4544 =C2=A0 3.=
6239 [ +697.5%] =C2=A016.0893 [+3440.8%] =C2=A010.2115 [+2147.3%] =C2=A010.=
1817 [+2140.7%]

=C2=A0 =C2=A030000 ( =C2=A0 =C2=A0 66 trials): =
=C2=A0 0.4670 =C2=A0 3.7075 [ +693.9%] =C2=A013.8502 [+2865.9%] =C2=A0 9.96=
16 [+2033.1%] =C2=A0 9.8806 [+2015.8%]

=C2=A0 =C2=A050000 ( =C2=A0 =C2=A0 40 trials): =C2=A0 0.4614 =C2=A0 3.=
7564 [ +714.1%] =C2=A012.6669 [+2645.1%] =C2=A010.3781 [+2149.1%] =C2=A010.=
4932 [+2174.1%]

=C2=A0 100000 ( =C2=A0 =C2=A0 20 trials): =C2=A0 =
0.4597 =C2=A0 3.7228 [ +709.8%] =C2=A011.9788 [+2505.7%] =C2=A010.9012 [+22=
71.3%] =C2=A010.9030 [+2271.7%]

=C2=A0 200000 ( =C2=A0 =C2=A0 10 trials): =C2=A0 0.4816 =C2=A0 3.8436 =
[ +698.2%] =C2=A0 9.9410 [+1964.4%] =C2=A0 7.4553 [+1448.2%] =C2=A0 6.9573 =
[+1344.8%]