I would like to see an AVX mode variable which when set converts every suitable SSE command to the corresponding AVX variant. It is almost always simply prepending the letter "v" to these commands, e.g. "movaps xmm1,xmm2" => "vmovaps xmm1,xmm2".
While prepending "V" (and replicating operands as needed)
seemingly does the trick, you really are looking at 2 different
operations: the 128-bit legacy SSE ops retain upper bits, but
the 128-bit VEX ops clear them.
That said, the assembler itself should implement the ISA. By
contrast, convenience features like the one you prefer should
be handled by optional macros.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, the AVX commands clear the upper bits. However this is irrelevant if only the lower 128 bits are used.
I have some procedures coded as macros which can optionally create AVX commands and use the AVX features (essentially the extra target operand) only at some places.
If I have understood Agner Fog's comments about AVX states (http://www.agner.org/optimize/optimizing_assembly.pdf, chapter 13.6) correctly it makes sense to encode a routine completely in AVX or completely in SSE in order to avoid potentially costly state changes (even if called from state B).
Using ymm (the high bits( registers is currently only meaningful for floats; I end all such procedures with vclearupper (Linux / Win32 / Win64) or vclearall (Linux / Win32) in order to avoid state C.
In the meantime have helped myself by prepending a define "avx" just before every SSE command. This is defined empty (for SSE) or "v %+" (for AVX). This kludge works but needs "unnecessary" work and I simply would like to get rid of this.
The proposed mode variable could become strict SSE, strict AVX and "best of SSE and AVX" (shortest command),
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would agree this is probably best implemented as a macro package. If someone, like yourself, develops (and maintains!) one I would be willing to consider adding it as one of the builtin macro packages.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, I have collected all the affected commands and prepared a macro.
It's usable but I'm not too happy with it; I still would prefer to simply write movdqu instead of ?movdqu.
Just overriding the built-in movdqu with a define might also be possible for most cases but then we have problems with mmx-commands and e.g. movsd which can also be a string-move-command. So this approach is sure to fail sooner or later.
Here is the macro:
%macro assign_avx 1
; assign is_avx and the macros avx and ?xxx
; avx can be used to prepend sse commands such as "avx movaps xmm1,xmm2"
; is_avx can be used to check the avx "state"
%if %1<>0
%define avx v %+
%assign is_avx 1
%else
%define avx
%assign is_avx 0
%endif
%idefine ?addpd avx addpd
; ... (analogous line for each of the other commands, see below) <<<
%endmacro
Intended usage: Prepend the code (i.e. everything or one proc) with
assign_avx x
where x is 0 or 1 and for safety append it with
assign_avx 0
For all sse commands use the variants with the prepended "?".
It makes sense to pre-initialize the avx mode to 0 or with e.g.
%ifdef use_avx
assign_avx 1
%else
assign_avx 0
%endif
where use_avx can be set e.g. with the nasm command line.
Example usage (produces the 2 procs avgb$sse2 and avgb$avx):
%macro __avgb 2
assign_avx %1 ; use avx
; %2 ; name decorator
global avgb %+ %2
avgb %+ %2:
?movdqu xmm0,[eax]
?movdqu xmm1,[edx]
?pavgb xmm0,xmm1
?movdqu [eax],xmm0
ret
assign_avx 0 ; reset for safety
%endmacro
;
__avgb 0,$sse2
__avgb 1,$avx
The program which uses the thereby created external routines can then select the appropriate one by checking the name decorators (e.g. "$avx").
The "?xxx" macros are simply abbreviations of "avx xxx" and could be eliminated.
A mode which uses the shortest possible command, i.e. sse or vex coded, is -if emplementable at all- quite complicated if done with macros whereas the assembler already knows which command is the shortest.
The user must always pay attention to not forget the ? or avx prefix.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your solution (2011-01-11) works most of the time. Thanks a lot!
I have appended the lists for the different argument counts (without the optional extra parameter for avx for some but not all commands).
I still have problems with the command movsd which obviously needs another macro without parameters since movsd is also a string command. If I define it "movsd" works but e.g. "rep movsd" does not.
In avx mode the following commands will not compile if used on mmx registers:
movd
movq
pabsb
pabsd
pabsw
packssdw
packsswb
packuswb
paddb
paddd
paddq
paddsb
paddsw
paddusb
paddusw
paddw
palignr
pand
pandn
pavgb
pavgw
pcmpeqb
pcmpeqd
pcmpeqw
pcmpgtb
pcmpgtd
pcmpgtw
pextrw
phaddd
phaddsw
phaddw
phsubd
phsubsw
phsubw
pinsrw
pmaddubsw
pmaddwd
pmaxsw
pmaxub
pminsw
pminub
pmovmskb
pmulhrsw
pmulhuw
pmulhw
pmullw
pmuludq
por
psadbw
pshufb
psignb
psignd
psignw
pslld
psllq
psllw
psrad
psraw
psrld
psrlq
psrlw
psubb
psubd
psubq
psubsb
psubsw
psubusb
psubusw
psubw
punpckhbw
punpckhdq
punpckhwd
punpcklbw
punpckldq
punpcklwd
pxor
; end of list
Here are 1-op commands:
ldmxcsr
stmxcsr
; end of list
>> movsd
> It should be possible to handle this with a 0-n arg mmac.
As I wrote, yes. But "rep movsd" still does not work.
>> In avx mode the following commands will not compile if used on mmx
> Only until x86 introduces VEX encoding and 256-bit support for MMX. :)
VEX encoding would be enough...
The problem arises e.g. when using MMX registers for temporaries (such as esp on Win32).
Although the macro stuff is almost what is needed, I still think it might be a good idea to embed this in the assembler, at least to get the size optimized variant (vex or sse coding). This would also solve the problems mentioned above.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
While prepending "V" (and replicating operands as needed)
seemingly does the trick, you really are looking at 2 different
operations: the 128-bit legacy SSE ops retain upper bits, but
the 128-bit VEX ops clear them.
That said, the assembler itself should implement the ISA. By
contrast, convenience features like the one you prefer should
be handled by optional macros.
Yes, the AVX commands clear the upper bits. However this is irrelevant if only the lower 128 bits are used.
I have some procedures coded as macros which can optionally create AVX commands and use the AVX features (essentially the extra target operand) only at some places.
If I have understood Agner Fog's comments about AVX states (http://www.agner.org/optimize/optimizing_assembly.pdf, chapter 13.6) correctly it makes sense to encode a routine completely in AVX or completely in SSE in order to avoid potentially costly state changes (even if called from state B).
Using ymm (the high bits( registers is currently only meaningful for floats; I end all such procedures with vclearupper (Linux / Win32 / Win64) or vclearall (Linux / Win32) in order to avoid state C.
In the meantime have helped myself by prepending a define "avx" just before every SSE command. This is defined empty (for SSE) or "v %+" (for AVX). This kludge works but needs "unnecessary" work and I simply would like to get rid of this.
The proposed mode variable could become strict SSE, strict AVX and "best of SSE and AVX" (shortest command),
I would agree this is probably best implemented as a macro package. If someone, like yourself, develops (and maintains!) one I would be willing to consider adding it as one of the builtin macro packages.
Well, I have collected all the affected commands and prepared a macro.
It's usable but I'm not too happy with it; I still would prefer to simply write movdqu instead of ?movdqu.
Just overriding the built-in movdqu with a define might also be possible for most cases but then we have problems with mmx-commands and e.g. movsd which can also be a string-move-command. So this approach is sure to fail sooner or later.
Here is the macro:
%macro assign_avx 1
; assign is_avx and the macros avx and ?xxx
; avx can be used to prepend sse commands such as "avx movaps xmm1,xmm2"
; is_avx can be used to check the avx "state"
%if %1<>0
%define avx v %+
%assign is_avx 1
%else
%define avx
%assign is_avx 0
%endif
%idefine ?addpd avx addpd
; ... (analogous line for each of the other commands, see below) <<<
%endmacro
Intended usage: Prepend the code (i.e. everything or one proc) with
assign_avx x
where x is 0 or 1 and for safety append it with
assign_avx 0
For all sse commands use the variants with the prepended "?".
It makes sense to pre-initialize the avx mode to 0 or with e.g.
%ifdef use_avx
assign_avx 1
%else
assign_avx 0
%endif
where use_avx can be set e.g. with the nasm command line.
Example usage (produces the 2 procs avgb$sse2 and avgb$avx):
%macro __avgb 2
assign_avx %1 ; use avx
; %2 ; name decorator
global avgb %+ %2
avgb %+ %2:
?movdqu xmm0,[eax]
?movdqu xmm1,[edx]
?pavgb xmm0,xmm1
?movdqu [eax],xmm0
ret
assign_avx 0 ; reset for safety
%endmacro
;
__avgb 0,$sse2
__avgb 1,$avx
The program which uses the thereby created external routines can then select the appropriate one by checking the name decorators (e.g. "$avx").
Here are the affected commands:
ADDPD
ADDPS
ADDSD
ADDSS
ADDSUBPD
ADDSUBPS
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
ANDNPD
ANDNPS
ANDPD
ANDPS
BLENDPD
BLENDPS
BLENDVPD
BLENDVPS
CMPEQPD
CMPEQPS
CMPEQSD
CMPEQSS
CMPLEPD
CMPLEPS
CMPLESD
CMPLESS
CMPLTPD
CMPLTPS
CMPLTSD
CMPLTSS
CMPNEQPD
CMPNEQPS
CMPNEQSD
CMPNEQSS
CMPNLEPD
CMPNLEPS
CMPNLESD
CMPNLESS
CMPNLTPD
CMPNLTPS
CMPNLTSD
CMPNLTSS
CMPORDPD
CMPORDPS
CMPORDSD
CMPORDSS
CMPPD
CMPPS
CMPSD
CMPSS
CMPUNORDPD
CMPUNORDPS
CMPUNORDSD
CMPUNORDSS
COMISD
COMISS
CVTDQ2PD
CVTDQ2PS
CVTPD2DQ
CVTPD2PS
CVTPS2DQ
CVTPS2PD
CVTSD2SI
CVTSD2SS
CVTSI2SD
CVTSI2SS
CVTSS2SD
CVTSS2SI
CVTTPD2DQ
CVTTPS2DQ
CVTTSD2SI
CVTTSS2SI
DIVPD
DIVPS
DIVSD
DIVSS
DPPD
DPPS
EXTRACTPS
HADDPD
HADDPS
HSUBPD
HSUBPS
INSERTPS
LDDQU
LDMXCSR
MASKMOVDQU
MAXPD
MAXPS
MAXSD
MAXSS
MINPD
MINPS
MINSD
MINSS
MOVAPD
MOVAPS
MOVD
MOVDDUP
MOVDQA
MOVDQU
MOVHLPS
MOVHPD
MOVHPS
MOVLHPS
MOVLPD
MOVLPS
MOVMSKPD
MOVMSKPS
MOVNTDQ
MOVNTDQA
MOVNTPD
MOVNTPS
MOVQ
MOVSD
MOVSHDUP
MOVSLDUP
MOVSS
MOVUPD
MOVUPS
MPSADBW
MULPD
MULPS
MULSD
MULSS
ORPD
ORPS
PABSB
PABSD
PABSW
PACKSSDW
PACKSSWB
PACKUSDW
PACKUSWB
PADDB
PADDD
PADDQ
PADDSB
PADDSW
PADDUSB
PADDUSW
PADDW
PALIGNR
PAND
PANDN
PAVGB
PAVGW
PBLENDVB
PBLENDW
PCLMULHQHQDQ
PCLMULHQLQDQ
PCLMULLQHQDQ
PCLMULLQLQDQ
PCLMULQDQ
PCMPEQB
PCMPEQD
PCMPEQQ
PCMPEQW
PCMPESTRI
PCMPESTRM
PCMPGTB
PCMPGTD
PCMPGTQ
PCMPGTW
PCMPISTRI
PCMPISTRM
PEXTRB
PEXTRD
PEXTRQ
PEXTRW
PHADDD
PHADDSW
PHADDW
PHMINPOSUW
PHSUBD
PHSUBSW
PHSUBW
PINSRB
PINSRD
PINSRQ
PINSRW
PMADDUBSW
PMADDWD
PMAXSB
PMAXSD
PMAXSW
PMAXUB
PMAXUD
PMAXUW
PMINSB
PMINSD
PMINSW
PMINUB
PMINUD
PMINUW
PMOVMSKB
PMOVSXBD
PMOVSXBQ
PMOVSXBW
PMOVSXDQ
PMOVSXWD
PMOVSXWQ
PMOVZXBD
PMOVZXBQ
PMOVZXBW
PMOVZXDQ
PMOVZXWD
PMOVZXWQ
PMULDQ
PMULHRSW
PMULHUW
PMULHW
PMULLD
PMULLW
PMULUDQ
POR
PSADBW
PSHUFB
PSHUFD
PSHUFHW
PSHUFLW
PSIGNB
PSIGND
PSIGNW
PSLLD
PSLLDQ
PSLLQ
PSLLW
PSRAD
PSRAW
PSRLD
PSRLDQ
PSRLQ
PSRLW
PSUBB
PSUBD
PSUBQ
PSUBSB
PSUBSW
PSUBUSB
PSUBUSW
PSUBW
PTEST
PUNPCKHBW
PUNPCKHDQ
PUNPCKHQDQ
PUNPCKHWD
PUNPCKLBW
PUNPCKLDQ
PUNPCKLQDQ
PUNPCKLWD
PXOR
RCPPS
RCPSS
ROUNDPD
ROUNDPS
ROUNDSD
ROUNDSS
RSQRTPS
RSQRTSS
SHUFPD
SHUFPS
SQRTPD
SQRTPS
SQRTSD
SQRTSS
STMXCSR
SUBPD
SUBPS
SUBSD
SUBSS
UCOMISD
UCOMISS
UNPCKHPD
UNPCKHPS
UNPCKLPD
UNPCKLPS
XORPD
XORPS
; -- end of list --
Some further notes:
The "?xxx" macros are simply abbreviations of "avx xxx" and could be eliminated.
A mode which uses the shortest possible command, i.e. sse or vex coded, is -if emplementable at all- quite complicated if done with macros whereas the assembler already knows which command is the shortest.
The user must always pay attention to not forget the ? or avx prefix.
> I still would prefer to simply write movdqu instead of ?movdqu.
Attached find source code for how to do that.
Since NASM doesn't support %REPTOK (see SF #1842438), I had
to use nested macros, with a particular trick -- see code comment.
Your solution (2011-01-11) works most of the time. Thanks a lot!
I have appended the lists for the different argument counts (without the optional extra parameter for avx for some but not all commands).
I still have problems with the command movsd which obviously needs another macro without parameters since movsd is also a string command. If I define it "movsd" works but e.g. "rep movsd" does not.
In avx mode the following commands will not compile if used on mmx registers:
movd
movq
pabsb
pabsd
pabsw
packssdw
packsswb
packuswb
paddb
paddd
paddq
paddsb
paddsw
paddusb
paddusw
paddw
palignr
pand
pandn
pavgb
pavgw
pcmpeqb
pcmpeqd
pcmpeqw
pcmpgtb
pcmpgtd
pcmpgtw
pextrw
phaddd
phaddsw
phaddw
phsubd
phsubsw
phsubw
pinsrw
pmaddubsw
pmaddwd
pmaxsw
pmaxub
pminsw
pminub
pmovmskb
pmulhrsw
pmulhuw
pmulhw
pmullw
pmuludq
por
psadbw
pshufb
psignb
psignd
psignw
pslld
psllq
psllw
psrad
psraw
psrld
psrlq
psrlw
psubb
psubd
psubq
psubsb
psubsw
psubusb
psubusw
psubw
punpckhbw
punpckhdq
punpckhwd
punpcklbw
punpckldq
punpcklwd
pxor
; end of list
Here are 1-op commands:
ldmxcsr
stmxcsr
; end of list
Here are 2-op commands:
addpd
addps
addsd
addss
addsubpd
addsubps
aesdec
aesdeclast
aesenc
aesenclast
aesimc
andnpd
andnps
andpd
andps
cmpeqpd
cmpeqps
cmpeqsd
cmpeqss
cmplepd
cmpleps
cmplesd
cmpless
cmpltpd
cmpltps
cmpltsd
cmpltss
cmpneqpd
cmpneqps
cmpneqsd
cmpneqss
cmpnlepd
cmpnleps
cmpnlesd
cmpnless
cmpnltpd
cmpnltps
cmpnltsd
cmpnltss
cmpordpd
cmpordps
cmpordsd
cmpordss
cmpunordpd
cmpunordps
cmpunordsd
cmpunordss
comisd
comiss
cvtdq2pd
cvtdq2ps
cvtpd2dq
cvtpd2ps
cvtps2dq
cvtps2pd
cvtsd2si
cvtsd2ss
cvtsi2sd
cvtsi2ss
cvtss2sd
cvtss2si
cvttpd2dq
cvttps2dq
cvttsd2si
cvttss2si
divpd
divps
divsd
divss
haddpd
haddps
hsubpd
hsubps
lddqu
maskmovdqu
maxpd
maxps
maxsd
maxss
minpd
minps
minsd
minss
movapd
movaps
movd
movddup
movdqa
movdqu
movhlps
movhpd
movhps
movlhps
movlpd
movlps
movmskpd
movmskps
movntdq
movntdqa
movntpd
movntps
movq
movsd
movshdup
movsldup
movss
movupd
movups
mulpd
mulps
mulsd
mulss
orpd
orps
pabsb
pabsd
pabsw
packssdw
packsswb
packusdw
packuswb
paddb
paddd
paddq
paddsb
paddsw
paddusb
paddusw
paddw
pand
pandn
pavgb
pavgw
pclmulhqhqdq
pclmulhqlqdq
pclmullqhqdq
pclmullqlqdq
pcmpeqb
pcmpeqd
pcmpeqq
pcmpeqw
pcmpgtb
pcmpgtd
pcmpgtq
pcmpgtw
phaddd
phaddsw
phaddw
phminposuw
phsubd
phsubsw
phsubw
pmaddubsw
pmaddwd
pmaxsb
pmaxsd
pmaxsw
pmaxub
pmaxud
pmaxuw
pminsb
pminsd
pminsw
pminub
pminud
pminuw
pmovmskb
pmovsxbd
pmovsxbq
pmovsxbw
pmovsxdq
pmovsxwd
pmovsxwq
pmovzxbd
pmovzxbq
pmovzxbw
pmovzxdq
pmovzxwd
pmovzxwq
pmuldq
pmulhrsw
pmulhuw
pmulhw
pmulld
pmullw
pmuludq
por
psadbw
pshufb
psignb
psignd
psignw
pslld
pslldq
psllq
psllw
psrad
psraw
psrld
psrldq
psrlq
psrlw
psubb
psubd
psubq
psubsb
psubsw
psubusb
psubusw
psubw
ptest
punpckhbw
punpckhdq
punpckhqdq
punpckhwd
punpcklbw
punpckldq
punpcklqdq
punpcklwd
pxor
rcpps
rcpss
rsqrtps
rsqrtss
sqrtpd
sqrtps
sqrtsd
sqrtss
subpd
subps
subsd
subss
ucomisd
ucomiss
unpckhpd
unpckhps
unpcklpd
unpcklps
xorpd
xorps
; end of list
Here are 3-op commands:
aeskeygenassist
blendpd
blendps
blendvpd
blendvps
cmppd
cmpps
cmpsd
cmpss
dppd
dpps
extractps
insertps
mpsadbw
palignr
pblendvb
pblendw
pclmulqdq
pcmpestri
pcmpestrm
pcmpistri
pcmpistrm
pextrb
pextrd
pextrq
pextrw
pinsrb
pinsrd
pinsrq
pinsrw
pshufd
pshufhw
pshuflw
roundpd
roundps
roundsd
roundss
shufpd
shufps
; end of list
> movsd
It should be possible to handle this with a 0-n arg mmac.
> In avx mode the following commands will not compile if used on mmx
Only until x86 introduces VEX encoding and 256-bit support for MMX. :)
>> movsd
> It should be possible to handle this with a 0-n arg mmac.
As I wrote, yes. But "rep movsd" still does not work.
>> In avx mode the following commands will not compile if used on mmx
> Only until x86 introduces VEX encoding and 256-bit support for MMX. :)
VEX encoding would be enough...
The problem arises e.g. when using MMX registers for temporaries (such as esp on Win32).
Although the macro stuff is almost what is needed, I still think it might be a good idea to embed this in the assembler, at least to get the size optimized variant (vex or sse coding). This would also solve the problems mentioned above.