|
From: Jojo R <rj...@li...> - 2023-04-21 10:06:23
|
Hi,
We consider to add RVV/Vector [1] feature in valgrind, there are some
challenges.
RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means
the vector length is agnostic.
ARM's SVE is not supported in valgrind :(
There are three major issues in implementing RVV instruction set in
Valgrind as following:
1. Scalable vector register width VLENB
2. Runtime changing property of LMUL and SEW
3. Lack of proper VEX IR to represent all vector operations
We propose applicable methods to solve 1 and 2. As for 3, we explore
several possible but maybe imperfect approaches to handle different cases.
We start from 1. As each guest register should be described in
VEXGuestState struct, the vector registers with scalable width of VLENB
can be added into VEXGuestState as arrays using an allowable maximum
length like 2048/4096.
The actual available access range can be determined at Valgrind startup
time by querying the CPU for its vector capability or some suitable
setup steps.
To solve problem 2, we are inspired by already-proven techniques in
QEMU, where translation blocks are broken up when certain critical CSRs
are set. Because the guest code to IR translation relies on the precise
value of LMUL/SEW and they may change within a basic block, we can break
up the basic block each time encountering a vsetvl{i} instruction and
return to the scheduler to execute the translated code and update
LMUL/SEW. Accordingly, translation cache management should be refactored
to detect the changing of LMUL/SEW to invalidate outdated code cache.
Without losing the generality, the LMUL/SEW should be encoded into an
ULong flag such that other architectures can leverage this flag to store
their arch-dependent information. The TTentry struct should also take
the flag into account no matter insertion or deletion. By doing this,
the flag carries the newest LMUL/SEW throughout the simulation and can
be passed to disassemble functions using the VEXArchInfo struct such
that we can get the real and newest value of LMUL and SEW to facilitate
our translation.
Also, some architecture-related code should be taken care of. Like
m_dispatch part, disp_cp_xindir function looks up code cache using
hardcoded assembly by checking the requested guest state IP and
translation cache entry address with no more constraints. Many other
modules should be checked to ensure the in-time update of LMUL/SEW is
instantly visible to essential parts in Valgrind.
The last remaining big issue is 3, which we introduce some ad-hoc
approaches to deal with. We summarize these approaches into three types
as following:
1. Break down a vector instruction to scalar VEX IR ops.
2. Break down a vector instruction to fixed-length VEX IR ops.
3. Use dirty helpers to realize vector instructions.
The very first method theoretically exists but is probably not
applicable as the number of IR ops explodes when a large VLENB is
adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL is
512 * 8 / 8 = 512, meaning that a single vector instruction turns into
512 scalar instructions and each scalar instruction would be expanded to
multiple IRs. To make things worse, the tool instrumentation will insert
more IRs between adjacent scalar IR ops. As a result, the performance is
likely to be slowed down thousand times during running a real-world
application with lots of vector instructions. Therefore, the other two
methods are more promising and we will discuss them below.
2 and 3 are not mutually exclusive as we may choose a suitable method
from them to implement a vector instruction regarding its concrete
behavior. To explain these methods in detail, we present some instances
to illustrate their pros and cons.
In terms of method 2, we have real values of VLENB/LMUL/SEW. The simple
case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are available
and can be directly applied to represent vector operations. However,
even when VLENB is restricted to 128, it still exceeds the maximum SIMD
width of 256 supported by VEX IR if LMUL>2. Hence, here are two variants
of method 2 to deal with long vectors:
*2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector
instructions in the granularity of VLENB. Accordingly, VLENB=4096 with
LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.
* *pros*: it encourages VEX backend to generate more compact and
efficient SIMD code (maybe). Particularly,it accommodatesmask and
gather/scatter (indexed) instructions by delivering more information
in IR itself.
* *cons*: too many new IR ops need to be introduced in VEX as each op
of different length should implement its add/sub/mul variants. New
data types to denote long vectors are necessary too, causing
difficulties in both VEX backend register allocation and tool
instrumentation.
*2.2*Break down long vectors to multiple repeated SIMD ops. For
instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is
composed of four operators of Iop_Add8x16 type.
* *pros:*less efforts are required in register allocation and tool
instrumentation. The VEX frontend is able to notify the backend to
generate efficient vector instructions by existing Iops. It better
trades off the complexity of adding many long vector IR ops and the
benefit of generating high-efficiency host code.
* *cons:*it is hard to describe a mask operation given that the mask
is pretty flexible (the least significant bit of each segment of
v0). Additionally, gather/scatter instructions may have similar
problems in appropriately dividing index registers. There are
various corner cases left here such as widening arithmetic
operations (widening SIMD IR ops are currently not compatible) and
vstart CSR register. When using fixed-length IR ops to comprise a
vector instruction, we will inevitably tell each IR op which
position encoded in vstart you can start to process the data. We can
use vstart as a normal guest state virtual register to calculate
each op's start position as a guard IRExpr or obtain the value of
vstart like what we do in LMUL/SEW. Nevertheless, it is non-trivial
to decompose a vector instruction concisely.
In short, both 2.1 and 2.2 confront a dilemma in reducing engineering
efforts of refactoring Valgrind elegantly as well as implementing the
vector instruction set efficiently. Same obstacles exist in ARM SVE as
they are scalable vector instructions and flexible in many ways.
The final solution is the dirty helper. It is undoubtedly practical and
requires possibly the least engineering efforts in dealing with so many
details in Valgrind. In this design, each instruction is completed using
an inline assembly running the same instruction on the host. Moreover,
tool instrumentation already handles IRDirty except that new fields
should be added in _IRDirty struct to indicate strided/indexed/masked
memory accesses and arithmetic operations.
* *pros:*it supports all instructions without bothering to build
complicated IR expressions and statements. It executes vector
instructions using host CPU to get acceleration to some extent.
Besides, we do not need to add VEX backend to translate new IRs to
vector instructions.
* *cons:*the dirty helper always keeps its operations in a black box
such that tools can never see what happens in a dirty helper. Like
memcheck, the bit precision merit is missing once it meets a dirty
helper as the V-bit propagation chain adopts a pretty coarse
determination strategy. On the other hand, it is also not an elegant
way to implement the entire ISA extension in dirty helpers.
In summary, it is far to reach a truly applicable solution in adding
vector extensions in Valgrind. We need to do detailed and comprehensive
estimations on different vector instruction categories.
Any feedback is welcome in github [3] also.
[1] https://github.com/riscv/riscv-v-spec
[2]
https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
[3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
Thanks.
Jojo
|
|
From: Jojo R <rj...@gm...> - 2023-05-22 11:46:42
|
Hi,
Any feedback or suggestion about this RFC ?
在 2023/4/21 17:25, Jojo R 写道:
>
> Hi,
>
> We consider to add RVV/Vector [1] feature in valgrind, there are some
> challenges.
> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that
> means the vector length is agnostic.
> ARM's SVE is not supported in valgrind :(
>
> There are three major issues in implementing RVV instruction set in
> Valgrind as following:
>
> 1. Scalable vector register width VLENB
> 2. Runtime changing property of LMUL and SEW
> 3. Lack of proper VEX IR to represent all vector operations
>
> We propose applicable methods to solve 1 and 2. As for 3, we explore
> several possible but maybe imperfect approaches to handle different cases.
>
> We start from 1. As each guest register should be described in
> VEXGuestState struct, the vector registers with scalable width of
> VLENB can be added into VEXGuestState as arrays using an allowable
> maximum length like 2048/4096.
>
> The actual available access range can be determined at Valgrind
> startup time by querying the CPU for its vector capability or some
> suitable setup steps.
>
>
> To solve problem 2, we are inspired by already-proven techniques in
> QEMU, where translation blocks are broken up when certain critical
> CSRs are set. Because the guest code to IR translation relies on the
> precise value of LMUL/SEW and they may change within a basic block, we
> can break up the basic block each time encountering a vsetvl{i}
> instruction and return to the scheduler to execute the translated code
> and update LMUL/SEW. Accordingly, translation cache management should
> be refactored to detect the changing of LMUL/SEW to invalidate
> outdated code cache. Without losing the generality, the LMUL/SEW
> should be encoded into an ULong flag such that other architectures can
> leverage this flag to store their arch-dependent information. The
> TTentry struct should also take the flag into account no matter
> insertion or deletion. By doing this, the flag carries the newest
> LMUL/SEW throughout the simulation and can be passed to disassemble
> functions using the VEXArchInfo struct such that we can get the real
> and newest value of LMUL and SEW to facilitate our translation.
>
> Also, some architecture-related code should be taken care of. Like
> m_dispatch part, disp_cp_xindir function looks up code cache using
> hardcoded assembly by checking the requested guest state IP and
> translation cache entry address with no more constraints. Many other
> modules should be checked to ensure the in-time update of LMUL/SEW is
> instantly visible to essential parts in Valgrind.
>
>
> The last remaining big issue is 3, which we introduce some ad-hoc
> approaches to deal with. We summarize these approaches into three
> types as following:
>
> 1. Break down a vector instruction to scalar VEX IR ops.
> 2. Break down a vector instruction to fixed-length VEX IR ops.
> 3. Use dirty helpers to realize vector instructions.
>
> The very first method theoretically exists but is probably not
> applicable as the number of IR ops explodes when a large VLENB is
> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL
> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns
> into 512 scalar instructions and each scalar instruction would be
> expanded to multiple IRs. To make things worse, the tool
> instrumentation will insert more IRs between adjacent scalar IR ops.
> As a result, the performance is likely to be slowed down thousand
> times during running a real-world application with lots of vector
> instructions. Therefore, the other two methods are more promising and
> we will discuss them below.
>
> 2 and 3 are not mutually exclusive as we may choose a suitable method
> from them to implement a vector instruction regarding its concrete
> behavior. To explain these methods in detail, we present some
> instances to illustrate their pros and cons.
>
> In terms of method 2, we have real values of VLENB/LMUL/SEW. The
> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are
> available and can be directly applied to represent vector operations.
> However, even when VLENB is restricted to 128, it still exceeds the
> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here
> are two variants of method 2 to deal with long vectors:
>
>
> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector
> instructions in the granularity of VLENB. Accordingly, VLENB=4096 with
> LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.
>
> * *pros*: it encourages VEX backend to generate more compact and
> efficient SIMD code (maybe). Particularly,it accommodatesmask and
> gather/scatter (indexed) instructions by delivering more
> information in IR itself.
> * *cons*: too many new IR ops need to be introduced in VEX as each
> op of different length should implement its add/sub/mul variants.
> New data types to denote long vectors are necessary too, causing
> difficulties in both VEX backend register allocation and tool
> instrumentation.
>
> *2.2*Break down long vectors to multiple repeated SIMD ops. For
> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is
> composed of four operators of Iop_Add8x16 type.
>
> * *pros:*less efforts are required in register allocation and tool
> instrumentation. The VEX frontend is able to notify the backend to
> generate efficient vector instructions by existing Iops. It better
> trades off the complexity of adding many long vector IR ops and
> the benefit of generating high-efficiency host code.
> * *cons:*it is hard to describe a mask operation given that the mask
> is pretty flexible (the least significant bit of each segment of
> v0). Additionally, gather/scatter instructions may have similar
> problems in appropriately dividing index registers. There are
> various corner cases left here such as widening arithmetic
> operations (widening SIMD IR ops are currently not compatible) and
> vstart CSR register. When using fixed-length IR ops to comprise a
> vector instruction, we will inevitably tell each IR op which
> position encoded in vstart you can start to process the data. We
> can use vstart as a normal guest state virtual register to
> calculate each op's start position as a guard IRExpr or obtain the
> value of vstart like what we do in LMUL/SEW. Nevertheless, it is
> non-trivial to decompose a vector instruction concisely.
>
> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering
> efforts of refactoring Valgrind elegantly as well as implementing the
> vector instruction set efficiently. Same obstacles exist in ARM SVE as
> they are scalable vector instructions and flexible in many ways.
>
> The final solution is the dirty helper. It is undoubtedly practical
> and requires possibly the least engineering efforts in dealing with so
> many details in Valgrind. In this design, each instruction is
> completed using an inline assembly running the same instruction on the
> host. Moreover, tool instrumentation already handles IRDirty except
> that new fields should be added in _IRDirty struct to indicate
> strided/indexed/masked memory accesses and arithmetic operations.
>
> * *pros:*it supports all instructions without bothering to build
> complicated IR expressions and statements. It executes vector
> instructions using host CPU to get acceleration to some extent.
> Besides, we do not need to add VEX backend to translate new IRs to
> vector instructions.
> * *cons:*the dirty helper always keeps its operations in a black box
> such that tools can never see what happens in a dirty helper. Like
> memcheck, the bit precision merit is missing once it meets a dirty
> helper as the V-bit propagation chain adopts a pretty coarse
> determination strategy. On the other hand, it is also not an
> elegant way to implement the entire ISA extension in dirty helpers.
>
> In summary, it is far to reach a truly applicable solution in adding
> vector extensions in Valgrind. We need to do detailed and
> comprehensive estimations on different vector instruction categories.
>
> Any feedback is welcome in github [3] also.
>
>
> [1] https://github.com/riscv/riscv-v-spec
>
> [2]
> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>
> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>
>
> Thanks.
>
> Jojo
>
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Petr P. <pet...@da...> - 2023-05-27 17:25:57
|
On 21. Apr 23 17:25, Jojo R wrote:
> We consider to add RVV/Vector [1] feature in valgrind, there are some
> challenges.
> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the
> vector length is agnostic.
> ARM's SVE is not supported in valgrind :(
>
> There are three major issues in implementing RVV instruction set in Valgrind
> as following:
>
> 1. Scalable vector register width VLENB
> 2. Runtime changing property of LMUL and SEW
> 3. Lack of proper VEX IR to represent all vector operations
>
> We propose applicable methods to solve 1 and 2. As for 3, we explore several
> possible but maybe imperfect approaches to handle different cases.
>
> We start from 1. As each guest register should be described in VEXGuestState
> struct, the vector registers with scalable width of VLENB can be added into
> VEXGuestState as arrays using an allowable maximum length like 2048/4096.
Size of VexGuestRISCV64State is currently 592 bytes. Adding these large
vector registers will bump it by 32*2048/8=8192 bytes.
The baseblock layout in VEX is: the guest state, two equal sized areas
for shadow state and then a spill area. The RISC-V port accesses the
baseblock in generated code via x8/s0. The register is set to the
address of the baseblock+2048 (file
coregrind/m_dispatch/dispatch-riscv64-linux.S). The extra offset is
a small optimization to utilize the fact that load/store instructions in
RVI have a signed offset in range [-2048,2047]. The end result is that
it is possible to access the baseblock data using only a single
instruction.
Adding the new vector registers will cause that more instructions will
be necessary. For instance, accessing any shadow guest state would
naively require a sequence of LUI+ADDI+LOAD/STORE.
I suspect this could affect performance quite a bit and might need some
optimizing.
>
> The actual available access range can be determined at Valgrind startup time
> by querying the CPU for its vector capability or some suitable setup steps.
Something to consider is that the virtual CPU provided by Valgrind does
not necessarily need to match the host CPU. For instance, VEX could
hardcode that its vector registers are only 128 bits in size.
I was originally hoping that this is how support for the V extension
could be added, but the LMUL grouping looks to break this model.
>
>
> To solve problem 2, we are inspired by already-proven techniques in QEMU,
> where translation blocks are broken up when certain critical CSRs are set.
> Because the guest code to IR translation relies on the precise value of
> LMUL/SEW and they may change within a basic block, we can break up the basic
> block each time encountering a vsetvl{i} instruction and return to the
> scheduler to execute the translated code and update LMUL/SEW. Accordingly,
> translation cache management should be refactored to detect the changing of
> LMUL/SEW to invalidate outdated code cache. Without losing the generality,
> the LMUL/SEW should be encoded into an ULong flag such that other
> architectures can leverage this flag to store their arch-dependent
> information. The TTentry struct should also take the flag into account no
> matter insertion or deletion. By doing this, the flag carries the newest
> LMUL/SEW throughout the simulation and can be passed to disassemble
> functions using the VEXArchInfo struct such that we can get the real and
> newest value of LMUL and SEW to facilitate our translation.
>
> Also, some architecture-related code should be taken care of. Like
> m_dispatch part, disp_cp_xindir function looks up code cache using hardcoded
> assembly by checking the requested guest state IP and translation cache
> entry address with no more constraints. Many other modules should be checked
> to ensure the in-time update of LMUL/SEW is instantly visible to essential
> parts in Valgrind.
>
>
> The last remaining big issue is 3, which we introduce some ad-hoc approaches
> to deal with. We summarize these approaches into three types as following:
>
> 1. Break down a vector instruction to scalar VEX IR ops.
> 2. Break down a vector instruction to fixed-length VEX IR ops.
> 3. Use dirty helpers to realize vector instructions.
I would also look at adding new VEX IR ops for scalable vector
instructions. In particular, if it could be shown that RVV and SVE can
use same new ops then it could make a good argument for adding them.
Perhaps interesting is if such new scalable vector ops could also
represent fixed operations on other architectures, but that is just me
thinking out loud.
> [...]
> In summary, it is far to reach a truly applicable solution in adding vector
> extensions in Valgrind. We need to do detailed and comprehensive estimations
> on different vector instruction categories.
>
> Any feedback is welcome in github [3] also.
>
>
> [1] https://github.com/riscv/riscv-v-spec
>
> [2] https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>
> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
Sorry for not being more helpful at this point. As mentioned in the
GitHub issue, I still need to get myself more familiar with RVV and how
Valgrind handles vector instructions.
Thanks,
Petr
|
|
From: Wu, F. <fe...@in...> - 2023-05-29 03:29:55
|
On 5/28/2023 1:06 AM, Petr Pavlu wrote:
> On 21. Apr 23 17:25, Jojo R wrote:
>> We consider to add RVV/Vector [1] feature in valgrind, there are some
>> challenges.
>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the
>> vector length is agnostic.
>> ARM's SVE is not supported in valgrind :(
>>
>> There are three major issues in implementing RVV instruction set in Valgrind
>> as following:
>>
>> 1. Scalable vector register width VLENB
>> 2. Runtime changing property of LMUL and SEW
>> 3. Lack of proper VEX IR to represent all vector operations
>>
>> We propose applicable methods to solve 1 and 2. As for 3, we explore several
>> possible but maybe imperfect approaches to handle different cases.
>>
>> We start from 1. As each guest register should be described in VEXGuestState
>> struct, the vector registers with scalable width of VLENB can be added into
>> VEXGuestState as arrays using an allowable maximum length like 2048/4096.
>
> Size of VexGuestRISCV64State is currently 592 bytes. Adding these large
> vector registers will bump it by 32*2048/8=8192 bytes.
>
Yes, that's the reason in my RFC patches the vlen is set to 128, that's
the largest room for vector in current design.
> The baseblock layout in VEX is: the guest state, two equal sized areas
> for shadow state and then a spill area. The RISC-V port accesses the
> baseblock in generated code via x8/s0. The register is set to the
> address of the baseblock+2048 (file
> coregrind/m_dispatch/dispatch-riscv64-linux.S). The extra offset is
> a small optimization to utilize the fact that load/store instructions in
> RVI have a signed offset in range [-2048,2047]. The end result is that
> it is possible to access the baseblock data using only a single
> instruction.
>
Nice design.
> Adding the new vector registers will cause that more instructions will
> be necessary. For instance, accessing any shadow guest state would
> naively require a sequence of LUI+ADDI+LOAD/STORE.
>
> I suspect this could affect performance quite a bit and might need some
> optimizing.
>
Yes, can we separate the vector registers from the other ones, is it
able to use two baseblocks? Or we can do some experiments to measure the
overhead.
>>
>> The actual available access range can be determined at Valgrind startup time
>> by querying the CPU for its vector capability or some suitable setup steps.
>
> Something to consider is that the virtual CPU provided by Valgrind does
> not necessarily need to match the host CPU. For instance, VEX could
> hardcode that its vector registers are only 128 bits in size.
>
> I was originally hoping that this is how support for the V extension
> could be added, but the LMUL grouping looks to break this model.
>
Originally I had the same idea, but 128 vlen hardware cannot run the
software built for larger vlen, e.g. clang has option
-riscv-v-vector-bits-min, if it's set to 256, then it assumes the
underlying hardware has at least 256 vlen.
>>
>>
>> To solve problem 2, we are inspired by already-proven techniques in QEMU,
>> where translation blocks are broken up when certain critical CSRs are set.
>> Because the guest code to IR translation relies on the precise value of
>> LMUL/SEW and they may change within a basic block, we can break up the basic
>> block each time encountering a vsetvl{i} instruction and return to the
>> scheduler to execute the translated code and update LMUL/SEW. Accordingly,
>> translation cache management should be refactored to detect the changing of
>> LMUL/SEW to invalidate outdated code cache. Without losing the generality,
>> the LMUL/SEW should be encoded into an ULong flag such that other
>> architectures can leverage this flag to store their arch-dependent
>> information. The TTentry struct should also take the flag into account no
>> matter insertion or deletion. By doing this, the flag carries the newest
>> LMUL/SEW throughout the simulation and can be passed to disassemble
>> functions using the VEXArchInfo struct such that we can get the real and
>> newest value of LMUL and SEW to facilitate our translation.
>>
>> Also, some architecture-related code should be taken care of. Like
>> m_dispatch part, disp_cp_xindir function looks up code cache using hardcoded
>> assembly by checking the requested guest state IP and translation cache
>> entry address with no more constraints. Many other modules should be checked
>> to ensure the in-time update of LMUL/SEW is instantly visible to essential
>> parts in Valgrind.
>>
>>
>> The last remaining big issue is 3, which we introduce some ad-hoc approaches
>> to deal with. We summarize these approaches into three types as following:
>>
>> 1. Break down a vector instruction to scalar VEX IR ops.
>> 2. Break down a vector instruction to fixed-length VEX IR ops.
>> 3. Use dirty helpers to realize vector instructions.
>
> I would also look at adding new VEX IR ops for scalable vector
> instructions. In particular, if it could be shown that RVV and SVE can
> use same new ops then it could make a good argument for adding them.
>
> Perhaps interesting is if such new scalable vector ops could also
> represent fixed operations on other architectures, but that is just me
> thinking out loud.
>
It's a good idea to consolidate all vector/simd together, the challenge
is to verify its feasibility and to speedup the adaption progress, as
it's supposed to take more efforts and longer time. Is there anyone with
knowledge or experience of other ISA such as avx/sve on valgrind can
share the pain and gain, or we can do some quick prototype?
Thanks,
Fei.
>> [...]
>> In summary, it is far to reach a truly applicable solution in adding vector
>> extensions in Valgrind. We need to do detailed and comprehensive estimations
>> on different vector instruction categories.
>>
>> Any feedback is welcome in github [3] also.
>>
>>
>> [1] https://github.com/riscv/riscv-v-spec
>>
>> [2] https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>>
>> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>
> Sorry for not being more helpful at this point. As mentioned in the
> GitHub issue, I still need to get myself more familiar with RVV and how
> Valgrind handles vector instructions.
>
> Thanks,
> Petr
>
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
|
|
From: LATHUILIERE B. <bru...@ed...> - 2023-06-01 11:29:19
|
-------- Courriel original -------- Objet: Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector Date: 2023-05-29 05:29 De: "Wu, Fei" <fe...@in...> À: Petr Pavlu <pet...@da...>, Jojo R <rj...@gm...> Cc: pa...@so..., yun...@al..., val...@li..., val...@li..., zha...@al... >On 5/28/2023 1:06 AM, Petr Pavlu wrote: >> On 21. Apr 23 17:25, Jojo R wrote: >>> The last remaining big issue is 3, which we introduce some ad-hoc >>> approaches to deal with. We summarize these approaches into three >>> types as >>> following: >>> >>> 1. Break down a vector instruction to scalar VEX IR ops. >>> 2. Break down a vector instruction to fixed-length VEX IR ops. >>> 3. Use dirty helpers to realize vector instructions. >> >> I would also look at adding new VEX IR ops for scalable vector >> instructions. In particular, if it could be shown that RVV and SVE can >> use same new ops then it could make a good argument for adding them. >> >> Perhaps interesting is if such new scalable vector ops could also >> represent fixed operations on other architectures, but that is just me >> thinking out loud. >> >It's a good idea to consolidate all vector/simd together, the challenge is to verify its feasibility and to speedup the adaption progress, as it's supposed to take more efforts and longer time. Is there anyone with knowledge or experience of other ISA such as avx/sve on valgrind >can share the pain and gain, or we can do some quick prototype? > >Thanks, >Fei. Hi, I don't know if my experience is the one you expect, nevertheless I will try to share it. I'm the main developer of a valgrind tool called verrou (url: https://github.com/edf-hpc/verrou ) which currently only works with x86_64 architecture. >From user's point of view, verrou enables to estimate the effect of the floating-point rounding error propagation (If you are interested by the subject, there are documentation and publication). >From valgrind tool developer's point of view, we need to replace all floating-point operations (fpo) by our own modified fpo implemented with C++ functions. One C++ function has 1,2 or 3 floating point input values and one floating point output value. As we have to replace all VEX fpo, the way we handle with SSE and AVX has consequences for us. For each kind of fpo (add,sub,mul,div,sqrt)x(float,double), we have to replace VEX op for the following variants : scalar, SSE low lane, SSE, AVX. It is painful but possible via code generation. Thanks to the multiple VEX ops it is possible to select only one type of instruction (it can be useful to 1- get speed up, 2- know if floating point errors come from scalar or vector instructions). On the other hand, for fma operations (madd,msub)x(float,double) we have less work to do, as valgrind do the un-vectorisation for us, but it is impossible to instrument selectively scalar or vector ops. We could think that the multiple VEX ops enable performance improvements via the vectorisation of C++ call, but it is not now possible (at least to my knowledge). Indeed, with the valgrind API I don't know how I can get the floating-point values in the register without applying un-vectorisation : To get the values in the AVX register, I do an awful sequence of Iop_V256to64_0, Iop_V256to64_1, Iop_V256to64_2, Iop_V256to64_3 for the 2 arguments. As it is not possible to do a IRStmt_Dirty call with a function with 9 args (9=2*4+1 2 for a binary operation, 4 for the vector length and 1 for the result), I do a first call to copy the 4 values of the first arg somewhere then a second one to perform the 4 C++ calls. Due to the algorithm inside the C++ calls it could be tricky to vectorise, but I even didn't try because of the sequence of Iop_V256to64_*. In my dreams I would like Iop_ to convert a V256 or V128 type to an aligned pointer on floating point args. So, I don't know if my experience can be useful for you, but if someone has a better solution to my needs it will be useful at least ... to me :) Best regards, Bruno Lathuilière Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. ____________________________________________________ This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free. |
|
From: Wu, F. <fe...@in...> - 2023-06-05 01:24:04
|
On 6/1/2023 7:13 PM, LATHUILIERE Bruno via Valgrind-developers wrote: > > -------- Courriel original -------- > Objet: Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector > Date: 2023-05-29 05:29 > De: "Wu, Fei" <fe...@in...> > À: Petr Pavlu <pet...@da...>, Jojo R <rj...@gm...> > Cc: pa...@so..., yun...@al..., val...@li..., > val...@li..., zha...@al... > >> On 5/28/2023 1:06 AM, Petr Pavlu wrote: >>> On 21. Apr 23 17:25, Jojo R wrote: >>>> The last remaining big issue is 3, which we introduce some ad-hoc >>>> approaches to deal with. We summarize these approaches into three >>>> types as >>>> following: >>>> >>>> 1. Break down a vector instruction to scalar VEX IR ops. >>>> 2. Break down a vector instruction to fixed-length VEX IR ops. >>>> 3. Use dirty helpers to realize vector instructions. >>> >>> I would also look at adding new VEX IR ops for scalable vector >>> instructions. In particular, if it could be shown that RVV and SVE can >>> use same new ops then it could make a good argument for adding them. >>> >>> Perhaps interesting is if such new scalable vector ops could also >>> represent fixed operations on other architectures, but that is just me >>> thinking out loud. >>> >> It's a good idea to consolidate all vector/simd together, the challenge is to verify its feasibility and to speedup the adaption progress, as it's supposed to take more efforts and longer time. Is there anyone with knowledge or experience of other ISA such as avx/sve on valgrind >can share the pain and gain, or we can do some quick prototype? >> >> Thanks, >> Fei. > > Hi, > > I don't know if my experience is the one you expect, nevertheless I will try to share it. Hi Bruno, Thank you for sharing this, it's definitely worth reading. > I'm the main developer of a valgrind tool called verrou (url: https://github.com/edf-hpc/verrou ) which currently only works with x86_64 architecture. > From user's point of view, verrou enables to estimate the effect of the floating-point rounding error propagation (If you are interested by the subject, there are documentation and publication). > It looks interesting, good job. > From valgrind tool developer's point of view, we need to replace all floating-point operations (fpo) by our own modified fpo implemented with C++ functions. One C++ function has 1,2 or 3 floating point input values and one floating point output value. > Do you use libvex_BackEnd() to translate the insn to host, e.g. host_riscv64_isel.c to select the host insn, Is there any difference of processing flow between verrou and memcheck? > As we have to replace all VEX fpo, the way we handle with SSE and AVX has consequences for us. For each kind of fpo (add,sub,mul,div,sqrt)x(float,double), we have to replace VEX op for the following variants : scalar, SSE low lane, SSE, AVX. It is painful but possible via code generation. Thanks to the multiple VEX ops it is possible to select only one type of instruction (it can be useful to 1- get speed up, 2- know if floating point errors come from scalar or vector instructions). > > On the other hand, for fma operations (madd,msub)x(float,double) we have less work to do, as valgrind do the un-vectorisation for us, but it is impossible to instrument selectively scalar or vector ops. As these insns are un-vectorised, are there any other issues besides the 1 (performance) & 2 (original type) mentioned above? I want to make sure if there is any risk of the un-vectorisation design, e.g. when the vector length is large such as 2k vlen on rvv. > We could think that the multiple VEX ops enable performance improvements via the vectorisation of C++ call, but it is not now possible (at least to my knowledge). Indeed, with the valgrind API I don't know how I can get the floating-point values in the register without applying un-vectorisation : To get the values in the AVX register, I do an awful sequence of Iop_V256to64_0, Iop_V256to64_1, Iop_V256to64_2, Iop_V256to64_3 for the 2 arguments. As it is not possible to do a IRStmt_Dirty call with a function with 9 args (9=2*4+1 2 for a binary operation, 4 for the vector length and 1 for the result), I do a first call to copy the 4 values of the first arg somewhere then a second one to perform the 4 C++ calls. > Due to the algorithm inside the C++ calls it could be tricky to vectorise, but I even didn't try because of the sequence of Iop_V256to64_*. For memcheck, the process is as follows if we put it simple: toIR -> instrumentation -> Backend isel If the vector insn is split into scalar at the stage of toIR just as I did in this series, the advantage looks obvious as I only need to deal with this single stage and leverage the existing code to handle the scalar version, the disadvantage is that it might lose some opportunities to optimize, e.g. * toIR - introduce extra temp variables for generated scalars * instrumentation - for memcheck, the key is to trace the V+A bits instead of the real results of the ops, the ideal case is V+A of the whole vector can be checked together w/o breaking it to scalars * Backend isel - the ideal case is to use the vector insn on host for guest vector insn, but I'm not sure how much effort will be taken to achieve this. > In my dreams I would like Iop_ to convert a V256 or V128 type to an aligned pointer on floating point args. > > So, I don't know if my experience can be useful for you, but if someone has a better solution to my needs it will be useful at least ... to me :) > Thank you again for this sharing. I hope the discussion can help both of us, and others. Best regards, Fei. > Best regards, > Bruno Lathuilière > > > > > Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. > > Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. > > Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. > ____________________________________________________ > > This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. > > If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. > > E-mail communication cannot be guaranteed to be timely secure, error or virus-free. > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: LATHUILIERE B. <bru...@ed...> - 2023-06-05 18:07:30
|
-----Message d'origine----- De : fe...@in... <fe...@in...> Envoyé : lundi 5 juin 2023 03:24 À : LATHUILIERE Bruno <bru...@ed...>; val...@li...; val...@li...; Petr Pavlu <pet...@da...>; Jojo R <rj...@gm...>; pa...@so...; yun...@al...; zha...@al... Objet : Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector >On 6/1/2023 7:13 PM, LATHUILIERE Bruno via Valgrind-developers wrote: >> >> -------- Courriel original -------- >> Objet: Re: [Valgrind-developers] RFC: support scalable vector model / >> riscv vector >> Date: 2023-05-29 05:29 >> De: "Wu, Fei" <fe...@in...> >> À: Petr Pavlu <pet...@da...>, Jojo R <rj...@gm...> >> From valgrind tool developer's point of view, we need to replace all floating-point operations (fpo) by our own modified fpo implemented with C++ functions. One C++ function has 1,2 or 3 floating point input values and one floating point output value. > >Do you use libvex_BackEnd() to translate the insn to host, e.g. >host_riscv64_isel.c to select the host insn, Is there any difference of processing flow between verrou and memcheck? I do not use (at least directly) function defined in host_*_isel.c. The fact Verrou is not yet portable to other architecture comes from two reasons : - In the C++ call we use intrinsics for the fma. - We need to compile with --enable-only64bit. As I do not need to use verrou on 32bit architecture, I postpone the problem. >> As we have to replace all VEX fpo, the way we handle with SSE and AVX has consequences for us. For each kind of fpo (add,sub,mul,div,sqrt)x(float,double), we have to replace VEX op for the following variants : scalar, SSE low lane, SSE, AVX. It is painful but possible via code generation. Thanks to the multiple VEX ops it is possible to select only one type of instruction (it can be useful to 1- get speed up, 2- know if floating point errors come from scalar or vector instructions). >> >> On the other hand, for fma operations (madd,msub)x(float,double) we have less work to do, as valgrind do the un-vectorisation for us, but it is impossible to instrument selectively scalar or vector ops. >As these insns are un-vectorised, are there any other issues besides the >1 (performance) & 2 (original type) mentioned above? I want to make sure if there is any risk of the un-vectorisation design, e.g. when the vector length is large such as 2k vlen on rvv. As a user of valgrind framework (ie tool developer), I ve no idea about this kind of limitation. To be able to develop a valgrind tool without strong architecture knowledge is a strength of valgrind framework. >> We could think that the multiple VEX ops enable performance improvements via the vectorisation of C++ call, but it is not now possible (at least to my knowledge). Indeed, with the valgrind API I don't know how I can get the floating-point values in the register without applying un-vectorisation : To get the values in the AVX register, I do an awful sequence of Iop_V256to64_0, Iop_V256to64_1, Iop_V256to64_2, Iop_V256to64_3 for the 2 arguments. As it is not possible to do a IRStmt_Dirty call with a function with 9 args (9=2*4+1 2 for a binary operation, 4 for the vector length and 1 for the result), I do a first call to copy the 4 values of the first arg somewhere then a second one to perform the 4 C++ calls. >> Due to the algorithm inside the C++ calls it could be tricky to vectorise, but I even didn't try because of the sequence of Iop_V256to64_*. >For memcheck, the process is as follows if we put it simple: > toIR -> instrumentation -> Backend isel With my understanding the tool memcheck do only the instrumentation stage, and toIR and backend isel stages are done by the valgrind framework. >If the vector insn is split into scalar at the stage of toIR just as I did in this series, the advantage looks obvious as I only need to deal with this single stage and leverage the existing code to handle the scalar >version, the disadvantage is that it might lose some opportunities to optimize, e.g. >* toIR - introduce extra temp variables for generated scalars >* instrumentation - for memcheck, the key is to trace the V+A bits instead of the real results of the ops, the ideal case is V+A of the whole vector can be checked together w/o breaking it to scalars You pinpoint the main difference between verrou and memcheck. The verrou instrumentation can not be seen as a trace generation : indeed we modifiy the floating point behaviour. >* Backend isel - the ideal case is to use the vector insn on host for guest vector insn, but I'm not sure how much effort will be taken to achieve this. > >Thank you again for this sharing. I hope the discussion can help both of us, and others. I hope so. > >Best regards, >Fei. Best regards, Bruno Lathuilière Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. ____________________________________________________ This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free. |
|
From: Floyd, P. <pj...@wa...> - 2023-06-12 09:26:16
|
On 01/06/2023 13:13, LATHUILIERE Bruno via Valgrind-developers wrote: > I don't know if my experience is the one you expect, nevertheless I will try to share it. > I'm the main developer of a valgrind tool called verrou (url: https://github.com/edf-hpc/verrou ) which currently only works with x86_64 architecture. > From user's point of view, verrou enables to estimate the effect of the floating-point rounding error propagation (If you are interested by the subject, there are documentation and publication). [snip] Interesting, I don't remember having seen anything on verrou. I need to look more at the doc and publications. I'll add a link to https://valgrind.org/downloads/variants.html (which is a bit out of date) A+ Paul |
|
From: LATHUILIERE B. <bru...@ed...> - 2023-06-12 12:52:34
|
Hi, I like the idea to add verrou in the variant list. You can get the source and documentation from github : https://github.com/edf-hpc/verrou/ The direct link to the documentation of the last version : http://edf-hpc.github.io/verrou/vr-manual.html (Soon or later I will change the link, to keep the documentation of old versions) The main references about verrou are : - François Févotte and Bruno Lathuilière. Debugging and optimization of HPC programs with the Verrou tool. In International Workshop on Software Correctness for HPC Applications (Correctness), Denver, CO, USA, Nov. 2019. DOI: 10.1109/Correctness49594.2019.00006 https://hal.science/hal-02044101/ - François Févotte and Bruno Lathuilière. Studying the numerical quality of an industrial computing code: A case study on code_aster. In 10th International Workshop on Numerical Software Verification (NSV), pages 61--80, Heidelberg, Germany, July 2017. DOI: 10.1007/978-3-319-63501-9_5 https://www.fevotte.net/publications/fevotte2017a.pdf - François Févotte and Bruno Lathuilière. VERROU: a CESTAC evaluation without recompilation. In International Symposium on Scientific Computing, Computer Arithmetics and Verified Numerics (SCAN), Uppsala, Sweden, September 2016. https://www.fevotte.net/publications/fevotte2016.pdf And if you are interested by the required number of samples, you should read the following paper (not specific to verrou) : - Devan Sohier, Pablo De Oliveira Castro, François Févotte, Bruno Lathuilière, Eric Petit, and Olivier Jamond. Confidence intervals for stochastic arithmetic. ACM Transactions on Mathematical Software, 47(2), 2021. https://hal.science/hal-01827319 ++ Bruno Lathuilière -----Message d'origine----- De : pj...@wa... <pj...@wa...> Envoyé : lundi 12 juin 2023 11:26 À : val...@li... Objet : Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector On 01/06/2023 13:13, LATHUILIERE Bruno via Valgrind-developers wrote: > I don't know if my experience is the one you expect, nevertheless I will try to share it. > I'm the main developer of a valgrind tool called verrou (url: https://github.com/edf-hpc/verrou ) which currently only works with x86_64 architecture. > From user's point of view, verrou enables to estimate the effect of the floating-point rounding error propagation (If you are interested by the subject, there are documentation and publication). [snip] Interesting, I don't remember having seen anything on verrou. I need to look more at the doc and publications. I'll add a link to https://valgrind.org/downloads/variants.html (which is a bit out of date) A+ Paul _______________________________________________ Valgrind-developers mailing list Val...@li... https://lists.sourceforge.net/lists/listinfo/valgrind-developers Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. ____________________________________________________ This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free. |
|
From: Paul F. <pj...@wa...> - 2023-07-04 05:29:46
|
Hi I just pushed a change to the web pages that adds this info. A+ Paul |
|
From: Wu, F. <fe...@in...> - 2023-07-06 12:40:15
|
On 5/29/2023 11:29 AM, Wu, Fei wrote:
> On 5/28/2023 1:06 AM, Petr Pavlu wrote:
>> On 21. Apr 23 17:25, Jojo R wrote:
>>> We consider to add RVV/Vector [1] feature in valgrind, there are some
>>> challenges.
>>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the
>>> vector length is agnostic.
>>> ARM's SVE is not supported in valgrind :(
>>>
>>> There are three major issues in implementing RVV instruction set in Valgrind
>>> as following:
>>>
>>> 1. Scalable vector register width VLENB
>>> 2. Runtime changing property of LMUL and SEW
>>> 3. Lack of proper VEX IR to represent all vector operations
>>>
>>> We propose applicable methods to solve 1 and 2. As for 3, we explore several
>>> possible but maybe imperfect approaches to handle different cases.
>>>
I did a very basic prototype for vlen Vector-IR, particularly on RISC-V
Vector (RVV):
* Define new iops such as Iop_VAdd8/16/32/64, the difference from
existing SIMD version is that no element number is specified like
Iop_Add8x32
* Define new IR type Ity_VLen along side existing types such as Ity_I64,
Ity_V256
* Define new class HRcVecVLen in HRegClass for vlen vector registers
The real length is embedded in both IROp and IRType for vlen ops/types,
it's runtime-decided and already known when handling insn such as vadd,
this leads to more flexibility, e.g. backend can issue extra vsetvl if
necessary.
With the above, RVV instruction in the guest can be passed from
frontend, to memcheck, to the backend, and generate the final RVV insn
during host isel, a very basic testcase has been tested.
Now here comes to the complexities:
1. RVV has the concept of LMUL, which groups multiple (or partial)
vector registers, e.g. when LMUL==2, v2 means the real v2+v3. This
complicates the register allocation.
2. RVV uses the "implicit" v0 for mask, its content must be loaded to
the exact "v0" register instead of any other ones if host isel wants to
leverage RVV insn, this implicitness in ISA requires more explicitness
in Valgrind implementation.
For #1 LMUL, a new register allocation algorithm for it can be added,
and it will be great if someone is willing to try it, I'm not sure how
much effort it will take. The other way is splitting it into multiple
ops which only takes one vector register, taking vadd for example, 2
vadd will run with LMUL=1 for one vadd with LMUL=2, this is still okay
for the widening insn, most of the arithmetic insns can be covered in
this way. The exception could be register gather insn vrgather, which we
can consult other ways for it, e.g. scalar or helper.
For #2 v0 mask, one way is to handle the mask in the very beginning at
guest_riscv64_toIR.c, similar to what AVX port does:
a) Read the whole dest register without mask
b) Generate unmasked result by running op without mask
c) Applying mask to a,b and generate the final dest
by doing this, insn with mask is converted to non-mask ones, although
more insns are generated but the performance should be acceptable. There
are still exceptions, e.g. vadc (Add-with-Carry), v0 is not used as mask
but as carry, but just as mentioned above, it's okay to use other ways
for a few insns. Eventually, we can pass v0 mask down to the backend if
it's proved a better solution.
This approach will introduce a bunch of new vlen Vector IRs, especially
the arithmetic IRs such as vadd, my goal is for a good solution which
takes reasonable time to reach usable status, yet still be able to
evolve and generic enough for other vector ISA. Any comments?
Best Regards,
Fei.
>>> We start from 1. As each guest register should be described in VEXGuestState
>>> struct, the vector registers with scalable width of VLENB can be added into
>>> VEXGuestState as arrays using an allowable maximum length like 2048/4096.
>>
>> Size of VexGuestRISCV64State is currently 592 bytes. Adding these large
>> vector registers will bump it by 32*2048/8=8192 bytes.
>>
> Yes, that's the reason in my RFC patches the vlen is set to 128, that's
> the largest room for vector in current design.
>
>> The baseblock layout in VEX is: the guest state, two equal sized areas
>> for shadow state and then a spill area. The RISC-V port accesses the
>> baseblock in generated code via x8/s0. The register is set to the
>> address of the baseblock+2048 (file
>> coregrind/m_dispatch/dispatch-riscv64-linux.S). The extra offset is
>> a small optimization to utilize the fact that load/store instructions in
>> RVI have a signed offset in range [-2048,2047]. The end result is that
>> it is possible to access the baseblock data using only a single
>> instruction.
>>
> Nice design.
>
>> Adding the new vector registers will cause that more instructions will
>> be necessary. For instance, accessing any shadow guest state would
>> naively require a sequence of LUI+ADDI+LOAD/STORE.
>>
>> I suspect this could affect performance quite a bit and might need some
>> optimizing.
>>
> Yes, can we separate the vector registers from the other ones, is it
> able to use two baseblocks? Or we can do some experiments to measure the
> overhead.
>
>>>
>>> The actual available access range can be determined at Valgrind startup time
>>> by querying the CPU for its vector capability or some suitable setup steps.
>>
>> Something to consider is that the virtual CPU provided by Valgrind does
>> not necessarily need to match the host CPU. For instance, VEX could
>> hardcode that its vector registers are only 128 bits in size.
>>
>> I was originally hoping that this is how support for the V extension
>> could be added, but the LMUL grouping looks to break this model.
>>
> Originally I had the same idea, but 128 vlen hardware cannot run the
> software built for larger vlen, e.g. clang has option
> -riscv-v-vector-bits-min, if it's set to 256, then it assumes the
> underlying hardware has at least 256 vlen.
>
>>>
>>>
>>> To solve problem 2, we are inspired by already-proven techniques in QEMU,
>>> where translation blocks are broken up when certain critical CSRs are set.
>>> Because the guest code to IR translation relies on the precise value of
>>> LMUL/SEW and they may change within a basic block, we can break up the basic
>>> block each time encountering a vsetvl{i} instruction and return to the
>>> scheduler to execute the translated code and update LMUL/SEW. Accordingly,
>>> translation cache management should be refactored to detect the changing of
>>> LMUL/SEW to invalidate outdated code cache. Without losing the generality,
>>> the LMUL/SEW should be encoded into an ULong flag such that other
>>> architectures can leverage this flag to store their arch-dependent
>>> information. The TTentry struct should also take the flag into account no
>>> matter insertion or deletion. By doing this, the flag carries the newest
>>> LMUL/SEW throughout the simulation and can be passed to disassemble
>>> functions using the VEXArchInfo struct such that we can get the real and
>>> newest value of LMUL and SEW to facilitate our translation.
>>>
>>> Also, some architecture-related code should be taken care of. Like
>>> m_dispatch part, disp_cp_xindir function looks up code cache using hardcoded
>>> assembly by checking the requested guest state IP and translation cache
>>> entry address with no more constraints. Many other modules should be checked
>>> to ensure the in-time update of LMUL/SEW is instantly visible to essential
>>> parts in Valgrind.
>>>
>>>
>>> The last remaining big issue is 3, which we introduce some ad-hoc approaches
>>> to deal with. We summarize these approaches into three types as following:
>>>
>>> 1. Break down a vector instruction to scalar VEX IR ops.
>>> 2. Break down a vector instruction to fixed-length VEX IR ops.
>>> 3. Use dirty helpers to realize vector instructions.
>>
>> I would also look at adding new VEX IR ops for scalable vector
>> instructions. In particular, if it could be shown that RVV and SVE can
>> use same new ops then it could make a good argument for adding them.
>>
>> Perhaps interesting is if such new scalable vector ops could also
>> represent fixed operations on other architectures, but that is just me
>> thinking out loud.
>>
> It's a good idea to consolidate all vector/simd together, the challenge
> is to verify its feasibility and to speedup the adaption progress, as
> it's supposed to take more efforts and longer time. Is there anyone with
> knowledge or experience of other ISA such as avx/sve on valgrind can
> share the pain and gain, or we can do some quick prototype?
>
> Thanks,
> Fei.
>
>>> [...]
>>> In summary, it is far to reach a truly applicable solution in adding vector
>>> extensions in Valgrind. We need to do detailed and comprehensive estimations
>>> on different vector instruction categories.
>>>
>>> Any feedback is welcome in github [3] also.
>>>
>>>
>>> [1] https://github.com/riscv/riscv-v-spec
>>>
>>> [2] https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>>>
>>> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>>
>> Sorry for not being more helpful at this point. As mentioned in the
>> GitHub issue, I still need to get myself more familiar with RVV and how
>> Valgrind handles vector instructions.
>>
>> Thanks,
>> Petr
>>
>>
>>
>> _______________________________________________
>> Valgrind-developers mailing list
>> Val...@li...
>> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
|
|
From: Petr P. <pet...@da...> - 2023-07-10 21:06:01
|
On 6. Jul 23 20:39, Wu, Fei wrote: > On 5/29/2023 11:29 AM, Wu, Fei wrote: > > On 5/28/2023 1:06 AM, Petr Pavlu wrote: > >> On 21. Apr 23 17:25, Jojo R wrote: > >>> We consider to add RVV/Vector [1] feature in valgrind, there are some > >>> challenges. > >>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the > >>> vector length is agnostic. > >>> ARM's SVE is not supported in valgrind :( > >>> > >>> There are three major issues in implementing RVV instruction set in Valgrind > >>> as following: > >>> > >>> 1. Scalable vector register width VLENB > >>> 2. Runtime changing property of LMUL and SEW > >>> 3. Lack of proper VEX IR to represent all vector operations > >>> > >>> We propose applicable methods to solve 1 and 2. As for 3, we explore several > >>> possible but maybe imperfect approaches to handle different cases. > >>> > I did a very basic prototype for vlen Vector-IR, particularly on RISC-V > Vector (RVV): > > * Define new iops such as Iop_VAdd8/16/32/64, the difference from > existing SIMD version is that no element number is specified like > Iop_Add8x32 > > * Define new IR type Ity_VLen along side existing types such as Ity_I64, > Ity_V256 > > * Define new class HRcVecVLen in HRegClass for vlen vector registers > The real length is embedded in both IROp and IRType for vlen ops/types, > it's runtime-decided and already known when handling insn such as vadd, > this leads to more flexibility, e.g. backend can issue extra vsetvl if > necessary. > > With the above, RVV instruction in the guest can be passed from > frontend, to memcheck, to the backend, and generate the final RVV insn > during host isel, a very basic testcase has been tested. > > Now here comes to the complexities: > > 1. RVV has the concept of LMUL, which groups multiple (or partial) > vector registers, e.g. when LMUL==2, v2 means the real v2+v3. This > complicates the register allocation. > > 2. RVV uses the "implicit" v0 for mask, its content must be loaded to > the exact "v0" register instead of any other ones if host isel wants to > leverage RVV insn, this implicitness in ISA requires more explicitness > in Valgrind implementation. > > For #1 LMUL, a new register allocation algorithm for it can be added, > and it will be great if someone is willing to try it, I'm not sure how > much effort it will take. The other way is splitting it into multiple > ops which only takes one vector register, taking vadd for example, 2 > vadd will run with LMUL=1 for one vadd with LMUL=2, this is still okay > for the widening insn, most of the arithmetic insns can be covered in > this way. The exception could be register gather insn vrgather, which we > can consult other ways for it, e.g. scalar or helper. > > For #2 v0 mask, one way is to handle the mask in the very beginning at > guest_riscv64_toIR.c, similar to what AVX port does: > > a) Read the whole dest register without mask > b) Generate unmasked result by running op without mask > c) Applying mask to a,b and generate the final dest > > by doing this, insn with mask is converted to non-mask ones, although > more insns are generated but the performance should be acceptable. There > are still exceptions, e.g. vadc (Add-with-Carry), v0 is not used as mask > but as carry, but just as mentioned above, it's okay to use other ways > for a few insns. Eventually, we can pass v0 mask down to the backend if > it's proved a better solution. > > This approach will introduce a bunch of new vlen Vector IRs, especially > the arithmetic IRs such as vadd, my goal is for a good solution which > takes reasonable time to reach usable status, yet still be able to > evolve and generic enough for other vector ISA. Any comments? Could you please share a repository with your changes or send them to me as patches? I have a few questions but I think it might be easier for me first to see the actual code. Thanks, Petr |
|
From: Wu, F. <fe...@in...> - 2023-07-11 11:29:25
Attachments:
rvv.tar.bz2
|
On 7/11/2023 4:50 AM, Petr Pavlu wrote: > On 6. Jul 23 20:39, Wu, Fei wrote: >> On 5/29/2023 11:29 AM, Wu, Fei wrote: >>> On 5/28/2023 1:06 AM, Petr Pavlu wrote: >>>> On 21. Apr 23 17:25, Jojo R wrote: >>>>> We consider to add RVV/Vector [1] feature in valgrind, there are some >>>>> challenges. >>>>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the >>>>> vector length is agnostic. >>>>> ARM's SVE is not supported in valgrind :( >>>>> >>>>> There are three major issues in implementing RVV instruction set in Valgrind >>>>> as following: >>>>> >>>>> 1. Scalable vector register width VLENB >>>>> 2. Runtime changing property of LMUL and SEW >>>>> 3. Lack of proper VEX IR to represent all vector operations >>>>> >>>>> We propose applicable methods to solve 1 and 2. As for 3, we explore several >>>>> possible but maybe imperfect approaches to handle different cases. >>>>> >> I did a very basic prototype for vlen Vector-IR, particularly on RISC-V >> Vector (RVV): >> >> * Define new iops such as Iop_VAdd8/16/32/64, the difference from >> existing SIMD version is that no element number is specified like >> Iop_Add8x32 >> >> * Define new IR type Ity_VLen along side existing types such as Ity_I64, >> Ity_V256 >> >> * Define new class HRcVecVLen in HRegClass for vlen vector registers >> The real length is embedded in both IROp and IRType for vlen ops/types, >> it's runtime-decided and already known when handling insn such as vadd, >> this leads to more flexibility, e.g. backend can issue extra vsetvl if >> necessary. >> >> With the above, RVV instruction in the guest can be passed from >> frontend, to memcheck, to the backend, and generate the final RVV insn >> during host isel, a very basic testcase has been tested. >> >> Now here comes to the complexities: >> >> 1. RVV has the concept of LMUL, which groups multiple (or partial) >> vector registers, e.g. when LMUL==2, v2 means the real v2+v3. This >> complicates the register allocation. >> >> 2. RVV uses the "implicit" v0 for mask, its content must be loaded to >> the exact "v0" register instead of any other ones if host isel wants to >> leverage RVV insn, this implicitness in ISA requires more explicitness >> in Valgrind implementation. >> >> For #1 LMUL, a new register allocation algorithm for it can be added, >> and it will be great if someone is willing to try it, I'm not sure how >> much effort it will take. The other way is splitting it into multiple >> ops which only takes one vector register, taking vadd for example, 2 >> vadd will run with LMUL=1 for one vadd with LMUL=2, this is still okay >> for the widening insn, most of the arithmetic insns can be covered in >> this way. The exception could be register gather insn vrgather, which we >> can consult other ways for it, e.g. scalar or helper. >> >> For #2 v0 mask, one way is to handle the mask in the very beginning at >> guest_riscv64_toIR.c, similar to what AVX port does: >> >> a) Read the whole dest register without mask >> b) Generate unmasked result by running op without mask >> c) Applying mask to a,b and generate the final dest >> >> by doing this, insn with mask is converted to non-mask ones, although >> more insns are generated but the performance should be acceptable. There >> are still exceptions, e.g. vadc (Add-with-Carry), v0 is not used as mask >> but as carry, but just as mentioned above, it's okay to use other ways >> for a few insns. Eventually, we can pass v0 mask down to the backend if >> it's proved a better solution. >> >> This approach will introduce a bunch of new vlen Vector IRs, especially >> the arithmetic IRs such as vadd, my goal is for a good solution which >> takes reasonable time to reach usable status, yet still be able to >> evolve and generic enough for other vector ISA. Any comments? > > Could you please share a repository with your changes or send them to me > as patches? I have a few questions but I think it might be easier for me > first to see the actual code. > Please see attachment. It's a very raw version to just verify the idea, mask is not added but expected to be done as mentioned above, it's based on commit 71272b2529 on your branch, patch 0013 is the key. btw, I will setup a repository but it takes a few days to pass the internal process. Thanks, Fei. > Thanks, > Petr |
|
From: Wu, F. <fe...@in...> - 2023-07-18 01:44:56
|
On 7/11/2023 7:28 PM, Wu, Fei wrote: > On 7/11/2023 4:50 AM, Petr Pavlu wrote: >> On 6. Jul 23 20:39, Wu, Fei wrote: >>> On 5/29/2023 11:29 AM, Wu, Fei wrote: >>>> On 5/28/2023 1:06 AM, Petr Pavlu wrote: >>>>> On 21. Apr 23 17:25, Jojo R wrote: >>>>>> We consider to add RVV/Vector [1] feature in valgrind, there are some >>>>>> challenges. >>>>>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the >>>>>> vector length is agnostic. >>>>>> ARM's SVE is not supported in valgrind :( >>>>>> >>>>>> There are three major issues in implementing RVV instruction set in Valgrind >>>>>> as following: >>>>>> >>>>>> 1. Scalable vector register width VLENB >>>>>> 2. Runtime changing property of LMUL and SEW >>>>>> 3. Lack of proper VEX IR to represent all vector operations >>>>>> >>>>>> We propose applicable methods to solve 1 and 2. As for 3, we explore several >>>>>> possible but maybe imperfect approaches to handle different cases. >>>>>> >>> I did a very basic prototype for vlen Vector-IR, particularly on RISC-V >>> Vector (RVV): >>> >>> * Define new iops such as Iop_VAdd8/16/32/64, the difference from >>> existing SIMD version is that no element number is specified like >>> Iop_Add8x32 >>> >>> * Define new IR type Ity_VLen along side existing types such as Ity_I64, >>> Ity_V256 >>> >>> * Define new class HRcVecVLen in HRegClass for vlen vector registers >>> The real length is embedded in both IROp and IRType for vlen ops/types, >>> it's runtime-decided and already known when handling insn such as vadd, >>> this leads to more flexibility, e.g. backend can issue extra vsetvl if >>> necessary. >>> >>> With the above, RVV instruction in the guest can be passed from >>> frontend, to memcheck, to the backend, and generate the final RVV insn >>> during host isel, a very basic testcase has been tested. >>> >>> Now here comes to the complexities: >>> >>> 1. RVV has the concept of LMUL, which groups multiple (or partial) >>> vector registers, e.g. when LMUL==2, v2 means the real v2+v3. This >>> complicates the register allocation. >>> >>> 2. RVV uses the "implicit" v0 for mask, its content must be loaded to >>> the exact "v0" register instead of any other ones if host isel wants to >>> leverage RVV insn, this implicitness in ISA requires more explicitness >>> in Valgrind implementation. >>> >>> For #1 LMUL, a new register allocation algorithm for it can be added, >>> and it will be great if someone is willing to try it, I'm not sure how >>> much effort it will take. The other way is splitting it into multiple >>> ops which only takes one vector register, taking vadd for example, 2 >>> vadd will run with LMUL=1 for one vadd with LMUL=2, this is still okay >>> for the widening insn, most of the arithmetic insns can be covered in >>> this way. The exception could be register gather insn vrgather, which we >>> can consult other ways for it, e.g. scalar or helper. >>> >>> For #2 v0 mask, one way is to handle the mask in the very beginning at >>> guest_riscv64_toIR.c, similar to what AVX port does: >>> >>> a) Read the whole dest register without mask >>> b) Generate unmasked result by running op without mask >>> c) Applying mask to a,b and generate the final dest >>> >>> by doing this, insn with mask is converted to non-mask ones, although >>> more insns are generated but the performance should be acceptable. There >>> are still exceptions, e.g. vadc (Add-with-Carry), v0 is not used as mask >>> but as carry, but just as mentioned above, it's okay to use other ways >>> for a few insns. Eventually, we can pass v0 mask down to the backend if >>> it's proved a better solution. >>> >>> This approach will introduce a bunch of new vlen Vector IRs, especially >>> the arithmetic IRs such as vadd, my goal is for a good solution which >>> takes reasonable time to reach usable status, yet still be able to >>> evolve and generic enough for other vector ISA. Any comments? >> >> Could you please share a repository with your changes or send them to me >> as patches? I have a few questions but I think it might be easier for me >> first to see the actual code. >> > Please see attachment. It's a very raw version to just verify the idea, > mask is not added but expected to be done as mentioned above, it's based > on commit 71272b2529 on your branch, patch 0013 is the key. > Hi Petr, Have you taken a look? Any comments? Thanks, Fei. > btw, I will setup a repository but it takes a few days to pass the > internal process. > > Thanks, > Fei. > >> Thanks, >> Petr |
|
From: Jojo R <rj...@li...> - 2023-07-17 07:06:22
Attachments:
Valgrind-RVV-T-HEAD.pdf
|
Hi,
Sorry for the late reply,
i have been pushing the progress of valgrind RVV implementation 😄
We finished the first version and tested with full RVV intrinsics spec.
For real project and developers, we implement the first useable/ full
functionality's RVV valgrind with dirtycall method,
and we will make experiment or optimize RVV implementation on ideal RVV
design.
Back to the RVV RFC, we are happy to share our thinking of design, see
attachment for more details :)
Regards
--Jojo
在 2023/4/21 17:25, Jojo R 写道:
>
> Hi,
>
> We consider to add RVV/Vector [1] feature in valgrind, there are some
> challenges.
> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that
> means the vector length is agnostic.
> ARM's SVE is not supported in valgrind :(
>
> There are three major issues in implementing RVV instruction set in
> Valgrind as following:
>
> 1. Scalable vector register width VLENB
> 2. Runtime changing property of LMUL and SEW
> 3. Lack of proper VEX IR to represent all vector operations
>
> We propose applicable methods to solve 1 and 2. As for 3, we explore
> several possible but maybe imperfect approaches to handle different cases.
>
> We start from 1. As each guest register should be described in
> VEXGuestState struct, the vector registers with scalable width of
> VLENB can be added into VEXGuestState as arrays using an allowable
> maximum length like 2048/4096.
>
> The actual available access range can be determined at Valgrind
> startup time by querying the CPU for its vector capability or some
> suitable setup steps.
>
>
> To solve problem 2, we are inspired by already-proven techniques in
> QEMU, where translation blocks are broken up when certain critical
> CSRs are set. Because the guest code to IR translation relies on the
> precise value of LMUL/SEW and they may change within a basic block, we
> can break up the basic block each time encountering a vsetvl{i}
> instruction and return to the scheduler to execute the translated code
> and update LMUL/SEW. Accordingly, translation cache management should
> be refactored to detect the changing of LMUL/SEW to invalidate
> outdated code cache. Without losing the generality, the LMUL/SEW
> should be encoded into an ULong flag such that other architectures can
> leverage this flag to store their arch-dependent information. The
> TTentry struct should also take the flag into account no matter
> insertion or deletion. By doing this, the flag carries the newest
> LMUL/SEW throughout the simulation and can be passed to disassemble
> functions using the VEXArchInfo struct such that we can get the real
> and newest value of LMUL and SEW to facilitate our translation.
>
> Also, some architecture-related code should be taken care of. Like
> m_dispatch part, disp_cp_xindir function looks up code cache using
> hardcoded assembly by checking the requested guest state IP and
> translation cache entry address with no more constraints. Many other
> modules should be checked to ensure the in-time update of LMUL/SEW is
> instantly visible to essential parts in Valgrind.
>
>
> The last remaining big issue is 3, which we introduce some ad-hoc
> approaches to deal with. We summarize these approaches into three
> types as following:
>
> 1. Break down a vector instruction to scalar VEX IR ops.
> 2. Break down a vector instruction to fixed-length VEX IR ops.
> 3. Use dirty helpers to realize vector instructions.
>
> The very first method theoretically exists but is probably not
> applicable as the number of IR ops explodes when a large VLENB is
> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL
> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns
> into 512 scalar instructions and each scalar instruction would be
> expanded to multiple IRs. To make things worse, the tool
> instrumentation will insert more IRs between adjacent scalar IR ops.
> As a result, the performance is likely to be slowed down thousand
> times during running a real-world application with lots of vector
> instructions. Therefore, the other two methods are more promising and
> we will discuss them below.
>
> 2 and 3 are not mutually exclusive as we may choose a suitable method
> from them to implement a vector instruction regarding its concrete
> behavior. To explain these methods in detail, we present some
> instances to illustrate their pros and cons.
>
> In terms of method 2, we have real values of VLENB/LMUL/SEW. The
> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are
> available and can be directly applied to represent vector operations.
> However, even when VLENB is restricted to 128, it still exceeds the
> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here
> are two variants of method 2 to deal with long vectors:
>
>
> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector
> instructions in the granularity of VLENB. Accordingly, VLENB=4096 with
> LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.
>
> * *pros*: it encourages VEX backend to generate more compact and
> efficient SIMD code (maybe). Particularly,it accommodatesmask and
> gather/scatter (indexed) instructions by delivering more
> information in IR itself.
> * *cons*: too many new IR ops need to be introduced in VEX as each
> op of different length should implement its add/sub/mul variants.
> New data types to denote long vectors are necessary too, causing
> difficulties in both VEX backend register allocation and tool
> instrumentation.
>
> *2.2*Break down long vectors to multiple repeated SIMD ops. For
> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is
> composed of four operators of Iop_Add8x16 type.
>
> * *pros:*less efforts are required in register allocation and tool
> instrumentation. The VEX frontend is able to notify the backend to
> generate efficient vector instructions by existing Iops. It better
> trades off the complexity of adding many long vector IR ops and
> the benefit of generating high-efficiency host code.
> * *cons:*it is hard to describe a mask operation given that the mask
> is pretty flexible (the least significant bit of each segment of
> v0). Additionally, gather/scatter instructions may have similar
> problems in appropriately dividing index registers. There are
> various corner cases left here such as widening arithmetic
> operations (widening SIMD IR ops are currently not compatible) and
> vstart CSR register. When using fixed-length IR ops to comprise a
> vector instruction, we will inevitably tell each IR op which
> position encoded in vstart you can start to process the data. We
> can use vstart as a normal guest state virtual register to
> calculate each op's start position as a guard IRExpr or obtain the
> value of vstart like what we do in LMUL/SEW. Nevertheless, it is
> non-trivial to decompose a vector instruction concisely.
>
> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering
> efforts of refactoring Valgrind elegantly as well as implementing the
> vector instruction set efficiently. Same obstacles exist in ARM SVE as
> they are scalable vector instructions and flexible in many ways.
>
> The final solution is the dirty helper. It is undoubtedly practical
> and requires possibly the least engineering efforts in dealing with so
> many details in Valgrind. In this design, each instruction is
> completed using an inline assembly running the same instruction on the
> host. Moreover, tool instrumentation already handles IRDirty except
> that new fields should be added in _IRDirty struct to indicate
> strided/indexed/masked memory accesses and arithmetic operations.
>
> * *pros:*it supports all instructions without bothering to build
> complicated IR expressions and statements. It executes vector
> instructions using host CPU to get acceleration to some extent.
> Besides, we do not need to add VEX backend to translate new IRs to
> vector instructions.
> * *cons:*the dirty helper always keeps its operations in a black box
> such that tools can never see what happens in a dirty helper. Like
> memcheck, the bit precision merit is missing once it meets a dirty
> helper as the V-bit propagation chain adopts a pretty coarse
> determination strategy. On the other hand, it is also not an
> elegant way to implement the entire ISA extension in dirty helpers.
>
> In summary, it is far to reach a truly applicable solution in adding
> vector extensions in Valgrind. We need to do detailed and
> comprehensive estimations on different vector instruction categories.
>
> Any feedback is welcome in github [3] also.
>
>
> [1] https://github.com/riscv/riscv-v-spec
>
> [2]
> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>
> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>
>
> Thanks.
>
> Jojo
>
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Jojo R <rj...@li...> - 2023-08-04 06:04:16
|
Hi, We are glad to open source RVV implementation here: https://github.com/rjiejie/valgrind-riscv64 3 kinds extra ISAs were added in this repo: RV64Zfh : Half-precision floating-point RV64Xthead [1] : T-THEAD vendor extension for RV64G RV64V0p7 [2] : Vector 0.7.1 RV64V : Vector 1.x, coming soon :) [1] https://github.com/T-head-Semi/thead-extension-spec [2] https://github.com/riscv/riscv-v-spec/releases/tag/0.7.1 Regards --Jojo 在 2023/7/17 15:05, Jojo R 写道: > > Hi, > > Sorry for the late reply, > > i have been pushing the progress of valgrind RVV implementation 😄 > We finished the first version and tested with full RVV intrinsics spec. > > For real project and developers, we implement the first useable/ full > functionality's RVV valgrind with dirtycall method, > and we will make experiment or optimize RVV implementation on ideal > RVV design. > > Back to the RVV RFC, we are happy to share our thinking of design, see > attachment for more details :) > > Regards > > --Jojo > > 在 2023/4/21 17:25, Jojo R 写道: >> >> Hi, >> >> We consider to add RVV/Vector [1] feature in valgrind, there are some >> challenges. >> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that >> means the vector length is agnostic. >> ARM's SVE is not supported in valgrind :( >> >> There are three major issues in implementing RVV instruction set in >> Valgrind as following: >> >> 1. Scalable vector register width VLENB >> 2. Runtime changing property of LMUL and SEW >> 3. Lack of proper VEX IR to represent all vector operations >> >> We propose applicable methods to solve 1 and 2. As for 3, we explore >> several possible but maybe imperfect approaches to handle different >> cases. >> >> We start from 1. As each guest register should be described in >> VEXGuestState struct, the vector registers with scalable width of >> VLENB can be added into VEXGuestState as arrays using an allowable >> maximum length like 2048/4096. >> >> The actual available access range can be determined at Valgrind >> startup time by querying the CPU for its vector capability or some >> suitable setup steps. >> >> >> To solve problem 2, we are inspired by already-proven techniques in >> QEMU, where translation blocks are broken up when certain critical >> CSRs are set. Because the guest code to IR translation relies on the >> precise value of LMUL/SEW and they may change within a basic block, >> we can break up the basic block each time encountering a vsetvl{i} >> instruction and return to the scheduler to execute the translated >> code and update LMUL/SEW. Accordingly, translation cache management >> should be refactored to detect the changing of LMUL/SEW to invalidate >> outdated code cache. Without losing the generality, the LMUL/SEW >> should be encoded into an ULong flag such that other architectures >> can leverage this flag to store their arch-dependent information. The >> TTentry struct should also take the flag into account no matter >> insertion or deletion. By doing this, the flag carries the newest >> LMUL/SEW throughout the simulation and can be passed to disassemble >> functions using the VEXArchInfo struct such that we can get the real >> and newest value of LMUL and SEW to facilitate our translation. >> >> Also, some architecture-related code should be taken care of. Like >> m_dispatch part, disp_cp_xindir function looks up code cache using >> hardcoded assembly by checking the requested guest state IP and >> translation cache entry address with no more constraints. Many other >> modules should be checked to ensure the in-time update of LMUL/SEW is >> instantly visible to essential parts in Valgrind. >> >> >> The last remaining big issue is 3, which we introduce some ad-hoc >> approaches to deal with. We summarize these approaches into three >> types as following: >> >> 1. Break down a vector instruction to scalar VEX IR ops. >> 2. Break down a vector instruction to fixed-length VEX IR ops. >> 3. Use dirty helpers to realize vector instructions. >> >> The very first method theoretically exists but is probably not >> applicable as the number of IR ops explodes when a large VLENB is >> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL >> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns >> into 512 scalar instructions and each scalar instruction would be >> expanded to multiple IRs. To make things worse, the tool >> instrumentation will insert more IRs between adjacent scalar IR ops. >> As a result, the performance is likely to be slowed down thousand >> times during running a real-world application with lots of vector >> instructions. Therefore, the other two methods are more promising and >> we will discuss them below. >> >> 2 and 3 are not mutually exclusive as we may choose a suitable method >> from them to implement a vector instruction regarding its concrete >> behavior. To explain these methods in detail, we present some >> instances to illustrate their pros and cons. >> >> In terms of method 2, we have real values of VLENB/LMUL/SEW. The >> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are >> available and can be directly applied to represent vector operations. >> However, even when VLENB is restricted to 128, it still exceeds the >> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here >> are two variants of method 2 to deal with long vectors: >> >> >> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate >> vector instructions in the granularity of VLENB. Accordingly, >> VLENB=4096 with LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops. >> >> * *pros*: it encourages VEX backend to generate more compact and >> efficient SIMD code (maybe). Particularly,it accommodatesmask and >> gather/scatter (indexed) instructions by delivering more >> information in IR itself. >> * *cons*: too many new IR ops need to be introduced in VEX as each >> op of different length should implement its add/sub/mul variants. >> New data types to denote long vectors are necessary too, causing >> difficulties in both VEX backend register allocation and tool >> instrumentation. >> >> *2.2*Break down long vectors to multiple repeated SIMD ops. For >> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is >> composed of four operators of Iop_Add8x16 type. >> >> * *pros:*less efforts are required in register allocation and tool >> instrumentation. The VEX frontend is able to notify the backend >> to generate efficient vector instructions by existing Iops. It >> better trades off the complexity of adding many long vector IR >> ops and the benefit of generating high-efficiency host code. >> * *cons:*it is hard to describe a mask operation given that the >> mask is pretty flexible (the least significant bit of each >> segment of v0). Additionally, gather/scatter instructions may >> have similar problems in appropriately dividing index registers. >> There are various corner cases left here such as widening >> arithmetic operations (widening SIMD IR ops are currently not >> compatible) and vstart CSR register. When using fixed-length IR >> ops to comprise a vector instruction, we will inevitably tell >> each IR op which position encoded in vstart you can start to >> process the data. We can use vstart as a normal guest state >> virtual register to calculate each op's start position as a guard >> IRExpr or obtain the value of vstart like what we do in LMUL/SEW. >> Nevertheless, it is non-trivial to decompose a vector instruction >> concisely. >> >> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering >> efforts of refactoring Valgrind elegantly as well as implementing the >> vector instruction set efficiently. Same obstacles exist in ARM SVE >> as they are scalable vector instructions and flexible in many ways. >> >> The final solution is the dirty helper. It is undoubtedly practical >> and requires possibly the least engineering efforts in dealing with >> so many details in Valgrind. In this design, each instruction is >> completed using an inline assembly running the same instruction on >> the host. Moreover, tool instrumentation already handles IRDirty >> except that new fields should be added in _IRDirty struct to indicate >> strided/indexed/masked memory accesses and arithmetic operations. >> >> * *pros:*it supports all instructions without bothering to build >> complicated IR expressions and statements. It executes vector >> instructions using host CPU to get acceleration to some extent. >> Besides, we do not need to add VEX backend to translate new IRs >> to vector instructions. >> * *cons:*the dirty helper always keeps its operations in a black >> box such that tools can never see what happens in a dirty helper. >> Like memcheck, the bit precision merit is missing once it meets a >> dirty helper as the V-bit propagation chain adopts a pretty >> coarse determination strategy. On the other hand, it is also not >> an elegant way to implement the entire ISA extension in dirty >> helpers. >> >> In summary, it is far to reach a truly applicable solution in adding >> vector extensions in Valgrind. We need to do detailed and >> comprehensive estimations on different vector instruction categories. >> >> Any feedback is welcome in github [3] also. >> >> >> [1] https://github.com/riscv/riscv-v-spec >> >> [2] >> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve >> >> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17 >> >> >> Thanks. >> >> Jojo >> >> >> >> _______________________________________________ >> Valgrind-developers mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Jojo R <rj...@gm...> - 2023-08-29 07:47:14
|
Hi, We are glad to open source RVV implementation here again: https://github.com/rjiejie/valgrind-riscv64 4 kinds extra ISAs were added in this repo: RV64Zfh : Half-precision floating-point RV64Xthead [1] : T-THEAD vendor extension for RV64G RV64V0p7 [2] : Vector 0.7.1 RV64V [3] : Vector 1.0 [1] https://github.com/T-head-Semi/thead-extension-spec [2] https://github.com/riscv/riscv-v-spec/releases/tag/0.7.1 [3] https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 Regards --Jojo 在 2023/7/17 15:05, Jojo R 写道: > > Hi, > > Sorry for the late reply, > > i have been pushing the progress of valgrind RVV implementation 😄 > We finished the first version and tested with full RVV intrinsics spec. > > For real project and developers, we implement the first useable/ full > functionality's RVV valgrind with dirtycall method, > and we will make experiment or optimize RVV implementation on ideal > RVV design. > > Back to the RVV RFC, we are happy to share our thinking of design, see > attachment for more details :) > > Regards > > --Jojo > > 在 2023/4/21 17:25, Jojo R 写道: >> >> Hi, >> >> We consider to add RVV/Vector [1] feature in valgrind, there are some >> challenges. >> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that >> means the vector length is agnostic. >> ARM's SVE is not supported in valgrind :( >> >> There are three major issues in implementing RVV instruction set in >> Valgrind as following: >> >> 1. Scalable vector register width VLENB >> 2. Runtime changing property of LMUL and SEW >> 3. Lack of proper VEX IR to represent all vector operations >> >> We propose applicable methods to solve 1 and 2. As for 3, we explore >> several possible but maybe imperfect approaches to handle different >> cases. >> >> We start from 1. As each guest register should be described in >> VEXGuestState struct, the vector registers with scalable width of >> VLENB can be added into VEXGuestState as arrays using an allowable >> maximum length like 2048/4096. >> >> The actual available access range can be determined at Valgrind >> startup time by querying the CPU for its vector capability or some >> suitable setup steps. >> >> >> To solve problem 2, we are inspired by already-proven techniques in >> QEMU, where translation blocks are broken up when certain critical >> CSRs are set. Because the guest code to IR translation relies on the >> precise value of LMUL/SEW and they may change within a basic block, >> we can break up the basic block each time encountering a vsetvl{i} >> instruction and return to the scheduler to execute the translated >> code and update LMUL/SEW. Accordingly, translation cache management >> should be refactored to detect the changing of LMUL/SEW to invalidate >> outdated code cache. Without losing the generality, the LMUL/SEW >> should be encoded into an ULong flag such that other architectures >> can leverage this flag to store their arch-dependent information. The >> TTentry struct should also take the flag into account no matter >> insertion or deletion. By doing this, the flag carries the newest >> LMUL/SEW throughout the simulation and can be passed to disassemble >> functions using the VEXArchInfo struct such that we can get the real >> and newest value of LMUL and SEW to facilitate our translation. >> >> Also, some architecture-related code should be taken care of. Like >> m_dispatch part, disp_cp_xindir function looks up code cache using >> hardcoded assembly by checking the requested guest state IP and >> translation cache entry address with no more constraints. Many other >> modules should be checked to ensure the in-time update of LMUL/SEW is >> instantly visible to essential parts in Valgrind. >> >> >> The last remaining big issue is 3, which we introduce some ad-hoc >> approaches to deal with. We summarize these approaches into three >> types as following: >> >> 1. Break down a vector instruction to scalar VEX IR ops. >> 2. Break down a vector instruction to fixed-length VEX IR ops. >> 3. Use dirty helpers to realize vector instructions. >> >> The very first method theoretically exists but is probably not >> applicable as the number of IR ops explodes when a large VLENB is >> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL >> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns >> into 512 scalar instructions and each scalar instruction would be >> expanded to multiple IRs. To make things worse, the tool >> instrumentation will insert more IRs between adjacent scalar IR ops. >> As a result, the performance is likely to be slowed down thousand >> times during running a real-world application with lots of vector >> instructions. Therefore, the other two methods are more promising and >> we will discuss them below. >> >> 2 and 3 are not mutually exclusive as we may choose a suitable method >> from them to implement a vector instruction regarding its concrete >> behavior. To explain these methods in detail, we present some >> instances to illustrate their pros and cons. >> >> In terms of method 2, we have real values of VLENB/LMUL/SEW. The >> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are >> available and can be directly applied to represent vector operations. >> However, even when VLENB is restricted to 128, it still exceeds the >> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here >> are two variants of method 2 to deal with long vectors: >> >> >> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate >> vector instructions in the granularity of VLENB. Accordingly, >> VLENB=4096 with LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops. >> >> * *pros*: it encourages VEX backend to generate more compact and >> efficient SIMD code (maybe). Particularly,it accommodatesmask and >> gather/scatter (indexed) instructions by delivering more >> information in IR itself. >> * *cons*: too many new IR ops need to be introduced in VEX as each >> op of different length should implement its add/sub/mul variants. >> New data types to denote long vectors are necessary too, causing >> difficulties in both VEX backend register allocation and tool >> instrumentation. >> >> *2.2*Break down long vectors to multiple repeated SIMD ops. For >> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is >> composed of four operators of Iop_Add8x16 type. >> >> * *pros:*less efforts are required in register allocation and tool >> instrumentation. The VEX frontend is able to notify the backend >> to generate efficient vector instructions by existing Iops. It >> better trades off the complexity of adding many long vector IR >> ops and the benefit of generating high-efficiency host code. >> * *cons:*it is hard to describe a mask operation given that the >> mask is pretty flexible (the least significant bit of each >> segment of v0). Additionally, gather/scatter instructions may >> have similar problems in appropriately dividing index registers. >> There are various corner cases left here such as widening >> arithmetic operations (widening SIMD IR ops are currently not >> compatible) and vstart CSR register. When using fixed-length IR >> ops to comprise a vector instruction, we will inevitably tell >> each IR op which position encoded in vstart you can start to >> process the data. We can use vstart as a normal guest state >> virtual register to calculate each op's start position as a guard >> IRExpr or obtain the value of vstart like what we do in LMUL/SEW. >> Nevertheless, it is non-trivial to decompose a vector instruction >> concisely. >> >> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering >> efforts of refactoring Valgrind elegantly as well as implementing the >> vector instruction set efficiently. Same obstacles exist in ARM SVE >> as they are scalable vector instructions and flexible in many ways. >> >> The final solution is the dirty helper. It is undoubtedly practical >> and requires possibly the least engineering efforts in dealing with >> so many details in Valgrind. In this design, each instruction is >> completed using an inline assembly running the same instruction on >> the host. Moreover, tool instrumentation already handles IRDirty >> except that new fields should be added in _IRDirty struct to indicate >> strided/indexed/masked memory accesses and arithmetic operations. >> >> * *pros:*it supports all instructions without bothering to build >> complicated IR expressions and statements. It executes vector >> instructions using host CPU to get acceleration to some extent. >> Besides, we do not need to add VEX backend to translate new IRs >> to vector instructions. >> * *cons:*the dirty helper always keeps its operations in a black >> box such that tools can never see what happens in a dirty helper. >> Like memcheck, the bit precision merit is missing once it meets a >> dirty helper as the V-bit propagation chain adopts a pretty >> coarse determination strategy. On the other hand, it is also not >> an elegant way to implement the entire ISA extension in dirty >> helpers. >> >> In summary, it is far to reach a truly applicable solution in adding >> vector extensions in Valgrind. We need to do detailed and >> comprehensive estimations on different vector instruction categories. >> >> Any feedback is welcome in github [3] also. >> >> >> [1] https://github.com/riscv/riscv-v-spec >> >> [2] >> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve >> >> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17 >> >> >> Thanks. >> >> Jojo >> >> >> >> _______________________________________________ >> Valgrind-developers mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Petr P. <pet...@da...> - 2023-07-18 19:26:03
|
On 11. Jul 23 19:28, Wu, Fei wrote:
> On 7/11/2023 4:50 AM, Petr Pavlu wrote:
> > On 6. Jul 23 20:39, Wu, Fei wrote:
> >> [...]
> >>
> >> This approach will introduce a bunch of new vlen Vector IRs, especially
> >> the arithmetic IRs such as vadd, my goal is for a good solution which
> >> takes reasonable time to reach usable status, yet still be able to
> >> evolve and generic enough for other vector ISA. Any comments?
This personally looks to me as a right direction. Supporting scalable
vector extensions in Valgrind as a first-class citizen would be my
preferred choice. I think it is something that will be needed to handle
Arm SVE and RISC-V RVV well. On the other hand, it is likely the most
complex approach and could take time to iron out.
> > Could you please share a repository with your changes or send them to me
> > as patches? I have a few questions but I think it might be easier for me
> > first to see the actual code.
> >
> Please see attachment. It's a very raw version to just verify the idea,
> mask is not added but expected to be done as mentioned above, it's based
> on commit 71272b2529 on your branch, patch 0013 is the key.
Thanks for sharing this code. The previous discussions and this series
introduces a new concept of translating client code per some CPU state.
That is something I spent most time thinking about.
I can see it is indeed necessary for RVV. In particular, this
"versioning" of translations allows that Valgrind IR can statically
express an element type of each vector operation, i.e. that it is an
operation on I32, F64, ... An alternative would be to try to express the
type dynamically in IR. That should be still somewhat manageable in the
toIR frontend but I have a hard time seeing how it would work for the
instrumentation and codegen.
The versioning should work well for RVV translations because my
expectation is that most RVV loops will consist of a call to vsetvli
(with a static vtype), followed by some actual vector operations. Such
a block then requires only one translation.
This is however true only if translations are versioned just per vtype,
without vl. If I understood correctly, the patches version them per vl
too but it isn't clear to me conceptually if this is really necessary.
For instance, I think VAdd8 could look as follows:
VAdd8(<len>, <in1>, <in2>, <flags?>) where <len> is something as
IRExpr_Get(OFFB_VL, Ity_I64).
Another problem which I noticed is that blocks containing no RVV
instructions are also versioned. Consider the following:
while (true) {
// (1) some RVV code which can set vtype to different values
// (2) a large chunk of non-RVV code
}
The code in (2) will currently have multiple same translations for each
residue left in vtype by (1).
In general, I think the concept of allowing translations per some CPU
state could be useful in other cases and for other architectures too.
For RISC-V, it could be beneficial for floating-point operations. My
expectation is that regular RISC-V FP code will have instructions with
encoded rm=DYN and always executed with frm=RNE. The current approach is
that the toIR frontend generates an IR which reads the rounding mode
from frm and remaps it to the Valgrind's representation. The codegen
then does the opposite. The idea here is that the frontend would know
the actual rounding mode and could create IR which has directly this
mode, for instance, AddF64(Irrm_NEAREST, <in1>, <in2>). The codegen then
doesn't need to know how to handle any dynamic rounding modes as they
become static.
I plan to look further into this series. Specifically, I'd like to have
a stab at adding some basic support for Arm SVE to get a better
understanding if this is generic enough.
Thanks,
Petr
|
|
From: Wu, F. <fe...@in...> - 2023-07-19 01:25:15
|
On 7/19/2023 3:08 AM, Petr Pavlu wrote:
> On 11. Jul 23 19:28, Wu, Fei wrote:
>> On 7/11/2023 4:50 AM, Petr Pavlu wrote:
>>> On 6. Jul 23 20:39, Wu, Fei wrote:
>>>> [...]
>>>>
>>>> This approach will introduce a bunch of new vlen Vector IRs, especially
>>>> the arithmetic IRs such as vadd, my goal is for a good solution which
>>>> takes reasonable time to reach usable status, yet still be able to
>>>> evolve and generic enough for other vector ISA. Any comments?
>
> This personally looks to me as a right direction. Supporting scalable
> vector extensions in Valgrind as a first-class citizen would be my
> preferred choice. I think it is something that will be needed to handle
> Arm SVE and RISC-V RVV well. On the other hand, it is likely the most
> complex approach and could take time to iron out.
>
>>> Could you please share a repository with your changes or send them to me
>>> as patches? I have a few questions but I think it might be easier for me
>>> first to see the actual code.
>>>
>> Please see attachment. It's a very raw version to just verify the idea,
>> mask is not added but expected to be done as mentioned above, it's based
>> on commit 71272b2529 on your branch, patch 0013 is the key.
>
> Thanks for sharing this code. The previous discussions and this series
> introduces a new concept of translating client code per some CPU state.
> That is something I spent most time thinking about.
>
> I can see it is indeed necessary for RVV. In particular, this
> "versioning" of translations allows that Valgrind IR can statically
> express an element type of each vector operation, i.e. that it is an
> operation on I32, F64, ... An alternative would be to try to express the
> type dynamically in IR. That should be still somewhat manageable in the
> toIR frontend but I have a hard time seeing how it would work for the
> instrumentation and codegen.
>
> The versioning should work well for RVV translations because my
> expectation is that most RVV loops will consist of a call to vsetvli
> (with a static vtype), followed by some actual vector operations. Such
> a block then requires only one translation.
>
> This is however true only if translations are versioned just per vtype,
> without vl. If I understood correctly, the patches version them per vl
> too but it isn't clear to me conceptually if this is really necessary.
>
Yes, this series does version vl, it helps the situation such as in the
last patch, it can break the large vl to multiple small vl operations,
in case the backend doesn't have a register allocation algorithm for LMUL>1.
> For instance, I think VAdd8 could look as follows:
> VAdd8(<len>, <in1>, <in2>, <flags?>) where <len> is something as
> IRExpr_Get(OFFB_VL, Ity_I64).
>
> Another problem which I noticed is that blocks containing no RVV
> instructions are also versioned. Consider the following:
> while (true) {
> // (1) some RVV code which can set vtype to different values
> // (2) a large chunk of non-RVV code
> }
>
> The code in (2) will currently have multiple same translations for each
> residue left in vtype by (1).
>
Yes, indeed. This is one place to optimize.
> In general, I think the concept of allowing translations per some CPU
> state could be useful in other cases and for other architectures too.
> For RISC-V, it could be beneficial for floating-point operations. My
> expectation is that regular RISC-V FP code will have instructions with
> encoded rm=DYN and always executed with frm=RNE. The current approach is
> that the toIR frontend generates an IR which reads the rounding mode
> from frm and remaps it to the Valgrind's representation. The codegen
> then does the opposite. The idea here is that the frontend would know
> the actual rounding mode and could create IR which has directly this
> mode, for instance, AddF64(Irrm_NEAREST, <in1>, <in2>). The codegen then
> doesn't need to know how to handle any dynamic rounding modes as they
> become static.
>
> I plan to look further into this series. Specifically, I'd like to have
> a stab at adding some basic support for Arm SVE to get a better
> understanding if this is generic enough.
>
Great, I will add more RVV support if it's proved to be the right
direction, and thank you for the review.
Thanks,
Fei.
> Thanks,
> Petr
|
|
From: Petr P. <pet...@da...> - 2023-07-25 19:55:29
|
On 17. Jul 23 15:05, Jojo R wrote: > Hi, > > Sorry for the late reply, > > i have been pushing the progress of valgrind RVV implementation 😄 > We finished the first version and tested with full RVV intrinsics spec. > > For real project and developers, we implement the first useable/ full > functionality's RVV valgrind with dirtycall method, > and we will make experiment or optimize RVV implementation on ideal RVV > design. > > Back to the RVV RFC, we are happy to share our thinking of design, see > attachment for more details :) This is a good summary. As mentioned in another part of the thread, I think that in long run it will be indeed needed to implement the approach described as "RVV to variable-length IR". I hope to help with making sure it can work for Arm SVE too. I guess if initial experiments show that this option is hard and will take time to implement then it could make sense in short term for the RISC-V port to go with the "RVV to dirty helper" implementation. Thanks, Petr |
|
From: Jojo R <rj...@gm...> - 2023-08-04 05:45:19
|
在 2023/7/26 03:55, Petr Pavlu 写道: > On 17. Jul 23 15:05, Jojo R wrote: >> Hi, >> >> Sorry for the late reply, >> >> i have been pushing the progress of valgrind RVV implementation 😄 >> We finished the first version and tested with full RVV intrinsics spec. >> >> For real project and developers, we implement the first useable/ full >> functionality's RVV valgrind with dirtycall method, >> and we will make experiment or optimize RVV implementation on ideal RVV >> design. >> >> Back to the RVV RFC, we are happy to share our thinking of design, see >> attachment for more details :) > This is a good summary. > > As mentioned in another part of the thread, I think that in long run it > will be indeed needed to implement the approach described as "RVV to > variable-length IR". I hope to help with making sure it can work for Arm > SVE too. > > I guess if initial experiments show that this option is hard and will > take time to implement then it could make sense in short term for the > RISC-V port to go with the "RVV to dirty helper" implementation. > > Thanks, > Petr Ok, experiments are helpfull, also we will open souce our RVV implementation soon :) Regards -- Jojo > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |