Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Sorry for the late reply,

i have been pushing the progress of valgrind RVV implementation 😄
We finished the first version and tested with full RVV intrinsics spec.

For real project and developers, we implement the first useable/ full 
functionality's RVV valgrind with dirtycall method,
and we will make experiment or optimize RVV implementation on ideal RVV 
design.

Back to the RVV RFC, we are happy to share our thinking of design, see 
attachment for more details :)

Regards

--Jojo

在 2023/4/21 17:25, Jojo R 写道:
>
> Hi,
>
> We consider to add RVV/Vector [1] feature in valgrind, there are some 
> challenges.
> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that 
> means the vector length is agnostic.
> ARM's SVE is not supported in valgrind :(
>
> There are three major issues in implementing RVV instruction set in 
> Valgrind as following:
>
>  1. Scalable vector register width VLENB
>  2. Runtime changing property of LMUL and SEW
>  3. Lack of proper VEX IR to represent all vector operations
>
> We propose applicable methods to solve 1 and 2. As for 3, we explore 
> several possible but maybe imperfect approaches to handle different cases.
>
> We start from 1. As each guest register should be described in 
> VEXGuestState struct, the vector registers with scalable width of 
> VLENB can be added into VEXGuestState as arrays using an allowable 
> maximum length like 2048/4096.
>
> The actual available access range can be determined at Valgrind 
> startup time by querying the CPU for its vector capability or some 
> suitable setup steps.
>
>
> To solve problem 2, we are inspired by already-proven techniques in 
> QEMU, where translation blocks are broken up when certain critical 
> CSRs are set. Because the guest code to IR translation relies on the 
> precise value of LMUL/SEW and they may change within a basic block, we 
> can break up the basic block each time encountering a vsetvl{i} 
> instruction and return to the scheduler to execute the translated code 
> and update LMUL/SEW. Accordingly, translation cache management should 
> be refactored to detect the changing of LMUL/SEW to invalidate 
> outdated code cache. Without losing the generality, the LMUL/SEW 
> should be encoded into an ULong flag such that other architectures can 
> leverage this flag to store their arch-dependent information. The 
> TTentry struct should also take the flag into account no matter 
> insertion or deletion. By doing this, the flag carries the newest 
> LMUL/SEW throughout the simulation and can be passed to disassemble 
> functions using the VEXArchInfo struct such that we can get the real 
> and newest value of LMUL and SEW to facilitate our translation.
>
> Also, some architecture-related code should be taken care of. Like 
> m_dispatch part, disp_cp_xindir function looks up code cache using 
> hardcoded assembly by checking the requested guest state IP and 
> translation cache entry address with no more constraints. Many other 
> modules should be checked to ensure the in-time update of LMUL/SEW is 
> instantly visible to essential parts in Valgrind.
>
>
> The last remaining big issue is 3, which we introduce some ad-hoc 
> approaches to deal with. We summarize these approaches into three 
> types as following:
>
>  1. Break down a vector instruction to scalar VEX IR ops.
>  2. Break down a vector instruction to fixed-length VEX IR ops.
>  3. Use dirty helpers to realize vector instructions.
>
> The very first method theoretically exists but is probably not 
> applicable as the number of IR ops explodes when a large VLENB is 
> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL 
> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns 
> into 512 scalar instructions and each scalar instruction would be 
> expanded to multiple IRs. To make things worse, the tool 
> instrumentation will insert more IRs between adjacent scalar IR ops. 
> As a result, the performance is likely to be slowed down thousand 
> times during running a real-world application with lots of vector 
> instructions. Therefore, the other two methods are more promising and 
> we will discuss them below.
>
> 2 and 3 are not mutually exclusive as we may choose a suitable method 
> from them to implement a vector instruction regarding its concrete 
> behavior. To explain these methods in detail, we present some 
> instances to illustrate their pros and cons.
>
> In terms of method 2, we have real values of VLENB/LMUL/SEW. The 
> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are 
> available and can be directly applied to represent vector operations. 
> However, even when VLENB is restricted to 128, it still exceeds the 
> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here 
> are two variants of method 2 to deal with long vectors:
>
>
> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector 
> instructions in the granularity of VLENB. Accordingly, VLENB=4096 with 
> LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.
>
>   * *pros*: it encourages VEX backend to generate more compact and
>     efficient SIMD code (maybe). Particularly,it accommodatesmask and
>     gather/scatter (indexed) instructions by delivering more
>     information in IR itself.
>   * *cons*: too many new IR ops need to be introduced in VEX as each
>     op of different length should implement its add/sub/mul variants.
>     New data types to denote long vectors are necessary too, causing
>     difficulties in both VEX backend register allocation and tool
>     instrumentation.
>
> *2.2*Break down long vectors to multiple repeated SIMD ops. For 
> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is 
> composed of four operators of Iop_Add8x16 type.
>
>   * *pros:*less efforts are required in register allocation and tool
>     instrumentation. The VEX frontend is able to notify the backend to
>     generate efficient vector instructions by existing Iops. It better
>     trades off the complexity of adding many long vector IR ops and
>     the benefit of generating high-efficiency host code.
>   * *cons:*it is hard to describe a mask operation given that the mask
>     is pretty flexible (the least significant bit of each segment of
>     v0). Additionally, gather/scatter instructions may have similar
>     problems in appropriately dividing index registers. There are
>     various corner cases left here such as widening arithmetic
>     operations (widening SIMD IR ops are currently not compatible) and
>     vstart CSR register. When using fixed-length IR ops to comprise a
>     vector instruction, we will inevitably tell each IR op which
>     position encoded in vstart you can start to process the data. We
>     can use vstart as a normal guest state virtual register to
>     calculate each op's start position as a guard IRExpr or obtain the
>     value of vstart like what we do in LMUL/SEW. Nevertheless, it is
>     non-trivial to decompose a vector instruction concisely.
>
> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering 
> efforts of refactoring Valgrind elegantly as well as implementing the 
> vector instruction set efficiently. Same obstacles exist in ARM SVE as 
> they are scalable vector instructions and flexible in many ways.
>
> The final solution is the dirty helper. It is undoubtedly practical 
> and requires possibly the least engineering efforts in dealing with so 
> many details in Valgrind. In this design, each instruction is 
> completed using an inline assembly running the same instruction on the 
> host. Moreover, tool instrumentation already handles IRDirty except 
> that new fields should be added in _IRDirty struct to indicate 
> strided/indexed/masked memory accesses and arithmetic operations.
>
>   * *pros:*it supports all instructions without bothering to build
>     complicated IR expressions and statements. It executes vector
>     instructions using host CPU to get acceleration to some extent.
>     Besides, we do not need to add VEX backend to translate new IRs to
>     vector instructions.
>   * *cons:*the dirty helper always keeps its operations in a black box
>     such that tools can never see what happens in a dirty helper. Like
>     memcheck, the bit precision merit is missing once it meets a dirty
>     helper as the V-bit propagation chain adopts a pretty coarse
>     determination strategy. On the other hand, it is also not an
>     elegant way to implement the entire ISA extension in dirty helpers.
>
> In summary, it is far to reach a truly applicable solution in adding 
> vector extensions in Valgrind. We need to do detailed and 
> comprehensive estimations on different vector instruction categories.
>
> Any feedback is welcome in github [3] also.
>
>
> [1] https://github.com/riscv/riscv-v-spec
>
> [2] 
> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>
> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>
>
> Thanks.
>
> Jojo
>
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers