Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

     Any feedback or suggestion about this RFC ?

在 2023/4/21 17:25, Jojo R 写道:
>
> Hi,
>
> We consider to add RVV/Vector [1] feature in valgrind, there are some 
> challenges.
> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that 
> means the vector length is agnostic.
> ARM's SVE is not supported in valgrind :(
>
> There are three major issues in implementing RVV instruction set in 
> Valgrind as following:
>
>  1. Scalable vector register width VLENB
>  2. Runtime changing property of LMUL and SEW
>  3. Lack of proper VEX IR to represent all vector operations
>
> We propose applicable methods to solve 1 and 2. As for 3, we explore 
> several possible but maybe imperfect approaches to handle different cases.
>
> We start from 1. As each guest register should be described in 
> VEXGuestState struct, the vector registers with scalable width of 
> VLENB can be added into VEXGuestState as arrays using an allowable 
> maximum length like 2048/4096.
>
> The actual available access range can be determined at Valgrind 
> startup time by querying the CPU for its vector capability or some 
> suitable setup steps.
>
>
> To solve problem 2, we are inspired by already-proven techniques in 
> QEMU, where translation blocks are broken up when certain critical 
> CSRs are set. Because the guest code to IR translation relies on the 
> precise value of LMUL/SEW and they may change within a basic block, we 
> can break up the basic block each time encountering a vsetvl{i} 
> instruction and return to the scheduler to execute the translated code 
> and update LMUL/SEW. Accordingly, translation cache management should 
> be refactored to detect the changing of LMUL/SEW to invalidate 
> outdated code cache. Without losing the generality, the LMUL/SEW 
> should be encoded into an ULong flag such that other architectures can 
> leverage this flag to store their arch-dependent information. The 
> TTentry struct should also take the flag into account no matter 
> insertion or deletion. By doing this, the flag carries the newest 
> LMUL/SEW throughout the simulation and can be passed to disassemble 
> functions using the VEXArchInfo struct such that we can get the real 
> and newest value of LMUL and SEW to facilitate our translation.
>
> Also, some architecture-related code should be taken care of. Like 
> m_dispatch part, disp_cp_xindir function looks up code cache using 
> hardcoded assembly by checking the requested guest state IP and 
> translation cache entry address with no more constraints. Many other 
> modules should be checked to ensure the in-time update of LMUL/SEW is 
> instantly visible to essential parts in Valgrind.
>
>
> The last remaining big issue is 3, which we introduce some ad-hoc 
> approaches to deal with. We summarize these approaches into three 
> types as following:
>
>  1. Break down a vector instruction to scalar VEX IR ops.
>  2. Break down a vector instruction to fixed-length VEX IR ops.
>  3. Use dirty helpers to realize vector instructions.
>
> The very first method theoretically exists but is probably not 
> applicable as the number of IR ops explodes when a large VLENB is 
> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL 
> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns 
> into 512 scalar instructions and each scalar instruction would be 
> expanded to multiple IRs. To make things worse, the tool 
> instrumentation will insert more IRs between adjacent scalar IR ops. 
> As a result, the performance is likely to be slowed down thousand 
> times during running a real-world application with lots of vector 
> instructions. Therefore, the other two methods are more promising and 
> we will discuss them below.
>
> 2 and 3 are not mutually exclusive as we may choose a suitable method 
> from them to implement a vector instruction regarding its concrete 
> behavior. To explain these methods in detail, we present some 
> instances to illustrate their pros and cons.
>
> In terms of method 2, we have real values of VLENB/LMUL/SEW. The 
> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are 
> available and can be directly applied to represent vector operations. 
> However, even when VLENB is restricted to 128, it still exceeds the 
> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here 
> are two variants of method 2 to deal with long vectors:
>
>
> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector 
> instructions in the granularity of VLENB. Accordingly, VLENB=4096 with 
> LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.
>
>   * *pros*: it encourages VEX backend to generate more compact and
>     efficient SIMD code (maybe). Particularly,it accommodatesmask and
>     gather/scatter (indexed) instructions by delivering more
>     information in IR itself.
>   * *cons*: too many new IR ops need to be introduced in VEX as each
>     op of different length should implement its add/sub/mul variants.
>     New data types to denote long vectors are necessary too, causing
>     difficulties in both VEX backend register allocation and tool
>     instrumentation.
>
> *2.2*Break down long vectors to multiple repeated SIMD ops. For 
> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is 
> composed of four operators of Iop_Add8x16 type.
>
>   * *pros:*less efforts are required in register allocation and tool
>     instrumentation. The VEX frontend is able to notify the backend to
>     generate efficient vector instructions by existing Iops. It better
>     trades off the complexity of adding many long vector IR ops and
>     the benefit of generating high-efficiency host code.
>   * *cons:*it is hard to describe a mask operation given that the mask
>     is pretty flexible (the least significant bit of each segment of
>     v0). Additionally, gather/scatter instructions may have similar
>     problems in appropriately dividing index registers. There are
>     various corner cases left here such as widening arithmetic
>     operations (widening SIMD IR ops are currently not compatible) and
>     vstart CSR register. When using fixed-length IR ops to comprise a
>     vector instruction, we will inevitably tell each IR op which
>     position encoded in vstart you can start to process the data. We
>     can use vstart as a normal guest state virtual register to
>     calculate each op's start position as a guard IRExpr or obtain the
>     value of vstart like what we do in LMUL/SEW. Nevertheless, it is
>     non-trivial to decompose a vector instruction concisely.
>
> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering 
> efforts of refactoring Valgrind elegantly as well as implementing the 
> vector instruction set efficiently. Same obstacles exist in ARM SVE as 
> they are scalable vector instructions and flexible in many ways.
>
> The final solution is the dirty helper. It is undoubtedly practical 
> and requires possibly the least engineering efforts in dealing with so 
> many details in Valgrind. In this design, each instruction is 
> completed using an inline assembly running the same instruction on the 
> host. Moreover, tool instrumentation already handles IRDirty except 
> that new fields should be added in _IRDirty struct to indicate 
> strided/indexed/masked memory accesses and arithmetic operations.
>
>   * *pros:*it supports all instructions without bothering to build
>     complicated IR expressions and statements. It executes vector
>     instructions using host CPU to get acceleration to some extent.
>     Besides, we do not need to add VEX backend to translate new IRs to
>     vector instructions.
>   * *cons:*the dirty helper always keeps its operations in a black box
>     such that tools can never see what happens in a dirty helper. Like
>     memcheck, the bit precision merit is missing once it meets a dirty
>     helper as the V-bit propagation chain adopts a pretty coarse
>     determination strategy. On the other hand, it is also not an
>     elegant way to implement the entire ISA extension in dirty helpers.
>
> In summary, it is far to reach a truly applicable solution in adding 
> vector extensions in Valgrind. We need to do detailed and 
> comprehensive estimations on different vector instruction categories.
>
> Any feedback is welcome in github [3] also.
>
>
> [1] https://github.com/riscv/riscv-v-spec
>
> [2] 
> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>
> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>
>
> Thanks.
>
> Jojo
>
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers