Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

We are glad to open source RVV implementation here:

https://github.com/rjiejie/valgrind-riscv64

3 kinds extra ISAs were added in this repo:

RV64Zfh             : Half-precision floating-point
RV64Xthead [1]  : T-THEAD vendor extension for RV64G
RV64V0p7 [2]     : Vector 0.7.1
RV64V                : Vector 1.x, coming soon :)

[1] https://github.com/T-head-Semi/thead-extension-spec
[2] https://github.com/riscv/riscv-v-spec/releases/tag/0.7.1

Regards

--Jojo

在 2023/7/17 15:05, Jojo R 写道:
>
> Hi,
>
> Sorry for the late reply,
>
> i have been pushing the progress of valgrind RVV implementation 😄
> We finished the first version and tested with full RVV intrinsics spec.
>
> For real project and developers, we implement the first useable/ full 
> functionality's RVV valgrind with dirtycall method,
> and we will make experiment or optimize RVV implementation on ideal 
> RVV design.
>
> Back to the RVV RFC, we are happy to share our thinking of design, see 
> attachment for more details :)
>
> Regards
>
> --Jojo
>
> 在 2023/4/21 17:25, Jojo R 写道:
>>
>> Hi,
>>
>> We consider to add RVV/Vector [1] feature in valgrind, there are some 
>> challenges.
>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that 
>> means the vector length is agnostic.
>> ARM's SVE is not supported in valgrind :(
>>
>> There are three major issues in implementing RVV instruction set in 
>> Valgrind as following:
>>
>>  1. Scalable vector register width VLENB
>>  2. Runtime changing property of LMUL and SEW
>>  3. Lack of proper VEX IR to represent all vector operations
>>
>> We propose applicable methods to solve 1 and 2. As for 3, we explore 
>> several possible but maybe imperfect approaches to handle different 
>> cases.
>>
>> We start from 1. As each guest register should be described in 
>> VEXGuestState struct, the vector registers with scalable width of 
>> VLENB can be added into VEXGuestState as arrays using an allowable 
>> maximum length like 2048/4096.
>>
>> The actual available access range can be determined at Valgrind 
>> startup time by querying the CPU for its vector capability or some 
>> suitable setup steps.
>>
>>
>> To solve problem 2, we are inspired by already-proven techniques in 
>> QEMU, where translation blocks are broken up when certain critical 
>> CSRs are set. Because the guest code to IR translation relies on the 
>> precise value of LMUL/SEW and they may change within a basic block, 
>> we can break up the basic block each time encountering a vsetvl{i} 
>> instruction and return to the scheduler to execute the translated 
>> code and update LMUL/SEW. Accordingly, translation cache management 
>> should be refactored to detect the changing of LMUL/SEW to invalidate 
>> outdated code cache. Without losing the generality, the LMUL/SEW 
>> should be encoded into an ULong flag such that other architectures 
>> can leverage this flag to store their arch-dependent information. The 
>> TTentry struct should also take the flag into account no matter 
>> insertion or deletion. By doing this, the flag carries the newest 
>> LMUL/SEW throughout the simulation and can be passed to disassemble 
>> functions using the VEXArchInfo struct such that we can get the real 
>> and newest value of LMUL and SEW to facilitate our translation.
>>
>> Also, some architecture-related code should be taken care of. Like 
>> m_dispatch part, disp_cp_xindir function looks up code cache using 
>> hardcoded assembly by checking the requested guest state IP and 
>> translation cache entry address with no more constraints. Many other 
>> modules should be checked to ensure the in-time update of LMUL/SEW is 
>> instantly visible to essential parts in Valgrind.
>>
>>
>> The last remaining big issue is 3, which we introduce some ad-hoc 
>> approaches to deal with. We summarize these approaches into three 
>> types as following:
>>
>>  1. Break down a vector instruction to scalar VEX IR ops.
>>  2. Break down a vector instruction to fixed-length VEX IR ops.
>>  3. Use dirty helpers to realize vector instructions.
>>
>> The very first method theoretically exists but is probably not 
>> applicable as the number of IR ops explodes when a large VLENB is 
>> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL 
>> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns 
>> into 512 scalar instructions and each scalar instruction would be 
>> expanded to multiple IRs. To make things worse, the tool 
>> instrumentation will insert more IRs between adjacent scalar IR ops. 
>> As a result, the performance is likely to be slowed down thousand 
>> times during running a real-world application with lots of vector 
>> instructions. Therefore, the other two methods are more promising and 
>> we will discuss them below.
>>
>> 2 and 3 are not mutually exclusive as we may choose a suitable method 
>> from them to implement a vector instruction regarding its concrete 
>> behavior. To explain these methods in detail, we present some 
>> instances to illustrate their pros and cons.
>>
>> In terms of method 2, we have real values of VLENB/LMUL/SEW. The 
>> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are 
>> available and can be directly applied to represent vector operations. 
>> However, even when VLENB is restricted to 128, it still exceeds the 
>> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here 
>> are two variants of method 2 to deal with long vectors:
>>
>>
>> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate 
>> vector instructions in the granularity of VLENB. Accordingly, 
>> VLENB=4096 with LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.
>>
>>   * *pros*: it encourages VEX backend to generate more compact and
>>     efficient SIMD code (maybe). Particularly,it accommodatesmask and
>>     gather/scatter (indexed) instructions by delivering more
>>     information in IR itself.
>>   * *cons*: too many new IR ops need to be introduced in VEX as each
>>     op of different length should implement its add/sub/mul variants.
>>     New data types to denote long vectors are necessary too, causing
>>     difficulties in both VEX backend register allocation and tool
>>     instrumentation.
>>
>> *2.2*Break down long vectors to multiple repeated SIMD ops. For 
>> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is 
>> composed of four operators of Iop_Add8x16 type.
>>
>>   * *pros:*less efforts are required in register allocation and tool
>>     instrumentation. The VEX frontend is able to notify the backend
>>     to generate efficient vector instructions by existing Iops. It
>>     better trades off the complexity of adding many long vector IR
>>     ops and the benefit of generating high-efficiency host code.
>>   * *cons:*it is hard to describe a mask operation given that the
>>     mask is pretty flexible (the least significant bit of each
>>     segment of v0). Additionally, gather/scatter instructions may
>>     have similar problems in appropriately dividing index registers.
>>     There are various corner cases left here such as widening
>>     arithmetic operations (widening SIMD IR ops are currently not
>>     compatible) and vstart CSR register. When using fixed-length IR
>>     ops to comprise a vector instruction, we will inevitably tell
>>     each IR op which position encoded in vstart you can start to
>>     process the data. We can use vstart as a normal guest state
>>     virtual register to calculate each op's start position as a guard
>>     IRExpr or obtain the value of vstart like what we do in LMUL/SEW.
>>     Nevertheless, it is non-trivial to decompose a vector instruction
>>     concisely.
>>
>> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering 
>> efforts of refactoring Valgrind elegantly as well as implementing the 
>> vector instruction set efficiently. Same obstacles exist in ARM SVE 
>> as they are scalable vector instructions and flexible in many ways.
>>
>> The final solution is the dirty helper. It is undoubtedly practical 
>> and requires possibly the least engineering efforts in dealing with 
>> so many details in Valgrind. In this design, each instruction is 
>> completed using an inline assembly running the same instruction on 
>> the host. Moreover, tool instrumentation already handles IRDirty 
>> except that new fields should be added in _IRDirty struct to indicate 
>> strided/indexed/masked memory accesses and arithmetic operations.
>>
>>   * *pros:*it supports all instructions without bothering to build
>>     complicated IR expressions and statements. It executes vector
>>     instructions using host CPU to get acceleration to some extent.
>>     Besides, we do not need to add VEX backend to translate new IRs
>>     to vector instructions.
>>   * *cons:*the dirty helper always keeps its operations in a black
>>     box such that tools can never see what happens in a dirty helper.
>>     Like memcheck, the bit precision merit is missing once it meets a
>>     dirty helper as the V-bit propagation chain adopts a pretty
>>     coarse determination strategy. On the other hand, it is also not
>>     an elegant way to implement the entire ISA extension in dirty
>>     helpers.
>>
>> In summary, it is far to reach a truly applicable solution in adding 
>> vector extensions in Valgrind. We need to do detailed and 
>> comprehensive estimations on different vector instruction categories.
>>
>> Any feedback is welcome in github [3] also.
>>
>>
>> [1] https://github.com/riscv/riscv-v-spec
>>
>> [2] 
>> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
>>
>> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17
>>
>>
>> Thanks.
>>
>> Jojo
>>
>>
>>
>> _______________________________________________
>> Valgrind-developers mailing list
>> Val...@li...
>> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>
>
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers