From: Jojo R <rj...@li...> - 2023-08-04 06:04:16
|
Hi, We are glad to open source RVV implementation here: https://github.com/rjiejie/valgrind-riscv64 3 kinds extra ISAs were added in this repo: RV64Zfh : Half-precision floating-point RV64Xthead [1] : T-THEAD vendor extension for RV64G RV64V0p7 [2] : Vector 0.7.1 RV64V : Vector 1.x, coming soon :) [1] https://github.com/T-head-Semi/thead-extension-spec [2] https://github.com/riscv/riscv-v-spec/releases/tag/0.7.1 Regards --Jojo 在 2023/7/17 15:05, Jojo R 写道: > > Hi, > > Sorry for the late reply, > > i have been pushing the progress of valgrind RVV implementation 😄 > We finished the first version and tested with full RVV intrinsics spec. > > For real project and developers, we implement the first useable/ full > functionality's RVV valgrind with dirtycall method, > and we will make experiment or optimize RVV implementation on ideal > RVV design. > > Back to the RVV RFC, we are happy to share our thinking of design, see > attachment for more details :) > > Regards > > --Jojo > > 在 2023/4/21 17:25, Jojo R 写道: >> >> Hi, >> >> We consider to add RVV/Vector [1] feature in valgrind, there are some >> challenges. >> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that >> means the vector length is agnostic. >> ARM's SVE is not supported in valgrind :( >> >> There are three major issues in implementing RVV instruction set in >> Valgrind as following: >> >> 1. Scalable vector register width VLENB >> 2. Runtime changing property of LMUL and SEW >> 3. Lack of proper VEX IR to represent all vector operations >> >> We propose applicable methods to solve 1 and 2. As for 3, we explore >> several possible but maybe imperfect approaches to handle different >> cases. >> >> We start from 1. As each guest register should be described in >> VEXGuestState struct, the vector registers with scalable width of >> VLENB can be added into VEXGuestState as arrays using an allowable >> maximum length like 2048/4096. >> >> The actual available access range can be determined at Valgrind >> startup time by querying the CPU for its vector capability or some >> suitable setup steps. >> >> >> To solve problem 2, we are inspired by already-proven techniques in >> QEMU, where translation blocks are broken up when certain critical >> CSRs are set. Because the guest code to IR translation relies on the >> precise value of LMUL/SEW and they may change within a basic block, >> we can break up the basic block each time encountering a vsetvl{i} >> instruction and return to the scheduler to execute the translated >> code and update LMUL/SEW. Accordingly, translation cache management >> should be refactored to detect the changing of LMUL/SEW to invalidate >> outdated code cache. Without losing the generality, the LMUL/SEW >> should be encoded into an ULong flag such that other architectures >> can leverage this flag to store their arch-dependent information. The >> TTentry struct should also take the flag into account no matter >> insertion or deletion. By doing this, the flag carries the newest >> LMUL/SEW throughout the simulation and can be passed to disassemble >> functions using the VEXArchInfo struct such that we can get the real >> and newest value of LMUL and SEW to facilitate our translation. >> >> Also, some architecture-related code should be taken care of. Like >> m_dispatch part, disp_cp_xindir function looks up code cache using >> hardcoded assembly by checking the requested guest state IP and >> translation cache entry address with no more constraints. Many other >> modules should be checked to ensure the in-time update of LMUL/SEW is >> instantly visible to essential parts in Valgrind. >> >> >> The last remaining big issue is 3, which we introduce some ad-hoc >> approaches to deal with. We summarize these approaches into three >> types as following: >> >> 1. Break down a vector instruction to scalar VEX IR ops. >> 2. Break down a vector instruction to fixed-length VEX IR ops. >> 3. Use dirty helpers to realize vector instructions. >> >> The very first method theoretically exists but is probably not >> applicable as the number of IR ops explodes when a large VLENB is >> adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL >> is 512 * 8 / 8 = 512, meaning that a single vector instruction turns >> into 512 scalar instructions and each scalar instruction would be >> expanded to multiple IRs. To make things worse, the tool >> instrumentation will insert more IRs between adjacent scalar IR ops. >> As a result, the performance is likely to be slowed down thousand >> times during running a real-world application with lots of vector >> instructions. Therefore, the other two methods are more promising and >> we will discuss them below. >> >> 2 and 3 are not mutually exclusive as we may choose a suitable method >> from them to implement a vector instruction regarding its concrete >> behavior. To explain these methods in detail, we present some >> instances to illustrate their pros and cons. >> >> In terms of method 2, we have real values of VLENB/LMUL/SEW. The >> simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are >> available and can be directly applied to represent vector operations. >> However, even when VLENB is restricted to 128, it still exceeds the >> maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here >> are two variants of method 2 to deal with long vectors: >> >> >> *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate >> vector instructions in the granularity of VLENB. Accordingly, >> VLENB=4096 with LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops. >> >> * *pros*: it encourages VEX backend to generate more compact and >> efficient SIMD code (maybe). Particularly,it accommodatesmask and >> gather/scatter (indexed) instructions by delivering more >> information in IR itself. >> * *cons*: too many new IR ops need to be introduced in VEX as each >> op of different length should implement its add/sub/mul variants. >> New data types to denote long vectors are necessary too, causing >> difficulties in both VEX backend register allocation and tool >> instrumentation. >> >> *2.2*Break down long vectors to multiple repeated SIMD ops. For >> instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is >> composed of four operators of Iop_Add8x16 type. >> >> * *pros:*less efforts are required in register allocation and tool >> instrumentation. The VEX frontend is able to notify the backend >> to generate efficient vector instructions by existing Iops. It >> better trades off the complexity of adding many long vector IR >> ops and the benefit of generating high-efficiency host code. >> * *cons:*it is hard to describe a mask operation given that the >> mask is pretty flexible (the least significant bit of each >> segment of v0). Additionally, gather/scatter instructions may >> have similar problems in appropriately dividing index registers. >> There are various corner cases left here such as widening >> arithmetic operations (widening SIMD IR ops are currently not >> compatible) and vstart CSR register. When using fixed-length IR >> ops to comprise a vector instruction, we will inevitably tell >> each IR op which position encoded in vstart you can start to >> process the data. We can use vstart as a normal guest state >> virtual register to calculate each op's start position as a guard >> IRExpr or obtain the value of vstart like what we do in LMUL/SEW. >> Nevertheless, it is non-trivial to decompose a vector instruction >> concisely. >> >> In short, both 2.1 and 2.2 confront a dilemma in reducing engineering >> efforts of refactoring Valgrind elegantly as well as implementing the >> vector instruction set efficiently. Same obstacles exist in ARM SVE >> as they are scalable vector instructions and flexible in many ways. >> >> The final solution is the dirty helper. It is undoubtedly practical >> and requires possibly the least engineering efforts in dealing with >> so many details in Valgrind. In this design, each instruction is >> completed using an inline assembly running the same instruction on >> the host. Moreover, tool instrumentation already handles IRDirty >> except that new fields should be added in _IRDirty struct to indicate >> strided/indexed/masked memory accesses and arithmetic operations. >> >> * *pros:*it supports all instructions without bothering to build >> complicated IR expressions and statements. It executes vector >> instructions using host CPU to get acceleration to some extent. >> Besides, we do not need to add VEX backend to translate new IRs >> to vector instructions. >> * *cons:*the dirty helper always keeps its operations in a black >> box such that tools can never see what happens in a dirty helper. >> Like memcheck, the bit precision merit is missing once it meets a >> dirty helper as the V-bit propagation chain adopts a pretty >> coarse determination strategy. On the other hand, it is also not >> an elegant way to implement the entire ISA extension in dirty >> helpers. >> >> In summary, it is far to reach a truly applicable solution in adding >> vector extensions in Valgrind. We need to do detailed and >> comprehensive estimations on different vector instruction categories. >> >> Any feedback is welcome in github [3] also. >> >> >> [1] https://github.com/riscv/riscv-v-spec >> >> [2] >> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve >> >> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17 >> >> >> Thanks. >> >> Jojo >> >> >> >> _______________________________________________ >> Valgrind-developers mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |