From: Jojo R <rj...@li...> - 2023-04-21 10:06:23
|
Hi, We consider to add RVV/Vector [1] feature in valgrind, there are some challenges. RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the vector length is agnostic. ARM's SVE is not supported in valgrind :( There are three major issues in implementing RVV instruction set in Valgrind as following: 1. Scalable vector register width VLENB 2. Runtime changing property of LMUL and SEW 3. Lack of proper VEX IR to represent all vector operations We propose applicable methods to solve 1 and 2. As for 3, we explore several possible but maybe imperfect approaches to handle different cases. We start from 1. As each guest register should be described in VEXGuestState struct, the vector registers with scalable width of VLENB can be added into VEXGuestState as arrays using an allowable maximum length like 2048/4096. The actual available access range can be determined at Valgrind startup time by querying the CPU for its vector capability or some suitable setup steps. To solve problem 2, we are inspired by already-proven techniques in QEMU, where translation blocks are broken up when certain critical CSRs are set. Because the guest code to IR translation relies on the precise value of LMUL/SEW and they may change within a basic block, we can break up the basic block each time encountering a vsetvl{i} instruction and return to the scheduler to execute the translated code and update LMUL/SEW. Accordingly, translation cache management should be refactored to detect the changing of LMUL/SEW to invalidate outdated code cache. Without losing the generality, the LMUL/SEW should be encoded into an ULong flag such that other architectures can leverage this flag to store their arch-dependent information. The TTentry struct should also take the flag into account no matter insertion or deletion. By doing this, the flag carries the newest LMUL/SEW throughout the simulation and can be passed to disassemble functions using the VEXArchInfo struct such that we can get the real and newest value of LMUL and SEW to facilitate our translation. Also, some architecture-related code should be taken care of. Like m_dispatch part, disp_cp_xindir function looks up code cache using hardcoded assembly by checking the requested guest state IP and translation cache entry address with no more constraints. Many other modules should be checked to ensure the in-time update of LMUL/SEW is instantly visible to essential parts in Valgrind. The last remaining big issue is 3, which we introduce some ad-hoc approaches to deal with. We summarize these approaches into three types as following: 1. Break down a vector instruction to scalar VEX IR ops. 2. Break down a vector instruction to fixed-length VEX IR ops. 3. Use dirty helpers to realize vector instructions. The very first method theoretically exists but is probably not applicable as the number of IR ops explodes when a large VLENB is adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL is 512 * 8 / 8 = 512, meaning that a single vector instruction turns into 512 scalar instructions and each scalar instruction would be expanded to multiple IRs. To make things worse, the tool instrumentation will insert more IRs between adjacent scalar IR ops. As a result, the performance is likely to be slowed down thousand times during running a real-world application with lots of vector instructions. Therefore, the other two methods are more promising and we will discuss them below. 2 and 3 are not mutually exclusive as we may choose a suitable method from them to implement a vector instruction regarding its concrete behavior. To explain these methods in detail, we present some instances to illustrate their pros and cons. In terms of method 2, we have real values of VLENB/LMUL/SEW. The simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are available and can be directly applied to represent vector operations. However, even when VLENB is restricted to 128, it still exceeds the maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here are two variants of method 2 to deal with long vectors: *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector instructions in the granularity of VLENB. Accordingly, VLENB=4096 with LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops. * *pros*: it encourages VEX backend to generate more compact and efficient SIMD code (maybe). Particularly,it accommodatesmask and gather/scatter (indexed) instructions by delivering more information in IR itself. * *cons*: too many new IR ops need to be introduced in VEX as each op of different length should implement its add/sub/mul variants. New data types to denote long vectors are necessary too, causing difficulties in both VEX backend register allocation and tool instrumentation. *2.2*Break down long vectors to multiple repeated SIMD ops. For instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is composed of four operators of Iop_Add8x16 type. * *pros:*less efforts are required in register allocation and tool instrumentation. The VEX frontend is able to notify the backend to generate efficient vector instructions by existing Iops. It better trades off the complexity of adding many long vector IR ops and the benefit of generating high-efficiency host code. * *cons:*it is hard to describe a mask operation given that the mask is pretty flexible (the least significant bit of each segment of v0). Additionally, gather/scatter instructions may have similar problems in appropriately dividing index registers. There are various corner cases left here such as widening arithmetic operations (widening SIMD IR ops are currently not compatible) and vstart CSR register. When using fixed-length IR ops to comprise a vector instruction, we will inevitably tell each IR op which position encoded in vstart you can start to process the data. We can use vstart as a normal guest state virtual register to calculate each op's start position as a guard IRExpr or obtain the value of vstart like what we do in LMUL/SEW. Nevertheless, it is non-trivial to decompose a vector instruction concisely. In short, both 2.1 and 2.2 confront a dilemma in reducing engineering efforts of refactoring Valgrind elegantly as well as implementing the vector instruction set efficiently. Same obstacles exist in ARM SVE as they are scalable vector instructions and flexible in many ways. The final solution is the dirty helper. It is undoubtedly practical and requires possibly the least engineering efforts in dealing with so many details in Valgrind. In this design, each instruction is completed using an inline assembly running the same instruction on the host. Moreover, tool instrumentation already handles IRDirty except that new fields should be added in _IRDirty struct to indicate strided/indexed/masked memory accesses and arithmetic operations. * *pros:*it supports all instructions without bothering to build complicated IR expressions and statements. It executes vector instructions using host CPU to get acceleration to some extent. Besides, we do not need to add VEX backend to translate new IRs to vector instructions. * *cons:*the dirty helper always keeps its operations in a black box such that tools can never see what happens in a dirty helper. Like memcheck, the bit precision merit is missing once it meets a dirty helper as the V-bit propagation chain adopts a pretty coarse determination strategy. On the other hand, it is also not an elegant way to implement the entire ISA extension in dirty helpers. In summary, it is far to reach a truly applicable solution in adding vector extensions in Valgrind. We need to do detailed and comprehensive estimations on different vector instruction categories. Any feedback is welcome in github [3] also. [1] https://github.com/riscv/riscv-v-spec [2] https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17 Thanks. Jojo |