From: Jojo R <rj...@gm...> - 2023-05-22 11:46:42
|
Hi, Any feedback or suggestion about this RFC ? 在 2023/4/21 17:25, Jojo R 写道: > > Hi, > > We consider to add RVV/Vector [1] feature in valgrind, there are some > challenges. > RVV like ARM's SVE [2] programming model, it's scalable/VLA, that > means the vector length is agnostic. > ARM's SVE is not supported in valgrind :( > > There are three major issues in implementing RVV instruction set in > Valgrind as following: > > 1. Scalable vector register width VLENB > 2. Runtime changing property of LMUL and SEW > 3. Lack of proper VEX IR to represent all vector operations > > We propose applicable methods to solve 1 and 2. As for 3, we explore > several possible but maybe imperfect approaches to handle different cases. > > We start from 1. As each guest register should be described in > VEXGuestState struct, the vector registers with scalable width of > VLENB can be added into VEXGuestState as arrays using an allowable > maximum length like 2048/4096. > > The actual available access range can be determined at Valgrind > startup time by querying the CPU for its vector capability or some > suitable setup steps. > > > To solve problem 2, we are inspired by already-proven techniques in > QEMU, where translation blocks are broken up when certain critical > CSRs are set. Because the guest code to IR translation relies on the > precise value of LMUL/SEW and they may change within a basic block, we > can break up the basic block each time encountering a vsetvl{i} > instruction and return to the scheduler to execute the translated code > and update LMUL/SEW. Accordingly, translation cache management should > be refactored to detect the changing of LMUL/SEW to invalidate > outdated code cache. Without losing the generality, the LMUL/SEW > should be encoded into an ULong flag such that other architectures can > leverage this flag to store their arch-dependent information. The > TTentry struct should also take the flag into account no matter > insertion or deletion. By doing this, the flag carries the newest > LMUL/SEW throughout the simulation and can be passed to disassemble > functions using the VEXArchInfo struct such that we can get the real > and newest value of LMUL and SEW to facilitate our translation. > > Also, some architecture-related code should be taken care of. Like > m_dispatch part, disp_cp_xindir function looks up code cache using > hardcoded assembly by checking the requested guest state IP and > translation cache entry address with no more constraints. Many other > modules should be checked to ensure the in-time update of LMUL/SEW is > instantly visible to essential parts in Valgrind. > > > The last remaining big issue is 3, which we introduce some ad-hoc > approaches to deal with. We summarize these approaches into three > types as following: > > 1. Break down a vector instruction to scalar VEX IR ops. > 2. Break down a vector instruction to fixed-length VEX IR ops. > 3. Use dirty helpers to realize vector instructions. > > The very first method theoretically exists but is probably not > applicable as the number of IR ops explodes when a large VLENB is > adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL > is 512 * 8 / 8 = 512, meaning that a single vector instruction turns > into 512 scalar instructions and each scalar instruction would be > expanded to multiple IRs. To make things worse, the tool > instrumentation will insert more IRs between adjacent scalar IR ops. > As a result, the performance is likely to be slowed down thousand > times during running a real-world application with lots of vector > instructions. Therefore, the other two methods are more promising and > we will discuss them below. > > 2 and 3 are not mutually exclusive as we may choose a suitable method > from them to implement a vector instruction regarding its concrete > behavior. To explain these methods in detail, we present some > instances to illustrate their pros and cons. > > In terms of method 2, we have real values of VLENB/LMUL/SEW. The > simple case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are > available and can be directly applied to represent vector operations. > However, even when VLENB is restricted to 128, it still exceeds the > maximum SIMD width of 256 supported by VEX IR if LMUL>2. Hence, here > are two variants of method 2 to deal with long vectors: > > > *2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector > instructions in the granularity of VLENB. Accordingly, VLENB=4096 with > LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops. > > * *pros*: it encourages VEX backend to generate more compact and > efficient SIMD code (maybe). Particularly,it accommodatesmask and > gather/scatter (indexed) instructions by delivering more > information in IR itself. > * *cons*: too many new IR ops need to be introduced in VEX as each > op of different length should implement its add/sub/mul variants. > New data types to denote long vectors are necessary too, causing > difficulties in both VEX backend register allocation and tool > instrumentation. > > *2.2*Break down long vectors to multiple repeated SIMD ops. For > instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is > composed of four operators of Iop_Add8x16 type. > > * *pros:*less efforts are required in register allocation and tool > instrumentation. The VEX frontend is able to notify the backend to > generate efficient vector instructions by existing Iops. It better > trades off the complexity of adding many long vector IR ops and > the benefit of generating high-efficiency host code. > * *cons:*it is hard to describe a mask operation given that the mask > is pretty flexible (the least significant bit of each segment of > v0). Additionally, gather/scatter instructions may have similar > problems in appropriately dividing index registers. There are > various corner cases left here such as widening arithmetic > operations (widening SIMD IR ops are currently not compatible) and > vstart CSR register. When using fixed-length IR ops to comprise a > vector instruction, we will inevitably tell each IR op which > position encoded in vstart you can start to process the data. We > can use vstart as a normal guest state virtual register to > calculate each op's start position as a guard IRExpr or obtain the > value of vstart like what we do in LMUL/SEW. Nevertheless, it is > non-trivial to decompose a vector instruction concisely. > > In short, both 2.1 and 2.2 confront a dilemma in reducing engineering > efforts of refactoring Valgrind elegantly as well as implementing the > vector instruction set efficiently. Same obstacles exist in ARM SVE as > they are scalable vector instructions and flexible in many ways. > > The final solution is the dirty helper. It is undoubtedly practical > and requires possibly the least engineering efforts in dealing with so > many details in Valgrind. In this design, each instruction is > completed using an inline assembly running the same instruction on the > host. Moreover, tool instrumentation already handles IRDirty except > that new fields should be added in _IRDirty struct to indicate > strided/indexed/masked memory accesses and arithmetic operations. > > * *pros:*it supports all instructions without bothering to build > complicated IR expressions and statements. It executes vector > instructions using host CPU to get acceleration to some extent. > Besides, we do not need to add VEX backend to translate new IRs to > vector instructions. > * *cons:*the dirty helper always keeps its operations in a black box > such that tools can never see what happens in a dirty helper. Like > memcheck, the bit precision merit is missing once it meets a dirty > helper as the V-bit propagation chain adopts a pretty coarse > determination strategy. On the other hand, it is also not an > elegant way to implement the entire ISA extension in dirty helpers. > > In summary, it is far to reach a truly applicable solution in adding > vector extensions in Valgrind. We need to do detailed and > comprehensive estimations on different vector instruction categories. > > Any feedback is welcome in github [3] also. > > > [1] https://github.com/riscv/riscv-v-spec > > [2] > https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve > > [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17 > > > Thanks. > > Jojo > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |