[Valgrind-developers] RFC: support scalable vector model / riscv vector

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

We consider to add RVV/Vector [1] feature in valgrind, there are some 
challenges.
RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means 
the vector length is agnostic.
ARM's SVE is not supported in valgrind :(

There are three major issues in implementing RVV instruction set in 
Valgrind as following:

 1. Scalable vector register width VLENB
 2. Runtime changing property of LMUL and SEW
 3. Lack of proper VEX IR to represent all vector operations

We propose applicable methods to solve 1 and 2. As for 3, we explore 
several possible but maybe imperfect approaches to handle different cases.

We start from 1. As each guest register should be described in 
VEXGuestState struct, the vector registers with scalable width of VLENB 
can be added into VEXGuestState as arrays using an allowable maximum 
length like 2048/4096.

The actual available access range can be determined at Valgrind startup 
time by querying the CPU for its vector capability or some suitable 
setup steps.

To solve problem 2, we are inspired by already-proven techniques in 
QEMU, where translation blocks are broken up when certain critical CSRs 
are set. Because the guest code to IR translation relies on the precise 
value of LMUL/SEW and they may change within a basic block, we can break 
up the basic block each time encountering a vsetvl{i} instruction and 
return to the scheduler to execute the translated code and update 
LMUL/SEW. Accordingly, translation cache management should be refactored 
to detect the changing of LMUL/SEW to invalidate outdated code cache. 
Without losing the generality, the LMUL/SEW should be encoded into an 
ULong flag such that other architectures can leverage this flag to store 
their arch-dependent information. The TTentry struct should also take 
the flag into account no matter insertion or deletion. By doing this, 
the flag carries the newest LMUL/SEW throughout the simulation and can 
be passed to disassemble functions using the VEXArchInfo struct such 
that we can get the real and newest value of LMUL and SEW to facilitate 
our translation.

Also, some architecture-related code should be taken care of. Like 
m_dispatch part, disp_cp_xindir function looks up code cache using 
hardcoded assembly by checking the requested guest state IP and 
translation cache entry address with no more constraints. Many other 
modules should be checked to ensure the in-time update of LMUL/SEW is 
instantly visible to essential parts in Valgrind.

The last remaining big issue is 3, which we introduce some ad-hoc 
approaches to deal with. We summarize these approaches into three types 
as following:

 1. Break down a vector instruction to scalar VEX IR ops.
 2. Break down a vector instruction to fixed-length VEX IR ops.
 3. Use dirty helpers to realize vector instructions.

The very first method theoretically exists but is probably not 
applicable as the number of IR ops explodes when a large VLENB is 
adopted. Imaging a configuration of VLENB=512, SEW=8, LMUL=8, the VL is 
512 * 8 / 8 = 512, meaning that a single vector instruction turns into 
512 scalar instructions and each scalar instruction would be expanded to 
multiple IRs. To make things worse, the tool instrumentation will insert 
more IRs between adjacent scalar IR ops. As a result, the performance is 
likely to be slowed down thousand times during running a real-world 
application with lots of vector instructions. Therefore, the other two 
methods are more promising and we will discuss them below.

2 and 3 are not mutually exclusive as we may choose a suitable method 
from them to implement a vector instruction regarding its concrete 
behavior. To explain these methods in detail, we present some instances 
to illustrate their pros and cons.

In terms of method 2, we have real values of VLENB/LMUL/SEW. The simple 
case is VLENB <= 256 and LMUL=1, where many SIMD IR ops are available 
and can be directly applied to represent vector operations. However, 
even when VLENB is restricted to 128, it still exceeds the maximum SIMD 
width of 256 supported by VEX IR if LMUL>2. Hence, here are two variants 
of method 2 to deal with long vectors:

*2.1*Add more SIMD IR ops such as 1024/2048/4096, and translate vector 
instructions in the granularity of VLENB. Accordingly, VLENB=4096 with 
LMUL=2 is fulfilled by two 4096 SIMD VEX IR ops.

  * *pros*: it encourages VEX backend to generate more compact and
    efficient SIMD code (maybe). Particularly,it accommodatesmask and
    gather/scatter (indexed) instructions by delivering more information
    in IR itself.
  * *cons*: too many new IR ops need to be introduced in VEX as each op
    of different length should implement its add/sub/mul variants. New
    data types to denote long vectors are necessary too, causing
    difficulties in both VEX backend register allocation and tool
    instrumentation.

*2.2*Break down long vectors to multiple repeated SIMD ops. For 
instance, a vadd.vv vector instruction with VLENB=256/LMUL=2/SEW=8 is 
composed of four operators of Iop_Add8x16 type.

  * *pros:*less efforts are required in register allocation and tool
    instrumentation. The VEX frontend is able to notify the backend to
    generate efficient vector instructions by existing Iops. It better
    trades off the complexity of adding many long vector IR ops and the
    benefit of generating high-efficiency host code.
  * *cons:*it is hard to describe a mask operation given that the mask
    is pretty flexible (the least significant bit of each segment of
    v0). Additionally, gather/scatter instructions may have similar
    problems in appropriately dividing index registers. There are
    various corner cases left here such as widening arithmetic
    operations (widening SIMD IR ops are currently not compatible) and
    vstart CSR register. When using fixed-length IR ops to comprise a
    vector instruction, we will inevitably tell each IR op which
    position encoded in vstart you can start to process the data. We can
    use vstart as a normal guest state virtual register to calculate
    each op's start position as a guard IRExpr or obtain the value of
    vstart like what we do in LMUL/SEW. Nevertheless, it is non-trivial
    to decompose a vector instruction concisely.

In short, both 2.1 and 2.2 confront a dilemma in reducing engineering 
efforts of refactoring Valgrind elegantly as well as implementing the 
vector instruction set efficiently. Same obstacles exist in ARM SVE as 
they are scalable vector instructions and flexible in many ways.

The final solution is the dirty helper. It is undoubtedly practical and 
requires possibly the least engineering efforts in dealing with so many 
details in Valgrind. In this design, each instruction is completed using 
an inline assembly running the same instruction on the host. Moreover, 
tool instrumentation already handles IRDirty except that new fields 
should be added in _IRDirty struct to indicate strided/indexed/masked 
memory accesses and arithmetic operations.

  * *pros:*it supports all instructions without bothering to build
    complicated IR expressions and statements. It executes vector
    instructions using host CPU to get acceleration to some extent.
    Besides, we do not need to add VEX backend to translate new IRs to
    vector instructions.
  * *cons:*the dirty helper always keeps its operations in a black box
    such that tools can never see what happens in a dirty helper. Like
    memcheck, the bit precision merit is missing once it meets a dirty
    helper as the V-bit propagation chain adopts a pretty coarse
    determination strategy. On the other hand, it is also not an elegant
    way to implement the entire ISA extension in dirty helpers.

In summary, it is far to reach a truly applicable solution in adding 
vector extensions in Valgrind. We need to do detailed and comprehensive 
estimations on different vector instruction categories.

Any feedback is welcome in github [3] also.

[1] https://github.com/riscv/riscv-v-spec

[2] 
https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve

[3] https://github.com/petrpavlu/valgrind-riscv64/issues/17

Thanks.

Jojo