Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 21. Apr 23 17:25, Jojo R wrote:
> We consider to add RVV/Vector [1] feature in valgrind, there are some
> challenges.
> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the
> vector length is agnostic.
> ARM's SVE is not supported in valgrind :(
> 
> There are three major issues in implementing RVV instruction set in Valgrind
> as following:
> 
> 1. Scalable vector register width VLENB
> 2. Runtime changing property of LMUL and SEW
> 3. Lack of proper VEX IR to represent all vector operations
> 
> We propose applicable methods to solve 1 and 2. As for 3, we explore several
> possible but maybe imperfect approaches to handle different cases.
> 
> We start from 1. As each guest register should be described in VEXGuestState
> struct, the vector registers with scalable width of VLENB can be added into
> VEXGuestState as arrays using an allowable maximum length like 2048/4096.

Size of VexGuestRISCV64State is currently 592 bytes. Adding these large
vector registers will bump it by 32*2048/8=8192 bytes.

The baseblock layout in VEX is: the guest state, two equal sized areas
for shadow state and then a spill area. The RISC-V port accesses the
baseblock in generated code via x8/s0. The register is set to the
address of the baseblock+2048 (file
coregrind/m_dispatch/dispatch-riscv64-linux.S). The extra offset is
a small optimization to utilize the fact that load/store instructions in
RVI have a signed offset in range [-2048,2047]. The end result is that
it is possible to access the baseblock data using only a single
instruction.

Adding the new vector registers will cause that more instructions will
be necessary. For instance, accessing any shadow guest state would
naively require a sequence of LUI+ADDI+LOAD/STORE.

I suspect this could affect performance quite a bit and might need some
optimizing.

> 
> The actual available access range can be determined at Valgrind startup time
> by querying the CPU for its vector capability or some suitable setup steps.

Something to consider is that the virtual CPU provided by Valgrind does
not necessarily need to match the host CPU. For instance, VEX could
hardcode that its vector registers are only 128 bits in size.

I was originally hoping that this is how support for the V extension
could be added, but the LMUL grouping looks to break this model.

> 
> 
> To solve problem 2, we are inspired by already-proven techniques in QEMU,
> where translation blocks are broken up when certain critical CSRs are set.
> Because the guest code to IR translation relies on the precise value of
> LMUL/SEW and they may change within a basic block, we can break up the basic
> block each time encountering a vsetvl{i} instruction and return to the
> scheduler to execute the translated code and update LMUL/SEW. Accordingly,
> translation cache management should be refactored to detect the changing of
> LMUL/SEW to invalidate outdated code cache. Without losing the generality,
> the LMUL/SEW should be encoded into an ULong flag such that other
> architectures can leverage this flag to store their arch-dependent
> information. The TTentry struct should also take the flag into account no
> matter insertion or deletion. By doing this, the flag carries the newest
> LMUL/SEW throughout the simulation and can be passed to disassemble
> functions using the VEXArchInfo struct such that we can get the real and
> newest value of LMUL and SEW to facilitate our translation.
> 
> Also, some architecture-related code should be taken care of. Like
> m_dispatch part, disp_cp_xindir function looks up code cache using hardcoded
> assembly by checking the requested guest state IP and translation cache
> entry address with no more constraints. Many other modules should be checked
> to ensure the in-time update of LMUL/SEW is instantly visible to essential
> parts in Valgrind.
> 
> 
> The last remaining big issue is 3, which we introduce some ad-hoc approaches
> to deal with. We summarize these approaches into three types as following:
> 
> 1. Break down a vector instruction to scalar VEX IR ops.
> 2. Break down a vector instruction to fixed-length VEX IR ops.
> 3. Use dirty helpers to realize vector instructions.

I would also look at adding new VEX IR ops for scalable vector
instructions. In particular, if it could be shown that RVV and SVE can
use same new ops then it could make a good argument for adding them.

Perhaps interesting is if such new scalable vector ops could also
represent fixed operations on other architectures, but that is just me
thinking out loud.

> [...]
> In summary, it is far to reach a truly applicable solution in adding vector
> extensions in Valgrind. We need to do detailed and comprehensive estimations
> on different vector instruction categories.
> 
> Any feedback is welcome in github [3] also.
> 
> 
> [1] https://github.com/riscv/riscv-v-spec
> 
> [2] https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve
> 
> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17

Sorry for not being more helpful at this point. As mentioned in the
GitHub issue, I still need to get myself more familiar with RVV and how
Valgrind handles vector instructions.

Thanks,
Petr