Re: [Valgrind-developers] [RFC 00/12] RISC-V Vector support for Valgrind

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 25. Sep 23 10:14, Wu, Fei wrote:
> On 8/21/2023 4:57 PM, Wu, Fei wrote:
> > On 5/26/2023 10:08 PM, Wu, Fei wrote:
> >> On 5/26/2023 9:59 PM, Fei Wu wrote:
> >>> I'm from Intel RISC-V team and working on a RISC-V International
> >>> development partner project to add RISC-V vector (RVV) support on 
> >>> Valgrind, the target tool is memcheck. My work bases on commit
> >>> 71272b252977 of Petr's riscv64-linux branch, many thanks to Petr for his
> >>> great work first.
> >>>     https://github.com/petrpavlu/valgrind-riscv64
> >>>
> > A vector IR version has been added into this repo (branch poc-rvv):
> >     https://github.com/intel/valgrind-rvv/commits/poc-rvv
> > 
> > A bunch of RVV instructions have been enabled, and I plan to enable more
> > instructions to run some general benchmarks, coremark is the first one.
> > 
> I enabled all the RVV instructions except floating-point and fixed-point
> ones, I think it's ready to have a try and an initial review.
> 	https://github.com/intel/valgrind-rvv/commits/poc-rvv
> 
> The designs of this patch series:
> * All the RVV instructions have a corresponding vector-IR, except
> load/store which is still using the split-to-scalar pattern according to
> the mask.
> * Most of the instructions where each element is calculated
> independently such as vadd, are implemented as (read origin, calculate
> the result as unmasked version, apply mask), so that it's not necessary
> to define both masked and unmasked version for the same vector IR. For
> the other instructions such as vredsum, the masked and unmasked
> instructions are treated differently.
> * Besides the `vl' which is encoded in vector op/ty, there are some
> other info such as vlmax needed to pass from frontend to backend, we
> have a basic method to do it.
> 
> I have done very limited testing on the code, usually a very basic test
> for single instruction correctness. The memcheck logic for some
> instructions still wait for polishing, it might not be able to find all
> the bugs memcheck is supposed to identify, but at least it shouldn't
> block us to try it against rvv binaries and find the bugs unrelated to
> RVV instructions.
> 
> @Petr, could you please take a look?

As previously touched upon in another thread [1], I'm not entirely
convinced by the idea of multiple translations of a same code per
various active vector lengths (vl).

The main issue is that having such multiple translations requires
spending more cycles on the translation and creates pressure on
memory/caches. Some chaining opportunities may be also lost when it
won't be possible in certain cases to statically determine which
translation to select.

Most vector operations shouldn't need to be aware of the current vl
setting as they can operate on entire vectors. Loads/stores are one
exception, they need to avoid reading/writing memory past the limit
determined by the value of vl.

The approach overall looks to me somewhat RVV centric. SVE as another
scalable vector architecture is designed around predicate masks and so
in general doesn't have a vl value.

The RVV prototype adds new IRTypes called Ity_VLen1 (for masks),
Ity_VLen8, Ity_VLen16, Ity_VLen32, Ity_VLen64 (for data) and uses the
upper 16 bits of an IRType instance to record the current vl. New IROps
named Iop_V* are added and similarly vl is stored in the upper 16 bits
of an IROp. For instance, some new addition ops are Iop_VAdd_vv_8,
Iop_VAdd_vv_16, Iop_VAdd_vv_32, Iop_VAdd_vv_64.

My SVE prototype takes an alternative approach inspired by GCC and LLVM.
The design derives from the "< vscale x <#elements> x <elementtype> >"
LLVM syntax [2] and the corresponding "nxv<#elements><elementtype>"
machine value types [3]. The "vscale"/"n" in this compiler model is
constant and equal at runtime to VLEN/64.

The prototype is available in the valgrind-riscv64 repository on GitHub,
branch poc-sve [4].

The code introduces two new IRTypes: Ity_V8xN to represent "an 8-bit x N
scalable vector predicate" and Ity_V64xN for "a 64-bit x N scalable
vector data". New IROps are added too. For example, addition ops are
Iop_Add8x8xN, Iop_Add16x4xN, Iop_Add32x2xN, Iop_Add64x1xN.

Note that Valgrind has a different ordering in the notation of vector
operations. Valgrind uses "< <elementtype> x <#elements> >" for fixed
vectors, compared to the more common "< <#elements> x <elementtype> >".
My prototype follows the reversed notation used by Valgrind, meaning
"< <elementtype> x <#elements> x <vscale/N> >".

The "N" dimension is critical. In contrast to compilers, Valgrind has
the luxury to know its actual value.

For SVE, the value would be naively equal to VLEN/64 and constant
throughout the runtime of a client program. However, Armv9.2-A
introduced Scalable Matrix Extension [5] along with the so-called
Streaming SVE mode. This operating mode can have an alternative vector
size and userspace programs may switch between the two modes as needed.

This requires tracking the current scalable vector register size and
therefore the "N" value at a higher granularity. It makes me think "N"
should be attached to and be constant per an IRSB.

RVV with its LMUL grouping is similar in that it results in a different
vector size. Following the LLVM schema, LMUL could be modeled through
the "<#elements>" dimension. For instance, Iop_Add8x16xN would be an
8-bit addition for LMUL=2. However, this would entail adding too many
new iops which is not desirable.

A better option in the Valgrind case looks to be for LMUL to affect "N"
too. Its current value would be tracked and constant per an IRSB, same
as in the SVE case.

[1] https://sourceforge.net/p/valgrind/mailman/valgrind-developers/thread/zsjnggi3ngqkjmbsojjnvjcqtlr5crplev3inb55jlwxnr6tjr%40s5a65qvm3kjc/#msg37872645
[2] https://llvm.org/docs/LangRef.html#t-vector
[3] https://github.com/llvm/llvm-project/blob/llvmorg-17.0.0/llvm/lib/Target/RISCV/RISCVRegisterInfo.td#L266
[4] https://github.com/petrpavlu/valgrind-riscv64/tree/poc-sve
[5] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture

-- 
Cheers,
Petr