Re: [Valgrind-developers] [RFC 00/12] RISC-V Vector support for Valgrind

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 10/31/2023 7:38 PM, Petr Pavlu wrote:
> On 25. Sep 23 10:14, Wu, Fei wrote:
>> On 8/21/2023 4:57 PM, Wu, Fei wrote:
>>> On 5/26/2023 10:08 PM, Wu, Fei wrote:
>>>> On 5/26/2023 9:59 PM, Fei Wu wrote:
>>>>> I'm from Intel RISC-V team and working on a RISC-V International
>>>>> development partner project to add RISC-V vector (RVV) support on 
>>>>> Valgrind, the target tool is memcheck. My work bases on commit
>>>>> 71272b252977 of Petr's riscv64-linux branch, many thanks to Petr for his
>>>>> great work first.
>>>>>     https://github.com/petrpavlu/valgrind-riscv64
>>>>>
>>> A vector IR version has been added into this repo (branch poc-rvv):
>>>     https://github.com/intel/valgrind-rvv/commits/poc-rvv
>>>
>>> A bunch of RVV instructions have been enabled, and I plan to enable more
>>> instructions to run some general benchmarks, coremark is the first one.
>>>
>> I enabled all the RVV instructions except floating-point and fixed-point
>> ones, I think it's ready to have a try and an initial review.
>> 	https://github.com/intel/valgrind-rvv/commits/poc-rvv
>>
>> The designs of this patch series:
>> * All the RVV instructions have a corresponding vector-IR, except
>> load/store which is still using the split-to-scalar pattern according to
>> the mask.
>> * Most of the instructions where each element is calculated
>> independently such as vadd, are implemented as (read origin, calculate
>> the result as unmasked version, apply mask), so that it's not necessary
>> to define both masked and unmasked version for the same vector IR. For
>> the other instructions such as vredsum, the masked and unmasked
>> instructions are treated differently.
>> * Besides the `vl' which is encoded in vector op/ty, there are some
>> other info such as vlmax needed to pass from frontend to backend, we
>> have a basic method to do it.
>>
>> I have done very limited testing on the code, usually a very basic test
>> for single instruction correctness. The memcheck logic for some
>> instructions still wait for polishing, it might not be able to find all
>> the bugs memcheck is supposed to identify, but at least it shouldn't
>> block us to try it against rvv binaries and find the bugs unrelated to
>> RVV instructions.
>>
>> @Petr, could you please take a look?
> 
> As previously touched upon in another thread [1], I'm not entirely
> convinced by the idea of multiple translations of a same code per
> various active vector lengths (vl).
> 
> The main issue is that having such multiple translations requires
> spending more cycles on the translation and creates pressure on
> memory/caches. Some chaining opportunities may be also lost when it
> won't be possible in certain cases to statically determine which
> translation to select.
> 
> Most vector operations shouldn't need to be aware of the current vl
> setting as they can operate on entire vectors. Loads/stores are one
> exception, they need to avoid reading/writing memory past the limit
> determined by the value of vl.
> 
> The approach overall looks to me somewhat RVV centric. SVE as another
> scalable vector architecture is designed around predicate masks and so
> in general doesn't have a vl value.
> 
>From this perspective, RVV looks more generic than SVE? Something in RVV
such as vl is not expressible by SVE primitive, RVV has to be covered
for a unified solution.

> The RVV prototype adds new IRTypes called Ity_VLen1 (for masks),
> Ity_VLen8, Ity_VLen16, Ity_VLen32, Ity_VLen64 (for data) and uses the
> upper 16 bits of an IRType instance to record the current vl. New IROps
> named Iop_V* are added and similarly vl is stored in the upper 16 bits
> of an IROp. For instance, some new addition ops are Iop_VAdd_vv_8,
> Iop_VAdd_vv_16, Iop_VAdd_vv_32, Iop_VAdd_vv_64.
> 
> My SVE prototype takes an alternative approach inspired by GCC and LLVM.
> The design derives from the "< vscale x <#elements> x <elementtype> >"
> LLVM syntax [2] and the corresponding "nxv<#elements><elementtype>"
> machine value types [3]. The "vscale"/"n" in this compiler model is
> constant and equal at runtime to VLEN/64.
> 
> The prototype is available in the valgrind-riscv64 repository on GitHub,
> branch poc-sve [4].
> 
> The code introduces two new IRTypes: Ity_V8xN to represent "an 8-bit x N
> scalable vector predicate" and Ity_V64xN for "a 64-bit x N scalable
> vector data". New IROps are added too. For example, addition ops are
> Iop_Add8x8xN, Iop_Add16x4xN, Iop_Add32x2xN, Iop_Add64x1xN.
> 
> Note that Valgrind has a different ordering in the notation of vector
> operations. Valgrind uses "< <elementtype> x <#elements> >" for fixed
> vectors, compared to the more common "< <#elements> x <elementtype> >".
> My prototype follows the reversed notation used by Valgrind, meaning
> "< <elementtype> x <#elements> x <vscale/N> >".
> 
> The "N" dimension is critical. In contrast to compilers, Valgrind has
> the luxury to know its actual value.
> 
> For SVE, the value would be naively equal to VLEN/64 and constant
> throughout the runtime of a client program. However, Armv9.2-A
> introduced Scalable Matrix Extension [5] along with the so-called
> Streaming SVE mode. This operating mode can have an alternative vector
> size and userspace programs may switch between the two modes as needed.
> 
> This requires tracking the current scalable vector register size and
> therefore the "N" value at a higher granularity. It makes me think "N"
> should be attached to and be constant per an IRSB.
> 
If "N" is attached, then your SVE prototype will be similar to my RVV
prototype, you have the following with a fixed 64bit mini-block:
	Iop_Add8x8xN, Iop_Add16x4xN, Iop_Add32x2xN, Iop_Add64x1xN
And I have the following with 8/16/32/64 bits elements:
	Iop_VAdd_vv_8, Iop_VAdd_vv_16, Iop_VAdd_vv_32, Iop_VAdd_vv_64
ignore the naming convention, it's the same except you group it into
64bit mini-block, but Iop_VAdd_vv_8 is necessary as vl in RVV might
result in non-64bit-aligned access.

I agree attaching "N" to IRSB is a good idea, the reason I embed it to
every vector IRType & IROps is that some logic requires the precise size
of the type e.g. by calling and checking sizeofIRType(type). I see you
have a TODO in SVE prototype too:

Int sizeofIRType ( IRType ty )
      /* TODO Fix the size for scalable types. */
      case Ity_V8xN: return 2;
      case Ity_V64xN: return 16;

If this function does need to return the precise size of IRType, we need
to address it probably by passing some kind of env to this function? It
looks like a lot code changes needed. In RVV prototype I made it simple
by encoding it in IRType. Another reason I do so is that it can handle
the case if the instrumentation code generates the vector instruction
with different vl, but I'm not sure if there is any real case.

> RVV with its LMUL grouping is similar in that it results in a different
> vector size. Following the LLVM schema, LMUL could be modeled through
> the "<#elements>" dimension. For instance, Iop_Add8x16xN would be an
> 8-bit addition for LMUL=2. However, this would entail adding too many
> new iops which is not desirable.
> 
> A better option in the Valgrind case looks to be for LMUL to affect "N"
> too. Its current value would be tracked and constant per an IRSB, same
> as in the SVE case.
> 
helper_vsetvl() in RVV takes LMUL into consideration to set guest_vl.

Thank you for all your comments, and please correct me if anything wrong.

Best Regards,
Fei.

> [1] https://sourceforge.net/p/valgrind/mailman/valgrind-developers/thread/zsjnggi3ngqkjmbsojjnvjcqtlr5crplev3inb55jlwxnr6tjr%40s5a65qvm3kjc/#msg37872645
> [2] https://llvm.org/docs/LangRef.html#t-vector
> [3] https://github.com/llvm/llvm-project/blob/llvmorg-17.0.0/llvm/lib/Target/RISCV/RISCVRegisterInfo.td#L266
> [4] https://github.com/petrpavlu/valgrind-riscv64/tree/poc-sve
> [5] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture
>