Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 6/1/2023 7:13 PM, LATHUILIERE Bruno via Valgrind-developers wrote:
> 
> -------- Courriel original --------
> Objet: Re: [Valgrind-developers] RFC: support scalable vector model / riscv vector
> Date: 2023-05-29 05:29
> De: "Wu, Fei" <fe...@in...>
> À: Petr Pavlu <pet...@da...>, Jojo R <rj...@gm...>
> Cc: pa...@so..., yun...@al..., val...@li...,
> val...@li..., zha...@al...
> 
>> On 5/28/2023 1:06 AM, Petr Pavlu wrote:
>>> On 21. Apr 23 17:25, Jojo R wrote:
>>>> The last remaining big issue is 3, which we introduce some ad-hoc 
>>>> approaches to deal with. We summarize these approaches into three 
>>>> types as
>>>> following:
>>>>
>>>> 1. Break down a vector instruction to scalar VEX IR ops.
>>>> 2. Break down a vector instruction to fixed-length VEX IR ops.
>>>> 3. Use dirty helpers to realize vector instructions.
>>>
>>> I would also look at adding new VEX IR ops for scalable vector 
>>> instructions. In particular, if it could be shown that RVV and SVE can 
>>> use same new ops then it could make a good argument for adding them.
>>>
>>> Perhaps interesting is if such new scalable vector ops could also 
>>> represent fixed operations on other architectures, but that is just me 
>>> thinking out loud.
>>>
>> It's a good idea to consolidate all vector/simd together, the challenge is to verify its feasibility and to speedup the adaption progress, as it's supposed to take more efforts and longer time. Is there anyone with knowledge or experience of other ISA such as avx/sve on valgrind >can share the pain and gain, or we can do some quick prototype?
>>
>> Thanks,
>> Fei.
> 
> Hi,
> 
> I don't know if my experience is the one you expect, nevertheless I will try to share it.

Hi Bruno,

Thank you for sharing this, it's definitely worth reading.

> I'm the main developer of a valgrind tool called verrou (url: https://github.com/edf-hpc/verrou ) which currently only works with x86_64 architecture.
> From user's point of view, verrou enables to estimate the effect of the floating-point rounding error propagation (If you are interested by the subject, there are documentation and publication). 
> 
It looks interesting, good job.

> From valgrind tool developer's point of view, we need to replace all floating-point operations (fpo) by our own modified fpo implemented with C++ functions. One C++ function has 1,2 or 3 floating point input values and one floating point output value. 
> 
Do you use libvex_BackEnd() to translate the insn to host, e.g.
host_riscv64_isel.c to select the host insn, Is there any difference of
processing flow between verrou and memcheck?

> As we have to replace all VEX fpo, the way we handle with SSE and AVX has consequences for us. For each kind of fpo (add,sub,mul,div,sqrt)x(float,double), we have to replace VEX op for the following variants : scalar, SSE low lane, SSE, AVX. It is painful but possible via code generation. Thanks to the multiple VEX ops it is possible to select only one type of instruction (it can be useful to 1- get speed up, 2- know if floating point errors come from scalar or vector instructions).
> 
> On the other hand, for fma operations (madd,msub)x(float,double) we have less work to do, as valgrind do the un-vectorisation for us, but it is impossible to instrument selectively scalar or vector ops.

As these insns are un-vectorised, are there any other issues besides the
1 (performance) & 2 (original type) mentioned above? I want to make sure
if there is any risk of the un-vectorisation design, e.g. when the
vector length is large such as 2k vlen on rvv.

> We could think that the multiple VEX ops enable performance improvements via the vectorisation of C++ call, but it is not now possible (at least to my knowledge). Indeed, with the valgrind API I don't know how I can get the floating-point values in the register without applying un-vectorisation : To get the values in the AVX register, I do an awful sequence of Iop_V256to64_0, Iop_V256to64_1, Iop_V256to64_2, Iop_V256to64_3 for the 2 arguments. As it is not possible to do a IRStmt_Dirty call with a function with 9 args (9=2*4+1  2 for a binary operation, 4 for the vector length and 1 for the result), I do a first call to copy the 4 values of the first arg somewhere then a second one to perform the 4 C++ calls.
> Due to the algorithm inside the C++ calls it could be tricky to vectorise, but I even didn't try because of the sequence of Iop_V256to64_*.

For memcheck, the process is as follows if we put it simple:
    toIR -> instrumentation -> Backend isel

If the vector insn is split into scalar at the stage of toIR just as I
did in this series, the advantage looks obvious as I only need to deal
with this single stage and leverage the existing code to handle the
scalar version, the disadvantage is that it might lose some
opportunities to optimize, e.g.
* toIR - introduce extra temp variables for generated scalars
* instrumentation - for memcheck, the key is to trace the V+A bits
instead of the real results of the ops, the ideal case is V+A of the
whole vector can be checked together w/o breaking it to scalars
* Backend isel - the ideal case is to use the vector insn on host for
guest vector insn, but I'm not sure how much effort will be taken to
achieve this.

> In my dreams I would like Iop_ to convert a V256 or V128 type to an aligned pointer on floating point args. 
> 
> So, I don't know if my experience can be useful for you, but if someone has a better solution to my needs it will be useful at least ... to me :)
> 
Thank you again for this sharing. I hope the discussion can help both of
us, and others.

Best regards,
Fei.

> Best regards,
> Bruno Lathuilière
> 
> 
> 
> 
> Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.
> 
> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.
> 
> Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
> ____________________________________________________
> 
> This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.
> 
> If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.
> 
> E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
> 
> 
> 
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers