Re: [Valgrind-developers] [RFC 00/12] RISC-V Vector support for Valgrind

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Fei,

Sorry, I've been moving houses and didn't have any time to look at this.
I'm only slowly getting back to it.

On 21. Feb 24 17:11, Wu, Fei wrote:
> On 2/12/2024 5:41 AM, Petr Pavlu wrote:
> > On  5. Feb 24 11:45, Wu, Fei2 wrote:
> >>>> Hi Petr,
> >>>>
> >>>> I have done something to remove vl from ir/op, please take a look at
> >>>> the recent commits here:
> >>>>
> >>>> https://github.com/intel/valgrind-rvv/commits/poc-rvv-remove-vl-from-i
> >>>> r/
> >>>>
> >>>> * backend can get the exact vl from VexGuestRISCV64State instead of
> >>>> ir/op, so the generated code doesn't change due to different vl,
> >>>> commit a532cd5708e changes it this way.
> >>>>
> >>>> * sizeofIRType() is called in many places, we may not change it but
> >>>> probably add another one and call it only when necessary. In this code
> >>>> it returns 1024 for any Ity_VLenX, it needs fix.
> >>>>
> >>>> * host isel now uses lmul & sew during codegen, see commit b9d976b8,
> >>>> particularly vtype_to_nregs(). no code is generated because of
> >>>> different vl anymore.
> >>>>
> >>>> * regarding Iop_Add8x8xN vs Iop_VAdd_vv_8, the latter is flexible even
> >>>> for different LMUL, so I don't change it here.
> >>>>
> >>> This does mean "I haven't not changed it in this prototype yet", not "I don't
> >>> want to change", just in case of any ambiguity. if Iop_Add8x8xN is better for all
> >>> vector ISA, surely I will change it.
> >>>
> >>> Thanks,
> >>> Fei.
> >>>
> >>>> Passed the test of coremark built with:
> >>>>     clang -march=rv64gcv -mllvm -riscv-v-vector-bits-min=128 -O2
> >>>>
> >>>> Does this address the major concern?
> > 
> > It is good that this version no longer has multiple translations of the
> > same code for different vl values. However, I don't think that having vl
> > used only by the backend and not expressed in the IR is the right thing.
> > The value needs to be clearly visible to Memcheck for definedness
> > tracking. I also strongly suspect that the backend cannot make much use
> > of vl and it will need to generate all code for vlmax.
> > 
> > A client code can contain the following sequence:
> > vsetvli t0, a0, e32, m1, ta, ma
> > vadd.vv v0, v1, v2
> > 
> > The vsetvli instruction sets a new vl value and records that vector
> > operations should be tail-agnostic (the ta flag). From the programmer's
> > perspective, the vadd.vv instruction then operates on vl elements and
> > rest of the result in v0 should be ignored. From the hardware
> > perspective, the instruction however operates on the whole vector
> > register and some real value ends up in the tail of v0.
> > 
> > Valgrind's IR should represent the latter, that is the actual hardware
> > view. In this case, the resulting value of the tail elements is
> > "unknown" and the IR needs to be able to express it so Memcheck can
> > correctly track it. The IR therefore works with whole vector registers
> > and it naturally should result in the backend generating code for vlmax.
> > 
> I agree with you Valgrind should consider the tailing elements as
> "unknown". But it's still able to track the definedness of tails
> correctly with the knowledge of vlmax in the backend, which can set the
> definedness of vl..vlmax to 1s. The backend doesn't need to tell the
> vector IR is from guest code or Memcheck, it's a coincidence but nice
> both Memcheck and RVV spec set them to 1s. The backend has no problem to
> generate code until to vlmax, it's not necessary to stop at vl as it's
> right now.

This kind of looks to me as a layering violation. IMO the backend
shouldn't have any implicit knowledge what needs to end up undefined but
it should be fully expressed in the IR.

> 
> Memcheck doesn't need to know vl explicitly as long as it doesn't use
> another vl value.

I believe vl needs to be taken into account for definedness tracking. If
the value of vl is itself unknown then that should cause that the result
is correctly marked as unknown too. Additionally, for origin tracking,
the resulting value should have its origin set from vl.

The example that I showed has this all explicitly expressed in the IR
and Memcheck should be able to instrument it accordingly. How would you
tackle it in your proposal?

> 
> > For representing vl, as I touched upon in my previous email, I think it
> > is best to look at it as an implicit mask.
> > 
> > SVE has explicit masks so it is easier to start with that. An SVE code
> > can contain the following instruction:
> > add z0.s, p0/m, z0.s, z1.s
> > 
> > The instruction adds 32-bit elements in z0 and z1 that are marked as
> > active by the predicate p0 and places the result in the corresponding
> > elements in the destination register z0, while keeping any inactive
> > elements unmodified.
> > 
> > Note that the instruction has a limited encoding and so the destination
> > and the first source register are always the same.
> > 
> > An IR for this operation could look as follows:
> > t_mask = Expand1x2xNTo32x2xN(GET:V8xN(OFFSET_P0))
> > t_maskn = Not32x2xN(t_mask)
> > t_op1 = GET:V64xN(OFFSET_Z0)
> > t_op2 = GET:V64xN(OFFSET_Z1)
> > t_sum = Add32x2xN(t_op1, t_op2)
> > t_sum_masked = And32x2xN(t_sum, t_mask)
> > t_old_masked = And32x2xN(t_op1, t_maskn)
> > t_res = Or32x2xN(t_sum_masked, t_old_masked)
> > PUT(OFFSET_Z0) = t_res
> > 
> > All temporaries are of type V64xN. Expand1x2xNTo32x2xN() takes single
> > mask bits and expands them to 32 bits.
> > 
> > Memcheck instrumentation would then look as:
> > s_mask = Expand1x2xNTo32x2xN(GET:V8xN(OFFSET_P0_SHADOW))
> > s_maskn = s_mask;
> > s_op1 = GET:V64xN(OFFSET_Z0_SHADOW)
> > s_op2 = GET:V64xN(OFFSET_Z1_SHADOW)
> > s_sum = CmpNEZ32x2xN(Or32x2xN(s_op1, s_op2))
> > s_sum_masked = And32x2xN(Or32x2xN(s_sum, s_mask), And32x2xN(Or32x2xN(t_sum, s_sum), Or32x2xN(t_mask, s_mask)))
> > s_old_masked = And32x2xN(Or32x2xN(s_op1, s_maskn), And32x2xN(Or32x2xN(t_op1, s_op1), Or32x2xN(t_maskn, s_maskn)))
> > s_res = And32x2xN(Or32x2xN(s_sum_masked, s_old_masked), And32x2xN(Or32x2xN(Not32x2xN(t_sum_masked), s_sum_masked), Or32x2xN(Not32x2xN(t_old_masked), s_old_masked)))
> > PUT(OFFSET_Z0_SHADOW) = s_res
> > 
> > In RVV, the same operation could be written as follows:
> > vsetvli t0, a0, e32, m1, ta, ma
> > vadd.vv v0, v1, v2
> > 
> > The add instruction is similar as in the AArch64 case, with a difference
> > that it operates on the first vl elements (implicit mask) and the result
> > for inactive elements is unknown.
> > 
> > An IR produced for vadd.vv would also look very similarly:
> > t_mask = Expand1x2xNTo32x2xN(PTrue1x2xN(GET:I64(OFFSET_VL)))
> > t_maskn = Not32x2xN(t_mask)
> > t_op1 = GET:V64xN(OFFSET_V1)
> > t_op2 = GET:V64xN(OFFSET_V2)
> > t_sum = Add32x2xN(t_op1, t_op2)
> > t_sum_masked = And32x2xN(t_sum, t_mask)
> > t_undef = GET:V64xN(OFFSET_V_UNDEF)
> > t_undef_masked = And32x2xN(t_undef, t_maskn)
> > t_res = Or32x2xN(t_sum_masked, t_undef_masked)
> > PUT(OFFSET_V0) = t_res
> > 
> > The mask cannot be obtained directly from a predicate register as in the
> > SVE case but is forged from the current vl using PTrue1x2xN(). The iop
> > creates a mask where bits lower than the given value are set to 1, and
> > rest to 0.
> 
> How about the performance? It looks several times slower.

Possibly, I'd say it's the cost for correctness. However, note that in
case of subsequent vector instructions, at least some of the IR
operations should end up same and duplicates can be removed.

On 29. Feb 24 17:01, Wu, Fei wrote:
> This is a real case of trace log on my code with the same vadd above, I
> replaced the offset number to the name for understanding.
> 
> * pre-instrumentation
> 
> t4 = GET:VLen32(OFFSET_V1)
> t5 = GET:VLen32(OFFSET_V2)
> t3 = VAdd_vv32(t4,t5)
> PUT(OFFSET_V0) = t3
> 
> * post-instrumentation (memcheck)
> 
>         t24 = GET:VLen32(OFFSET_V1_SHADOW)
> t4 = GET:VLen32(OFFSET_V1)
>         t25 = GET:VLen32(OFFSET_V2_SHADOW)
> t5 = GET:VLen32(OFFSET_V2)
>         t27 = VOr_vv32(t24,t25)
>         t28 = VCmpNEZ32(t27)
> t3 = VAdd_vv32(t4,t5)
>         PUT(OFFSET_V0_SHADOW) = t28
> PUT(OFFSET_V0) = t3
> 
> We can see the key is t27, if the tailing of t27 is all-1s in this case,
> then the definedness tracking records the correct value. As all the IRs
> don't change vl, backend has no problem to set that.

As touched up on above, what if vl is itself undefined?

-- 
Thanks,
Petr