Re: [Valgrind-developers] [RFC 00/12] RISC-V Vector support for Valgrind

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 2/21/2024 5:11 PM, Wu, Fei wrote:
> On 2/12/2024 5:41 AM, Petr Pavlu wrote:
>> It is good that this version no longer has multiple translations of the
>> same code for different vl values. However, I don't think that having vl
>> used only by the backend and not expressed in the IR is the right thing.
>> The value needs to be clearly visible to Memcheck for definedness
>> tracking. I also strongly suspect that the backend cannot make much use
>> of vl and it will need to generate all code for vlmax.
>>
>> A client code can contain the following sequence:
>> vsetvli t0, a0, e32, m1, ta, ma
>> vadd.vv v0, v1, v2
>>
>> The vsetvli instruction sets a new vl value and records that vector
>> operations should be tail-agnostic (the ta flag). From the programmer's
>> perspective, the vadd.vv instruction then operates on vl elements and
>> rest of the result in v0 should be ignored. From the hardware
>> perspective, the instruction however operates on the whole vector
>> register and some real value ends up in the tail of v0.
>>
>> Valgrind's IR should represent the latter, that is the actual hardware
>> view. In this case, the resulting value of the tail elements is
>> "unknown" and the IR needs to be able to express it so Memcheck can
>> correctly track it. The IR therefore works with whole vector registers
>> and it naturally should result in the backend generating code for vlmax.
>>
> I agree with you Valgrind should consider the tailing elements as
> "unknown". But it's still able to track the definedness of tails
> correctly with the knowledge of vlmax in the backend, which can set the
> definedness of vl..vlmax to 1s. The backend doesn't need to tell the
> vector IR is from guest code or Memcheck, it's a coincidence but nice
> both Memcheck and RVV spec set them to 1s. The backend has no problem to
> generate code until to vlmax, it's not necessary to stop at vl as it's
> right now.
> 
> Memcheck doesn't need to know vl explicitly as long as it doesn't use
> another vl value.
> 
This is a real case of trace log on my code with the same vadd above, I
replaced the offset number to the name for understanding.

* pre-instrumentation

t4 = GET:VLen32(OFFSET_V1)
t5 = GET:VLen32(OFFSET_V2)
t3 = VAdd_vv32(t4,t5)
PUT(OFFSET_V0) = t3

* post-instrumentation (memcheck)

        t24 = GET:VLen32(OFFSET_V1_SHADOW)
t4 = GET:VLen32(OFFSET_V1)
        t25 = GET:VLen32(OFFSET_V2_SHADOW)
t5 = GET:VLen32(OFFSET_V2)
        t27 = VOr_vv32(t24,t25)
        t28 = VCmpNEZ32(t27)
t3 = VAdd_vv32(t4,t5)
        PUT(OFFSET_V0_SHADOW) = t28
PUT(OFFSET_V0) = t3

We can see the key is t27, if the tailing of t27 is all-1s in this case,
then the definedness tracking records the correct value. As all the IRs
don't change vl, backend has no problem to set that.

I do agree it's beautiful to have the same IRs for all the vector
instructions, but converting all to mask-based operations result in
performance degradation too.

Thanks,
Fei.

>> For representing vl, as I touched upon in my previous email, I think it
>> is best to look at it as an implicit mask.
>>
>> SVE has explicit masks so it is easier to start with that. An SVE code
>> can contain the following instruction:
>> add z0.s, p0/m, z0.s, z1.s
>>
>> The instruction adds 32-bit elements in z0 and z1 that are marked as
>> active by the predicate p0 and places the result in the corresponding
>> elements in the destination register z0, while keeping any inactive
>> elements unmodified.
>>
>> Note that the instruction has a limited encoding and so the destination
>> and the first source register are always the same.
>>
>> An IR for this operation could look as follows:
>> t_mask = Expand1x2xNTo32x2xN(GET:V8xN(OFFSET_P0))
>> t_maskn = Not32x2xN(t_mask)
>> t_op1 = GET:V64xN(OFFSET_Z0)
>> t_op2 = GET:V64xN(OFFSET_Z1)
>> t_sum = Add32x2xN(t_op1, t_op2)
>> t_sum_masked = And32x2xN(t_sum, t_mask)
>> t_old_masked = And32x2xN(t_op1, t_maskn)
>> t_res = Or32x2xN(t_sum_masked, t_old_masked)
>> PUT(OFFSET_Z0) = t_res
>>
>> All temporaries are of type V64xN. Expand1x2xNTo32x2xN() takes single
>> mask bits and expands them to 32 bits.
>>
>> Memcheck instrumentation would then look as:
>> s_mask = Expand1x2xNTo32x2xN(GET:V8xN(OFFSET_P0_SHADOW))
>> s_maskn = s_mask;
>> s_op1 = GET:V64xN(OFFSET_Z0_SHADOW)
>> s_op2 = GET:V64xN(OFFSET_Z1_SHADOW)
>> s_sum = CmpNEZ32x2xN(Or32x2xN(s_op1, s_op2))
>> s_sum_masked = And32x2xN(Or32x2xN(s_sum, s_mask), And32x2xN(Or32x2xN(t_sum, s_sum), Or32x2xN(t_mask, s_mask)))
>> s_old_masked = And32x2xN(Or32x2xN(s_op1, s_maskn), And32x2xN(Or32x2xN(t_op1, s_op1), Or32x2xN(t_maskn, s_maskn)))
>> s_res = And32x2xN(Or32x2xN(s_sum_masked, s_old_masked), And32x2xN(Or32x2xN(Not32x2xN(t_sum_masked), s_sum_masked), Or32x2xN(Not32x2xN(t_old_masked), s_old_masked)))
>> PUT(OFFSET_Z0_SHADOW) = s_res
>>
>> In RVV, the same operation could be written as follows:
>> vsetvli t0, a0, e32, m1, ta, ma
>> vadd.vv v0, v1, v2
>>
>> The add instruction is similar as in the AArch64 case, with a difference
>> that it operates on the first vl elements (implicit mask) and the result
>> for inactive elements is unknown.
>>
>> An IR produced for vadd.vv would also look very similarly:
>> t_mask = Expand1x2xNTo32x2xN(PTrue1x2xN(GET:I64(OFFSET_VL)))
>> t_maskn = Not32x2xN(t_mask)
>> t_op1 = GET:V64xN(OFFSET_V1)
>> t_op2 = GET:V64xN(OFFSET_V2)
>> t_sum = Add32x2xN(t_op1, t_op2)
>> t_sum_masked = And32x2xN(t_sum, t_mask)
>> t_undef = GET:V64xN(OFFSET_V_UNDEF)
>> t_undef_masked = And32x2xN(t_undef, t_maskn)
>> t_res = Or32x2xN(t_sum_masked, t_undef_masked)
>> PUT(OFFSET_V0) = t_res
>>
>> The mask cannot be obtained directly from a predicate register as in the
>> SVE case but is forged from the current vl using PTrue1x2xN(). The iop
>> creates a mask where bits lower than the given value are set to 1, and
>> rest to 0.
> 
> How about the performance? It looks several times slower.
> 
> Thanks,
> Fei.
> 
>>
>> Note that SVE and RVV masks differ in how they are packed. PTrue1x2xN()
>> and Expand1x2xNTo32x2xN() might then need two variants or some
>> additional flag, I'm not immediately sure.
>>
>> To create an "unknown" value, the IR refers to V_UNDEF which is supposed
>> to be a read-only pseudo-register with all bits set to 1 (to adhere to
>> what the RVV spec allows) but tracked as undefined. This is just an
>> example, another approach to create an undefined value might be better.
>>
>> Memcheck instrumentation for this RVV instruction should look very
>> similar to the SVE case. Importantly, Memcheck is able to fully see how
>> the result depends on the value of vl.
>>
>> I think these SVE and RVV examples show how the two extensions could be
>> supported in Valgrind in a more or less similar fashion.
>>
>> The RVV codegen from this IR is a bit tricky though. RVV seems to have
>> a limited set of operations to work with masks so PTrue1x2xN() and
>> Expand1x2xNTo32x2xN() would be harder to generate but it looks doable.
>>
>> Another aspect is that the client code gets expanded quite a bit.
>> Load/stores in particular could get quite large when all tracking needs
>> to be in place. Function dis_VMASKMOV() in VEX/priv/guest_amd64_toIR.c
>> provides an example what needs to be done. It loads/stores vector
>> registers per lane using LoadG/StoreG. Perhaps, these statements could
>> be extended to work directly on vectors in some way.
>>
>> I hope this description makes sense. It is at least a direction I would
>> be personally looking at.
>>
>> Thanks,
>> Petr
>