From: Wu, F. <fe...@in...> - 2024-02-29 09:02:11
|
On 2/21/2024 5:11 PM, Wu, Fei wrote: > On 2/12/2024 5:41 AM, Petr Pavlu wrote: >> It is good that this version no longer has multiple translations of the >> same code for different vl values. However, I don't think that having vl >> used only by the backend and not expressed in the IR is the right thing. >> The value needs to be clearly visible to Memcheck for definedness >> tracking. I also strongly suspect that the backend cannot make much use >> of vl and it will need to generate all code for vlmax. >> >> A client code can contain the following sequence: >> vsetvli t0, a0, e32, m1, ta, ma >> vadd.vv v0, v1, v2 >> >> The vsetvli instruction sets a new vl value and records that vector >> operations should be tail-agnostic (the ta flag). From the programmer's >> perspective, the vadd.vv instruction then operates on vl elements and >> rest of the result in v0 should be ignored. From the hardware >> perspective, the instruction however operates on the whole vector >> register and some real value ends up in the tail of v0. >> >> Valgrind's IR should represent the latter, that is the actual hardware >> view. In this case, the resulting value of the tail elements is >> "unknown" and the IR needs to be able to express it so Memcheck can >> correctly track it. The IR therefore works with whole vector registers >> and it naturally should result in the backend generating code for vlmax. >> > I agree with you Valgrind should consider the tailing elements as > "unknown". But it's still able to track the definedness of tails > correctly with the knowledge of vlmax in the backend, which can set the > definedness of vl..vlmax to 1s. The backend doesn't need to tell the > vector IR is from guest code or Memcheck, it's a coincidence but nice > both Memcheck and RVV spec set them to 1s. The backend has no problem to > generate code until to vlmax, it's not necessary to stop at vl as it's > right now. > > Memcheck doesn't need to know vl explicitly as long as it doesn't use > another vl value. > This is a real case of trace log on my code with the same vadd above, I replaced the offset number to the name for understanding. * pre-instrumentation t4 = GET:VLen32(OFFSET_V1) t5 = GET:VLen32(OFFSET_V2) t3 = VAdd_vv32(t4,t5) PUT(OFFSET_V0) = t3 * post-instrumentation (memcheck) t24 = GET:VLen32(OFFSET_V1_SHADOW) t4 = GET:VLen32(OFFSET_V1) t25 = GET:VLen32(OFFSET_V2_SHADOW) t5 = GET:VLen32(OFFSET_V2) t27 = VOr_vv32(t24,t25) t28 = VCmpNEZ32(t27) t3 = VAdd_vv32(t4,t5) PUT(OFFSET_V0_SHADOW) = t28 PUT(OFFSET_V0) = t3 We can see the key is t27, if the tailing of t27 is all-1s in this case, then the definedness tracking records the correct value. As all the IRs don't change vl, backend has no problem to set that. I do agree it's beautiful to have the same IRs for all the vector instructions, but converting all to mask-based operations result in performance degradation too. Thanks, Fei. >> For representing vl, as I touched upon in my previous email, I think it >> is best to look at it as an implicit mask. >> >> SVE has explicit masks so it is easier to start with that. An SVE code >> can contain the following instruction: >> add z0.s, p0/m, z0.s, z1.s >> >> The instruction adds 32-bit elements in z0 and z1 that are marked as >> active by the predicate p0 and places the result in the corresponding >> elements in the destination register z0, while keeping any inactive >> elements unmodified. >> >> Note that the instruction has a limited encoding and so the destination >> and the first source register are always the same. >> >> An IR for this operation could look as follows: >> t_mask = Expand1x2xNTo32x2xN(GET:V8xN(OFFSET_P0)) >> t_maskn = Not32x2xN(t_mask) >> t_op1 = GET:V64xN(OFFSET_Z0) >> t_op2 = GET:V64xN(OFFSET_Z1) >> t_sum = Add32x2xN(t_op1, t_op2) >> t_sum_masked = And32x2xN(t_sum, t_mask) >> t_old_masked = And32x2xN(t_op1, t_maskn) >> t_res = Or32x2xN(t_sum_masked, t_old_masked) >> PUT(OFFSET_Z0) = t_res >> >> All temporaries are of type V64xN. Expand1x2xNTo32x2xN() takes single >> mask bits and expands them to 32 bits. >> >> Memcheck instrumentation would then look as: >> s_mask = Expand1x2xNTo32x2xN(GET:V8xN(OFFSET_P0_SHADOW)) >> s_maskn = s_mask; >> s_op1 = GET:V64xN(OFFSET_Z0_SHADOW) >> s_op2 = GET:V64xN(OFFSET_Z1_SHADOW) >> s_sum = CmpNEZ32x2xN(Or32x2xN(s_op1, s_op2)) >> s_sum_masked = And32x2xN(Or32x2xN(s_sum, s_mask), And32x2xN(Or32x2xN(t_sum, s_sum), Or32x2xN(t_mask, s_mask))) >> s_old_masked = And32x2xN(Or32x2xN(s_op1, s_maskn), And32x2xN(Or32x2xN(t_op1, s_op1), Or32x2xN(t_maskn, s_maskn))) >> s_res = And32x2xN(Or32x2xN(s_sum_masked, s_old_masked), And32x2xN(Or32x2xN(Not32x2xN(t_sum_masked), s_sum_masked), Or32x2xN(Not32x2xN(t_old_masked), s_old_masked))) >> PUT(OFFSET_Z0_SHADOW) = s_res >> >> In RVV, the same operation could be written as follows: >> vsetvli t0, a0, e32, m1, ta, ma >> vadd.vv v0, v1, v2 >> >> The add instruction is similar as in the AArch64 case, with a difference >> that it operates on the first vl elements (implicit mask) and the result >> for inactive elements is unknown. >> >> An IR produced for vadd.vv would also look very similarly: >> t_mask = Expand1x2xNTo32x2xN(PTrue1x2xN(GET:I64(OFFSET_VL))) >> t_maskn = Not32x2xN(t_mask) >> t_op1 = GET:V64xN(OFFSET_V1) >> t_op2 = GET:V64xN(OFFSET_V2) >> t_sum = Add32x2xN(t_op1, t_op2) >> t_sum_masked = And32x2xN(t_sum, t_mask) >> t_undef = GET:V64xN(OFFSET_V_UNDEF) >> t_undef_masked = And32x2xN(t_undef, t_maskn) >> t_res = Or32x2xN(t_sum_masked, t_undef_masked) >> PUT(OFFSET_V0) = t_res >> >> The mask cannot be obtained directly from a predicate register as in the >> SVE case but is forged from the current vl using PTrue1x2xN(). The iop >> creates a mask where bits lower than the given value are set to 1, and >> rest to 0. > > How about the performance? It looks several times slower. > > Thanks, > Fei. > >> >> Note that SVE and RVV masks differ in how they are packed. PTrue1x2xN() >> and Expand1x2xNTo32x2xN() might then need two variants or some >> additional flag, I'm not immediately sure. >> >> To create an "unknown" value, the IR refers to V_UNDEF which is supposed >> to be a read-only pseudo-register with all bits set to 1 (to adhere to >> what the RVV spec allows) but tracked as undefined. This is just an >> example, another approach to create an undefined value might be better. >> >> Memcheck instrumentation for this RVV instruction should look very >> similar to the SVE case. Importantly, Memcheck is able to fully see how >> the result depends on the value of vl. >> >> I think these SVE and RVV examples show how the two extensions could be >> supported in Valgrind in a more or less similar fashion. >> >> The RVV codegen from this IR is a bit tricky though. RVV seems to have >> a limited set of operations to work with masks so PTrue1x2xN() and >> Expand1x2xNTo32x2xN() would be harder to generate but it looks doable. >> >> Another aspect is that the client code gets expanded quite a bit. >> Load/stores in particular could get quite large when all tracking needs >> to be in place. Function dis_VMASKMOV() in VEX/priv/guest_amd64_toIR.c >> provides an example what needs to be done. It loads/stores vector >> registers per lane using LoadG/StoreG. Perhaps, these statements could >> be extended to work directly on vectors in some way. >> >> I hope this description makes sense. It is at least a direction I would >> be personally looking at. >> >> Thanks, >> Petr > |