From: Wu, F. <fe...@in...> - 2023-11-13 07:58:16
|
On 10/31/2023 7:38 PM, Petr Pavlu wrote: > On 25. Sep 23 10:14, Wu, Fei wrote: >> On 8/21/2023 4:57 PM, Wu, Fei wrote: >>> On 5/26/2023 10:08 PM, Wu, Fei wrote: >>>> On 5/26/2023 9:59 PM, Fei Wu wrote: >>>>> I'm from Intel RISC-V team and working on a RISC-V International >>>>> development partner project to add RISC-V vector (RVV) support on >>>>> Valgrind, the target tool is memcheck. My work bases on commit >>>>> 71272b252977 of Petr's riscv64-linux branch, many thanks to Petr for his >>>>> great work first. >>>>> https://github.com/petrpavlu/valgrind-riscv64 >>>>> >>> A vector IR version has been added into this repo (branch poc-rvv): >>> https://github.com/intel/valgrind-rvv/commits/poc-rvv >>> >>> A bunch of RVV instructions have been enabled, and I plan to enable more >>> instructions to run some general benchmarks, coremark is the first one. >>> >> I enabled all the RVV instructions except floating-point and fixed-point >> ones, I think it's ready to have a try and an initial review. >> https://github.com/intel/valgrind-rvv/commits/poc-rvv >> >> The designs of this patch series: >> * All the RVV instructions have a corresponding vector-IR, except >> load/store which is still using the split-to-scalar pattern according to >> the mask. >> * Most of the instructions where each element is calculated >> independently such as vadd, are implemented as (read origin, calculate >> the result as unmasked version, apply mask), so that it's not necessary >> to define both masked and unmasked version for the same vector IR. For >> the other instructions such as vredsum, the masked and unmasked >> instructions are treated differently. >> * Besides the `vl' which is encoded in vector op/ty, there are some >> other info such as vlmax needed to pass from frontend to backend, we >> have a basic method to do it. >> >> I have done very limited testing on the code, usually a very basic test >> for single instruction correctness. The memcheck logic for some >> instructions still wait for polishing, it might not be able to find all >> the bugs memcheck is supposed to identify, but at least it shouldn't >> block us to try it against rvv binaries and find the bugs unrelated to >> RVV instructions. >> >> @Petr, could you please take a look? > > As previously touched upon in another thread [1], I'm not entirely > convinced by the idea of multiple translations of a same code per > various active vector lengths (vl). > > The main issue is that having such multiple translations requires > spending more cycles on the translation and creates pressure on > memory/caches. Some chaining opportunities may be also lost when it > won't be possible in certain cases to statically determine which > translation to select. > > Most vector operations shouldn't need to be aware of the current vl > setting as they can operate on entire vectors. Loads/stores are one > exception, they need to avoid reading/writing memory past the limit > determined by the value of vl. > > The approach overall looks to me somewhat RVV centric. SVE as another > scalable vector architecture is designed around predicate masks and so > in general doesn't have a vl value. > >From this perspective, RVV looks more generic than SVE? Something in RVV such as vl is not expressible by SVE primitive, RVV has to be covered for a unified solution. > The RVV prototype adds new IRTypes called Ity_VLen1 (for masks), > Ity_VLen8, Ity_VLen16, Ity_VLen32, Ity_VLen64 (for data) and uses the > upper 16 bits of an IRType instance to record the current vl. New IROps > named Iop_V* are added and similarly vl is stored in the upper 16 bits > of an IROp. For instance, some new addition ops are Iop_VAdd_vv_8, > Iop_VAdd_vv_16, Iop_VAdd_vv_32, Iop_VAdd_vv_64. > > My SVE prototype takes an alternative approach inspired by GCC and LLVM. > The design derives from the "< vscale x <#elements> x <elementtype> >" > LLVM syntax [2] and the corresponding "nxv<#elements><elementtype>" > machine value types [3]. The "vscale"/"n" in this compiler model is > constant and equal at runtime to VLEN/64. > > The prototype is available in the valgrind-riscv64 repository on GitHub, > branch poc-sve [4]. > > The code introduces two new IRTypes: Ity_V8xN to represent "an 8-bit x N > scalable vector predicate" and Ity_V64xN for "a 64-bit x N scalable > vector data". New IROps are added too. For example, addition ops are > Iop_Add8x8xN, Iop_Add16x4xN, Iop_Add32x2xN, Iop_Add64x1xN. > > Note that Valgrind has a different ordering in the notation of vector > operations. Valgrind uses "< <elementtype> x <#elements> >" for fixed > vectors, compared to the more common "< <#elements> x <elementtype> >". > My prototype follows the reversed notation used by Valgrind, meaning > "< <elementtype> x <#elements> x <vscale/N> >". > > The "N" dimension is critical. In contrast to compilers, Valgrind has > the luxury to know its actual value. > > For SVE, the value would be naively equal to VLEN/64 and constant > throughout the runtime of a client program. However, Armv9.2-A > introduced Scalable Matrix Extension [5] along with the so-called > Streaming SVE mode. This operating mode can have an alternative vector > size and userspace programs may switch between the two modes as needed. > > This requires tracking the current scalable vector register size and > therefore the "N" value at a higher granularity. It makes me think "N" > should be attached to and be constant per an IRSB. > If "N" is attached, then your SVE prototype will be similar to my RVV prototype, you have the following with a fixed 64bit mini-block: Iop_Add8x8xN, Iop_Add16x4xN, Iop_Add32x2xN, Iop_Add64x1xN And I have the following with 8/16/32/64 bits elements: Iop_VAdd_vv_8, Iop_VAdd_vv_16, Iop_VAdd_vv_32, Iop_VAdd_vv_64 ignore the naming convention, it's the same except you group it into 64bit mini-block, but Iop_VAdd_vv_8 is necessary as vl in RVV might result in non-64bit-aligned access. I agree attaching "N" to IRSB is a good idea, the reason I embed it to every vector IRType & IROps is that some logic requires the precise size of the type e.g. by calling and checking sizeofIRType(type). I see you have a TODO in SVE prototype too: Int sizeofIRType ( IRType ty ) /* TODO Fix the size for scalable types. */ case Ity_V8xN: return 2; case Ity_V64xN: return 16; If this function does need to return the precise size of IRType, we need to address it probably by passing some kind of env to this function? It looks like a lot code changes needed. In RVV prototype I made it simple by encoding it in IRType. Another reason I do so is that it can handle the case if the instrumentation code generates the vector instruction with different vl, but I'm not sure if there is any real case. > RVV with its LMUL grouping is similar in that it results in a different > vector size. Following the LLVM schema, LMUL could be modeled through > the "<#elements>" dimension. For instance, Iop_Add8x16xN would be an > 8-bit addition for LMUL=2. However, this would entail adding too many > new iops which is not desirable. > > A better option in the Valgrind case looks to be for LMUL to affect "N" > too. Its current value would be tracked and constant per an IRSB, same > as in the SVE case. > helper_vsetvl() in RVV takes LMUL into consideration to set guest_vl. Thank you for all your comments, and please correct me if anything wrong. Best Regards, Fei. > [1] https://sourceforge.net/p/valgrind/mailman/valgrind-developers/thread/zsjnggi3ngqkjmbsojjnvjcqtlr5crplev3inb55jlwxnr6tjr%40s5a65qvm3kjc/#msg37872645 > [2] https://llvm.org/docs/LangRef.html#t-vector > [3] https://github.com/llvm/llvm-project/blob/llvmorg-17.0.0/llvm/lib/Target/RISCV/RISCVRegisterInfo.td#L266 > [4] https://github.com/petrpavlu/valgrind-riscv64/tree/poc-sve > [5] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture > |