From: Fei Wu <fe...@in...> - 2023-05-26 13:57:45
|
I'm from Intel RISC-V team and working on a RISC-V International development partner project to add RISC-V vector (RVV) support on Valgrind, the target tool is memcheck. My work bases on commit 71272b252977 of Petr's riscv64-linux branch, many thanks to Petr for his great work first. https://github.com/petrpavlu/valgrind-riscv64 This RFC is a starting point of RVV support on Valgrind, It's far from complete, which will take huge time, but I do think it's more effective to have some real code for discussion, so this series adds the RVV support to run memcpy/strcmp/strcpy/strlen/strncpy in: https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/master/examples The whole idea is splitting the vector instructions into scalar instructions which have already been well supported on Petr's branch, the correctness of binary translation (tool=none) is simple to ensure, but the logic of tool=memcheck should not be broken, one of the keys is to deal with the instructions with mask: * for load/store with mask, LoadG/StoreG are enabled, the same semantics as other architectures * for other instructions such as vadd, if the vector mask agnostic (vma) is set to undisturbed, the masked original value is read first then write back, the V bit won't change even after write back, it's not necessary to have another guard type like LoadG/StoreG. Pros ---- * by leveraging the existing scalar instructions support on Valgrind, usually adding a new instruction involves only the frontend in guest_riscv64_toIR, other parts are rare touched, so effort is much reduced to enable new instructions. * As the backend only sees the scalar IRs and generates scalar instructions, it's possible to run valgrind ./vec-test on non-RVV host. Cons ---- * as this method splits RVV instruction at frontend, there is less chance to optimize at other stages, e.g. the vbits tracking. * with larger vlen such as 1K, at most 1 RVV instruction will split into 1K ops, besides the performance penalty, it causes pressure to other components such as tmp space too. Some of this can be relieved by grouping multiple elements together. There are some alternatives, but none seems perfect: * helper function. It's much easier to make tool=none work, but how good is it to handle the V+A tracking and other tools? Generally speaking, it should not be a general solution for too many instructions. * define and pass the RVV IR to backend, instead of splitting it too early. This introduces much effort, we should evaluate what level of profit can be attained. At last, if the performance is tolerable, is this the right way to go? Fei Wu (12): riscv64: Starting Vector support, registers added riscv64: Pass riscv guest_state for translation riscv64: Add SyncupEnv & TooManyIR jump kinds riscv64: Add LoadG/StoreG support riscv64: Shift guest_state -2048 on calling helper riscv64: Add cpu_state to TB riscv64: Introduce dis_RV64V and add vsetvl riscv64: Add load/store riscv64: Add csrr vl riscv64: add vfirst riscv64: Add vmsgtu/vmseq/vmsne/vmsbf/vmsif/vmor/vmv/vid riscv64: Add vadd VEX/priv/guest_riscv64_toIR.c | 974 +++++++++++++++++++++++++++++- VEX/priv/host_riscv64_defs.c | 133 ++++ VEX/priv/host_riscv64_defs.h | 23 + VEX/priv/host_riscv64_isel.c | 89 ++- VEX/priv/ir_defs.c | 8 + VEX/priv/ir_opt.c | 4 +- VEX/pub/libvex.h | 4 + VEX/pub/libvex_guest_riscv64.h | 47 +- VEX/pub/libvex_ir.h | 9 +- coregrind/m_scheduler/scheduler.c | 17 +- coregrind/m_translate.c | 5 + coregrind/m_transtab.c | 26 +- coregrind/pub_core_transtab.h | 5 + memcheck/mc_machine.c | 35 ++ memcheck/mc_translate.c | 4 + 15 files changed, 1368 insertions(+), 15 deletions(-) -- 2.25.1 |