[Valgrind-developers] [RFC 00/12] RISC-V Vector support for Valgrind

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'm from Intel RISC-V team and working on a RISC-V International
development partner project to add RISC-V vector (RVV) support on 
Valgrind, the target tool is memcheck. My work bases on commit
71272b252977 of Petr's riscv64-linux branch, many thanks to Petr for his
great work first.
    https://github.com/petrpavlu/valgrind-riscv64

This RFC is a starting point of RVV support on Valgrind, It's far from
complete, which will take huge time, but I do think it's more effective
to have some real code for discussion, so this series adds the RVV
support to run memcpy/strcmp/strcpy/strlen/strncpy in:
    https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/master/examples

The whole idea is splitting the vector instructions into scalar
instructions which have already been well supported on Petr's branch,
the correctness of binary translation (tool=none) is simple to ensure,
but the logic of tool=memcheck should not be broken, one of the keys is
to deal with the instructions with mask:

* for load/store with mask, LoadG/StoreG are enabled, the same semantics
as other architectures

* for other instructions such as vadd, if the vector mask agnostic (vma)
is set to undisturbed, the masked original value is read first then
write back, the V bit won't change even after write back, it's not
necessary to have another guard type like LoadG/StoreG.

Pros
----
* by leveraging the existing scalar instructions support on Valgrind,
usually adding a new instruction involves only the frontend in
guest_riscv64_toIR, other parts are rare touched, so effort is much
reduced to enable new instructions.

* As the backend only sees the scalar IRs and generates scalar
instructions, it's possible to run valgrind ./vec-test on non-RVV host.

Cons
----
* as this method splits RVV instruction at frontend, there is less
chance to optimize at other stages, e.g. the vbits tracking.

* with larger vlen such as 1K, at most 1 RVV instruction will split into
1K ops, besides the performance penalty, it causes pressure to other
components such as tmp space too. Some of this can be relieved by
grouping multiple elements together.

There are some alternatives, but none seems perfect:
* helper function. It's much easier to make tool=none work, but how good
is it to handle the V+A tracking and other tools? Generally speaking, it
should not be a general solution for too many instructions.

* define and pass the RVV IR to backend, instead of splitting it too
early.  This introduces much effort, we should evaluate what level of
profit can be attained.

At last, if the performance is tolerable, is this the right way to go?

Fei Wu (12):
  riscv64: Starting Vector support, registers added
  riscv64: Pass riscv guest_state for translation
  riscv64: Add SyncupEnv & TooManyIR jump kinds
  riscv64: Add LoadG/StoreG support
  riscv64: Shift guest_state -2048 on calling helper
  riscv64: Add cpu_state to TB
  riscv64: Introduce dis_RV64V and add vsetvl
  riscv64: Add load/store
  riscv64: Add csrr vl
  riscv64: add vfirst
  riscv64: Add vmsgtu/vmseq/vmsne/vmsbf/vmsif/vmor/vmv/vid
  riscv64: Add vadd

 VEX/priv/guest_riscv64_toIR.c     | 974 +++++++++++++++++++++++++++++-
 VEX/priv/host_riscv64_defs.c      | 133 ++++
 VEX/priv/host_riscv64_defs.h      |  23 +
 VEX/priv/host_riscv64_isel.c      |  89 ++-
 VEX/priv/ir_defs.c                |   8 +
 VEX/priv/ir_opt.c                 |   4 +-
 VEX/pub/libvex.h                  |   4 +
 VEX/pub/libvex_guest_riscv64.h    |  47 +-
 VEX/pub/libvex_ir.h               |   9 +-
 coregrind/m_scheduler/scheduler.c |  17 +-
 coregrind/m_translate.c           |   5 +
 coregrind/m_transtab.c            |  26 +-
 coregrind/pub_core_transtab.h     |   5 +
 memcheck/mc_machine.c             |  35 ++
 memcheck/mc_translate.c           |   4 +
 15 files changed, 1368 insertions(+), 15 deletions(-)

-- 
2.25.1