Re: [Valgrind-users] Problem with valgrind-3.17.0 and openmpi-4.0.5

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I have tried valgrind 3.17.0 and openmpi 4.0.2, and it 
works.

Do you know if there are some reported bugs with that 
specific
version?

Regards,

Federico Tesser

On Wed, 07 Jul 2021 10:25:52 +0200
  "TESSER FEDERICO" <fed...@po...> wrote:
> Good morning.
> 
> I have installed valgrind 3.17.0, having previously 
>loaded the
> module for openmpi 4.0.5, so it found the 
>"MPI2-compliant mpicc
> and mpi.h...".
> 
> However, trying to run just a simple program like this 
>one:
> 
> 
> 
> #include <mpi.h>
> #include <stdio.h>
> 
> int main(int argc, char** argv) {
> 
> MPI_Init(NULL, NULL);
> 
> int world_size;
> int world_rank;
> int name_len;
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> 
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> MPI_Get_processor_name(processor_name, &name_len);
> 
> printf("Hello world from processor %s, rank %d out of %d 
>processors\n",
> 	   processor_name, world_rank, world_size);
> 
> MPI_Finalize();
> 
> }
> 
> 
> 
> will produce the following errors:
> 
> 
> 
> ==113228== Memcheck, a memory error detector
> ==113228== Copyright (C) 2002-2017, and GNU GPL'd, by 
>Julian Seward et al.
> ==113228== Using Valgrind-3.17.0 and LibVEX; rerun with 
>-h for copyright info
> ==113228== Command: ./pure_mpi_valgrind_try/a.out
> ==113228==
> valgrind MPI wrappers 113228: Active for pid 113228
> valgrind MPI wrappers 113228: Try MPIWRAP_DEBUG=help for 
>possible options
> vex amd64->IR: unhandled instruction bytes: 0x62 0xF2 
>0x7D 0x8 0x7C 0xC5 0xC5 0xF9 0xD6 0x43
> vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
> vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
> vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
> ==113228== valgrind: Unrecognised instruction at address 
>0x5c79318.
> ==113228==    at 0x5C79318: opal_pointer_array_init (in 
>/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
> ==113228==    by 0x5CA4BDB: mca_base_var_init (in 
>/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
> ==113228==    by 0x5C82F11: opal_init_util (in 
>/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
> ==113228==    by 0x5157FD9: ompi_mpi_init 
>(ompi_mpi_init.c:428)
> ==113228==    by 0x50FB3A8: PMPI_Init (pinit.c:69)
> ==113228==    by 0x4E4BC26: PMPI_Init 
>(libmpiwrap.c:2288)
> ==113228==    by 0x10893B: main (main.c:6)
> ==113228== Your program just tried to execute an 
>instruction that Valgrind
> ==113228== did not recognise.  There are two possible 
>reasons for this.
> ==113228== 1. Your program has a bug and erroneously 
>jumped to a non-code
> ==113228==    location.  If you are running Memcheck and 
>you just saw a
> ==113228==    warning about a bad jump, it's probably 
>your program's fault.
> ==113228== 2. The instruction is legitimate but Valgrind 
>doesn't handle it,
> ==113228==    i.e. it's Valgrind's fault.  If you think 
>this is the case or
> ==113228==    you are not sure, please let us know and 
>we'll try to fix it.
> ==113228== Either way, Valgrind will now raise a SIGILL 
>signal which will
> ==113228== probably kill your program.
> ==113228==
> ==113228== Process terminating with default action of 
>signal 4 (SIGILL): dumping core
> ==113228==  Illegal opcode at address 0x5C79318
> ==113228==    at 0x5C79318: opal_pointer_array_init (in 
>/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
> ==113228==    by 0x5CA4BDB: mca_base_var_init (in 
>/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
> ==113228==    by 0x5C82F11: opal_init_util (in 
>/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
> ==113228==    by 0x5157FD9: ompi_mpi_init 
>(ompi_mpi_init.c:428)
> ==113228==    by 0x50FB3A8: PMPI_Init (pinit.c:69)
> ==113228==    by 0x4E4BC26: PMPI_Init 
>(libmpiwrap.c:2288)
> ==113228==    by 0x10893B: main (main.c:6)
> slurmstepd: error: *** JOB 159641 ON node01 CANCELLED AT 
>2021-07-07T10:21:29 ***
> srun: Job step aborted: Waiting up to 32 seconds for job 
>step to finish.
> srun: error: Timed out waiting for job step to complete
> slurmstepd: error: *** STEP 159641.0 ON node01 CANCELLED 
>AT 2021-07-07T10:22:48 ***
> 
> 
> 
> What am I doing wrong?
> 
> Regards,
> 
>Federico Tesser