From: TESSER F. <fed...@po...> - 2021-07-07 08:53:25
|
Good morning. I have installed valgrind 3.17.0, having previously loaded the module for openmpi 4.0.5, so it found the "MPI2-compliant mpicc and mpi.h...". However, trying to run just a simple program like this one: #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { MPI_Init(NULL, NULL); int world_size; int world_rank; int name_len; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Get_processor_name(processor_name, &name_len); printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size); MPI_Finalize(); } will produce the following errors: ==113228== Memcheck, a memory error detector ==113228== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==113228== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info ==113228== Command: ./pure_mpi_valgrind_try/a.out ==113228== valgrind MPI wrappers 113228: Active for pid 113228 valgrind MPI wrappers 113228: Try MPIWRAP_DEBUG=help for possible options vex amd64->IR: unhandled instruction bytes: 0x62 0xF2 0x7D 0x8 0x7C 0xC5 0xC5 0xF9 0xD6 0x43 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==113228== valgrind: Unrecognised instruction at address 0x5c79318. ==113228== at 0x5C79318: opal_pointer_array_init (in /usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5) ==113228== by 0x5CA4BDB: mca_base_var_init (in /usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5) ==113228== by 0x5C82F11: opal_init_util (in /usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5) ==113228== by 0x5157FD9: ompi_mpi_init (ompi_mpi_init.c:428) ==113228== by 0x50FB3A8: PMPI_Init (pinit.c:69) ==113228== by 0x4E4BC26: PMPI_Init (libmpiwrap.c:2288) ==113228== by 0x10893B: main (main.c:6) ==113228== Your program just tried to execute an instruction that Valgrind ==113228== did not recognise. There are two possible reasons for this. ==113228== 1. Your program has a bug and erroneously jumped to a non-code ==113228== location. If you are running Memcheck and you just saw a ==113228== warning about a bad jump, it's probably your program's fault. ==113228== 2. The instruction is legitimate but Valgrind doesn't handle it, ==113228== i.e. it's Valgrind's fault. If you think this is the case or ==113228== you are not sure, please let us know and we'll try to fix it. ==113228== Either way, Valgrind will now raise a SIGILL signal which will ==113228== probably kill your program. ==113228== ==113228== Process terminating with default action of signal 4 (SIGILL): dumping core ==113228== Illegal opcode at address 0x5C79318 ==113228== at 0x5C79318: opal_pointer_array_init (in /usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5) ==113228== by 0x5CA4BDB: mca_base_var_init (in /usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5) ==113228== by 0x5C82F11: opal_init_util (in /usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5) ==113228== by 0x5157FD9: ompi_mpi_init (ompi_mpi_init.c:428) ==113228== by 0x50FB3A8: PMPI_Init (pinit.c:69) ==113228== by 0x4E4BC26: PMPI_Init (libmpiwrap.c:2288) ==113228== by 0x10893B: main (main.c:6) slurmstepd: error: *** JOB 159641 ON node01 CANCELLED AT 2021-07-07T10:21:29 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete slurmstepd: error: *** STEP 159641.0 ON node01 CANCELLED AT 2021-07-07T10:22:48 *** What am I doing wrong? Regards, Federico Tesser |