|
From: Patrick B. <Pat...@le...> - 2020-05-18 16:06:09
|
Hi, I'm new to valgrind. My goal is to investigate a possible memory problem in a large parallel MPI+OpenMP code. I've cloned Valgrind from git and built it with GCC7.3 and fortran 3.1 for mpicc (my application is built with the same environment). I'm using these 2 options: --enable-only64bit --with-mpicc=$(which mpicc) "mpirun -np 8 my_application" is working on my fat node (just to have few processes for the test, I use nearly 60GB of RAM over more than 1TB). It fails after some tenth of iterations. "mpirun -np 8 valgrind /bin/hostname" works too. So Valgrind seams working with MPI 3.1 compiled with GCC7.3. But "mpirun -np 8 valgrind ./my_application" immediately fails with: Program received signal SIGILL: Illegal instruction. Backtrace for this error: vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x8 0x6F 0x5 0x25 0xA8 0x18 0x0 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==377969== valgrind: Unrecognised instruction at address 0xabf9581. ==377969== at 0xABF9581: opal_pointer_array_construct (in /opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3) ==377969== by 0xAC1BA78: mca_base_var_init (in /opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3) ==377969== by 0xABFDE39: opal_init_util (in /opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3) ==377969== by 0x911AD60: ompi_mpi_init (in /opt/openmpi-GCC73/v3.1.x-20181010/lib/libmpi.so.40.10.3) ==377969== by 0x914BB34: PMPI_Init_thread (in /opt/openmpi-GCC73/v3.1.x-20181010/lib/libmpi.so.40.10.3) ==377969== by 0x8E97C1F: MPI_INIT_THREAD (in /opt/openmpi-GCC73/v3.1.x-20181010/lib/libmpi_mpifh.so.40.11.2) ==377969== by 0x543066: __mpi_m_MOD_init_mpi (mpi_m.f90:140) ==377969== by 0x411447: __yales2_m_MOD_init_yales2_env (yales2_m.f90:511) ==377969== by 0x411595: __yales2_m_MOD_run_yales2 (yales2_m.f90:378) ==377969== by 0x40B9E0: MAIN__ (3D_cylinder.f90:20) ==377969== by 0x40B9E0: main (3D_cylinder.f90:8) ==377969== Your program just tried to execute an instruction that Valgrind ==377969== did not recognise. There are two possible reasons for this. ==377969== 1. Your program has a bug and erroneously jumped to a non-code ==377969== location. If you are running Memcheck and you just saw a ==377969== warning about a bad jump, it's probably your program's fault. ==377969== 2. The instruction is legitimate but Valgrind doesn't handle it, ==377969== i.e. it's Valgrind's fault. If you think this is the case or ==377969== you are not sure, please let us know and we'll try to fix it. ==377969== Either way, Valgrind will now raise a SIGILL signal which will ==377969== probably kill your program. May be I've missed something ? I'm using master branch. The branch VALGRIND_3_16_BRANCH that I have tested do not build: make: *** Aucune règle pour fabriquer la cible « exp-sgcheck.supp », nécessaire pour « default.supp ». Arrêt. Thanks for your help Patrick |
|
From: Julian S. <js...@ac...> - 2020-05-18 17:48:40
|
> Program received signal SIGILL: Illegal instruction. > > Backtrace for this error: > vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x8 0x6F 0x5 > 0x25 0xA8 0x18 0x0 > vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 > vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE > vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 > ==377969== valgrind: Unrecognised instruction at address 0xabf9581. > ==377969== at 0xABF9581: opal_pointer_array_construct (in > /opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3) It sounds like there's an instruction in libopen-pal.so.40.10.3 that Valgrind doesn't like. What CPU does the machine have? J |
|
From: Tom H. <to...@co...> - 2020-05-18 18:36:52
|
On 18/05/2020 18:48, Julian Seward wrote: > >> Program received signal SIGILL: Illegal instruction. >> >> Backtrace for this error: >> vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x8 0x6F 0x5 >> 0x25 0xA8 0x18 0x0 >> vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 >> vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE >> vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 >> ==377969== valgrind: Unrecognised instruction at address 0xabf9581. >> ==377969== at 0xABF9581: opal_pointer_array_construct (in >> /opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3) > > It sounds like there's an instruction in libopen-pal.so.40.10.3 that > Valgrind doesn't like. What CPU does the machine have? 0x62 is an EVEX prefix from the AVX512 extensions, so isn't supported yet. Tom -- Tom Hughes (to...@co...) http://compton.nu/ |
|
From: Patrick B. <Pat...@le...> - 2020-05-18 19:11:27
|
Le 18/05/2020 à 19:48, Julian Seward a écrit : > >> Program received signal SIGILL: Illegal instruction. >> >> Backtrace for this error: >> vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x8 0x6F 0x5 >> 0x25 0xA8 0x18 0x0 >> vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 >> vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE >> vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 >> ==377969== valgrind: Unrecognised instruction at address 0xabf9581. >> ==377969== at 0xABF9581: opal_pointer_array_construct (in >> /opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3) > > It sounds like there's an instruction in libopen-pal.so.40.10.3 that > Valgrind doesn't like. What CPU does the machine have? > > J Hi Julian,*This machine is a fat node with 4 Intel Xeon Gold 6148 (20 cores each): vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz stepping : 4 microcode : 0x2000064 cpu MHz : 2400.000 cache size : 28160 KB physical id : 3 siblings : 40 core id : 26 cpu cores : 20 apicid : 245 initial apicid : 245 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d bogomips : 4806.68 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual |