From: Blondel, S. <sbl...@ut...> - 2024-02-29 16:49:41
|
Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. errno 14 pid 864557 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. [e28n07:864557] *** Process received signal *** [e28n07:864557] Signal: Aborted (6) [e28n07:864557] Signal code: (-6) [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so(gsignal+0xd8)[0x200005d796f8] [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so(abort+0x164)[0x200005d53ff4] [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x3d280)[0x200005d6d280] [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** [e28n07:864558] Signal: Aborted (6) [e28n07:864558] Signal code: (-6) [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so(gsignal+0xd8)[0x200005d796f8] [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so(abort+0x164)[0x200005d53ff4] [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x3d280)[0x200005d6d280] [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so(__assert_fail+0x64)[0x200005d6d324] [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so(__assert_fail+0x64)[0x200005d6d324] [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864557] *** End of error message *** /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864558] *** End of error message *** It seems to be pointing to https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. Best, Sophie |
From: Junchao Z. <jun...@gm...> - 2024-02-29 15:50:54
|
Hi Sophie, PetscSFBcastEnd() was calling MPI_Waitall() to finish the communication in DMGlobalToLocal. I guess you used gpu-aware MPI. The error you saw might be due to it. You can try without it with a petsc option -use_gpu_aware_mpi 0 But we generally recommend gpu-aware mpi. You can try on other GPU machines to see if it is just an IBM Spectrum MPI problem. Thanks. --Junchao Zhang On Thu, Feb 29, 2024 at 9:17 AM Blondel, Sophie via petsc-users < pet...@mc...> wrote: > Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when > I run my code with multiple MPI tasks I get the following error: 0 TS dt > 1e-12 time 0. errno 14 pid 864558 xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/. > . /pami/components/devices/shmem/shaddr/CMAShaddr. h: 164: > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi, > > I am using PETSc build with the Kokkos CUDA backend on Summit but when I > run my code with multiple MPI tasks I get the following error: > 0 TS dt 1e-12 time 0. > errno 14 pid 864558 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > errno 14 pid 864557 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > [e28n07:864557] *** Process received signal *** > [e28n07:864557] Signal: Aborted (6) > [e28n07:864557] Signal code: (-6) > [e28n07:864557] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** > [e28n07:864558] Signal: Aborted (6) > [e28n07:864558] Signal code: (-6) > [e28n07:864558] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864557] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864557] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864557] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864557] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864557] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864557] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864558] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864558] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864558] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864558] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864558] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864558] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864558] [11] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864557] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864557] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864557] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864557] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864557] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864557] [17] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864558] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864558] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864558] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864558] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864558] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864558] [17] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864557] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864558] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864557] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864558] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864557] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864558] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864557] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864558] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864558] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864557] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864558] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864557] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864557] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864557] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864557] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864558] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864558] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864558] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864557] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864558] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864557] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864558] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864557] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864558] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864557] *** End of error message *** > > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864558] *** End of error message *** > > It seems to be pointing to > https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ > <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!Y8h5WLnoArdfhK2_UDmISOEiqxAN9gBUzvWniKOoMwtA9cGjg8w9sYX6V8aIgfzL8Uhea5ppiRbuTGr1jZ_R2DOV$> > so I wanted to check if you had seen this type of error before and if it > could be related to how the code is compiled or run. Let me know if I can > provide any additional information. > > Best, > > Sophie > |
From: Blondel, S. <sbl...@ut...> - 2024-02-29 16:17:05
|
Thank you Junchao, Yes, I am using gpu-aware MPI. Is "-use_gpu_aware_mpi 0" a runtime option or a compile option? Best, Sophie ________________________________ From: Junchao Zhang <jun...@gm...> Sent: Thursday, February 29, 2024 10:50 To: Blondel, Sophie <sbl...@ut...> Cc: xol...@li... <xol...@li...>; pet...@mc... <pet...@mc...> Subject: Re: [petsc-users] PAMI error on Summit You don't often get email from jun...@gm.... Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hi Sophie, PetscSFBcastEnd() was calling MPI_Waitall() to finish the communication in DMGlobalToLocal. I guess you used gpu-aware MPI. The error you saw might be due to it. You can try without it with a petsc option -use_gpu_aware_mpi 0 But we generally recommend gpu-aware mpi. You can try on other GPU machines to see if it is just an IBM Spectrum MPI problem. Thanks. --Junchao Zhang On Thu, Feb 29, 2024 at 9:17 AM Blondel, Sophie via petsc-users <pet...@mc...<mailto:pet...@mc...>> wrote: Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/. . /pami/components/devices/shmem/shaddr/CMAShaddr. h: 164: ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. errno 14 pid 864557 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. [e28n07:864557] *** Process received signal *** [e28n07:864557] Signal: Aborted (6) [e28n07:864557] Signal code: (-6) [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(gsignal+0xd8)[0x200005d796f8] [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(abort+0x164)[0x200005d53ff4] [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(+0x3d280)[0x200005d6d280] [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** [e28n07:864558] Signal: Aborted (6) [e28n07:864558] Signal code: (-6) [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(gsignal+0xd8)[0x200005d796f8] [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(abort+0x164)[0x200005d53ff4] [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(+0x3d280)[0x200005d6d280] [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(__assert_fail+0x64)[0x200005d6d324] [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(__assert_fail+0x64)[0x200005d6d324] [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864557] *** End of error message *** /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864558] *** End of error message *** It seems to be pointing to https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/<https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!Y8h5WLnoArdfhK2_UDmISOEiqxAN9gBUzvWniKOoMwtA9cGjg8w9sYX6V8aIgfzL8Uhea5ppiRbuTGr1jZ_R2DOV$> so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. Best, Sophie |
From: Matthew K. <kn...@gm...> - 2024-02-29 16:07:22
|
On Thu, Feb 29, 2024 at 11:03 AM Blondel, Sophie via petsc-users < pet...@mc...> wrote: > Thank you Junchao, Yes, I am using gpu-aware MPI. Is "-use_gpu_aware_mpi > 0" a runtime option or a compile option? Best, Sophie From: Junchao Zhang > <junchao. zhang@ gmail. com> Sent: Thursday, February 29, 2024 10: 50 To: > Blondel, > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Thank you Junchao, > > Yes, I am using gpu-aware MPI. > > Is "-use_gpu_aware_mpi 0" a runtime option or a compile option? > That is a configure option, so cd $PETSC_DIR ./${PETSC_ARCH}/lib./petsc/conf/reconfigure-${PETSC_ARCH}.py -use-gpu_aware_mpi 0 make all Thanks, Matt > Best, > > Sophie > ------------------------------ > *From:* Junchao Zhang <jun...@gm...> > *Sent:* Thursday, February 29, 2024 10:50 > *To:* Blondel, Sophie <sbl...@ut...> > *Cc:* xol...@li... < > xol...@li...>; pet...@mc... < > pet...@mc...> > *Subject:* Re: [petsc-users] PAMI error on Summit > > You don't often get email from jun...@gm.... Learn why this is > important > <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YnlFkp83$> > Hi Sophie, > PetscSFBcastEnd() was calling MPI_Waitall() to finish the communication > in DMGlobalToLocal. > I guess you used gpu-aware MPI. The error you saw might be due to it. > You can try without it with a petsc option -use_gpu_aware_mpi 0 > But we generally recommend gpu-aware mpi. You can try on other GPU > machines to see if it is just an IBM Spectrum MPI problem. > > Thanks. > --Junchao Zhang > > > On Thu, Feb 29, 2024 at 9:17 AM Blondel, Sophie via petsc-users < > pet...@mc...> wrote: > > Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when > I run my code with multiple MPI tasks I get the following error: 0 TS dt > 1e-12 time 0. errno 14 pid 864558 xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/. > . /pami/components/devices/shmem/shaddr/CMAShaddr. h: 164: > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi, > > I am using PETSc build with the Kokkos CUDA backend on Summit but when I > run my code with multiple MPI tasks I get the following error: > 0 TS dt 1e-12 time 0. > errno 14 pid 864558 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > errno 14 pid 864557 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > [e28n07:864557] *** Process received signal *** > [e28n07:864557] Signal: Aborted (6) > [e28n07:864557] Signal code: (-6) > [e28n07:864557] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (abort+0x164)[0x200005d53ff4] > [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (+0x3d280)[0x200005d6d280] > [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** > [e28n07:864558] Signal: Aborted (6) > [e28n07:864558] Signal code: (-6) > [e28n07:864558] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (abort+0x164)[0x200005d53ff4] > [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (+0x3d280)[0x200005d6d280] > [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864557] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864557] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864557] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864557] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864557] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864557] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so > <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864558] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864558] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864558] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864558] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864558] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864558] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864558] [11] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864557] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864557] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864557] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864557] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864557] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864557] [17] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864558] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864558] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864558] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864558] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864558] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864558] [17] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864557] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864558] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864557] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864558] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864557] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864558] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864557] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864558] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864558] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864557] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864558] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864557] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864557] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864557] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864557] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864558] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864558] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864558] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864557] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864558] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864557] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864558] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864557] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864558] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864557] *** End of error message *** > > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864558] *** End of error message *** > > It seems to be pointing to > https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ > <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!Y8h5WLnoArdfhK2_UDmISOEiqxAN9gBUzvWniKOoMwtA9cGjg8w9sYX6V8aIgfzL8Uhea5ppiRbuTGr1jZ_R2DOV$> > so I wanted to check if you had seen this type of error before and if it > could be related to how the code is compiled or run. Let me know if I can > provide any additional information. > > Best, > > Sophie > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/> |
From: Junchao Z. <jun...@gm...> - 2024-02-29 16:14:00
|
Yes, it is a runtime option. No need to reconfigure petsc. Just add "-use_gpu_aware_mpi 0" to your test's command line. --Junchao Zhang On Thu, Feb 29, 2024 at 10:07 AM Matthew Knepley <kn...@gm...> wrote: > On Thu, Feb 29, 2024 at 11:03 AM Blondel, Sophie via petsc-users < > pet...@mc...> wrote: > >> Thank you Junchao, Yes, I am using gpu-aware MPI. Is "-use_gpu_aware_mpi >> 0" a runtime option or a compile option? Best, Sophie From: Junchao Zhang >> <junchao. zhang@ gmail. com> Sent: Thursday, February 29, 2024 10: 50 >> To: Blondel, >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> Thank you Junchao, >> >> Yes, I am using gpu-aware MPI. >> >> Is "-use_gpu_aware_mpi 0" a runtime option or a compile option? >> > > That is a configure option, so > > cd $PETSC_DIR > ./${PETSC_ARCH}/lib./petsc/conf/reconfigure-${PETSC_ARCH}.py > -use-gpu_aware_mpi 0 > make all > > Thanks, > > Matt > > >> Best, >> >> Sophie >> ------------------------------ >> *From:* Junchao Zhang <jun...@gm...> >> *Sent:* Thursday, February 29, 2024 10:50 >> *To:* Blondel, Sophie <sbl...@ut...> >> *Cc:* xol...@li... < >> xol...@li...>; pet...@mc... < >> pet...@mc...> >> *Subject:* Re: [petsc-users] PAMI error on Summit >> >> You don't often get email from jun...@gm.... Learn why this >> is important >> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YnlFkp83$> >> Hi Sophie, >> PetscSFBcastEnd() was calling MPI_Waitall() to finish the communication >> in DMGlobalToLocal. >> I guess you used gpu-aware MPI. The error you saw might be due to it. >> You can try without it with a petsc option -use_gpu_aware_mpi 0 >> But we generally recommend gpu-aware mpi. You can try on other GPU >> machines to see if it is just an IBM Spectrum MPI problem. >> >> Thanks. >> --Junchao Zhang >> >> >> On Thu, Feb 29, 2024 at 9:17 AM Blondel, Sophie via petsc-users < >> pet...@mc...> wrote: >> >> Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but >> when I run my code with multiple MPI tasks I get the following error: 0 TS >> dt 1e-12 time 0. errno 14 pid 864558 xolotl: >> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/. >> . /pami/components/devices/shmem/shaddr/CMAShaddr. h: 164: >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> Hi, >> >> I am using PETSc build with the Kokkos CUDA backend on Summit but when I >> run my code with multiple MPI tasks I get the following error: >> 0 TS dt 1e-12 time 0. >> errno 14 pid 864558 >> xolotl: >> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: >> size_t PAMI::Dev >> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, >> PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. >> errno 14 pid 864557 >> xolotl: >> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: >> size_t PAMI::Dev >> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, >> PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. >> [e28n07:864557] *** Process received signal *** >> [e28n07:864557] Signal: Aborted (6) >> [e28n07:864557] Signal code: (-6) >> [e28n07:864557] [ 0] >> linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] >> [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (gsignal+0xd8)[0x200005d796f8] >> [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (abort+0x164)[0x200005d53ff4] >> [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (+0x3d280)[0x200005d6d280] >> [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** >> [e28n07:864558] Signal: Aborted (6) >> [e28n07:864558] Signal code: (-6) >> [e28n07:864558] [ 0] >> linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] >> [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (gsignal+0xd8)[0x200005d796f8] >> [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (abort+0x164)[0x200005d53ff4] >> [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (+0x3d280)[0x200005d6d280] >> [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (__assert_fail+0x64)[0x200005d6d324] >> [e28n07:864557] [ 5] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt >> >> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] >> [e28n07:864557] [ 6] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] >> [e28n07:864557] [ 7] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] >> [e28n07:864557] [ 8] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Rget+0x18)[0x20007f2d94a8] >> [e28n07:864557] [ 9] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] >> [e28n07:864557] [10] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] >> [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so >> <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$> >> (__assert_fail+0x64)[0x200005d6d324] >> [e28n07:864558] [ 5] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt >> >> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] >> [e28n07:864558] [ 6] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] >> [e28n07:864558] [ 7] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] >> [e28n07:864558] [ 8] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Rget+0x18)[0x20007f2d94a8] >> [e28n07:864558] [ 9] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] >> [e28n07:864558] [10] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] >> [e28n07:864558] [11] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic >> >> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] >> [e28n07:864557] [12] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Context_advancev+0x6b0)[0x20007f2da540] >> [e28n07:864557] [13] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] >> [e28n07:864557] [14] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ >> progress+0x6c)[0x20003d60640c] >> [e28n07:864557] [15] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r >> equest_default_wait_all+0x144)[0x2000034c4b04] >> [e28n07:864557] [16] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W >> aitall+0x10c)[0x20000352790c] >> [e28n07:864557] [17] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic >> >> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] >> [e28n07:864558] [12] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Context_advancev+0x6b0)[0x20007f2da540] >> [e28n07:864558] [13] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] >> [e28n07:864558] [14] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ >> progress+0x6c)[0x20003d60640c] >> [e28n07:864558] [15] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r >> equest_default_wait_all+0x144)[0x2000034c4b04] >> [e28n07:864558] [16] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W >> aitall+0x10c)[0x20000352790c] >> [e28n07:864558] [17] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] >> [e28n07:864557] [18] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] >> [e28n07:864558] [18] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] >> [e28n07:864557] [19] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] >> [e28n07:864558] [19] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] >> [e28n07:864557] [20] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] >> [e28n07:864558] [20] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] >> [e28n07:864557] [21] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] >> [e28n07:864558] [21] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] >> [e28n07:864558] [22] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] >> [e28n07:864557] [22] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] >> [e28n07:864558] [23] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] >> [e28n07:864557] [23] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] >> [e28n07:864557] [24] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 >> _+0xc4)[0x200005f710d4] >> [e28n07:864557] [25] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 >> 00005f7130c] >> [e28n07:864557] [26] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] >> [e28n07:864558] [24] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 >> _+0xc4)[0x200005f710d4] >> [e28n07:864558] [25] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 >> 00005f7130c] >> [e28n07:864558] [26] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] >> [e28n07:864557] [27] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] >> [e28n07:864558] [27] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] >> [e28n07:864557] [28] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] >> [e28n07:864558] [28] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] >> [e28n07:864557] [29] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] >> [e28n07:864558] [29] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] >> [e28n07:864557] *** End of error message *** >> >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] >> [e28n07:864558] *** End of error message *** >> >> It seems to be pointing to >> https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ >> <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!Y8h5WLnoArdfhK2_UDmISOEiqxAN9gBUzvWniKOoMwtA9cGjg8w9sYX6V8aIgfzL8Uhea5ppiRbuTGr1jZ_R2DOV$> >> so I wanted to check if you had seen this type of error before and if it >> could be related to how the code is compiled or run. Let me know if I can >> provide any additional information. >> >> Best, >> >> Sophie >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > |
From: Pierre J. <pi...@jo...> - 2024-02-29 16:30:48
|
> On 29 Feb 2024, at 5:06 PM, Matthew Knepley <kn...@gm...> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > On Thu, Feb 29, 2024 at 11:03 AM Blondel, Sophie via petsc-users <pet...@mc... <mailto:pet...@mc...>> wrote: >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Thank you Junchao, >> >> Yes, I am using gpu-aware MPI. >> >> Is "-use_gpu_aware_mpi 0" a runtime option or a compile option? > > That is a configure option, so No, it’s a runtime option, you don’t need to reconfigure, just add it to your command line arguments. Thanks, Pierre > cd $PETSC_DIR > ./${PETSC_ARCH}/lib./petsc/conf/reconfigure-${PETSC_ARCH}.py -use-gpu_aware_mpi 0 > make all > > Thanks, > > Matt > >> Best, >> >> Sophie >> From: Junchao Zhang <jun...@gm... <mailto:jun...@gm...>> >> Sent: Thursday, February 29, 2024 10:50 >> To: Blondel, Sophie <sbl...@ut... <mailto:sbl...@ut...>> >> Cc: xol...@li... <mailto:xol...@li...> <xol...@li... <mailto:xol...@li...>>; pet...@mc... <mailto:pet...@mc...> <pet...@mc... <mailto:pet...@mc...>> >> Subject: Re: [petsc-users] PAMI error on Summit >> >> You don't often get email from jun...@gm... <mailto:jun...@gm...>. Learn why this is important <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YnlFkp83$> >> Hi Sophie, >> PetscSFBcastEnd() was calling MPI_Waitall() to finish the communication in DMGlobalToLocal. >> I guess you used gpu-aware MPI. The error you saw might be due to it. You can try without it with a petsc option -use_gpu_aware_mpi 0 >> But we generally recommend gpu-aware mpi. You can try on other GPU machines to see if it is just an IBM Spectrum MPI problem. >> >> Thanks. >> --Junchao Zhang >> >> >> On Thu, Feb 29, 2024 at 9:17 AM Blondel, Sophie via petsc-users <pet...@mc... <mailto:pet...@mc...>> wrote: >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Hi, >> >> I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: >> 0 TS dt 1e-12 time 0. >> errno 14 pid 864558 >> xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev >> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. >> errno 14 pid 864557 >> xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev >> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. >> [e28n07:864557] *** Process received signal *** >> [e28n07:864557] Signal: Aborted (6) >> [e28n07:864557] Signal code: (-6) >> [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] >> [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(gsignal+0xd8)[0x200005d796f8] >> [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(abort+0x164)[0x200005d53ff4] >> [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(+0x3d280)[0x200005d6d280] >> [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** >> [e28n07:864558] Signal: Aborted (6) >> [e28n07:864558] Signal code: (-6) >> [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] >> [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(gsignal+0xd8)[0x200005d796f8] >> [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(abort+0x164)[0x200005d53ff4] >> [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(+0x3d280)[0x200005d6d280] >> [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(__assert_fail+0x64)[0x200005d6d324] >> [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt >> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] >> [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] >> [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] >> [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Rget+0x18)[0x20007f2d94a8] >> [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] >> [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] >> [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so <https://urldefense.us/v3/__http://libc-2.28.so__;!!G_uCfscf7eWS!facODjKBT4NdUpQRG3bRO5aufX56SW84P_ciaPKeSHjRxArfCGndt8hUfkRcvfjlBIbRbEra5PwQNlA6YlO8YWjh$>(__assert_fail+0x64)[0x200005d6d324] >> [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt >> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] >> [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] >> [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] >> [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Rget+0x18)[0x20007f2d94a8] >> [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] >> [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] >> [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic >> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] >> [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Context_advancev+0x6b0)[0x20007f2da540] >> [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] >> [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ >> progress+0x6c)[0x20003d60640c] >> [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r >> equest_default_wait_all+0x144)[0x2000034c4b04] >> [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W >> aitall+0x10c)[0x20000352790c] >> [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic >> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] >> [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Context_advancev+0x6b0)[0x20007f2da540] >> [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] >> [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ >> progress+0x6c)[0x20003d60640c] >> [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r >> equest_default_wait_all+0x144)[0x2000034c4b04] >> [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W >> aitall+0x10c)[0x20000352790c] >> [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] >> [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] >> [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] >> [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] >> [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] >> [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] >> [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] >> [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] >> [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] >> [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] >> [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] >> [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] >> [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] >> [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 >> _+0xc4)[0x200005f710d4] >> [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 >> 00005f7130c] >> [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] >> [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 >> _+0xc4)[0x200005f710d4] >> [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 >> 00005f7130c] >> [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] >> [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] >> [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] >> [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] >> [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] >> [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] >> [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] >> [e28n07:864557] *** End of error message *** >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] >> [e28n07:864558] *** End of error message *** >> >> It seems to be pointing to https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!Y8h5WLnoArdfhK2_UDmISOEiqxAN9gBUzvWniKOoMwtA9cGjg8w9sYX6V8aIgfzL8Uhea5ppiRbuTGr1jZ_R2DOV$> so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. >> >> Best, >> >> Sophie > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!alnN9lGj6QLPXpVJ1V1bDi40lqd3DTGEpANSS9orjCElucgd0fAjX5VcHJ7J8afDWcWiUr8Fg2QwHxAwswTs$> |
From: Blondel, S. <sbl...@ut...> - 2024-02-29 21:37:44
|
I still get the same error when deactivating GPU-aware MPI. I also tried unloading spectrum MPI and using openMPI instead (recompiling everything) and I get a segfault in PETSc in that case (still using GPU-aware MPI I think, at least not explicitly turning it off): 0 TS dt 1e-12 time 0. [ERROR] [0]PETSC ERROR: [ERROR] ------------------------------------------------------------------------ [ERROR] [0]PETSC ERROR: [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [ERROR] [0]PETSC ERROR: [ERROR] Try option -start_in_debugger or -on_error_attach_debugger [ERROR] [0]PETSC ERROR: [ERROR] or see https://petsc.org/release/faq/#valgrind<https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=> and https://petsc.org/release/faq/<https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=> [ERROR] [0]PETSC ERROR: [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html<https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=> on NVIDIA CUDA systems to find memory corruption errors [ERROR] [0]PETSC ERROR: [ERROR] configure using --with-debugging=yes, recompile, link, and run [ERROR] [0]PETSC ERROR: [ERROR] to get more information on the crash. [ERROR] [0]PETSC ERROR: [ERROR] Run with -malloc_debug to check if memory corruption is causing the crash. -------------------------------------------------------------------------- Best, Sophie ________________________________ From: Blondel, Sophie via Xolotl-psi-development <xol...@li...> Sent: Thursday, February 29, 2024 10:17 To: xol...@li... <xol...@li...>; pet...@mc... <pet...@mc...> Subject: [Xolotl-psi-development] PAMI error on Summit Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. errno 14 pid 864557 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. [e28n07:864557] *** Process received signal *** [e28n07:864557] Signal: Aborted (6) [e28n07:864557] Signal code: (-6) [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so(gsignal+0xd8)[0x200005d796f8] [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so(abort+0x164)[0x200005d53ff4] [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x3d280)[0x200005d6d280] [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** [e28n07:864558] Signal: Aborted (6) [e28n07:864558] Signal code: (-6) [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so(gsignal+0xd8)[0x200005d796f8] [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so(abort+0x164)[0x200005d53ff4] [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x3d280)[0x200005d6d280] [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so(__assert_fail+0x64)[0x200005d6d324] [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so(__assert_fail+0x64)[0x200005d6d324] [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864557] *** End of error message *** /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864558] *** End of error message *** It seems to be pointing to https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. Best, Sophie |
From: Matthew K. <kn...@gm...> - 2024-02-29 21:40:53
|
On Thu, Feb 29, 2024 at 4:22 PM Blondel, Sophie via petsc-users < pet...@mc...> wrote: > I still get the same error when deactivating GPU-aware MPI. I also tried > unloading spectrum MPI and using openMPI instead (recompiling everything) > and I get a segfault in PETSc in that case (still using GPU-aware MPI I > think, at least not explicitly > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > I still get the same error when deactivating GPU-aware MPI. > > I also tried unloading spectrum MPI and using openMPI instead (recompiling > everything) and I get a segfault in PETSc in that case (still using > GPU-aware MPI I think, at least not explicitly turning it off): > For this case, can you get a stack trace? Thanks, Matt > 0 TS dt 1e-12 time 0. > > [ERROR] [0]PETSC ERROR: > > [ERROR] > ------------------------------------------------------------------------ > > [ERROR] [0]PETSC ERROR: > > [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably > memory access out of range > > [ERROR] [0]PETSC ERROR: > > [ERROR] Try option -start_in_debugger or -on_error_attach_debugger > > [ERROR] [0]PETSC ERROR: > > [ERROR] or see https://petsc.org/release/faq/#valgrind > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$> and > https://petsc.org/release/faq/ > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$> > > [ERROR] [0]PETSC ERROR: > > [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$> on > NVIDIA CUDA systems to find memory corruption errors > > [ERROR] [0]PETSC ERROR: > > [ERROR] configure using --with-debugging=yes, recompile, link, and run > > [ERROR] [0]PETSC ERROR: > > [ERROR] to get more information on the crash. > > [ERROR] [0]PETSC ERROR: > > [ERROR] Run with -malloc_debug to check if memory corruption is causing > the crash. > > -------------------------------------------------------------------------- > > Best, > > Sophie > ------------------------------ > *From:* Blondel, Sophie via Xolotl-psi-development < > xol...@li...> > *Sent:* Thursday, February 29, 2024 10:17 > *To:* xol...@li... < > xol...@li...>; pet...@mc... < > pet...@mc...> > *Subject:* [Xolotl-psi-development] PAMI error on Summit > > Hi, > > I am using PETSc build with the Kokkos CUDA backend on Summit but when I > run my code with multiple MPI tasks I get the following error: > 0 TS dt 1e-12 time 0. > errno 14 pid 864558 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > errno 14 pid 864557 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > [e28n07:864557] *** Process received signal *** > [e28n07:864557] Signal: Aborted (6) > [e28n07:864557] Signal code: (-6) > [e28n07:864557] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** > [e28n07:864558] Signal: Aborted (6) > [e28n07:864558] Signal code: (-6) > [e28n07:864558] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864557] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864557] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864557] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864557] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864557] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864557] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864558] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864558] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864558] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864558] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864558] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864558] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864558] [11] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864557] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864557] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864557] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864557] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864557] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864557] [17] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864558] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864558] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864558] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864558] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864558] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864558] [17] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864557] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864558] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864557] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864558] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864557] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864558] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864557] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864558] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864558] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864557] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864558] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864557] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864557] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864557] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864557] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864558] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864558] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864558] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864557] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864558] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864557] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864558] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864557] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864558] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864557] *** End of error message *** > > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864558] *** End of error message *** > > It seems to be pointing to > https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ > <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$> > so I wanted to check if you had seen this type of error before and if it > could be related to how the code is compiled or run. Let me know if I can > provide any additional information. > > Best, > > Sophie > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/> |
From: Junchao Z. <jun...@gm...> - 2024-02-29 22:10:00
|
Could you try a petsc example to see if the environment is good? For example, cd src/ksp/ksp/tutorials make bench_kspsolve mpirun -n 6 ./bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi {0 or 1} BTW, I remember to use gpu-aware mpi on Summit, one needs to pass --smpiargs "-gpu" to jsrun --Junchao Zhang On Thu, Feb 29, 2024 at 3:22 PM Blondel, Sophie via petsc-users < pet...@mc...> wrote: > I still get the same error when deactivating GPU-aware MPI. I also tried > unloading spectrum MPI and using openMPI instead (recompiling everything) > and I get a segfault in PETSc in that case (still using GPU-aware MPI I > think, at least not explicitly > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > I still get the same error when deactivating GPU-aware MPI. > > I also tried unloading spectrum MPI and using openMPI instead (recompiling > everything) and I get a segfault in PETSc in that case (still using > GPU-aware MPI I think, at least not explicitly turning it off): > > 0 TS dt 1e-12 time 0. > > [ERROR] [0]PETSC ERROR: > > [ERROR] > ------------------------------------------------------------------------ > > [ERROR] [0]PETSC ERROR: > > [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably > memory access out of range > > [ERROR] [0]PETSC ERROR: > > [ERROR] Try option -start_in_debugger or -on_error_attach_debugger > > [ERROR] [0]PETSC ERROR: > > [ERROR] or see https://petsc.org/release/faq/#valgrind > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$> and > https://petsc.org/release/faq/ > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$> > > [ERROR] [0]PETSC ERROR: > > [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$> on > NVIDIA CUDA systems to find memory corruption errors > > [ERROR] [0]PETSC ERROR: > > [ERROR] configure using --with-debugging=yes, recompile, link, and run > > [ERROR] [0]PETSC ERROR: > > [ERROR] to get more information on the crash. > > [ERROR] [0]PETSC ERROR: > > [ERROR] Run with -malloc_debug to check if memory corruption is causing > the crash. > > -------------------------------------------------------------------------- > > Best, > > Sophie > ------------------------------ > *From:* Blondel, Sophie via Xolotl-psi-development < > xol...@li...> > *Sent:* Thursday, February 29, 2024 10:17 > *To:* xol...@li... < > xol...@li...>; pet...@mc... < > pet...@mc...> > *Subject:* [Xolotl-psi-development] PAMI error on Summit > > Hi, > > I am using PETSc build with the Kokkos CUDA backend on Summit but when I > run my code with multiple MPI tasks I get the following error: > 0 TS dt 1e-12 time 0. > errno 14 pid 864558 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > errno 14 pid 864557 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > [e28n07:864557] *** Process received signal *** > [e28n07:864557] Signal: Aborted (6) > [e28n07:864557] Signal code: (-6) > [e28n07:864557] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** > [e28n07:864558] Signal: Aborted (6) > [e28n07:864558] Signal code: (-6) > [e28n07:864558] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864557] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864557] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864557] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864557] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864557] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864557] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864558] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864558] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864558] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864558] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864558] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864558] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864558] [11] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864557] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864557] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864557] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864557] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864557] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864557] [17] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864558] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864558] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864558] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864558] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864558] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864558] [17] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864557] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864558] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864557] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864558] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864557] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864558] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864557] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864558] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864558] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864557] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864558] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864557] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864557] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864557] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864557] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864558] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864558] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864558] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864557] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864558] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864557] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864558] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864557] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864558] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864557] *** End of error message *** > > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864558] *** End of error message *** > > It seems to be pointing to > https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ > <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$> > so I wanted to check if you had seen this type of error before and if it > could be related to how the code is compiled or run. Let me know if I can > provide any additional information. > > Best, > > Sophie > |
From: Blondel, S. <sbl...@ut...> - 2024-03-01 23:07:39
|
I have been using --smpiargs "-gpu". I tried the benchmark with "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi 0" and it seems to work: Fri Mar 1 16:27:14 EST 2024 =========================================== Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 1000000 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2 - running KSPSolve()... Step3 - calculating error norm... Error norm: 5.591e-02 KSP iters: 63 KSPSolve: 3.16646 seconds FOM: 3.158e+05 DoFs/sec =========================================== ------------------------------------------------------------ Sender: LSF System <lsfadmin@batch3> Subject: Job 3322694: <xolotlTest> in cluster <summit> Done Job <xolotlTest> was submitted from host <login2> by user <bqo> in cluster <summit> at Fri Mar 1 16:26:58 2024 Job was executed on host(s) <1*batch3>, in queue <debug>, as user <bqo> in cluster <summit> at Fri Mar 1 16:27:00 2024 <42*a35n05> </ccs/home/bqo> was used as the home directory. </gpfs/alpine2/mat267/scratch/bqo/test> was used as the working directory. Started at Fri Mar 1 16:27:00 2024 Terminated at Fri Mar 1 16:27:26 2024 Results reported at Fri Mar 1 16:27:26 2024 The output (if any) is above this job summary. If I switch to "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi 1" it complains: Fri Mar 1 16:25:02 EST 2024 =========================================== Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 1000000 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... [5]PETSC ERROR: PETSc is configured with GPU support, but your MPI is not GPU-aware. For better performance, please use a GPU-aware MPI. [5]PETSC ERROR: If you do not care, add option -use_gpu_aware_mpi 0. To not see the message again, add the option to your .petscrc, OR add it to the env var PETSC_OPTIONS. [5]PETSC ERROR: If you do care, for IBM Spectrum MPI on OLCF Summit, you may need jsrun --smpiargs=-gpu. [5]PETSC ERROR: For Open MPI, you need to configure it --with-cuda (https://www.open-mpi.org/faq/?category=buildcuda) [5]PETSC ERROR: For MVAPICH2-GDR, you need to set MV2_USE_CUDA=1 (http://mvapich.cse.ohio-state.edu/userguide/gdr/) [5]PETSC ERROR: For Cray-MPICH, you need to set MPICH_GPU_SUPPORT_ENABLED=1 (man mpi to see manual of cray-mpich) -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF with errorcode 76. Best, Sophie ________________________________ From: Junchao Zhang <jun...@gm...> Sent: Thursday, February 29, 2024 17:09 To: Blondel, Sophie <sbl...@ut...> Cc: xol...@li... <xol...@li...>; pet...@mc... <pet...@mc...> Subject: Re: [petsc-users] PAMI error on Summit You don't often get email from jun...@gm.... Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Could you try a petsc example to see if the environment is good? For example, cd src/ksp/ksp/tutorials make bench_kspsolve mpirun -n 6 ./bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi {0 or 1} BTW, I remember to use gpu-aware mpi on Summit, one needs to pass --smpiargs "-gpu" to jsrun --Junchao Zhang On Thu, Feb 29, 2024 at 3:22 PM Blondel, Sophie via petsc-users <pet...@mc...<mailto:pet...@mc...>> wrote: I still get the same error when deactivating GPU-aware MPI. I also tried unloading spectrum MPI and using openMPI instead (recompiling everything) and I get a segfault in PETSc in that case (still using GPU-aware MPI I think, at least not explicitly ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd I still get the same error when deactivating GPU-aware MPI. I also tried unloading spectrum MPI and using openMPI instead (recompiling everything) and I get a segfault in PETSc in that case (still using GPU-aware MPI I think, at least not explicitly turning it off): 0 TS dt 1e-12 time 0. [ERROR] [0]PETSC ERROR: [ERROR] ------------------------------------------------------------------------ [ERROR] [0]PETSC ERROR: [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [ERROR] [0]PETSC ERROR: [ERROR] Try option -start_in_debugger or -on_error_attach_debugger [ERROR] [0]PETSC ERROR: [ERROR] or see https://petsc.org/release/faq/#valgrind<https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$> and https://petsc.org/release/faq/<https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$> [ERROR] [0]PETSC ERROR: [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html<https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$> on NVIDIA CUDA systems to find memory corruption errors [ERROR] [0]PETSC ERROR: [ERROR] configure using --with-debugging=yes, recompile, link, and run [ERROR] [0]PETSC ERROR: [ERROR] to get more information on the crash. [ERROR] [0]PETSC ERROR: [ERROR] Run with -malloc_debug to check if memory corruption is causing the crash. -------------------------------------------------------------------------- Best, Sophie ________________________________ From: Blondel, Sophie via Xolotl-psi-development <xol...@li...<mailto:xol...@li...>> Sent: Thursday, February 29, 2024 10:17 To: xol...@li...<mailto:xol...@li...> <xol...@li...<mailto:xol...@li...>>; pet...@mc...<mailto:pet...@mc...> <pet...@mc...<mailto:pet...@mc...>> Subject: [Xolotl-psi-development] PAMI error on Summit Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. errno 14 pid 864557 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. [e28n07:864557] *** Process received signal *** [e28n07:864557] Signal: Aborted (6) [e28n07:864557] Signal code: (-6) [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(gsignal+0xd8)[0x200005d796f8] [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(abort+0x164)[0x200005d53ff4] [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(+0x3d280)[0x200005d6d280] [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** [e28n07:864558] Signal: Aborted (6) [e28n07:864558] Signal code: (-6) [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(gsignal+0xd8)[0x200005d796f8] [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(abort+0x164)[0x200005d53ff4] [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(+0x3d280)[0x200005d6d280] [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(__assert_fail+0x64)[0x200005d6d324] [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(__assert_fail+0x64)[0x200005d6d324] [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864557] *** End of error message *** /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864558] *** End of error message *** It seems to be pointing to https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/<https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$> so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. Best, Sophie |
From: Junchao Z. <jun...@gm...> - 2024-03-01 21:58:27
|
It is weird, with jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj- shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi 1 petsc tried to test if the MPI is gpu aware (by doing an MPI_Allreduce on device buffers). It tried and found it was not, so it threw out the complaint in the error message. From https://docs.olcf.ornl.gov/systems/summit_user_guide.html#cuda-aware-mpi, I think your flags were right. I just got my Summit account reactivated today. I will give it a try. --Junchao Zhang On Fri, Mar 1, 2024 at 3:32 PM Blondel, Sophie <sbl...@ut...> wrote: > I have been using --smpiargs "-gpu". > > I tried the benchmark with "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve > -mat_type aijkokkos -use_gpu_aware_mpi 0" and it seems to work: > Fri Mar 1 16:27:14 EST 2024 > =========================================== > Test: KSP performance - Poisson > Input matrix: 27-pt finite difference stencil > -n 100 > DoFs = 1000000 > Number of nonzeros = 26463592 > > Step1 - creating Vecs and Mat... > Step2 - running KSPSolve()... > Step3 - calculating error norm... > > Error norm: 5.591e-02 > KSP iters: 63 > KSPSolve: 3.16646 seconds > FOM: 3.158e+05 DoFs/sec > =========================================== > > ------------------------------------------------------------ > Sender: LSF System <lsfadmin@batch3> > Subject: Job 3322694: <xolotlTest> in cluster <summit> Done > > Job <xolotlTest> was submitted from host <login2> by user <bqo> in cluster > <summit> at Fri Mar 1 16:26:58 2024 > Job was executed on host(s) <1*batch3>, in queue <debug>, as user <bqo> in > cluster <summit> at Fri Mar 1 16:27:00 2024 > <42*a35n05> > </ccs/home/bqo> was used as the home directory. > </gpfs/alpine2/mat267/scratch/bqo/test> was used as the working directory. > Started at Fri Mar 1 16:27:00 2024 > Terminated at Fri Mar 1 16:27:26 2024 > Results reported at Fri Mar 1 16:27:26 2024 > > The output (if any) is above this job summary. > > > If I switch to "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve > -mat_type aijkokkos -use_gpu_aware_mpi 1" it complains: > Fri Mar 1 16:25:02 EST 2024 > =========================================== > Test: KSP performance - Poisson > Input matrix: 27-pt finite difference stencil > -n 100 > DoFs = 1000000 > Number of nonzeros = 26463592 > > Step1 - creating Vecs and Mat... > [5]PETSC ERROR: PETSc is configured with GPU support, but your MPI is not > GPU-aware. For better performance, please use a GPU-aware MPI. > [5]PETSC ERROR: If you do not care, add option -use_gpu_aware_mpi 0. To > not see the message again, add the option to your .petscrc, OR add it to > the env var PETSC_OPTIONS. > [5]PETSC ERROR: If you do care, for IBM Spectrum MPI on OLCF Summit, you > may need jsrun --smpiargs=-gpu. > [5]PETSC ERROR: For Open MPI, you need to configure it --with-cuda ( > https://www.open-mpi.org/faq/?category=buildcuda) > [5]PETSC ERROR: For MVAPICH2-GDR, you need to set MV2_USE_CUDA=1 ( > http://mvapich.cse.ohio-state.edu/userguide/gdr/) > [5]PETSC ERROR: For Cray-MPICH, you need to set > MPICH_GPU_SUPPORT_ENABLED=1 (man mpi to see manual of cray-mpich) > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF > with errorcode 76. > > Best, > > Sophie > ------------------------------ > *From:* Junchao Zhang <jun...@gm...> > *Sent:* Thursday, February 29, 2024 17:09 > *To:* Blondel, Sophie <sbl...@ut...> > *Cc:* xol...@li... < > xol...@li...>; pet...@mc... < > pet...@mc...> > *Subject:* Re: [petsc-users] PAMI error on Summit > > You don't often get email from jun...@gm.... Learn why this is > important <https://aka.ms/LearnAboutSenderIdentification> > Could you try a petsc example to see if the environment is good? > For example, > > cd src/ksp/ksp/tutorials > make bench_kspsolve > mpirun -n 6 ./bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi {0 or > 1} > > BTW, I remember to use gpu-aware mpi on Summit, one needs to pass > --smpiargs "-gpu" to jsrun > > --Junchao Zhang > > > On Thu, Feb 29, 2024 at 3:22 PM Blondel, Sophie via petsc-users < > pet...@mc...> wrote: > > I still get the same error when deactivating GPU-aware MPI. I also tried > unloading spectrum MPI and using openMPI instead (recompiling everything) > and I get a segfault in PETSc in that case (still using GPU-aware MPI I > think, at least not explicitly > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > I still get the same error when deactivating GPU-aware MPI. > > I also tried unloading spectrum MPI and using openMPI instead (recompiling > everything) and I get a segfault in PETSc in that case (still using > GPU-aware MPI I think, at least not explicitly turning it off): > > 0 TS dt 1e-12 time 0. > > [ERROR] [0]PETSC ERROR: > > [ERROR] > ------------------------------------------------------------------------ > > [ERROR] [0]PETSC ERROR: > > [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably > memory access out of range > > [ERROR] [0]PETSC ERROR: > > [ERROR] Try option -start_in_debugger or -on_error_attach_debugger > > [ERROR] [0]PETSC ERROR: > > [ERROR] or see https://petsc.org/release/faq/#valgrind > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$> and > https://petsc.org/release/faq/ > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$> > > [ERROR] [0]PETSC ERROR: > > [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html > <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$> on > NVIDIA CUDA systems to find memory corruption errors > > [ERROR] [0]PETSC ERROR: > > [ERROR] configure using --with-debugging=yes, recompile, link, and run > > [ERROR] [0]PETSC ERROR: > > [ERROR] to get more information on the crash. > > [ERROR] [0]PETSC ERROR: > > [ERROR] Run with -malloc_debug to check if memory corruption is causing > the crash. > > -------------------------------------------------------------------------- > > Best, > > Sophie > ------------------------------ > *From:* Blondel, Sophie via Xolotl-psi-development < > xol...@li...> > *Sent:* Thursday, February 29, 2024 10:17 > *To:* xol...@li... < > xol...@li...>; pet...@mc... < > pet...@mc...> > *Subject:* [Xolotl-psi-development] PAMI error on Summit > > Hi, > > I am using PETSc build with the Kokkos CUDA backend on Summit but when I > run my code with multiple MPI tasks I get the following error: > 0 TS dt 1e-12 time 0. > errno 14 pid 864558 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > errno 14 pid 864557 > xolotl: > /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: > size_t PAMI::Dev > ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, > PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. > [e28n07:864557] *** Process received signal *** > [e28n07:864557] Signal: Aborted (6) > [e28n07:864557] Signal code: (-6) > [e28n07:864557] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** > [e28n07:864558] Signal: Aborted (6) > [e28n07:864558] Signal code: (-6) > [e28n07:864558] [ 0] > linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] > [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so > (gsignal+0xd8)[0x200005d796f8] > [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so > (abort+0x164)[0x200005d53ff4] > [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so > (+0x3d280)[0x200005d6d280] > [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864557] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864557] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864557] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864557] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864557] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864557] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so > (__assert_fail+0x64)[0x200005d6d324] > [e28n07:864558] [ 5] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt > > omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] > [e28n07:864558] [ 6] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] > [e28n07:864558] [ 7] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] > [e28n07:864558] [ 8] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Rget+0x18)[0x20007f2d94a8] > [e28n07:864558] [ 9] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] > [e28n07:864558] [10] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] > [e28n07:864558] [11] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864557] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864557] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864557] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864557] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864557] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864557] [17] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > > (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic > > 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] > [e28n07:864558] [12] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 > (PAMI_Context_advancev+0x6b0)[0x20007f2da540] > [e28n07:864558] [13] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p > ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] > [e28n07:864558] [14] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ > progress+0x6c)[0x20003d60640c] > [e28n07:864558] [15] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r > equest_default_wait_all+0x144)[0x2000034c4b04] > [e28n07:864558] [16] > /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W > aitall+0x10c)[0x20000352790c] > [e28n07:864558] [17] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864557] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] > [e28n07:864558] [18] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864557] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] > [e28n07:864558] [19] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864557] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] > [e28n07:864558] [20] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864557] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] > [e28n07:864558] [21] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864558] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] > [e28n07:864557] [22] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864558] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] > [e28n07:864557] [23] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864557] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864557] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864557] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] > [e28n07:864558] [24] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 > _+0xc4)[0x200005f710d4] > [e28n07:864558] [25] > /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 > 00005f7130c] > [e28n07:864558] [26] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864557] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] > [e28n07:864558] [27] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864557] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] > [e28n07:864558] [28] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864557] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] > [e28n07:864558] [29] > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864557] *** End of error message *** > > /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] > [e28n07:864558] *** End of error message *** > > It seems to be pointing to > https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ > <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$> > so I wanted to check if you had seen this type of error before and if it > could be related to how the code is compiled or run. Let me know if I can > provide any additional information. > > Best, > > Sophie > > |
From: Junchao Z. <jun...@gm...> - 2024-03-04 17:11:12
|
Hi, Sophie, I tried various modules and compilers on Summit and failed to find one that works with gpu aware mpi. The one that could build petsc and kokkos was "module load cuda/11.7.1 gcc/9.3.0-compiler_only spectrum-mpi essl netlib-lapack". But it only worked with "-use_gpu_aware_mpi 0". Without it, I saw code crashes. From what I can see, the gpu-aware mpi on Summit is an unusable and unmaintained state. --Junchao Zhang On Fri, Mar 1, 2024 at 3:58 PM Junchao Zhang <jun...@gm...> wrote: > It is weird, with > jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj- > shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve > -mat_type aijkokkos -use_gpu_aware_mpi 1 > > petsc tried to test if the MPI is gpu aware (by doing an MPI_Allreduce on > device buffers). It tried and found it was not, so it threw out the > complaint in the error message. > > From > https://docs.olcf.ornl.gov/systems/summit_user_guide.html#cuda-aware-mpi, > I think your flags were right. > > I just got my Summit account reactivated today. I will give it a try. > > --Junchao Zhang > > > On Fri, Mar 1, 2024 at 3:32 PM Blondel, Sophie <sbl...@ut...> wrote: > >> I have been using --smpiargs "-gpu". >> >> I tried the benchmark with "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve >> -mat_type aijkokkos -use_gpu_aware_mpi 0" and it seems to work: >> Fri Mar 1 16:27:14 EST 2024 >> =========================================== >> Test: KSP performance - Poisson >> Input matrix: 27-pt finite difference stencil >> -n 100 >> DoFs = 1000000 >> Number of nonzeros = 26463592 >> >> Step1 - creating Vecs and Mat... >> Step2 - running KSPSolve()... >> Step3 - calculating error norm... >> >> Error norm: 5.591e-02 >> KSP iters: 63 >> KSPSolve: 3.16646 seconds >> FOM: 3.158e+05 DoFs/sec >> =========================================== >> >> ------------------------------------------------------------ >> Sender: LSF System <lsfadmin@batch3> >> Subject: Job 3322694: <xolotlTest> in cluster <summit> Done >> >> Job <xolotlTest> was submitted from host <login2> by user <bqo> in >> cluster <summit> at Fri Mar 1 16:26:58 2024 >> Job was executed on host(s) <1*batch3>, in queue <debug>, as user <bqo> >> in cluster <summit> at Fri Mar 1 16:27:00 2024 >> <42*a35n05> >> </ccs/home/bqo> was used as the home directory. >> </gpfs/alpine2/mat267/scratch/bqo/test> was used as the working directory. >> Started at Fri Mar 1 16:27:00 2024 >> Terminated at Fri Mar 1 16:27:26 2024 >> Results reported at Fri Mar 1 16:27:26 2024 >> >> The output (if any) is above this job summary. >> >> >> If I switch to "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve >> -mat_type aijkokkos -use_gpu_aware_mpi 1" it complains: >> Fri Mar 1 16:25:02 EST 2024 >> =========================================== >> Test: KSP performance - Poisson >> Input matrix: 27-pt finite difference stencil >> -n 100 >> DoFs = 1000000 >> Number of nonzeros = 26463592 >> >> Step1 - creating Vecs and Mat... >> [5]PETSC ERROR: PETSc is configured with GPU support, but your MPI is not >> GPU-aware. For better performance, please use a GPU-aware MPI. >> [5]PETSC ERROR: If you do not care, add option -use_gpu_aware_mpi 0. To >> not see the message again, add the option to your .petscrc, OR add it to >> the env var PETSC_OPTIONS. >> [5]PETSC ERROR: If you do care, for IBM Spectrum MPI on OLCF Summit, you >> may need jsrun --smpiargs=-gpu. >> [5]PETSC ERROR: For Open MPI, you need to configure it --with-cuda ( >> https://www.open-mpi.org/faq/?category=buildcuda) >> [5]PETSC ERROR: For MVAPICH2-GDR, you need to set MV2_USE_CUDA=1 ( >> http://mvapich.cse.ohio-state.edu/userguide/gdr/) >> [5]PETSC ERROR: For Cray-MPICH, you need to set >> MPICH_GPU_SUPPORT_ENABLED=1 (man mpi to see manual of cray-mpich) >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF >> with errorcode 76. >> >> Best, >> >> Sophie >> ------------------------------ >> *From:* Junchao Zhang <jun...@gm...> >> *Sent:* Thursday, February 29, 2024 17:09 >> *To:* Blondel, Sophie <sbl...@ut...> >> *Cc:* xol...@li... < >> xol...@li...>; pet...@mc... < >> pet...@mc...> >> *Subject:* Re: [petsc-users] PAMI error on Summit >> >> You don't often get email from jun...@gm.... Learn why this >> is important <https://aka.ms/LearnAboutSenderIdentification> >> Could you try a petsc example to see if the environment is good? >> For example, >> >> cd src/ksp/ksp/tutorials >> make bench_kspsolve >> mpirun -n 6 ./bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi {0 >> or 1} >> >> BTW, I remember to use gpu-aware mpi on Summit, one needs to pass >> --smpiargs "-gpu" to jsrun >> >> --Junchao Zhang >> >> >> On Thu, Feb 29, 2024 at 3:22 PM Blondel, Sophie via petsc-users < >> pet...@mc...> wrote: >> >> I still get the same error when deactivating GPU-aware MPI. I also tried >> unloading spectrum MPI and using openMPI instead (recompiling everything) >> and I get a segfault in PETSc in that case (still using GPU-aware MPI I >> think, at least not explicitly >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> I still get the same error when deactivating GPU-aware MPI. >> >> I also tried unloading spectrum MPI and using openMPI instead >> (recompiling everything) and I get a segfault in PETSc in that case (still >> using GPU-aware MPI I think, at least not explicitly turning it off): >> >> 0 TS dt 1e-12 time 0. >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] >> ------------------------------------------------------------------------ >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably >> memory access out of range >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] Try option -start_in_debugger or -on_error_attach_debugger >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] or see https://petsc.org/release/faq/#valgrind >> <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$> and >> https://petsc.org/release/faq/ >> <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$> >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html >> <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$> on >> NVIDIA CUDA systems to find memory corruption errors >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] configure using --with-debugging=yes, recompile, link, and run >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] to get more information on the crash. >> >> [ERROR] [0]PETSC ERROR: >> >> [ERROR] Run with -malloc_debug to check if memory corruption is causing >> the crash. >> >> -------------------------------------------------------------------------- >> >> Best, >> >> Sophie >> ------------------------------ >> *From:* Blondel, Sophie via Xolotl-psi-development < >> xol...@li...> >> *Sent:* Thursday, February 29, 2024 10:17 >> *To:* xol...@li... < >> xol...@li...>; pet...@mc... < >> pet...@mc...> >> *Subject:* [Xolotl-psi-development] PAMI error on Summit >> >> Hi, >> >> I am using PETSc build with the Kokkos CUDA backend on Summit but when I >> run my code with multiple MPI tasks I get the following error: >> 0 TS dt 1e-12 time 0. >> errno 14 pid 864558 >> xolotl: >> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: >> size_t PAMI::Dev >> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, >> PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. >> errno 14 pid 864557 >> xolotl: >> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: >> size_t PAMI::Dev >> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, >> PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. >> [e28n07:864557] *** Process received signal *** >> [e28n07:864557] Signal: Aborted (6) >> [e28n07:864557] Signal code: (-6) >> [e28n07:864557] [ 0] >> linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] >> [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (gsignal+0xd8)[0x200005d796f8] >> [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (abort+0x164)[0x200005d53ff4] >> [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (+0x3d280)[0x200005d6d280] >> [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** >> [e28n07:864558] Signal: Aborted (6) >> [e28n07:864558] Signal code: (-6) >> [e28n07:864558] [ 0] >> linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] >> [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (gsignal+0xd8)[0x200005d796f8] >> [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (abort+0x164)[0x200005d53ff4] >> [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (+0x3d280)[0x200005d6d280] >> [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (__assert_fail+0x64)[0x200005d6d324] >> [e28n07:864557] [ 5] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt >> >> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] >> [e28n07:864557] [ 6] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] >> [e28n07:864557] [ 7] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] >> [e28n07:864557] [ 8] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Rget+0x18)[0x20007f2d94a8] >> [e28n07:864557] [ 9] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] >> [e28n07:864557] [10] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] >> [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so >> (__assert_fail+0x64)[0x200005d6d324] >> [e28n07:864558] [ 5] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt >> >> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] >> [e28n07:864558] [ 6] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] >> [e28n07:864558] [ 7] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] >> [e28n07:864558] [ 8] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Rget+0x18)[0x20007f2d94a8] >> [e28n07:864558] [ 9] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] >> [e28n07:864558] [10] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] >> [e28n07:864558] [11] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic >> >> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] >> [e28n07:864557] [12] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Context_advancev+0x6b0)[0x20007f2da540] >> [e28n07:864557] [13] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] >> [e28n07:864557] [14] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ >> progress+0x6c)[0x20003d60640c] >> [e28n07:864557] [15] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r >> equest_default_wait_all+0x144)[0x2000034c4b04] >> [e28n07:864557] [16] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W >> aitall+0x10c)[0x20000352790c] >> [e28n07:864557] [17] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> >> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic >> >> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] >> [e28n07:864558] [12] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 >> (PAMI_Context_advancev+0x6b0)[0x20007f2da540] >> [e28n07:864558] [13] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p >> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] >> [e28n07:864558] [14] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ >> progress+0x6c)[0x20003d60640c] >> [e28n07:864558] [15] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r >> equest_default_wait_all+0x144)[0x2000034c4b04] >> [e28n07:864558] [16] >> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W >> aitall+0x10c)[0x20000352790c] >> [e28n07:864558] [17] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] >> [e28n07:864557] [18] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] >> [e28n07:864558] [18] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] >> [e28n07:864557] [19] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] >> [e28n07:864558] [19] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] >> [e28n07:864557] [20] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] >> [e28n07:864558] [20] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] >> [e28n07:864557] [21] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] >> [e28n07:864558] [21] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] >> [e28n07:864558] [22] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] >> [e28n07:864557] [22] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] >> [e28n07:864558] [23] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] >> [e28n07:864557] [23] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] >> [e28n07:864557] [24] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 >> _+0xc4)[0x200005f710d4] >> [e28n07:864557] [25] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 >> 00005f7130c] >> [e28n07:864557] [26] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] >> [e28n07:864558] [24] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 >> _+0xc4)[0x200005f710d4] >> [e28n07:864558] [25] >> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 >> 00005f7130c] >> [e28n07:864558] [26] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] >> [e28n07:864557] [27] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] >> [e28n07:864558] [27] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] >> [e28n07:864557] [28] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] >> [e28n07:864558] [28] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] >> [e28n07:864557] [29] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] >> [e28n07:864558] [29] >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] >> [e28n07:864557] *** End of error message *** >> >> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] >> [e28n07:864558] *** End of error message *** >> >> It seems to be pointing to >> https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/ >> <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$> >> so I wanted to check if you had seen this type of error before and if it >> could be related to how the code is compiled or run. Let me know if I can >> provide any additional information. >> >> Best, >> >> Sophie >> >> |
From: Blondel, S. <sbl...@ut...> - 2024-03-04 19:49:18
|
Thank you Junchao for looking into it. I managed to build a previous version of Xolotl that uses Kokkos (but PETSc is default PETSc without Kokkos) so that we have at least partial GPU support. Best, Sophie ________________________________ From: Junchao Zhang <jun...@gm...> Sent: Monday, March 4, 2024 12:10 To: Blondel, Sophie <sbl...@ut...> Cc: xol...@li... <xol...@li...>; pet...@mc... <pet...@mc...> Subject: Re: [petsc-users] PAMI error on Summit You don't often get email from jun...@gm.... Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hi, Sophie, I tried various modules and compilers on Summit and failed to find one that works with gpu aware mpi. The one that could build petsc and kokkos was "module load cuda/11.7.1 gcc/9.3.0-compiler_only spectrum-mpi essl netlib-lapack". But it only worked with "-use_gpu_aware_mpi 0". Without it, I saw code crashes. From what I can see, the gpu-aware mpi on Summit is an unusable and unmaintained state. --Junchao Zhang On Fri, Mar 1, 2024 at 3:58 PM Junchao Zhang <jun...@gm...<mailto:jun...@gm...>> wrote: It is weird, with jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi 1 petsc tried to test if the MPI is gpu aware (by doing an MPI_Allreduce on device buffers). It tried and found it was not, so it threw out the complaint in the error message. From https://docs.olcf.ornl.gov/systems/summit_user_guide.html#cuda-aware-mpi, I think your flags were right. I just got my Summit account reactivated today. I will give it a try. --Junchao Zhang On Fri, Mar 1, 2024 at 3:32 PM Blondel, Sophie <sbl...@ut...<mailto:sbl...@ut...>> wrote: I have been using --smpiargs "-gpu". I tried the benchmark with "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi 0" and it seems to work: Fri Mar 1 16:27:14 EST 2024 =========================================== Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 1000000 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2 - running KSPSolve()... Step3 - calculating error norm... Error norm: 5.591e-02 KSP iters: 63 KSPSolve: 3.16646 seconds FOM: 3.158e+05 DoFs/sec =========================================== ------------------------------------------------------------ Sender: LSF System <lsfadmin@batch3> Subject: Job 3322694: <xolotlTest> in cluster <summit> Done Job <xolotlTest> was submitted from host <login2> by user <bqo> in cluster <summit> at Fri Mar 1 16:26:58 2024 Job was executed on host(s) <1*batch3>, in queue <debug>, as user <bqo> in cluster <summit> at Fri Mar 1 16:27:00 2024 <42*a35n05> </ccs/home/bqo> was used as the home directory. </gpfs/alpine2/mat267/scratch/bqo/test> was used as the working directory. Started at Fri Mar 1 16:27:00 2024 Terminated at Fri Mar 1 16:27:26 2024 Results reported at Fri Mar 1 16:27:26 2024 The output (if any) is above this job summary. If I switch to "jsrun --smpiargs "-gpu" -n 6 -a 1 -c 1 -g 1 /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos/src/ksp/ksp/tutorials/bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi 1" it complains: Fri Mar 1 16:25:02 EST 2024 =========================================== Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 1000000 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... [5]PETSC ERROR: PETSc is configured with GPU support, but your MPI is not GPU-aware. For better performance, please use a GPU-aware MPI. [5]PETSC ERROR: If you do not care, add option -use_gpu_aware_mpi 0. To not see the message again, add the option to your .petscrc, OR add it to the env var PETSC_OPTIONS. [5]PETSC ERROR: If you do care, for IBM Spectrum MPI on OLCF Summit, you may need jsrun --smpiargs=-gpu. [5]PETSC ERROR: For Open MPI, you need to configure it --with-cuda (https://www.open-mpi.org/faq/?category=buildcuda) [5]PETSC ERROR: For MVAPICH2-GDR, you need to set MV2_USE_CUDA=1 (http://mvapich.cse.ohio-state.edu/userguide/gdr/) [5]PETSC ERROR: For Cray-MPICH, you need to set MPICH_GPU_SUPPORT_ENABLED=1 (man mpi to see manual of cray-mpich) -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF with errorcode 76. Best, Sophie ________________________________ From: Junchao Zhang <jun...@gm...<mailto:jun...@gm...>> Sent: Thursday, February 29, 2024 17:09 To: Blondel, Sophie <sbl...@ut...<mailto:sbl...@ut...>> Cc: xol...@li...<mailto:xol...@li...> <xol...@li...<mailto:xol...@li...>>; pet...@mc...<mailto:pet...@mc...> <pet...@mc...<mailto:pet...@mc...>> Subject: Re: [petsc-users] PAMI error on Summit You don't often get email from jun...@gm...<mailto:jun...@gm...>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Could you try a petsc example to see if the environment is good? For example, cd src/ksp/ksp/tutorials make bench_kspsolve mpirun -n 6 ./bench_kspsolve -mat_type aijkokkos -use_gpu_aware_mpi {0 or 1} BTW, I remember to use gpu-aware mpi on Summit, one needs to pass --smpiargs "-gpu" to jsrun --Junchao Zhang On Thu, Feb 29, 2024 at 3:22 PM Blondel, Sophie via petsc-users <pet...@mc...<mailto:pet...@mc...>> wrote: I still get the same error when deactivating GPU-aware MPI. I also tried unloading spectrum MPI and using openMPI instead (recompiling everything) and I get a segfault in PETSc in that case (still using GPU-aware MPI I think, at least not explicitly ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd I still get the same error when deactivating GPU-aware MPI. I also tried unloading spectrum MPI and using openMPI instead (recompiling everything) and I get a segfault in PETSc in that case (still using GPU-aware MPI I think, at least not explicitly turning it off): 0 TS dt 1e-12 time 0. [ERROR] [0]PETSC ERROR: [ERROR] ------------------------------------------------------------------------ [ERROR] [0]PETSC ERROR: [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [ERROR] [0]PETSC ERROR: [ERROR] Try option -start_in_debugger or -on_error_attach_debugger [ERROR] [0]PETSC ERROR: [ERROR] or see https://petsc.org/release/faq/#valgrind<https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$> and https://petsc.org/release/faq/<https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$> [ERROR] [0]PETSC ERROR: [ERROR] or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html<https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$> on NVIDIA CUDA systems to find memory corruption errors [ERROR] [0]PETSC ERROR: [ERROR] configure using --with-debugging=yes, recompile, link, and run [ERROR] [0]PETSC ERROR: [ERROR] to get more information on the crash. [ERROR] [0]PETSC ERROR: [ERROR] Run with -malloc_debug to check if memory corruption is causing the crash. -------------------------------------------------------------------------- Best, Sophie ________________________________ From: Blondel, Sophie via Xolotl-psi-development <xol...@li...<mailto:xol...@li...>> Sent: Thursday, February 29, 2024 10:17 To: xol...@li...<mailto:xol...@li...> <xol...@li...<mailto:xol...@li...>>; pet...@mc...<mailto:pet...@mc...> <pet...@mc...<mailto:pet...@mc...>> Subject: [Xolotl-psi-development] PAMI error on Summit Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. errno 14 pid 864557 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. [e28n07:864557] *** Process received signal *** [e28n07:864557] Signal: Aborted (6) [e28n07:864557] Signal code: (-6) [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(gsignal+0xd8)[0x200005d796f8] [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(abort+0x164)[0x200005d53ff4] [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(+0x3d280)[0x200005d6d280] [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** [e28n07:864558] Signal: Aborted (6) [e28n07:864558] Signal code: (-6) [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(gsignal+0xd8)[0x200005d796f8] [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(abort+0x164)[0x200005d53ff4] [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(+0x3d280)[0x200005d6d280] [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(__assert_fail+0x64)[0x200005d6d324] [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so<http://libc-2.28.so>(__assert_fail+0x64)[0x200005d6d324] [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864557] *** End of error message *** /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864558] *** End of error message *** It seems to be pointing to https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/<https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$> so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. Best, Sophie |