|
From: Madan B K <mad...@wm...> - 2025-11-01 10:59:47
|
I am encountering a reproducible crash in my simulation when running with the Kokkos GPU package enabled. The error message is:
991 38.271389 47.475665 9.91e-05 1e-07 9001317 1030184 943593 502982
992 38.31183 47.516105 9.92e-05 1e-07 9001317 1031398 946377 503695
993 38.352661 47.556936 9.93e-05 1e-07 9001317 1032604 948972 504611
994 38.392964 47.597239 9.94e-05 1e-07 9001317 1033818 952089 506258
995 38.433504 47.63778 9.95e-05 1e-07 9001317 1035050 954883 507869
996 38.473899 47.678175 9.96e-05 1e-07 9001317 1036243 957689 509319
997 38.514491 47.718766 9.97e-05 1e-07 9001317 1037443 960349 510675
998 38.555133 47.759408 9.98e-05 1e-07 9001317 1038632 963126 511338
999 38.595603 47.799878 9.99e-05 1e-07 9001317 1039805 965803 513788
1000 67.705565 76.909841 0.0001 1e-07 9008016 1040996 968259 514734
ERROR on proc 3: Particle being sent to self proc on step 1001 (../update_kokkos.cpp:707)
ERROR on proc 1: Particle being sent to self proc on step 1001 (../update_kokkos.cpp:707)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
Proc: [[29498,1],3]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[nch-000:00000] *** An error occurred in Socket closed
[nch-000:00000] *** reported by process [1933180929,0]
[nch-000:00000] *** on a NULL communicator
[nch-000:00000] *** Unknown error
[nch-000:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[nch-000:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 3 with PID 275509 on node nch-003 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------
Regards,
Madan B K
|