From: helder v. <hel...@ho...> - 2022-11-21 12:37:22
|
Hi all! Recently, I opened a discussion in this mailing list regarding the installation of Scipion inside a singularity container, which worked very well, but now I'm facing a problem that I'm not sure if could be related to that installation. I'm trying to run the relion 3D classification protocol in GPU, but I received the following error message: (It seems an MPI issue, and if I use only 1 MPI it works. Interestingly, the 2D classification which also calls the "relion_refine_mpi" program runs without problems. It seems an issue specific to the 3D classification). Does anyone have any clues as to the possible cause of that problem? ps: sorry if this is not a scipion-related issue. Configuration tested: 3 MPI + 1 thread GPU nvidia A100 UBUNTU 20.04 cuda-11.7 gcc version 9.4 mpirun version 4.0.3 Thank you!! Best, Helder * stderr: 00027: [gpu01:3062501] 5 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port 00028: [gpu01:3062501] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 00029: [gpu01:3062508] *** Process received signal *** 00030: [gpu01:3062508] Signal: Segmentation fault (11) 00031: [gpu01:3062508] Signal code: Address not mapped (1) 00032: [gpu01:3062508] Failing at address: 0x30 00033: [gpu01:3062508] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fffef157420] 00034: [gpu01:3062508] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0x1a2)[0x7fffecb041c2] 00035: [gpu01:3062508] [ 2] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fffeea71854] 00036: [gpu01:3062508] [ 3] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait_all+0xe5)[0x7fffef43ce25] 00037: [gpu01:3062508] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x4be)[0x7fffef491d4e] 00038: [gpu01:3062508] [ 5] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fffef492061] 00039: [gpu01:3062508] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7fffec0b4dae] 00040: [gpu01:3062508] [ 7] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7fffef454b10] 00041: [gpu01:3062508] [ 8] /opt/software/em/relion-4.0/bin/relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55555565de56] 00042: [gpu01:3062508] [ 9] /opt/software/em/relion-4.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x555555644d48] 00043: [gpu01:3062508] [10] /opt/software/em/relion-4.0/bin/relion_refine_mpi(main+0x71)[0x5555555fcf41] 00044: [gpu01:3062508] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fffeebd5083] 00045: [gpu01:3062508] [12] /opt/software/em/relion-4.0/bin/relion_refine_mpi(_start+0x2e)[0x55555560026e] 00046: [gpu01:3062508] *** End of error message *** * stdout: Logging configured. STDOUT --> Runs/003851_ProtRelionClassify3D/logs/run.stdout , STDERR --> Runs/003851_ProtRelionClassify3D/logs/run.stderr ^[[32mRUNNING PROTOCOL -----------------^[[0m Protocol starts Hostname: gpu01.cnpem.local PID: 3062463 pyworkflow: 3.0.27 plugin: relion plugin v: 4.0.11 currentDir: /home/helder.ribeiro/ScipionUserData/projects/scipion_teste workingDir: Runs/003851_ProtRelionClassify3D runMode: Restart MPI: 3 threads: 1 Starting at step: 1 Running steps ^[[35mSTARTED^[[0m: convertInputStep, step 1, time 2022-11-21 13:12:45.527978 Converting set from 'Runs/003330_ProtImportParticles/particles.sqlite' into 'Runs/003851_ProtRelionClassify3D/input_particles.star' ** Running command: ^[[32m relion_image_handler --i Runs/002945_ProtImportVolumes/extra/import_output_volume.mrc --o Runs/003851_ProtRelionClassify3D/tmp/import_output_volume.00.mrc --angpix 1.10745 --new_box 220^[[0m 000/??? sec ~~(,_,"> [oo]^M 0/ 0 sec ............................................................~~(,_,"> ^[[35mFINISHED^[[0m: convertInputStep, step 1, time 2022-11-21 13:12:45.850105 ^[[35mSTARTED^[[0m: runRelionStep, step 2, time 2022-11-21 13:12:45.875931 ^[[32m mpirun -np 3 -bynode `which relion_refine_mpi` --i Runs/003851_ProtRelionClassify3D/input_particles.star --particle_diameter 226 --zero_mask --K 3 --firstiter_cc --ini_high 60.0 --sym c1 --ref_angpix 1.10745 --ref Runs/003851_ProtRelionClassify3D/tmp/import_output_volume.00.mrc --norm --scale --o Runs/003851_ProtRelionClassify3D/extra/relion --oversampling 1 --flatten_solvent --tau2_fudge 4.0 --iter 25 --pad 2 --healpix_order 2 --offset_range 5.0 --offset_step 2.0 --dont_combine_weights_via_disc --pool 3 --gpu --j 1^[[0m RELION version: 4.0.0-commit-138b9c Precision: BASE=double === RELION MPI setup === + Number of MPI processes = 3 + Leader (0) runs on host = gpu01 + Follower 1 runs on host = gpu01 + Follower 2 runs on host = gpu01 ================= uniqueHost gpu01 has 2 ranks. GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on follower 1 mapped to device 0 GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on follower 2 mapped to device 0 Device 0 on gpu01 is split between 2 followers Running CPU instructions in double precision. Estimating initial noise spectra from 1000 particles 000/??? sec ~~(,_,"> ..... ^[[35mFAILED^[[0m: runRelionStep, step 2, time 2022-11-21 13:12:52.259808 *** Last status is failed ^[[32m------------------- PROTOCOL FAILED (DONE 2/3)^[[0m |