From: Grigory S. <sha...@gm...> - 2022-11-27 15:02:59
|
Hi, does scipion3 tests relion.tests.test_protocols_3d.TestRelionInitialModel fail as well? Have you verified that your MPI works (without scipion or relion)? Best regards, Grigory -------------------------------------------------------------------------------- Grigory Sharov, Ph.D. MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK. tel. +44 (0) 1223 267228 <+44%201223%20267228> e-mail: gs...@mr... On Mon, Nov 21, 2022 at 12:37 PM helder veras <hel...@ho...> wrote: > Hi all! > > Recently, I opened a discussion in this mailing list regarding the > installation of Scipion inside a singularity container, which worked very > well, but now I'm facing a problem that I'm not sure if could be related to > that installation. > I'm trying to run the relion 3D classification protocol in GPU, but I > received the following error message: > (It seems an MPI issue, and if I use only 1 MPI it works. Interestingly, > the 2D classification which also calls the "relion_refine_mpi" program runs > without problems. It seems an issue specific to the 3D classification). > > Does anyone have any clues as to the possible cause of that problem? > > ps: sorry if this is not a scipion-related issue. > > Configuration tested: > 3 MPI + 1 thread > GPU nvidia A100 > UBUNTU 20.04 > cuda-11.7 > gcc version 9.4 > mpirun version 4.0.3 > > > Thank you!! > > Best, > > Helder > > - stderr: > > 00027: [gpu01:3062501] 5 more processes have sent help message > help-mpi-btl-openib-cpc-base.txt / no cpcs for port > 00028: [gpu01:3062501] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > 00029: [gpu01:3062508] *** Process received signal *** > 00030: [gpu01:3062508] Signal: Segmentation fault (11) > 00031: [gpu01:3062508] Signal code: Address not mapped (1) > 00032: [gpu01:3062508] Failing at address: 0x30 > 00033: [gpu01:3062508] [ 0] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fffef157420] > 00034: [gpu01:3062508] [ 1] > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0x1a2)[0x7fffecb041c2] > 00035: [gpu01:3062508] [ 2] > /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fffeea71854] > 00036: [gpu01:3062508] [ 3] > /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait_all+0xe5)[0x7fffef43ce25] > 00037: [gpu01:3062508] [ 4] > /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x4be)[0x7fffef491d4e] > 00038: [gpu01:3062508] [ 5] > /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fffef492061] > 00039: [gpu01:3062508] [ 6] > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7fffec0b4dae] > 00040: [gpu01:3062508] [ 7] > /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7fffef454b10] > 00041: [gpu01:3062508] [ 8] > /opt/software/em/relion-4.0/bin/relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55555565de56] > 00042: [gpu01:3062508] [ 9] > /opt/software/em/relion-4.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x555555644d48] > 00043: [gpu01:3062508] [10] > /opt/software/em/relion-4.0/bin/relion_refine_mpi(main+0x71)[0x5555555fcf41] > 00044: [gpu01:3062508] [11] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fffeebd5083] > 00045: [gpu01:3062508] [12] > /opt/software/em/relion-4.0/bin/relion_refine_mpi(_start+0x2e)[0x55555560026e] > 00046: [gpu01:3062508] *** End of error message *** > > > - stdout: > > Logging configured. STDOUT --> > Runs/003851_ProtRelionClassify3D/logs/run.stdout , STDERR --> > Runs/003851_ProtRelionClassify3D/logs/run.stderr > ^[[32mRUNNING PROTOCOL -----------------^[[0m > Protocol starts > Hostname: gpu01.cnpem.local > PID: 3062463 > pyworkflow: 3.0.27 > plugin: relion > plugin v: 4.0.11 > currentDir: /home/helder.ribeiro/ScipionUserData/projects/scipion_teste > workingDir: Runs/003851_ProtRelionClassify3D > runMode: Restart > MPI: 3 > threads: 1 > Starting at step: 1 > Running steps > ^[[35mSTARTED^[[0m: convertInputStep, step 1, time 2022-11-21 > 13:12:45.527978 > Converting set from 'Runs/003330_ProtImportParticles/particles.sqlite' > into 'Runs/003851_ProtRelionClassify3D/input_particles.star' > ** Running command: ^[[32m relion_image_handler --i > Runs/002945_ProtImportVolumes/extra/import_output_volume.mrc --o > Runs/003851_ProtRelionClassify3D/tmp/import_output_volume.00.mrc --angpix > 1.10745 --new_box 220^[[0m > 000/??? sec ~~(,_,"> > [oo]^M 0/ 0 sec > ............................................................~~(,_,"> > ^[[35mFINISHED^[[0m: convertInputStep, step 1, time 2022-11-21 > 13:12:45.850105 > ^[[35mSTARTED^[[0m: runRelionStep, step 2, time 2022-11-21 13:12:45.875931 > ^[[32m mpirun -np 3 -bynode `which relion_refine_mpi` --i > Runs/003851_ProtRelionClassify3D/input_particles.star --particle_diameter > 226 --zero_mask --K 3 --firstiter_cc --ini_high 60.0 --sym c1 > --ref_angpix 1.10745 --ref > Runs/003851_ProtRelionClassify3D/tmp/import_output_volume.00.mrc --norm > --scale --o Runs/003851_ProtRelionClassify3D/extra/relion --oversampling > 1 --flatten_solvent --tau2_fudge 4.0 --iter 25 --pad 2 --healpix_order 2 > --offset_range 5.0 --offset_step 2.0 --dont_combine_weights_via_disc > --pool 3 --gpu --j 1^[[0m > RELION version: 4.0.0-commit-138b9c > Precision: BASE=double > > === RELION MPI setup === > + Number of MPI processes = 3 > + Leader (0) runs on host = gpu01 > + Follower 1 runs on host = gpu01 > + Follower 2 runs on host = gpu01 > ================= > uniqueHost gpu01 has 2 ranks. > GPU-ids not specified for this rank, threads will automatically be mapped > to available devices. > Thread 0 on follower 1 mapped to device 0 > GPU-ids not specified for this rank, threads will automatically be mapped > to available devices. > Thread 0 on follower 2 mapped to device 0 > Device 0 on gpu01 is split between 2 followers > Running CPU instructions in double precision. > Estimating initial noise spectra from 1000 particles > 000/??? sec ~~(,_,"> > > ..... > ^[[35mFAILED^[[0m: runRelionStep, step 2, time 2022-11-21 13:12:52.259808 > *** Last status is failed > ^[[32m------------------- PROTOCOL FAILED (DONE 2/3)^[[0m > > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > |