From: Montserrat F. F. <mf...@ib...> - 2018-01-23 10:43:52
|
Hi, The box size is 220x220. This might be the reason why we are running out of memory? We are running the process in a GPU clusters, and were using 4 threads and 4 MPI because the system admins told us to use that values. We will try to run it using an odd number of MPI. Thank you very much for your feedback, Montserrat Fabrega El dt, 23 gen 2018 a les 0:03 Joshua Jude Lobo <jl...@um...> va escriure: > Hi Dr.Ferrer > > It seems like you are running out of memory on your cards . What is the > box size ? .Also you might want to give an odd number of MPI because one > will always become a master an the rest > will be slaves > > Sincerely > Joshua Lobo > > On Mon, Jan 22, 2018 at 11:46 AM, Montserrat Fabrega Ferrer < > mf...@ib...> wrote: > >> Hi, >> >> I am trying to run a Relion auto-refine in Scipion v1.1 (2017-06-14) >> Balbino. The Relion version is 2.0.3. However, I get the error I copy >> below. Does anybody have any suggestion that would help? >> >> Thank you very much in advance, >> >> Montserrat Fabrega >> >> 00001: RUNNING PROTOCOL ----------------- >> 00002: PID: 10237 >> 00003: Scipion: v1.1 (2017-06-14) Balbino >> 00004: currentDir: >> /gpfs/projects/irb12/irb12336/ScipionUserData/projects/Titan >> 00005: workingDir: Runs/011639_ProtRelionRefine3D >> 00006: runMode: Continue >> 00007: MPI: 4 >> 00008: threads: 4 >> 00009: len(steps) 3 len(prevSteps) 0 >> 00010: Starting at step: 1 >> 00011: Running steps >> 00012: STARTED: convertInputStep, step 1 >> 00013: 2018-01-22 00:13:40.523885 >> 00014: Converting set from >> 'Runs/011589_ProtUserSubSet/particles.sqlite' into >> 'Runs/011639_ProtRelionRefine3D/input_particles.star' >> 00015: FINISHED: convertInputStep, step 1 >> 00016: 2018-01-22 00:13:45.845200 >> 00017: STARTED: runRelionStep, step 2 >> 00018: 2018-01-22 00:13:45.860738 >> 00019: srun `which relion_refine_mpi` --gpu --low_resol_join_halves >> 40 --pool 3 --auto_local_healpix_order 4 --angpix 1.04 >> --dont_combine_weights_via_disc --ref >> Runs/011639_ProtRelionRefine3D/tmp/proposedVolume00003.mrc --scale >> --offset_range 5.0 --ini_high 60.0 --offset_step 2.0 --healpix_order 2 >> --auto_refine --ctf --oversampling 1 --split_random_halves --o >> Runs/011639_ProtRelionRefine3D/extra/relion --i >> Runs/011639_ProtRelionRefine3D/input_particles.star --zero_mask --norm >> --firstiter_cc --sym c12 --flatten_solvent --particle_diameter 228.8 --j >> 4 >> 00020: === RELION MPI setup === >> 00021: + Number of MPI processes = 4 >> 00022: + Number of threads per MPI process = 4 >> 00023: + Total number of threads therefore = 16 >> 00024: + Master (0) runs on host = nvb36 >> 00025: + Slave 1 runs on host = nvb36 >> 00026: + Slave 2 runs on host = nvb36 >> 00027: + Slave 3 runs on host = nvb36 >> 00028: ================= >> 00029: uniqueHost nvb36 has 3 ranks. >> 00030: GPU-ids not specified for this rank, threads will automatically >> be mapped to available devices. >> 00031: Thread 0 on slave 1 mapped to device 0 >> 00032: Thread 1 on slave 1 mapped to device 0 >> 00033: Thread 2 on slave 1 mapped to device 0 >> 00034: Thread 3 on slave 1 mapped to device 1 >> 00035: GPU-ids not specified for this rank, threads will automatically >> be mapped to available devices. >> 00036: Thread 0 on slave 2 mapped to device 1 >> 00037: Thread 1 on slave 2 mapped to device 1 >> 00038: Thread 2 on slave 2 mapped to device 2 >> 00039: Thread 3 on slave 2 mapped to device 2 >> 00040: GPU-ids not specified for this rank, threads will automatically >> be mapped to available devices. >> 00041: Thread 0 on slave 3 mapped to device 2 >> 00042: Thread 1 on slave 3 mapped to device 3 >> 00043: Thread 2 on slave 3 mapped to device 3 >> 00044: Thread 3 on slave 3 mapped to device 3 >> 00045: Device 1 on nvb36 is split between 2 slaves >> 00046: Device 2 on nvb36 is split between 2 slaves >> 00047: [nvb36:10305] *** Process received signal *** >> 00048: [nvb36:10305] Signal: Segmentation fault (11) >> 00049: [nvb36:10305] Signal code: Address not mapped (1) >> 00050: [nvb36:10305] Failing at address: 0x2802b08 >> 00051: [nvb36:10305] [ 0] /lib64/libpthread.so.0() [0x358740f790] >> 00052: [nvb36:10305] [ 1] /opt/mpi/bullxmpi/ >> 1.2.9.1/lib/libmpi.so.1(opal_memory_ptmalloc2_free+0x26) >> <http://1.2.9.1/lib/libmpi.so.1%28opal_memory_ptmalloc2_free+0x26%29> >> [0x2ac8aeb94046] >> 00053: [nvb36:10305] [ 2] >> /apps/RELION/2.0.3/lib/librelion_lib.so(_ZN14MlOptimiserMpi10initialiseEv+0x115f) >> [0x2ac8a7491f0f] >> 00054: [nvb36:10305] [ 3] >> /apps/RELION/2.0.3/bin/relion_refine_mpi(main+0x218) [0x4052c8] >> 00055: [nvb36:10305] [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) >> [0x3586c1ed5d] >> 00056: [nvb36:10305] [ 5] /apps/RELION/2.0.3/bin/relion_refine_mpi() >> [0x404fe9] >> 00057: [nvb36:10305] *** End of error message *** >> 00058: srun: error: nvb36: task 0: Segmentation fault >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users >> >> > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > |