From: Manoel P. <Man...@un...> - 2018-06-21 15:15:50
|
Dear Users, The answer is coming from Gregory Sharov. Indeed, I made a mistake I started the classification asking with X CPUs and found out that it was too much... I stopped the job, reassigned fewer CPUs and restarted it. I think for some unknown (to me) reasons it kept the initial number of CPUs. Killing the job and restarting it sloved everything! Thank very much to all of you for the help! Cheers, Manoël Prouteau, Ph.D. Scientific Collaborator Department of Molecular Biology Sciences III - University of Geneva Quai Ernest Ansermet, 30 1211 Geneve 04 Switzerland (+41) 022 379 61 18 man...@un... http://www.unige.ch ________________________________ De : Gregory Sharov <sha...@gm...> Envoyé : jeudi, 21 juin 2018 16:13:05 À : Mailing list for Scipion users Objet : Re: [scipion-users] Problem with CL2D Hello Manoel, from the attached output it looks like you were running it first with 4 MPIs but then the second step continued with 32mpis. It could be that you are running out of memory and you job is getting killed.. On Thu, Jun 21, 2018, 14:58 Bart Alewijnse <sca...@gm...<mailto:sca...@gm...>> wrote: In my experience, seemingly random kills are often the kernel's out-of-memory handler dealing with over-allocating processes. Maybe a window size or other parameter thing? On Jun 21, 2018, at 15:40, Carlos Oscar Sorzano <co...@cn...<mailto:co...@cn...>> wrote: Dear Manoel, from the stdout there is no obvious reason why the process has finished. There is no error other than it has been killed. In some machines there is a limit on the time a process can be running, and beyond this time, processes have to be submitted through a queue. I don't know if this could be the case in this case. Kind regards, Carlos Oscar On 21/06/2018 14:18, Manoel Prouteau wrote: Dear users, I am just starting using Scipion for CL2D classification of a small set of manually picked objects. I get an error while the softwaer starts the second step of the command. Can you help me understanding the problem? You can find the error in the run.stdout here: 00001: RUNNING PROTOCOL ----------------- 00002: PID: 9060 00003: Scipion: v1.1 (2017-06-14) Balbino 00004: currentDir: /data/prouteau/Mano_newdata_frames2-16_DW/TOROID-Sides 00005: workingDir: Runs/000400_XmippProtCL2D 00006: runMode: Continue 00007: MPI: 4 00008: threads: 1 00009: len(steps) 13 len(prevSteps) 0 00010: Starting at step: 1 00011: Running steps 00012: STARTED: convertInputStep, step 1 00013: 2018-06-20 15:03:42.471496 00014: FINISHED: convertInputStep, step 1 00015: 2018-06-20 15:03:43.256208 00016: STARTED: runJob, step 2 00017: 2018-06-20 15:03:43.312482 00018: mpirun -np 4 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 --distance correlation --classicalMultiref --nref0 4 00019: -------------------------------------------------------------------------- 00020: The following command line options and corresponding MCA parameter have 00021: been deprecated and replaced as follows: 00022: 00023: Command line options: 00024: Deprecated: --bynode, -bynode 00025: Replacement: --map-by node 00026: 00027: Equivalent MCA parameter: 00028: Deprecated: rmaps_base_bynode 00029: Replacement: rmaps_base_mapping_policy=node 00030: 00031: The deprecated forms *will* disappear in a future version of Open MPI. 00032: Please update to the new syntax. 00033: -------------------------------------------------------------------------- 00034: Input images: Runs/000400_XmippProtCL2D/tmp/input_particles.xmd 00035: Output root: level 00036: Output dir: Runs/000400_XmippProtCL2D/extra 00037: Iterations: 10 00038: CodesSel0: 00039: Codes0: 4 00040: Codes: 15 00041: Neighbours: 4 00042: Minimum node size: 20 00043: Use Correlation: 1 00044: Classical Multiref: 1 00045: Classical Split: 0 00046: Maximum shift: 10 00047: Classify all images: 0 00048: Normalize images: 1 00049: Mirror images: 1 00050: Align images: 1 00051: Initializing ... 00052: 0/ 0 sec. ............................................................ 00053: Quantizing with 4 codes... 00054: Iteration 1 ... 00055: 13/ 25 sec. ...............................RUNNING PROTOCOL ----------------- 00056: PID: 9099 00057: Scipion: v1.1 (2017-06-14) Balbino 00058: currentDir: /data/prouteau/Mano_newdata_frames2-16_DW/TOROID-Sides 00059: workingDir: Runs/000400_XmippProtCL2D 00060: runMode: Continue 00061: MPI: 32 00062: threads: 1 00063: len(steps) 13 len(prevSteps) 13 00064: Starting at step: 2 00065: Running steps 00066: STARTED: runJob, step 2 00067: 2018-06-20 15:04:06.958333 00068: mpirun -np 32 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 --distance correlation --classicalMultiref --nref0 4 00069: -------------------------------------------------------------------------- 00070: The following command line options and corresponding MCA parameter have 00071: been deprecated and replaced as follows: 00072: 00073: Command line options: 00074: Deprecated: --bynode, -bynode 00075: Replacement: --map-by node 00076: 00077: Equivalent MCA parameter: 00078: Deprecated: rmaps_base_bynode 00079: Replacement: rmaps_base_mapping_policy=node 00080: 00081: The deprecated forms *will* disappear in a future version of Open MPI. 00082: Please update to the new syntax. 00083: -------------------------------------------------------------------------- 00084: Input images: Runs/000400_XmippProtCL2D/tmp/input_particles.xmd 00085: Output root: level 00086: Output dir: Runs/000400_XmippProtCL2D/extra 00087: Iterations: 10 00088: CodesSel0: 00089: Codes0: 4 00090: Codes: 15 00091: Neighbours: 4 00092: Minimum node size: 20 00093: Use Correlation: 1 00094: Classical Multiref: 1 00095: Classical Split: 0 00096: Maximum shift: 10 00097: Classify all images: 0 00098: Normalize images: 1 00099: Mirror images: 1 00100: Align images: 1 00101: Initializing ... 00102: 0/ 0 sec. ............................................................ 00103: Quantizing with 4 codes... 00104: Iteration 1 ... 00105: 10/ 10 sec. ............................................................ 00106: 00107: Average correlation with input vectors=0.0310552 00108: Number of assignment changes=0 00109: Iteration 2 ... 00110: 10/ 10 sec. ............................................................ 00111: 00112: Average correlation with input vectors=0.0882044 00113: Number of assignment changes=324 00114: Iteration 3 ... 00115: 10/ 10 sec. ............................................................ 00116: 00117: Average correlation with input vectors=0.107101 00118: Number of assignment changes=378 00119: Iteration 4 ... 00120: 9/ 9 sec. ............................................................ 00121: 00122: Average correlation with input vectors=0.122994 00123: Number of assignment changes=225 00124: Iteration 5 ... 00125: 10/ 10 sec. ............................................................ 00126: 00127: Average correlation with input vectors=0.119519 00128: Number of assignment changes=290 00129: Iteration 6 ... 00130: 9/ 9 sec. ............................................................ 00131: 00132: Average correlation with input vectors=0.127653 00133: Number of assignment changes=233 00134: Iteration 7 ... 00135: 10/ 10 sec. ............................................................ 00136: 00137: Average correlation with input vectors=0.127296 00138: Number of assignment changes=223 00139: Iteration 8 ... 00140: 9/ 9 sec. ............................................................ 00141: 00142: Average correlation with input vectors=0.129356 00143: Number of assignment changes=236 00144: Iteration 9 ... 00145: 10/ 10 sec. ............................................................ 00146: 00147: Average correlation with input vectors=0.143878 00148: Number of assignment changes=126 00149: Iteration 10 ... 00150: 9/ 9 sec. ............................................................ 00151: 00152: Average correlation with input vectors=0.138916 00153: Number of assignment changes=187 00154: Spliting nodes ... 00155: Currently there are 5 nodes 00156: Currently there are 6 nodes 00157: Currently there are 7 nodes 00158: Currently there are 8 nodes 00159: Quantizing with 8 codes... 00160: Iteration 1 ... 00161: 28/ 28 sec. ............................................................ 00162: 00163: Average correlation with input vectors=0.139535 00164: Number of assignment changes=0 00165: Iteration 2 ... 00166: 26/ 26 sec. ............................................................ 00167: 00168: Average correlation with input vectors=0.153304 00169: Number of assignment changes=181 00170: Iteration 3 ... 00171: 26/ 26 sec. ............................................................ 00172: 00173: Average correlation with input vectors=0.159167 00174: Number of assignment changes=265 00175: Iteration 4 ... 00176: 25/ 25 sec. ............................................................ 00177: 00178: Average correlation with input vectors=0.151184 00179: Number of assignment changes=424 00180: Iteration 5 ... 00181: 25/ 25 sec. ............................................................ 00182: 00183: Average correlation with input vectors=0.155143 00184: Number of assignment changes=177 00185: Iteration 6 ... 00186: 23/ 23 sec. ............................................................ 00187: 00188: Average correlation with input vectors=0.147184 00189: Number of assignment changes=263 00190: Iteration 7 ... 00191: 27/ 27 sec. ............................................................ 00192: 00193: Average correlation with input vectors=0.159538 00194: Number of assignment changes=119 00195: Iteration 8 ... 00196: 25/ 25 sec. ............................................................ 00197: 00198: Average correlation with input vectors=0.160486 00199: Number of assignment changes=139 00200: Iteration 9 ... 00201: 26/ 26 sec. ............................................................ 00202: 00203: Average correlation with input vectors=0.164716 00204: Number of assignment changes=120 00205: Iteration 10 ... 00206: 27/ 27 sec. ............................................................ 00207: 00208: Average correlation with input vectors=0.162771 00209: Number of assignment changes=130 00210: Spliting nodes ... 00211: Currently there are 9 nodes 00212: Currently there are 10 nodes 00213: Currently there are 11 nodes 00214: Currently there are 12 nodes 00215: -------------------------------------------------------------------------- 00216: mpirun noticed that process rank 11 with PID 9147 on node smaug exited on signal 9 (Killed). 00217: -------------------------------------------------------------------------- 00218: Traceback (most recent call last): 00219: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 182, in run 00220: self._run() 00221: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 228, in _run 00222: resultFiles = self._runFunc() 00223: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 224, in _runFunc 00224: return self._func(*self._args) 00225: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 1077, in runJob 00226: self._stepsExecutor.runJob(self._log, program, arguments, **kwargs) 00227: File "/opt/scipion/pyworkflow/protocol/executor.py", line 56, in runJob 00228: env=env, cwd=cwd) 00229: File "/opt/scipion/pyworkflow/utils/process.py", line 51, in runJob 00230: return runCommand(command, env, cwd) 00231: File "/opt/scipion/pyworkflow/utils/process.py", line 65, in runCommand 00232: check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd) 00233: File "/opt/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call 00234: raise CalledProcessError(retcode, cmd) 00235: CalledProcessError: Command 'mpirun -np 32 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 137 00236: Protocol failed: Command 'mpirun -np 32 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 137 00237: FAILED: runJob, step 2 00238: 2018-06-20 15:31:45.758171 00239: ------------------- PROTOCOL FAILED (DONE 2/13) Thanks in advance for your help, Cheers, Manoël Prouteau, Ph.D. Scientific Collaborator Department of Molecular Biology Sciences III - University of Geneva Quai Ernest Ansermet, 30 1211 Geneve 04 Switzerland (+41) 022 379 61 18 man...@un...<mailto:man...@un...> http://www.unige.ch ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ scipion-users mailing list sci...@li...<mailto:sci...@li...> https://lists.sourceforge.net/lists/listinfo/scipion-users -- ------------------------------------------------------------------------ Carlos Oscar Sánchez Sorzano e-mail: co...@cn...<mailto:co...@cn...> Biocomputing unit http://i2pc.es/coss National Center of Biotechnology (CSIC) c/Darwin, 3 Campus Universidad Autónoma (Cantoblanco) Tlf: 34-91-585 4510 28049 MADRID (SPAIN) Fax: 34-91-585 4506 ------------------------------------------------------------------------ ________________________________ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org<http://Slashdot.org>! http://sdm.link/slashdot ________________________________ scipion-users mailing list sci...@li...<mailto:sci...@li...> https://lists.sourceforge.net/lists/listinfo/scipion-users ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ scipion-users mailing list sci...@li...<mailto:sci...@li...> https://lists.sourceforge.net/lists/listinfo/scipion-users |