From: Gregory S. <sha...@gm...> - 2018-06-21 14:13:22
|
Hello Manoel, from the attached output it looks like you were running it first with 4 MPIs but then the second step continued with 32mpis. It could be that you are running out of memory and you job is getting killed.. On Thu, Jun 21, 2018, 14:58 Bart Alewijnse <sca...@gm...> wrote: > > In my experience, seemingly random kills are often the kernel's > out-of-memory handler dealing with over-allocating processes. > Maybe a window size or other parameter thing? > > On Jun 21, 2018, at 15:40, Carlos Oscar Sorzano <co...@cn...> wrote: >> >> Dear Manoel, >> >> >> from the stdout there is no obvious reason why the process has finished. >> There is no error other than it has been killed. In some machines there is >> a limit on the time a process can be running, and beyond this time, >> processes have to be submitted through a queue. I don't know if this could >> be the case in this case. >> >> >> Kind regards, Carlos Oscar >> >> On 21/06/2018 14:18, Manoel Prouteau wrote: >> >> Dear users, >> >> >> I am just starting using Scipion for CL2D classification of a small set >> of manually picked objects. >> >> I get an error while the softwaer starts the second step of the command. >> Can you help me understanding the problem? >> >> >> You can find the error in the run.stdout here: >> >> >> 00001: RUNNING PROTOCOL ----------------- >> 00002: PID: 9060 >> 00003: Scipion: v1.1 (2017-06-14) Balbino >> 00004: currentDir: >> /data/prouteau/Mano_newdata_frames2-16_DW/TOROID-Sides >> 00005: workingDir: Runs/000400_XmippProtCL2D >> 00006: runMode: Continue >> 00007: MPI: 4 >> 00008: threads: 1 >> 00009: len(steps) 13 len(prevSteps) 0 >> 00010: Starting at step: 1 >> 00011: Running steps >> 00012: STARTED: convertInputStep, step 1 >> 00013: 2018-06-20 15:03:42.471496 >> 00014: FINISHED: convertInputStep, step 1 >> 00015: 2018-06-20 15:03:43.256208 >> 00016: STARTED: runJob, step 2 >> 00017: 2018-06-20 15:03:43.312482 >> 00018: mpirun -np 4 -bynode `which xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4 >> 00019: >> -------------------------------------------------------------------------- >> 00020: The following command line options and corresponding MCA >> parameter have >> 00021: been deprecated and replaced as follows: >> 00022: >> 00023: Command line options: >> 00024: Deprecated: --bynode, -bynode >> 00025: Replacement: --map-by node >> 00026: >> 00027: Equivalent MCA parameter: >> 00028: Deprecated: rmaps_base_bynode >> 00029: Replacement: rmaps_base_mapping_policy=node >> 00030: >> 00031: The deprecated forms *will* disappear in a future version of >> Open MPI. >> 00032: Please update to the new syntax. >> 00033: >> -------------------------------------------------------------------------- >> 00034: Input images: >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd >> 00035: Output root: level >> 00036: Output dir: Runs/000400_XmippProtCL2D/extra >> 00037: Iterations: 10 >> 00038: CodesSel0: >> 00039: Codes0: 4 >> 00040: Codes: 15 >> 00041: Neighbours: 4 >> 00042: Minimum node size: 20 >> 00043: Use Correlation: 1 >> 00044: Classical Multiref: 1 >> 00045: Classical Split: 0 >> 00046: Maximum shift: 10 >> 00047: Classify all images: 0 >> 00048: Normalize images: 1 >> 00049: Mirror images: 1 >> 00050: Align images: 1 >> 00051: Initializing ... >> 00052: 0/ 0 sec. >> ............................................................ >> 00053: Quantizing with 4 codes... >> 00054: Iteration 1 ... >> 00055: 13/ 25 sec. ...............................RUNNING PROTOCOL >> ----------------- >> 00056: PID: 9099 >> 00057: Scipion: v1.1 (2017-06-14) Balbino >> 00058: currentDir: >> /data/prouteau/Mano_newdata_frames2-16_DW/TOROID-Sides >> 00059: workingDir: Runs/000400_XmippProtCL2D >> 00060: runMode: Continue >> 00061: MPI: 32 >> 00062: threads: 1 >> 00063: len(steps) 13 len(prevSteps) 13 >> 00064: Starting at step: 2 >> 00065: Running steps >> 00066: STARTED: runJob, step 2 >> 00067: 2018-06-20 15:04:06.958333 >> 00068: mpirun -np 32 -bynode `which xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4 >> 00069: >> -------------------------------------------------------------------------- >> 00070: The following command line options and corresponding MCA >> parameter have >> 00071: been deprecated and replaced as follows: >> 00072: >> 00073: Command line options: >> 00074: Deprecated: --bynode, -bynode >> 00075: Replacement: --map-by node >> 00076: >> 00077: Equivalent MCA parameter: >> 00078: Deprecated: rmaps_base_bynode >> 00079: Replacement: rmaps_base_mapping_policy=node >> 00080: >> 00081: The deprecated forms *will* disappear in a future version of >> Open MPI. >> 00082: Please update to the new syntax. >> 00083: >> -------------------------------------------------------------------------- >> 00084: Input images: >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd >> 00085: Output root: level >> 00086: Output dir: Runs/000400_XmippProtCL2D/extra >> 00087: Iterations: 10 >> 00088: CodesSel0: >> 00089: Codes0: 4 >> 00090: Codes: 15 >> 00091: Neighbours: 4 >> 00092: Minimum node size: 20 >> 00093: Use Correlation: 1 >> 00094: Classical Multiref: 1 >> 00095: Classical Split: 0 >> 00096: Maximum shift: 10 >> 00097: Classify all images: 0 >> 00098: Normalize images: 1 >> 00099: Mirror images: 1 >> 00100: Align images: 1 >> 00101: Initializing ... >> 00102: 0/ 0 sec. >> ............................................................ >> 00103: Quantizing with 4 codes... >> 00104: Iteration 1 ... >> 00105: 10/ 10 sec. >> ............................................................ >> 00106: >> 00107: Average correlation with input vectors=0.0310552 >> 00108: Number of assignment changes=0 >> 00109: Iteration 2 ... >> 00110: 10/ 10 sec. >> ............................................................ >> 00111: >> 00112: Average correlation with input vectors=0.0882044 >> 00113: Number of assignment changes=324 >> 00114: Iteration 3 ... >> 00115: 10/ 10 sec. >> ............................................................ >> 00116: >> 00117: Average correlation with input vectors=0.107101 >> 00118: Number of assignment changes=378 >> 00119: Iteration 4 ... >> 00120: 9/ 9 sec. >> ............................................................ >> 00121: >> 00122: Average correlation with input vectors=0.122994 >> 00123: Number of assignment changes=225 >> 00124: Iteration 5 ... >> 00125: 10/ 10 sec. >> ............................................................ >> 00126: >> 00127: Average correlation with input vectors=0.119519 >> 00128: Number of assignment changes=290 >> 00129: Iteration 6 ... >> 00130: 9/ 9 sec. >> ............................................................ >> 00131: >> 00132: Average correlation with input vectors=0.127653 >> 00133: Number of assignment changes=233 >> 00134: Iteration 7 ... >> 00135: 10/ 10 sec. >> ............................................................ >> 00136: >> 00137: Average correlation with input vectors=0.127296 >> 00138: Number of assignment changes=223 >> 00139: Iteration 8 ... >> 00140: 9/ 9 sec. >> ............................................................ >> 00141: >> 00142: Average correlation with input vectors=0.129356 >> 00143: Number of assignment changes=236 >> 00144: Iteration 9 ... >> 00145: 10/ 10 sec. >> ............................................................ >> 00146: >> 00147: Average correlation with input vectors=0.143878 >> 00148: Number of assignment changes=126 >> 00149: Iteration 10 ... >> 00150: 9/ 9 sec. >> ............................................................ >> 00151: >> 00152: Average correlation with input vectors=0.138916 >> 00153: Number of assignment changes=187 >> 00154: Spliting nodes ... >> 00155: Currently there are 5 nodes >> 00156: Currently there are 6 nodes >> 00157: Currently there are 7 nodes >> 00158: Currently there are 8 nodes >> 00159: Quantizing with 8 codes... >> 00160: Iteration 1 ... >> 00161: 28/ 28 sec. >> ............................................................ >> 00162: >> 00163: Average correlation with input vectors=0.139535 >> 00164: Number of assignment changes=0 >> 00165: Iteration 2 ... >> 00166: 26/ 26 sec. >> ............................................................ >> 00167: >> 00168: Average correlation with input vectors=0.153304 >> 00169: Number of assignment changes=181 >> 00170: Iteration 3 ... >> 00171: 26/ 26 sec. >> ............................................................ >> 00172: >> 00173: Average correlation with input vectors=0.159167 >> 00174: Number of assignment changes=265 >> 00175: Iteration 4 ... >> 00176: 25/ 25 sec. >> ............................................................ >> 00177: >> 00178: Average correlation with input vectors=0.151184 >> 00179: Number of assignment changes=424 >> 00180: Iteration 5 ... >> 00181: 25/ 25 sec. >> ............................................................ >> 00182: >> 00183: Average correlation with input vectors=0.155143 >> 00184: Number of assignment changes=177 >> 00185: Iteration 6 ... >> 00186: 23/ 23 sec. >> ............................................................ >> 00187: >> 00188: Average correlation with input vectors=0.147184 >> 00189: Number of assignment changes=263 >> 00190: Iteration 7 ... >> 00191: 27/ 27 sec. >> ............................................................ >> 00192: >> 00193: Average correlation with input vectors=0.159538 >> 00194: Number of assignment changes=119 >> 00195: Iteration 8 ... >> 00196: 25/ 25 sec. >> ............................................................ >> 00197: >> 00198: Average correlation with input vectors=0.160486 >> 00199: Number of assignment changes=139 >> 00200: Iteration 9 ... >> 00201: 26/ 26 sec. >> ............................................................ >> 00202: >> 00203: Average correlation with input vectors=0.164716 >> 00204: Number of assignment changes=120 >> 00205: Iteration 10 ... >> 00206: 27/ 27 sec. >> ............................................................ >> 00207: >> 00208: Average correlation with input vectors=0.162771 >> 00209: Number of assignment changes=130 >> 00210: Spliting nodes ... >> 00211: Currently there are 9 nodes >> 00212: Currently there are 10 nodes >> 00213: Currently there are 11 nodes >> 00214: Currently there are 12 nodes >> 00215: >> -------------------------------------------------------------------------- >> 00216: mpirun noticed that process rank 11 with PID 9147 on node smaug >> exited on signal 9 (Killed). >> 00217: >> -------------------------------------------------------------------------- >> 00218: Traceback (most recent call last): >> 00219: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 182, >> in run >> 00220: self._run() >> 00221: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 228, >> in _run >> 00222: resultFiles = self._runFunc() >> 00223: File "/opt/scipion/pyworkflow/protocol/protocol.py", line 224, >> in _runFunc >> 00224: return self._func(*self._args) >> 00225: File "/opt/scipion/pyworkflow/protocol/protocol.py", line >> 1077, in runJob >> 00226: self._stepsExecutor.runJob(self._log, program, arguments, >> **kwargs) >> 00227: File "/opt/scipion/pyworkflow/protocol/executor.py", line 56, >> in runJob >> 00228: env=env, cwd=cwd) >> 00229: File "/opt/scipion/pyworkflow/utils/process.py", line 51, in >> runJob >> 00230: return runCommand(command, env, cwd) >> 00231: File "/opt/scipion/pyworkflow/utils/process.py", line 65, in >> runCommand >> 00232: check_call(command, shell=True, stdout=sys.stdout, >> stderr=sys.stderr, env=env, cwd=cwd) >> 00233: File "/opt/scipion/software/lib/python2.7/subprocess.py", line >> 540, in check_call >> 00234: raise CalledProcessError(retcode, cmd) >> 00235: CalledProcessError: Command 'mpirun -np 32 -bynode `which >> xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4' returned non-zero >> exit status 137 >> 00236: Protocol failed: Command 'mpirun -np 32 -bynode `which >> xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4' returned non-zero >> exit status 137 >> 00237: FAILED: runJob, step 2 >> 00238: 2018-06-20 15:31:45.758171 >> 00239: ------------------- PROTOCOL FAILED (DONE 2/13) >> >> >> Thanks in advance for your help, >> >> >> Cheers, >> >> >> *Manoël Prouteau, Ph.D.* >> >> Scientific Collaborator >> >> Department of Molecular Biology >> >> Sciences III - University of Geneva >> >> Quai Ernest Ansermet, 30 >> >> 1211 Geneve 04 >> >> Switzerland >> >> (+41) 022 379 61 18 >> >> man...@un... >> >> http://www.unige.ch >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> >> >> _______________________________________________ >> scipion-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/scipion-users >> >> >> -- >> ------------------------------------------------------------------------ >> Carlos Oscar Sánchez Sorzano e-mail: co...@cn... >> Biocomputing unit http://i2pc.es/coss >> National Center of Biotechnology (CSIC) >> c/Darwin, 3 >> Campus Universidad Autónoma (Cantoblanco) Tlf: 34-91-585 4510 >> 28049 MADRID (SPAIN) Fax: 34-91-585 4506 >> ------------------------------------------------------------------------ >> >> ------------------------------ >> >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> ------------------------------ >> >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users >> >> > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > |