From: Bart A. <sca...@gm...> - 2018-06-21 13:58:46
|
In my experience, seemingly random kills are often the kernel's out-of-memory handler dealing with over-allocating processes. Maybe a window size or other parameter thing? On Jun 21, 2018, 15:40, at 15:40, Carlos Oscar Sorzano <co...@cn...> wrote: >Dear Manoel, > > >from the stdout there is no obvious reason why the process has >finished. >There is no error other than it has been killed. In some machines there > >is a limit on the time a process can be running, and beyond this time, >processes have to be submitted through a queue. I don't know if this >could be the case in this case. > > >Kind regards, Carlos Oscar > > >On 21/06/2018 14:18, Manoel Prouteau wrote: >> >> Dear users, >> >> >> I am just starting using Scipion for CL2D classification of a small >> set of manually picked objects. >> >> I get an error while the softwaer starts the second step of the >> command. Can you help me understanding the problem? >> >> >> You can find the error in the run.stdout here: >> >> >> 00001: RUNNING PROTOCOL ----------------- >> 00002: PID: 9060 >> 00003: Scipion: v1.1 (2017-06-14) Balbino >> 00004: currentDir: >> /data/prouteau/Mano_newdata_frames2-16_DW/TOROID-Sides >> 00005: workingDir: Runs/000400_XmippProtCL2D >> 00006: runMode: Continue >> 00007: MPI: 4 >> 00008: threads: 1 >> 00009: len(steps) 13 len(prevSteps) 0 >> 00010: Starting at step: 1 >> 00011: Running steps >> 00012: STARTED: convertInputStep, step 1 >> 00013: 2018-06-20 15:03:42.471496 >> 00014: FINISHED: convertInputStep, step 1 >> 00015: 2018-06-20 15:03:43.256208 >> 00016: STARTED: runJob, step 2 >> 00017: 2018-06-20 15:03:43.312482 >> 00018: mpirun -np 4 -bynode `which xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4 >> 00019: >> >-------------------------------------------------------------------------- >> 00020: The following command line options and corresponding MCA >> parameter have >> 00021: been deprecated and replaced as follows: >> 00022: >> 00023: Command line options: >> 00024: Deprecated: --bynode, -bynode >> 00025: Replacement: --map-by node >> 00026: >> 00027: Equivalent MCA parameter: >> 00028: Deprecated: rmaps_base_bynode >> 00029: Replacement: rmaps_base_mapping_policy=node >> 00030: >> 00031: The deprecated forms *will* disappear in a future version of > >> Open MPI. >> 00032: Please update to the new syntax. >> 00033: >> >-------------------------------------------------------------------------- >> 00034: Input images: >Runs/000400_XmippProtCL2D/tmp/input_particles.xmd >> 00035: Output root: level >> 00036: Output dir: Runs/000400_XmippProtCL2D/extra >> 00037: Iterations: 10 >> 00038: CodesSel0: >> 00039: Codes0: 4 >> 00040: Codes: 15 >> 00041: Neighbours: 4 >> 00042: Minimum node size: 20 >> 00043: Use Correlation: 1 >> 00044: Classical Multiref: 1 >> 00045: Classical Split: 0 >> 00046: Maximum shift: 10 >> 00047: Classify all images: 0 >> 00048: Normalize images: 1 >> 00049: Mirror images: 1 >> 00050: Align images: 1 >> 00051: Initializing ... >> 00052: 0/ 0 sec. >> ............................................................ >> 00053: Quantizing with 4 codes... >> 00054: Iteration 1 ... >> 00055: 13/ 25 sec. ...............................RUNNING >> PROTOCOL ----------------- >> 00056: PID: 9099 >> 00057: Scipion: v1.1 (2017-06-14) Balbino >> 00058: currentDir: >> /data/prouteau/Mano_newdata_frames2-16_DW/TOROID-Sides >> 00059: workingDir: Runs/000400_XmippProtCL2D >> 00060: runMode: Continue >> 00061: MPI: 32 >> 00062: threads: 1 >> 00063: len(steps) 13 len(prevSteps) 13 >> 00064: Starting at step: 2 >> 00065: Running steps >> 00066: STARTED: runJob, step 2 >> 00067: 2018-06-20 15:04:06.958333 >> 00068: mpirun -np 32 -bynode `which xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4 >> 00069: >> >-------------------------------------------------------------------------- >> 00070: The following command line options and corresponding MCA >> parameter have >> 00071: been deprecated and replaced as follows: >> 00072: >> 00073: Command line options: >> 00074: Deprecated: --bynode, -bynode >> 00075: Replacement: --map-by node >> 00076: >> 00077: Equivalent MCA parameter: >> 00078: Deprecated: rmaps_base_bynode >> 00079: Replacement: rmaps_base_mapping_policy=node >> 00080: >> 00081: The deprecated forms *will* disappear in a future version of > >> Open MPI. >> 00082: Please update to the new syntax. >> 00083: >> >-------------------------------------------------------------------------- >> 00084: Input images: >Runs/000400_XmippProtCL2D/tmp/input_particles.xmd >> 00085: Output root: level >> 00086: Output dir: Runs/000400_XmippProtCL2D/extra >> 00087: Iterations: 10 >> 00088: CodesSel0: >> 00089: Codes0: 4 >> 00090: Codes: 15 >> 00091: Neighbours: 4 >> 00092: Minimum node size: 20 >> 00093: Use Correlation: 1 >> 00094: Classical Multiref: 1 >> 00095: Classical Split: 0 >> 00096: Maximum shift: 10 >> 00097: Classify all images: 0 >> 00098: Normalize images: 1 >> 00099: Mirror images: 1 >> 00100: Align images: 1 >> 00101: Initializing ... >> 00102: 0/ 0 sec. >> ............................................................ >> 00103: Quantizing with 4 codes... >> 00104: Iteration 1 ... >> 00105: 10/ 10 sec. >> ............................................................ >> 00106: >> 00107: Average correlation with input vectors=0.0310552 >> 00108: Number of assignment changes=0 >> 00109: Iteration 2 ... >> 00110: 10/ 10 sec. >> ............................................................ >> 00111: >> 00112: Average correlation with input vectors=0.0882044 >> 00113: Number of assignment changes=324 >> 00114: Iteration 3 ... >> 00115: 10/ 10 sec. >> ............................................................ >> 00116: >> 00117: Average correlation with input vectors=0.107101 >> 00118: Number of assignment changes=378 >> 00119: Iteration 4 ... >> 00120: 9/ 9 sec. >> ............................................................ >> 00121: >> 00122: Average correlation with input vectors=0.122994 >> 00123: Number of assignment changes=225 >> 00124: Iteration 5 ... >> 00125: 10/ 10 sec. >> ............................................................ >> 00126: >> 00127: Average correlation with input vectors=0.119519 >> 00128: Number of assignment changes=290 >> 00129: Iteration 6 ... >> 00130: 9/ 9 sec. >> ............................................................ >> 00131: >> 00132: Average correlation with input vectors=0.127653 >> 00133: Number of assignment changes=233 >> 00134: Iteration 7 ... >> 00135: 10/ 10 sec. >> ............................................................ >> 00136: >> 00137: Average correlation with input vectors=0.127296 >> 00138: Number of assignment changes=223 >> 00139: Iteration 8 ... >> 00140: 9/ 9 sec. >> ............................................................ >> 00141: >> 00142: Average correlation with input vectors=0.129356 >> 00143: Number of assignment changes=236 >> 00144: Iteration 9 ... >> 00145: 10/ 10 sec. >> ............................................................ >> 00146: >> 00147: Average correlation with input vectors=0.143878 >> 00148: Number of assignment changes=126 >> 00149: Iteration 10 ... >> 00150: 9/ 9 sec. >> ............................................................ >> 00151: >> 00152: Average correlation with input vectors=0.138916 >> 00153: Number of assignment changes=187 >> 00154: Spliting nodes ... >> 00155: Currently there are 5 nodes >> 00156: Currently there are 6 nodes >> 00157: Currently there are 7 nodes >> 00158: Currently there are 8 nodes >> 00159: Quantizing with 8 codes... >> 00160: Iteration 1 ... >> 00161: 28/ 28 sec. >> ............................................................ >> 00162: >> 00163: Average correlation with input vectors=0.139535 >> 00164: Number of assignment changes=0 >> 00165: Iteration 2 ... >> 00166: 26/ 26 sec. >> ............................................................ >> 00167: >> 00168: Average correlation with input vectors=0.153304 >> 00169: Number of assignment changes=181 >> 00170: Iteration 3 ... >> 00171: 26/ 26 sec. >> ............................................................ >> 00172: >> 00173: Average correlation with input vectors=0.159167 >> 00174: Number of assignment changes=265 >> 00175: Iteration 4 ... >> 00176: 25/ 25 sec. >> ............................................................ >> 00177: >> 00178: Average correlation with input vectors=0.151184 >> 00179: Number of assignment changes=424 >> 00180: Iteration 5 ... >> 00181: 25/ 25 sec. >> ............................................................ >> 00182: >> 00183: Average correlation with input vectors=0.155143 >> 00184: Number of assignment changes=177 >> 00185: Iteration 6 ... >> 00186: 23/ 23 sec. >> ............................................................ >> 00187: >> 00188: Average correlation with input vectors=0.147184 >> 00189: Number of assignment changes=263 >> 00190: Iteration 7 ... >> 00191: 27/ 27 sec. >> ............................................................ >> 00192: >> 00193: Average correlation with input vectors=0.159538 >> 00194: Number of assignment changes=119 >> 00195: Iteration 8 ... >> 00196: 25/ 25 sec. >> ............................................................ >> 00197: >> 00198: Average correlation with input vectors=0.160486 >> 00199: Number of assignment changes=139 >> 00200: Iteration 9 ... >> 00201: 26/ 26 sec. >> ............................................................ >> 00202: >> 00203: Average correlation with input vectors=0.164716 >> 00204: Number of assignment changes=120 >> 00205: Iteration 10 ... >> 00206: 27/ 27 sec. >> ............................................................ >> 00207: >> 00208: Average correlation with input vectors=0.162771 >> 00209: Number of assignment changes=130 >> 00210: Spliting nodes ... >> 00211: Currently there are 9 nodes >> 00212: Currently there are 10 nodes >> 00213: Currently there are 11 nodes >> 00214: Currently there are 12 nodes >> 00215: >> >-------------------------------------------------------------------------- >> 00216: mpirun noticed that process rank 11 with PID 9147 on node >> smaug exited on signal 9 (Killed). >> 00217: >> >-------------------------------------------------------------------------- >> 00218: Traceback (most recent call last): >> 00219: File "/opt/scipion/pyworkflow/protocol/protocol.py", line >> 182, in run >> 00220: self._run() >> 00221: File "/opt/scipion/pyworkflow/protocol/protocol.py", line >> 228, in _run >> 00222: resultFiles = self._runFunc() >> 00223: File "/opt/scipion/pyworkflow/protocol/protocol.py", line >> 224, in _runFunc >> 00224: return self._func(*self._args) >> 00225: File "/opt/scipion/pyworkflow/protocol/protocol.py", line >> 1077, in runJob >> 00226: self._stepsExecutor.runJob(self._log, program, >arguments, >> **kwargs) >> 00227: File "/opt/scipion/pyworkflow/protocol/executor.py", line >> 56, in runJob >> 00228: env=env, cwd=cwd) >> 00229: File "/opt/scipion/pyworkflow/utils/process.py", line 51, >> in runJob >> 00230: return runCommand(command, env, cwd) >> 00231: File "/opt/scipion/pyworkflow/utils/process.py", line 65, >> in runCommand >> 00232: check_call(command, shell=True, stdout=sys.stdout, >> stderr=sys.stderr, env=env, cwd=cwd) >> 00233: File "/opt/scipion/software/lib/python2.7/subprocess.py", >> line 540, in check_call >> 00234: raise CalledProcessError(retcode, cmd) >> 00235: CalledProcessError: Command 'mpirun -np 32 -bynode `which >> xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4' returned >> non-zero exit status 137 >> 00236: Protocol failed: Command 'mpirun -np 32 -bynode `which >> xmipp_mpi_classify_CL2D` -i >> Runs/000400_XmippProtCL2D/tmp/input_particles.xmd --odir >> Runs/000400_XmippProtCL2D/extra --oroot level --nref 15 --iter 10 >> --distance correlation --classicalMultiref --nref0 4' returned >> non-zero exit status 137 >> 00237: FAILED: runJob, step 2 >> 00238: 2018-06-20 15:31:45.758171 >> 00239: ------------------- PROTOCOL FAILED (DONE 2/13) >> >> >> Thanks in advance for your help, >> >> >> Cheers, >> >> >> *Manoël Prouteau, /Ph.D./* >> >> Scientific Collaborator >> >> Department of Molecular Biology >> >> Sciences III - University of Geneva >> >> Quai Ernest Ansermet, 30 >> >> 1211 Geneve 04 >> >> Switzerland >> >> (+41) 022 379 61 18 >> >> man...@un... >> >> http://www.unige.ch >> >> >> >> >------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> >> _______________________________________________ >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users > >-- >------------------------------------------------------------------------ >Carlos Oscar Sánchez Sorzano e-mail: >co...@cn... >Biocomputing unit http://i2pc.es/coss >National Center of Biotechnology (CSIC) >c/Darwin, 3 >Campus Universidad Autónoma (Cantoblanco) Tlf: 34-91-585 4510 >28049 MADRID (SPAIN) Fax: 34-91-585 4506 >------------------------------------------------------------------------ > > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------------ >Check out the vibrant tech community on one of the world's most >engaging tech sites, Slashdot.org! http://sdm.link/slashdot > >------------------------------------------------------------------------ > >_______________________________________________ >scipion-users mailing list >sci...@li... >https://lists.sourceforge.net/lists/listinfo/scipion-users |