From: ldelcano <lde...@cn...> - 2017-01-18 13:46:59
|
Hi Valerie, it seems a problem with openmpi, can you run ./scipion run mpirun -np 4 hostname and just: mpirun hostname thanks Laura On 18/01/17 11:06, Valerie Biou wrote: > Dear all, > > I have installed the latest Scipion version on a linux machine Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-59-generic x86_64). > the install tests have run OK but I have recurrent problems with CL2D: > > at first it was looking for libmpi.so.1 so I created a symbolic link : > lrwxrwxrwx 1 root root 12 janv. 17 17:25 libmpi.so.1 -> libmpi.so.12 > > Now it fails with the message below. > > Can you help me fix this, please? > > Best regards, > Valerie > > > 00001: RUNNING PROTOCOL ----------------- > 00002: Scipion: v1.0.1 > 00003: currentDir: /home/biou/ScipionUserData/projects/EX_LMNG > 00004: workingDir: Runs/001509_XmippProtCL2D > 00005: runMode: Restart > 00006: MPI: 2 > 00007: threads: 1 > 00008: Starting at step: 1 > 00009: Running steps > 00010: STARTED: convertInputStep, step 1 > 00011: 2017-01-18 10:28:06.919867 > 00012: FINISHED: convertInputStep, step 1 > 00013: 2017-01-18 10:28:14.324241 > 00014: STARTED: runJob, step 2 > 00015: 2017-01-18 10:28:14.436293 > 00016: mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4 > 00017: -------------------------------------------------------------------------- > 00018: The following command line options and corresponding MCA parameter have > 00019: been deprecated and replaced as follows: > 00020: > 00021: Command line options: > 00022: Deprecated: --bynode, -bynode > 00023: Replacement: --map-by node > 00024: > 00025: Equivalent MCA parameter: > 00026: Deprecated: rmaps_base_bynode > 00027: Replacement: rmaps_base_mapping_policy=node > 00028: > 00029: The deprecated forms *will* disappear in a future version of Open MPI. > 00030: Please update to the new syntax. > 00031: -------------------------------------------------------------------------- > 00032: -------------------------------------------------------------------------- > 00033: A requested component was not found, or was unable to be opened. This > 00034: means that this component is either not installed or is unable to be > 00035: used on your system (e.g., sometimes this means that shared libraries > 00036: that the component requires are unable to be found/loaded). Note that > 00037: Open MPI stopped checking at the first component that it did not find. > 00038: > 00039: Host: vblinux > 00040: Framework: ess > 00041: Component: pmi > 00042: -------------------------------------------------------------------------- > 00043: -------------------------------------------------------------------------- > 00044: A requested component was not found, or was unable to be opened. This > 00045: means that this component is either not installed or is unable to be > 00046: used on your system (e.g., sometimes this means that shared libraries > 00047: that the component requires are unable to be found/loaded). Note that > 00048: Open MPI stopped checking at the first component that it did not find. > 00049: > 00050: Host: vblinux > 00051: Framework: ess > 00052: Component: pmi > 00053: -------------------------------------------------------------------------- > 00054: [vblinux:02124] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 > 00055: [vblinux:02123] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 > 00056: -------------------------------------------------------------------------- > 00057: It looks like orte_init failed for some reason; your parallel process is > 00058: likely to abort. There are many reasons that a parallel process can > 00059: fail during orte_init; some of which are due to configuration or > 00060: environment problems. This failure appears to be an internal failure; > 00061: here's some additional information (which may only be relevant to an > 00062: Open MPI developer): > 00063: > 00064: orte_ess_base_open failed > 00065: --> Returned value Not found (-13) instead of ORTE_SUCCESS > 00066: -------------------------------------------------------------------------- > 00067: -------------------------------------------------------------------------- > 00068: It looks like orte_init failed for some reason; your parallel process is > 00069: likely to abort. There are many reasons that a parallel process can > 00070: fail during orte_init; some of which are due to configuration or > 00071: environment problems. This failure appears to be an internal failure; > 00072: here's some additional information (which may only be relevant to an > 00073: Open MPI developer): > 00074: > 00075: orte_ess_base_open failed > 00076: --> Returned value Not found (-13) instead of ORTE_SUCCESS > 00077: -------------------------------------------------------------------------- > 00078: *** An error occurred in MPI_Init > 00079: *** on a NULL communicator > 00080: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > 00081: *** and potentially your MPI job) > 00082: *** An error occurred in MPI_Init > 00083: *** on a NULL communicator > 00084: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > 00085: *** and potentially your MPI job) > 00086: -------------------------------------------------------------------------- > 00087: It looks like MPI_INIT failed for some reason; your parallel process is > 00088: likely to abort. There are many reasons that a parallel process can > 00089: fail during MPI_INIT; some of which are due to configuration or environment > 00090: problems. This failure appears to be an internal failure; here's some > 00091: additional information (which may only be relevant to an Open MPI > 00092: developer): > 00093: > 00094: ompi_mpi_init: ompi_rte_init failed > 00095: --> Returned "Not found" (-13) instead of "Success" (0) > 00096: -------------------------------------------------------------------------- > 00097: -------------------------------------------------------------------------- > 00098: It looks like MPI_INIT failed for some reason; your parallel process is > 00099: likely to abort. There are many reasons that a parallel process can > 00100: fail during MPI_INIT; some of which are due to configuration or environment > 00101: problems. This failure appears to be an internal failure; here's some > 00102: additional information (which may only be relevant to an Open MPI > 00103: developer): > 00104: > 00105: ompi_mpi_init: ompi_rte_init failed > 00106: --> Returned "Not found" (-13) instead of "Success" (0) > 00107: -------------------------------------------------------------------------- > 00108: [vblinux:2124] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! > 00109: [vblinux:2123] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! > 00110: ------------------------------------------------------- > 00111: Primary job terminated normally, but 1 process returned > 00112: a non-zero exit code.. Per user-direction, the job has been aborted. > 00113: ------------------------------------------------------- > 00114: -------------------------------------------------------------------------- > 00115: mpirun detected that one or more processes exited with non-zero status, thus causing > 00116: the job to be terminated. The first process to do so was: > 00117: > 00118: Process name: [[63772,1],0] > 00119: Exit code: 1 > 00120: -------------------------------------------------------------------------- > 00121: Traceback (most recent call last): > 00122: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 167, in run > 00123: self._run() > 00124: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 211, in _run > 00125: resultFiles = self._runFunc() > 00126: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 207, in _runFunc > 00127: return self._func(*self._args) > 00128: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 960, in runJob > 00129: self._stepsExecutor.runJob(self._log, program, arguments, **kwargs) > 00130: File "/usr/local/scipion/pyworkflow/protocol/executor.py", line 56, in runJob > 00131: env=env, cwd=cwd) > 00132: File "/usr/local/scipion/pyworkflow/utils/process.py", line 51, in runJob > 00133: return runCommand(command, env, cwd) > 00134: File "/usr/local/scipion/pyworkflow/utils/process.py", line 65, in runCommand > 00135: check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd) > 00136: File "/usr/local/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call > 00137: raise CalledProcessError(retcode, cmd) > 00138: CalledProcessError: Command 'mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 1 > 00139: Protocol failed: Command 'mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 1 > 00140: FAILED: runJob, step 2 > 00141: 2017-01-18 10:28:14.966673 > 00142: Cleaning temporarly files.... > 00143: ------------------- PROTOCOL FAILED (DONE 2/13) > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users |