From: Valerie B. <val...@ib...> - 2017-01-23 08:44:35
|
Hello sorry to insist but I haven’t solved the problem. here is the result of the commands that Laura asked me to type: biou@vblinux:~$ scipion run mpirun -np 4 hostname Scipion v1.0.1 (2016-06-30) Augusto >>>>> "mpirun" "-np" "4" "hostname" vblinux vblinux vblinux vblinux biou@vblinux:~$ mpirun hostname vblinux vblinux vblinux vblinux Thanks! Valérie >> Le 18 janv. 2017 à 14:46, ldelcano <lde...@cn...> a écrit : >> >> Hi Valerie, >> >> it seems a problem with openmpi, can you run >> >> ./scipion run mpirun -np 4 hostname >> >> and just: >> >> mpirun hostname >> >> thanks >> >> Laura >> >> >> On 18/01/17 11:06, Valerie Biou wrote: >>> Dear all, >>> >>> I have installed the latest Scipion version on a linux machine Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-59-generic x86_64). >>> the install tests have run OK but I have recurrent problems with CL2D: >>> >>> at first it was looking for libmpi.so.1 so I created a symbolic link : >>> lrwxrwxrwx 1 root root 12 janv. 17 17:25 libmpi.so.1 -> libmpi.so.12 >>> >>> Now it fails with the message below. >>> >>> Can you help me fix this, please? >>> >>> Best regards, >>> Valerie >>> >>> >>> 00001: RUNNING PROTOCOL ----------------- >>> 00002: Scipion: v1.0.1 >>> 00003: currentDir: /home/biou/ScipionUserData/projects/EX_LMNG >>> 00004: workingDir: Runs/001509_XmippProtCL2D >>> 00005: runMode: Restart >>> 00006: MPI: 2 >>> 00007: threads: 1 >>> 00008: Starting at step: 1 >>> 00009: Running steps >>> 00010: STARTED: convertInputStep, step 1 >>> 00011: 2017-01-18 10:28:06.919867 >>> 00012: FINISHED: convertInputStep, step 1 >>> 00013: 2017-01-18 10:28:14.324241 >>> 00014: STARTED: runJob, step 2 >>> 00015: 2017-01-18 10:28:14.436293 >>> 00016: mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4 >>> 00017: -------------------------------------------------------------------------- >>> 00018: The following command line options and corresponding MCA parameter have >>> 00019: been deprecated and replaced as follows: >>> 00020: >>> 00021: Command line options: >>> 00022: Deprecated: --bynode, -bynode >>> 00023: Replacement: --map-by node >>> 00024: >>> 00025: Equivalent MCA parameter: >>> 00026: Deprecated: rmaps_base_bynode >>> 00027: Replacement: rmaps_base_mapping_policy=node >>> 00028: >>> 00029: The deprecated forms *will* disappear in a future version of Open MPI. >>> 00030: Please update to the new syntax. >>> 00031: -------------------------------------------------------------------------- >>> 00032: -------------------------------------------------------------------------- >>> 00033: A requested component was not found, or was unable to be opened. This >>> 00034: means that this component is either not installed or is unable to be >>> 00035: used on your system (e.g., sometimes this means that shared libraries >>> 00036: that the component requires are unable to be found/loaded). Note that >>> 00037: Open MPI stopped checking at the first component that it did not find. >>> 00038: >>> 00039: Host: vblinux >>> 00040: Framework: ess >>> 00041: Component: pmi >>> 00042: -------------------------------------------------------------------------- >>> 00043: -------------------------------------------------------------------------- >>> 00044: A requested component was not found, or was unable to be opened. This >>> 00045: means that this component is either not installed or is unable to be >>> 00046: used on your system (e.g., sometimes this means that shared libraries >>> 00047: that the component requires are unable to be found/loaded). Note that >>> 00048: Open MPI stopped checking at the first component that it did not find. >>> 00049: >>> 00050: Host: vblinux >>> 00051: Framework: ess >>> 00052: Component: pmi >>> 00053: -------------------------------------------------------------------------- >>> 00054: [vblinux:02124] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 >>> 00055: [vblinux:02123] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 >>> 00056: -------------------------------------------------------------------------- >>> 00057: It looks like orte_init failed for some reason; your parallel process is >>> 00058: likely to abort. There are many reasons that a parallel process can >>> 00059: fail during orte_init; some of which are due to configuration or >>> 00060: environment problems. This failure appears to be an internal failure; >>> 00061: here's some additional information (which may only be relevant to an >>> 00062: Open MPI developer): >>> 00063: >>> 00064: orte_ess_base_open failed >>> 00065: --> Returned value Not found (-13) instead of ORTE_SUCCESS >>> 00066: -------------------------------------------------------------------------- >>> 00067: -------------------------------------------------------------------------- >>> 00068: It looks like orte_init failed for some reason; your parallel process is >>> 00069: likely to abort. There are many reasons that a parallel process can >>> 00070: fail during orte_init; some of which are due to configuration or >>> 00071: environment problems. This failure appears to be an internal failure; >>> 00072: here's some additional information (which may only be relevant to an >>> 00073: Open MPI developer): >>> 00074: >>> 00075: orte_ess_base_open failed >>> 00076: --> Returned value Not found (-13) instead of ORTE_SUCCESS >>> 00077: -------------------------------------------------------------------------- >>> 00078: *** An error occurred in MPI_Init >>> 00079: *** on a NULL communicator >>> 00080: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> 00081: *** and potentially your MPI job) >>> 00082: *** An error occurred in MPI_Init >>> 00083: *** on a NULL communicator >>> 00084: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> 00085: *** and potentially your MPI job) >>> 00086: -------------------------------------------------------------------------- >>> 00087: It looks like MPI_INIT failed for some reason; your parallel process is >>> 00088: likely to abort. There are many reasons that a parallel process can >>> 00089: fail during MPI_INIT; some of which are due to configuration or environment >>> 00090: problems. This failure appears to be an internal failure; here's some >>> 00091: additional information (which may only be relevant to an Open MPI >>> 00092: developer): >>> 00093: >>> 00094: ompi_mpi_init: ompi_rte_init failed >>> 00095: --> Returned "Not found" (-13) instead of "Success" (0) >>> 00096: -------------------------------------------------------------------------- >>> 00097: -------------------------------------------------------------------------- >>> 00098: It looks like MPI_INIT failed for some reason; your parallel process is >>> 00099: likely to abort. There are many reasons that a parallel process can >>> 00100: fail during MPI_INIT; some of which are due to configuration or environment >>> 00101: problems. This failure appears to be an internal failure; here's some >>> 00102: additional information (which may only be relevant to an Open MPI >>> 00103: developer): >>> 00104: >>> 00105: ompi_mpi_init: ompi_rte_init failed >>> 00106: --> Returned "Not found" (-13) instead of "Success" (0) >>> 00107: -------------------------------------------------------------------------- >>> 00108: [vblinux:2124] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! >>> 00109: [vblinux:2123] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! >>> 00110: ------------------------------------------------------- >>> 00111: Primary job terminated normally, but 1 process returned >>> 00112: a non-zero exit code.. Per user-direction, the job has been aborted. >>> 00113: ------------------------------------------------------- >>> 00114: -------------------------------------------------------------------------- >>> 00115: mpirun detected that one or more processes exited with non-zero status, thus causing >>> 00116: the job to be terminated. The first process to do so was: >>> 00117: >>> 00118: Process name: [[63772,1],0] >>> 00119: Exit code: 1 >>> 00120: -------------------------------------------------------------------------- >>> 00121: Traceback (most recent call last): >>> 00122: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 167, in run >>> 00123: self._run() >>> 00124: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 211, in _run >>> 00125: resultFiles = self._runFunc() >>> 00126: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 207, in _runFunc >>> 00127: return self._func(*self._args) >>> 00128: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 960, in runJob >>> 00129: self._stepsExecutor.runJob(self._log, program, arguments, **kwargs) >>> 00130: File "/usr/local/scipion/pyworkflow/protocol/executor.py", line 56, in runJob >>> 00131: env=env, cwd=cwd) >>> 00132: File "/usr/local/scipion/pyworkflow/utils/process.py", line 51, in runJob >>> 00133: return runCommand(command, env, cwd) >>> 00134: File "/usr/local/scipion/pyworkflow/utils/process.py", line 65, in runCommand >>> 00135: check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd) >>> 00136: File "/usr/local/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call >>> 00137: raise CalledProcessError(retcode, cmd) >>> 00138: CalledProcessError: Command 'mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 1 >>> 00139: Protocol failed: Command 'mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 1 >>> 00140: FAILED: runJob, step 2 >>> 00141: 2017-01-18 10:28:14.966673 >>> 00142: Cleaning temporarly files.... >>> 00143: ------------------- PROTOCOL FAILED (DONE 2/13) >>> >>> >>> ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> scipion-users mailing list >>> sci...@li... >>> https://lists.sourceforge.net/lists/listinfo/scipion-users >> > |