From: ldelcano <lde...@cn...> - 2017-01-23 09:13:13
|
Sorry Valerie, I was away last week and could not look at your problem again. Just a couple of more questions for you: First, have you installed the Scipion binaries or compiled the source on your machine? Are you running on a cluster or on a single machine? I believe you are running CL2D in your own project, right? Could you run this tests? scipion test tests.em.protocols.test_protocols_xmipp_2d.TestXmippCL2D thanks Laura On 23/01/17 09:44, Valerie Biou wrote: > Hello > > sorry to insist but I haven’t solved the problem. > > > here is the result of the commands that Laura asked me to type: > > biou@vblinux:~$ scipion run mpirun -np 4 hostname > > Scipion v1.0.1 (2016-06-30) Augusto > >>>>>> "mpirun" "-np" "4" "hostname" > vblinux > vblinux > vblinux > vblinux > biou@vblinux:~$ mpirun hostname > vblinux > vblinux > vblinux > vblinux > > Thanks! > > Valérie > >>> Le 18 janv. 2017 à 14:46, ldelcano <lde...@cn...> a écrit : >>> >>> Hi Valerie, >>> >>> it seems a problem with openmpi, can you run >>> >>> ./scipion run mpirun -np 4 hostname >>> >>> and just: >>> >>> mpirun hostname >>> >>> thanks >>> >>> Laura >>> >>> >>> On 18/01/17 11:06, Valerie Biou wrote: >>>> Dear all, >>>> >>>> I have installed the latest Scipion version on a linux machine Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-59-generic x86_64). >>>> the install tests have run OK but I have recurrent problems with CL2D: >>>> >>>> at first it was looking for libmpi.so.1 so I created a symbolic link : >>>> lrwxrwxrwx 1 root root 12 janv. 17 17:25 libmpi.so.1 -> libmpi.so.12 >>>> >>>> Now it fails with the message below. >>>> >>>> Can you help me fix this, please? >>>> >>>> Best regards, >>>> Valerie >>>> >>>> >>>> 00001: RUNNING PROTOCOL ----------------- >>>> 00002: Scipion: v1.0.1 >>>> 00003: currentDir: /home/biou/ScipionUserData/projects/EX_LMNG >>>> 00004: workingDir: Runs/001509_XmippProtCL2D >>>> 00005: runMode: Restart >>>> 00006: MPI: 2 >>>> 00007: threads: 1 >>>> 00008: Starting at step: 1 >>>> 00009: Running steps >>>> 00010: STARTED: convertInputStep, step 1 >>>> 00011: 2017-01-18 10:28:06.919867 >>>> 00012: FINISHED: convertInputStep, step 1 >>>> 00013: 2017-01-18 10:28:14.324241 >>>> 00014: STARTED: runJob, step 2 >>>> 00015: 2017-01-18 10:28:14.436293 >>>> 00016: mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4 >>>> 00017: -------------------------------------------------------------------------- >>>> 00018: The following command line options and corresponding MCA parameter have >>>> 00019: been deprecated and replaced as follows: >>>> 00020: >>>> 00021: Command line options: >>>> 00022: Deprecated: --bynode, -bynode >>>> 00023: Replacement: --map-by node >>>> 00024: >>>> 00025: Equivalent MCA parameter: >>>> 00026: Deprecated: rmaps_base_bynode >>>> 00027: Replacement: rmaps_base_mapping_policy=node >>>> 00028: >>>> 00029: The deprecated forms *will* disappear in a future version of Open MPI. >>>> 00030: Please update to the new syntax. >>>> 00031: -------------------------------------------------------------------------- >>>> 00032: -------------------------------------------------------------------------- >>>> 00033: A requested component was not found, or was unable to be opened. This >>>> 00034: means that this component is either not installed or is unable to be >>>> 00035: used on your system (e.g., sometimes this means that shared libraries >>>> 00036: that the component requires are unable to be found/loaded). Note that >>>> 00037: Open MPI stopped checking at the first component that it did not find. >>>> 00038: >>>> 00039: Host: vblinux >>>> 00040: Framework: ess >>>> 00041: Component: pmi >>>> 00042: -------------------------------------------------------------------------- >>>> 00043: -------------------------------------------------------------------------- >>>> 00044: A requested component was not found, or was unable to be opened. This >>>> 00045: means that this component is either not installed or is unable to be >>>> 00046: used on your system (e.g., sometimes this means that shared libraries >>>> 00047: that the component requires are unable to be found/loaded). Note that >>>> 00048: Open MPI stopped checking at the first component that it did not find. >>>> 00049: >>>> 00050: Host: vblinux >>>> 00051: Framework: ess >>>> 00052: Component: pmi >>>> 00053: -------------------------------------------------------------------------- >>>> 00054: [vblinux:02124] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 >>>> 00055: [vblinux:02123] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 >>>> 00056: -------------------------------------------------------------------------- >>>> 00057: It looks like orte_init failed for some reason; your parallel process is >>>> 00058: likely to abort. There are many reasons that a parallel process can >>>> 00059: fail during orte_init; some of which are due to configuration or >>>> 00060: environment problems. This failure appears to be an internal failure; >>>> 00061: here's some additional information (which may only be relevant to an >>>> 00062: Open MPI developer): >>>> 00063: >>>> 00064: orte_ess_base_open failed >>>> 00065: --> Returned value Not found (-13) instead of ORTE_SUCCESS >>>> 00066: -------------------------------------------------------------------------- >>>> 00067: -------------------------------------------------------------------------- >>>> 00068: It looks like orte_init failed for some reason; your parallel process is >>>> 00069: likely to abort. There are many reasons that a parallel process can >>>> 00070: fail during orte_init; some of which are due to configuration or >>>> 00071: environment problems. This failure appears to be an internal failure; >>>> 00072: here's some additional information (which may only be relevant to an >>>> 00073: Open MPI developer): >>>> 00074: >>>> 00075: orte_ess_base_open failed >>>> 00076: --> Returned value Not found (-13) instead of ORTE_SUCCESS >>>> 00077: -------------------------------------------------------------------------- >>>> 00078: *** An error occurred in MPI_Init >>>> 00079: *** on a NULL communicator >>>> 00080: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> 00081: *** and potentially your MPI job) >>>> 00082: *** An error occurred in MPI_Init >>>> 00083: *** on a NULL communicator >>>> 00084: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> 00085: *** and potentially your MPI job) >>>> 00086: -------------------------------------------------------------------------- >>>> 00087: It looks like MPI_INIT failed for some reason; your parallel process is >>>> 00088: likely to abort. There are many reasons that a parallel process can >>>> 00089: fail during MPI_INIT; some of which are due to configuration or environment >>>> 00090: problems. This failure appears to be an internal failure; here's some >>>> 00091: additional information (which may only be relevant to an Open MPI >>>> 00092: developer): >>>> 00093: >>>> 00094: ompi_mpi_init: ompi_rte_init failed >>>> 00095: --> Returned "Not found" (-13) instead of "Success" (0) >>>> 00096: -------------------------------------------------------------------------- >>>> 00097: -------------------------------------------------------------------------- >>>> 00098: It looks like MPI_INIT failed for some reason; your parallel process is >>>> 00099: likely to abort. There are many reasons that a parallel process can >>>> 00100: fail during MPI_INIT; some of which are due to configuration or environment >>>> 00101: problems. This failure appears to be an internal failure; here's some >>>> 00102: additional information (which may only be relevant to an Open MPI >>>> 00103: developer): >>>> 00104: >>>> 00105: ompi_mpi_init: ompi_rte_init failed >>>> 00106: --> Returned "Not found" (-13) instead of "Success" (0) >>>> 00107: -------------------------------------------------------------------------- >>>> 00108: [vblinux:2124] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! >>>> 00109: [vblinux:2123] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! >>>> 00110: ------------------------------------------------------- >>>> 00111: Primary job terminated normally, but 1 process returned >>>> 00112: a non-zero exit code.. Per user-direction, the job has been aborted. >>>> 00113: ------------------------------------------------------- >>>> 00114: -------------------------------------------------------------------------- >>>> 00115: mpirun detected that one or more processes exited with non-zero status, thus causing >>>> 00116: the job to be terminated. The first process to do so was: >>>> 00117: >>>> 00118: Process name: [[63772,1],0] >>>> 00119: Exit code: 1 >>>> 00120: -------------------------------------------------------------------------- >>>> 00121: Traceback (most recent call last): >>>> 00122: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 167, in run >>>> 00123: self._run() >>>> 00124: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 211, in _run >>>> 00125: resultFiles = self._runFunc() >>>> 00126: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 207, in _runFunc >>>> 00127: return self._func(*self._args) >>>> 00128: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 960, in runJob >>>> 00129: self._stepsExecutor.runJob(self._log, program, arguments, **kwargs) >>>> 00130: File "/usr/local/scipion/pyworkflow/protocol/executor.py", line 56, in runJob >>>> 00131: env=env, cwd=cwd) >>>> 00132: File "/usr/local/scipion/pyworkflow/utils/process.py", line 51, in runJob >>>> 00133: return runCommand(command, env, cwd) >>>> 00134: File "/usr/local/scipion/pyworkflow/utils/process.py", line 65, in runCommand >>>> 00135: check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd) >>>> 00136: File "/usr/local/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call >>>> 00137: raise CalledProcessError(retcode, cmd) >>>> 00138: CalledProcessError: Command 'mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 1 >>>> 00139: Protocol failed: Command 'mpirun -np 2 -bynode `which xmipp_mpi_classify_CL2D` -i Runs/001509_XmippProtCL2D/extra/images.xmd --odir Runs/001509_XmippProtCL2D/extra --oroot level --nref 20 --iter 10 --distance correlation --classicalMultiref --nref0 4' returned non-zero exit status 1 >>>> 00140: FAILED: runJob, step 2 >>>> 00141: 2017-01-18 10:28:14.966673 >>>> 00142: Cleaning temporarly files.... >>>> 00143: ------------------- PROTOCOL FAILED (DONE 2/13) >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>>> _______________________________________________ >>>> scipion-users mailing list >>>> sci...@li... >>>> https://lists.sourceforge.net/lists/listinfo/scipion-users > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users |