#1 Problem working with OpenMPI 1.3.3

closed-fixed
nobody
None
2014-08-13
2009-10-22
Anonymous
No

I was testing DMTCP with an MPI application using OpenMPI library. There are two observations I had.

The applications when run under control of DMTCP (there is no check pointing done in this case), abnormally terminates with signal 11 with the following stack

Write 30 / 33

[hpc-t7500:10963] *** Process received signal ***

[hpc-t7500:10963] Signal: Segmentation fault (11)

[hpc-t7500:10963] Signal code: Address not mapped (1)

[hpc-t7500:10963] Failing at address: 0x2aaaadb5f288

[hpc-t7500:10963] [ 0] /lib64/libc.so.6 [0x3ac2c30280]

[hpc-t7500:10963] [ 1] /opt/openmpi/lib/libopen-pal.so.0 [0x2b8cb7ca1981]

[hpc-t7500:10963] [ 2] /opt/openmpi/lib/libopen-pal.so.0(opal_event_del_i+0xd0) [0x2b8cb7ca1c50]

[hpc-t7500:10963] [ 3] /opt/openmpi/lib/openmpi/mca_oob_tcp.so [0x2aaaab2c24a4]

[hpc-t7500:10963] [ 4] /opt/openmpi/lib/openmpi/mca_oob_tcp.so [0x2aaaab2bf2ab]

[hpc-t7500:10963] [ 5] /opt/openmpi/lib/libopen-rte.so.0(mca_oob_base_close+0x5f) [0x2b8cb7a6d90f]

[hpc-t7500:10963] [ 6] /opt/openmpi/lib/libopen-pal.so.0(mca_base_components_close+0x83) [0x2b8cb7ca6b23]

[hpc-t7500:10963] [ 7] /opt/openmpi/lib/libopen-rte.so.0(orte_rml_base_close+0x91) [0x2b8cb7a73321]

[hpc-t7500:10963] [ 8] /opt/openmpi/lib/libopen-rte.so.0(orte_ess_base_app_finalize+0x2a) [0x2b8cb7a62d8a]

[hpc-t7500:10963] [ 9] /opt/openmpi/lib/openmpi/mca_ess_env.so [0x2aaaaaeb4751]

[hpc-t7500:10963] [10] /opt/openmpi/lib/libopen-rte.so.0(orte_finalize+0x49) [0x2b8cb7a4be19]

[hpc-t7500:10963] [11] /opt/openmpi/lib/libmpi.so.0 [0x2b8cb77d3774]

[hpc-t7500:10963] [12] ./navcalcmpi(_ZN10NavierCalcD1Ev+0xa5) [0x419535]

[hpc-t7500:10963] [13] ./navcalcmpi(main+0x3e) [0x41ce0e]

[hpc-t7500:10963] [14] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3ac2c1d974]

[hpc-t7500:10963] [15] ./navcalcmpi(__gxx_personality_v0+0x169) [0x40a159]

[hpc-t7500:10963] *** End of error message ***

--------------------------------------------------------------------------

mpirun noticed that process rank 0 with PID 10963 on node hpc-t7500 exited on signal 11 (Segmentation fault).

When I check point and try to restart the OpenMPI application from checkpoint files, I get the following messages

Message: socket type not yet [fully] supported

[8885] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'

id() = 2ad5f1d1891cbb8-2113-4add7576(99085)

_sockDomain = 10

_sockType = 1

_sockProtocol = 0

Message: socket type not yet [fully] supported

[8894] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed'

i->second = 575

(strerror((*__errno_location ()))) = Bad file descriptor

[8894] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed'

i->second = 582

(strerror((*__errno_location ()))) = Bad file descriptor

Is that OpenMPI not supported yet by DMTCP completely or I am missing something?

Thanks in advance,

Balwant

Discussion

  • Kapil Arya

    Kapil Arya - 2009-10-29

    Sorry for the delay in the response.
    DMTCP 1.1.0 has been released. OpenMPI should be working with the latest release. Please try it out and if there are any problems, please let us know.

     
  • Gene Cooperman

    Gene Cooperman - 2009-10-31

    Please also note (as indicated in our news announcement) that DMTCP 1.1.0 has a bug that sometimes occurs under heavy CPU load conditions. In the next day or two, there will be a release DMTCP 1.1.1 that will also fix that.

    The history is that DMTCP had supported OpenMPI in an earlier version. The introduction of DMTCP's pid virtualization created an incompatibility with DMTCP. In the QUICK-START file, we noted that in an earlier DMTCP and recommended to use --disable-pid-virtualization with OpenMPI. Further, OpenMPI 1.3.x made use of additional Linux features not present in 1.2.x. As indicated, DMTCP 1.1.0 should fix most of the bugs, and DMTCP 1.1.1 should fix the remaining ones.

    Thank you for the helpful feedback.

     
  • Gene Cooperman

    Gene Cooperman - 2009-11-22

    As a final update for others who may look at this, DMTCP 1.1.1 has now been released. In our internal testing, we believe that this version now supports OpenMPI version 1.3.3. If someone finds additional bugs in our support for OpenMPI, please let us know. Thank you. -- the developers

     
  • Gene Cooperman

    Gene Cooperman - 2010-02-02

    Balwant,
    We recently released DMTCP 1.1.3. As of this release, we believe that DMTCP is now working well both with OpenMPI 1.3.x and OpenMPI 1.4..1. At this time, it works primarily with the Ethernet transport (TCP/IP), but we will be looking next at the Infiniband transport. So, I am going to close this report now. Thanks very much for writing to us and working with us to let us know of the bugs you found. Please feel free to write again at any time. Best wishes, - Gene

     
  • Gene Cooperman

    Gene Cooperman - 2010-02-02
    • status: open --> closed
     
  • Kapil Arya

    Kapil Arya - 2010-09-09
    • status: closed --> closed-fixed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks