You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
|
Feb
(18) |
Mar
|
Apr
(4) |
May
(2) |
Jun
(9) |
Jul
(11) |
Aug
(7) |
Sep
(13) |
Oct
(6) |
Nov
(8) |
Dec
|
2011 |
Jan
(4) |
Feb
(36) |
Mar
(34) |
Apr
(9) |
May
(9) |
Jun
(18) |
Jul
(10) |
Aug
(13) |
Sep
(5) |
Oct
(22) |
Nov
(5) |
Dec
(3) |
2012 |
Jan
(5) |
Feb
(24) |
Mar
(22) |
Apr
(24) |
May
(4) |
Jun
(8) |
Jul
(14) |
Aug
(20) |
Sep
(14) |
Oct
(20) |
Nov
(5) |
Dec
(1) |
2013 |
Jan
(6) |
Feb
(1) |
Mar
(32) |
Apr
(6) |
May
(3) |
Jun
(26) |
Jul
(14) |
Aug
(11) |
Sep
(15) |
Oct
(44) |
Nov
(33) |
Dec
(2) |
2014 |
Jan
(9) |
Feb
(19) |
Mar
(12) |
Apr
(5) |
May
(2) |
Jun
(4) |
Jul
(3) |
Aug
(2) |
Sep
(8) |
Oct
(32) |
Nov
(30) |
Dec
(8) |
2015 |
Jan
|
Feb
(18) |
Mar
(31) |
Apr
(41) |
May
(33) |
Jun
(13) |
Jul
(15) |
Aug
(13) |
Sep
(4) |
Oct
(58) |
Nov
(9) |
Dec
(5) |
2016 |
Jan
(9) |
Feb
(21) |
Mar
(6) |
Apr
(35) |
May
(50) |
Jun
(16) |
Jul
(11) |
Aug
(7) |
Sep
(13) |
Oct
(20) |
Nov
(2) |
Dec
(8) |
2017 |
Jan
(1) |
Feb
(1) |
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
(17) |
Aug
(20) |
Sep
|
Oct
(1) |
Nov
(2) |
Dec
(7) |
2018 |
Jan
(1) |
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(3) |
Jul
(1) |
Aug
|
Sep
(5) |
Oct
(3) |
Nov
(2) |
Dec
(1) |
2019 |
Jan
(4) |
Feb
(1) |
Mar
(6) |
Apr
(2) |
May
(1) |
Jun
(1) |
Jul
(3) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2020 |
Jan
(8) |
Feb
(3) |
Mar
(3) |
Apr
|
May
(4) |
Jun
|
Jul
(2) |
Aug
(3) |
Sep
|
Oct
(4) |
Nov
(1) |
Dec
|
2021 |
Jan
(5) |
Feb
(1) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2022 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(1) |
Sep
(2) |
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
(2) |
Mar
|
Apr
(7) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(6) |
Nov
|
Dec
|
2024 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Alex R. <ale...@gm...> - 2024-08-02 06:13:40
|
Hi DMTCP forum! I am considering this technology as a solution for C/R of long-ish (up to 1 week) running production jobs on Kubernetes, mostly java and python for offline data processing. Could anyone from this list share feedback of using it in similar environment (non-hpc, non-academia, Java, Linux, Hadoop, data pipelines, ideally with k8s, but bare metal is fine too)? Thanks, Alex |
From: Marcus D. <ma...@sn...> - 2024-06-27 23:30:25
|
To add another piece of information to this inquiry, the point of failure for running with -mpi (this is on Rocky 8), is in src/processinfo.cpp. void ProcessInfo::updateRestoreBufAddr(void* addr, uint64_t len) { if (_restoreBufAddr != 0) { JASSERT(munmap((void*) _restoreBufAddr, _restoreBufLen) == 0) (JASSERT_ERRNO); // The munmap call fails in --mpi mode. } . } I'm using the latest code from the repository. Cheers, Marcus |
From: Marcus D. <ma...@sn...> - 2024-06-26 23:27:16
|
Hi, I noticed an option -mpi was introduced after dmtcp-3.0.0. I find that it causes crashes during mpich 4.1.2 launches on Centos 7 or Rocky 8. Is this an essential launch flag? Without it, multi-node launch complete on Rocky 8, and can be checkpointed, but restart hangs. Is it likely the absence of -mpi is the reason for the hanging restart? Centos 7 launches don't complete unless I restrict the launch to a single node. Any ideas on tests to try? Thanks! Marcus |
From: Madan S. T. <MTi...@lb...> - 2024-01-31 18:51:54
|
Hi, I am currently testing the checkpoint-restart feature using DMTCP with CP2K, a software package used for quantum chemistry and solid-state physics (https://github.com/cp2k/cp2k). However, I encountered an error during the process. Do you have any suggestions on how to configure it properly? Your help would be much appreciated. Best regards, Madan [0;33m[2024-01-29T22:56:28.264, 41000, 41003, Warning] at fileconnlist.cpp:457 in prepareShmList; REASON='JWARNING(false) failed' area.name = /dev/zero (deleted) Message: Ckpt/Restart of anonymous shared memory not supported. slurmstepd: error: *** JOB 21040868 ON nid005128 CANCELLED AT 2024-01-29T23:00:43 DUE TO TIME LIMIT *** [0;31m[2024-01-29T23:05:28.446, 41000, 41003, Error] at fileconnection.cpp:340 in refill; REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed' _path = /tmp/.libxsmm.74748 Message: File not found. cp2k.psmp: Terminating... Backtrace: 1 _ZN16jassert_internal7JAssertD2Ev in /usr/local/lib/dmtcp/libdmtcp.so 0x7fe4f18738c4 2 _ZN5dmtcp14FileConnection6refillEb in /usr/local/lib/dmtcp/libdmtcp_ipc.so 0x7fe4f1911c4e 3 _ZN5dmtcp14ConnectionList6refillEb in /usr/local/lib/dmtcp/libdmtcp_ipc.so 0x7fe4f190377a 4 _Z28dmtcp_FileConnList_EventHook11eDmtcpEventP17_DmtcpEventData_t in /usr/local/lib/dmtcp/libdmtcp_ipc.so 0x7fe4f191ea52 5 _ZN5dmtcp13PluginManager9eventHookE11eDmtcpEventP17_DmtcpEventData_t in /usr/local/lib/dmtcp/libdmtcp.so 0x7fe4f183e45a 6 _ZN5dmtcp11DmtcpWorker11postRestartEd in /usr/local/lib/dmtcp/libdmtcp.so 0x7fe4f1833568 7 _ZN5dmtcp10ThreadList18waitForAllRestoredEP6Thread in /usr/local/lib/dmtcp/libdmtcp.so 0x7fe4f1844a7d 8 in /usr/local/lib/dmtcp/libdmtcp.so 0x7fe4f18462e7 9 in /usr/local/lib/dmtcp/libdmtcp.so 0x7fe4f18488de 10 in /lib/x86_64-linux-gnu/libc.so.6 0x7fe4ecc6cb43 11 in /lib/x86_64-linux-gnu/libc.so.6 0x7fe4eccfea00 [0m -- Madan K Sharma Timalsina, PhD NESAP Fellow (NERSC) ------------------------------------------------------- Lawrence Berkeley National Laboratory 1 Cyclotron Road Mail Stop: 59R4010A Berkeley, CA 94720 US |
From: Kapil A. <ka...@cc...> - 2023-10-26 22:04:03
|
Thanks for the pointers. I’ll try it out and report back. On Thu, Oct 26, 2023 at 2:31 PM Prentice Bisbal via Dmtcp-forum < dmt...@li...> wrote: > On 10/26/23 1:47 PM, Kapil Arya wrote: > > I couldn't find an RH9 equivalent distro to test it. If you have any > > pointers, I'd appreciate it. > > > > > What's wrong with Rocky Linux? They're at 9.2 right now: > > https://rockylinux.org/download > > And there's Springdale Linux, which is also a RHEL rebuild: > > https://springdale.math.ias.edu. > > There's an issue with the Springdale webserver cert being expired, but > that should be fixed in a few minutes. I know the maintainers. > > -- > Prentice > > > > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > |
From: Prentice B. <pb...@pp...> - 2023-10-26 21:31:42
|
On 10/26/23 1:47 PM, Kapil Arya wrote: > I couldn't find an RH9 equivalent distro to test it. If you have any > pointers, I'd appreciate it. > > What's wrong with Rocky Linux? They're at 9.2 right now: https://rockylinux.org/download And there's Springdale Linux, which is also a RHEL rebuild: https://springdale.math.ias.edu. There's an issue with the Springdale webserver cert being expired, but that should be fixed in a few minutes. I know the maintainers. -- Prentice |
From: Andrew L. <dr...@ca...> - 2023-10-26 19:35:26
|
Thanks Kapil. What is the first version which supports RH9? -Drew [http://www.cadence.com/mail/footer_logocdns2.jpg] [Cadence Cares]<http://fortune.com/best-companies/cadence-52/> Andrew T. Lynch | Software Architect T: 408.914.6875 M: 408.832.1045 www.cadence.com<http://www.cadence.com/> From: Kapil Arya <ka...@cc...> Date: Thursday, October 26, 2023 at 10:44 AM To: Andrew Lynch <dr...@ca...> Cc: "dmt...@li..." <dmt...@li...> Subject: Re: [Dmtcp-forum] 3.0 or 2.6.0? EXTERNAL MAIL Hi Andrew, DMTCP 3.0 is the current version. Also, we will be releasing version 3.1 towards the middle of November. Kapil On Thu, Oct 26, 2023 at 10:27 AM Andrew Lynch <dr...@ca...<mailto:dr...@ca...>> wrote: Hi Folks, The sourceforge page refers to 2.6.0 as the latest stable version (although there is a download for 2.6.1). This github page references a new 3.0 release: https://github.com/dmtcp/dmtcp/releases/tag/3.0.0 and provides a tar.gz<https://urldefense.com/v3/__https:/github.com/dmtcp/dmtcp/releases/tag/3.0.0*20and*20provides*20a*20tar.gz__;JSUlJQ!!EHscmS1ygiU1lA!AVR3HdD5rMlbNZI7sYZH4CcoyFIyHxiqNeDQTUJ4jrDhBLx8LPYaXM5QTOvkS85kyJc2n2vk61Vk2w$> file. Which is correct? Is the 3.0 version stable, or is it still the tip of a changing development stream? Also, which versions of DMTCP support RH9? Regards, Drew [http://www.cadence.com/mail/footer_logocdns2.jpg] [Cadence Cares]<https://urldefense.com/v3/__http:/fortune.com/best-companies/cadence-52/__;!!EHscmS1ygiU1lA!AVR3HdD5rMlbNZI7sYZH4CcoyFIyHxiqNeDQTUJ4jrDhBLx8LPYaXM5QTOvkS85kyJc2n2tPDMOteg$> Andrew T. Lynch | Software Architect T: 408.914.6875 M: 408.832.1045 www.cadence.com<http://www.cadence.com/> _______________________________________________ Dmtcp-forum mailing list Dmt...@li...<mailto:Dmt...@li...> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum<https://urldefense.com/v3/__https:/lists.sourceforge.net/lists/listinfo/dmtcp-forum__;!!EHscmS1ygiU1lA!AVR3HdD5rMlbNZI7sYZH4CcoyFIyHxiqNeDQTUJ4jrDhBLx8LPYaXM5QTOvkS85kyJc2n2tCnlYQnA$> |
From: Kapil A. <ka...@cc...> - 2023-10-26 17:59:32
|
Hi Andrew, DMTCP 3.0 is the current version. Also, we will be releasing version 3.1 towards the middle of November. Kapil On Thu, Oct 26, 2023 at 10:27 AM Andrew Lynch <dr...@ca...> wrote: > Hi Folks, > > The sourceforge page refers to 2.6.0 as the latest stable version > (although there is a download for 2.6.1). > > This github page references a new 3.0 release: https://github.com/dmtcp/dmtcp/releases/tag/3.0.0 > and provides a tar.gz file. > > > > Which is correct? Is the 3.0 version stable, or is it still the tip of a > changing development stream? > > Also, which versions of DMTCP support RH9? > > > > Regards, > > Drew > > > > [image: http://www.cadence.com/mail/footer_logocdns2.jpg] > > [image: Cadence Cares] <http://fortune.com/best-companies/cadence-52/> > > *Andrew T. Lynch* | Software Architect > > T: 408.914.6875 M: 408.832.1045 www.cadence.com > > > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > |
From: Kapil A. <ka...@cc...> - 2023-10-26 17:59:31
|
3.0 should support RH9. However, I don't have a way to validate it. I couldn't find an RH9 equivalent distro to test it. If you have any pointers, I'd appreciate it. On Thu, Oct 26, 2023 at 10:46 AM Andrew Lynch <dr...@ca...> wrote: > Thanks Kapil. What is the first version which supports RH9? > > > > -Drew > > > > [image: http://www.cadence.com/mail/footer_logocdns2.jpg] > > [image: Cadence Cares] <http://fortune.com/best-companies/cadence-52/> > > *Andrew T. Lynch* | Software Architect > > T: 408.914.6875 M: 408.832.1045 www.cadence.com > > > > > > *From: *Kapil Arya <ka...@cc...> > *Date: *Thursday, October 26, 2023 at 10:44 AM > *To: *Andrew Lynch <dr...@ca...> > *Cc: *"dmt...@li..." < > dmt...@li...> > *Subject: *Re: [Dmtcp-forum] 3.0 or 2.6.0? > > > > EXTERNAL MAIL > > Hi Andrew, > > > > DMTCP 3.0 is the current version. Also, we will be releasing version 3.1 > towards the middle of November. > > > > Kapil > > > > On Thu, Oct 26, 2023 at 10:27 AM Andrew Lynch <dr...@ca...> wrote: > > Hi Folks, > > The sourceforge page refers to 2.6.0 as the latest stable version > (although there is a download for 2.6.1). > > This github page references a new 3.0 release: https://github.com/dmtcp/dmtcp/releases/tag/3.0.0 > and provides a tar.gz > <https://urldefense.com/v3/__https:/github.com/dmtcp/dmtcp/releases/tag/3.0.0*20and*20provides*20a*20tar.gz__;JSUlJQ!!EHscmS1ygiU1lA!AVR3HdD5rMlbNZI7sYZH4CcoyFIyHxiqNeDQTUJ4jrDhBLx8LPYaXM5QTOvkS85kyJc2n2vk61Vk2w$> > file. > > > > Which is correct? Is the 3.0 version stable, or is it still the tip of a > changing development stream? > > Also, which versions of DMTCP support RH9? > > > > Regards, > > Drew > > > > [image: http://www.cadence.com/mail/footer_logocdns2.jpg] > > [image: Cadence Cares] > <https://urldefense.com/v3/__http:/fortune.com/best-companies/cadence-52/__;!!EHscmS1ygiU1lA!AVR3HdD5rMlbNZI7sYZH4CcoyFIyHxiqNeDQTUJ4jrDhBLx8LPYaXM5QTOvkS85kyJc2n2tPDMOteg$> > > *Andrew T. Lynch* | Software Architect > > T: 408.914.6875 M: 408.832.1045 www.cadence.com > > > > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > <https://urldefense.com/v3/__https:/lists.sourceforge.net/lists/listinfo/dmtcp-forum__;!!EHscmS1ygiU1lA!AVR3HdD5rMlbNZI7sYZH4CcoyFIyHxiqNeDQTUJ4jrDhBLx8LPYaXM5QTOvkS85kyJc2n2tCnlYQnA$> > > |
From: Andrew L. <dr...@ca...> - 2023-10-26 17:26:47
|
Hi Folks, The sourceforge page refers to 2.6.0 as the latest stable version (although there is a download for 2.6.1). This github page references a new 3.0 release: https://github.com/dmtcp/dmtcp/releases/tag/3.0.0 and provides a tar.gz<https://github.com/dmtcp/dmtcp/releases/tag/3.0.0%20and%20provides%20a%20tar.gz> file. Which is correct? Is the 3.0 version stable, or is it still the tip of a changing development stream? Also, which versions of DMTCP support RH9? Regards, Drew [http://www.cadence.com/mail/footer_logocdns2.jpg] [Cadence Cares]<http://fortune.com/best-companies/cadence-52/> Andrew T. Lynch | Software Architect T: 408.914.6875 M: 408.832.1045 www.cadence.com<http://www.cadence.com/> |
From: Gunter, D. O <do...@la...> - 2023-04-28 19:17:10
|
Finally, from what I can tell, nothing about “make install” moves any of the mana executables to the prefix given in the configure stage. > On Apr 28, 2023, at 10:53 AM, Gunter, David O <do...@la...> wrote: > > Digging a little further, the build does build the mana* executables in the build/bin directory. The “make install” process is what is failing. In your Makefiles, you are referencing a variable that is never set, $DESTDIR, in many instances. The only place I see this being set is in the subdirectory for dmtcp, > > dmtcp/debian/rules:DESTDIR = $(CURDIR)/$(BUILDDIR) > > Meaning only the dmtcp bits are being installed correctly. > > -david > >> On Apr 28, 2023, at 10:27 AM, Gunter, David O via Dmtcp-forum <dmt...@li...> wrote: >> >> Although I was finally able to build mana/dmtcp for our Cray system, the how-to section on running seems misleading. >> >> In my $MANA_ROOT/bin dir, all the commands begin with “dmtcp” and not “mana”, i.e. >> -rwxr-xr-x 1 dog dog 3824384 Apr 27 16:25 dmtcp_command >> -rwxr-xr-x 1 dog dog 3808448 Apr 27 16:25 dmtcp_coordinator >> -rwxr-xr-x 1 dog dog 1604376 Apr 27 16:25 dmtcp_discover_rm >> -rwxr-xr-x 1 dog dog 20952 Apr 27 16:25 dmtcp_get_libc_offset >> -rwxr-xr-x 1 dog dog 3931984 Apr 27 16:25 dmtcp_launch >> -rwxr-xr-x 1 dog dog 15984 Apr 27 16:25 dmtcp_nocheckpoint >> -rwxr-xr-x 1 dog dog 4529840 Apr 27 16:25 dmtcp_restart >> -rwxr-xr-x 1 dog dog 5393 Apr 27 16:25 dmtcp_rm_loclaunch >> -rwxr-xr-x 1 dog dog 76448 Apr 27 16:25 dmtcp_srun_helper >> -rwxr-xr-x 1 dog dog 97808 Apr 27 16:25 dmtcp_ssh >> -rwxr-xr-x 1 dog dog 104008 Apr 27 16:25 dmtcp_sshd >> -rwxr-xr-x 1 dog dog 162432 Apr 27 16:25 mtcp_restart >> >> The how-to guide says to do the following, >> >> 3a. Launching an MPI application >> >> The MANA directory comes with many test MPI applications that can be found in mpi-proxy-plugin/test. Depending on the application, you may require more than one MPI process running -- for example, ping_pong.mana.exe requires two. To support this, change the argument after -np accordingly. For this tutorial, we'll use mpi_hello_world.mana.exe, which can run with one MPI process. >> >> $ mana_coordinator >> $ srun -n 1 mana_launch mpi-proxy-split/test/mpi_hello_world.mana.exe >> >> Did I do something wrong in the build or is the documentation incorrect? >> >> Thanks, >> david >> -- >> David Gunter >> CCS-7 >> Los Alamos National Laboratory >> >> >> >> >> >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmt...@li... >> https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/dmtcp-forum__;!!Bt8fGhp8LhKGRg!EqLqBkv1zzJgbs-_g2fVXM45l9fU2eMjpAAns8OTplgYIE_DE9CD0iVacNuFZgRKAnb1fSz9NoSucijCc4_xwOCRO54$ > |
From: Gunter, D. O <do...@la...> - 2023-04-28 16:53:51
|
Digging a little further, the build does build the mana* executables in the build/bin directory. The “make install” process is what is failing. In your Makefiles, you are referencing a variable that is never set, $DESTDIR, in many instances. The only place I see this being set is in the subdirectory for dmtcp, dmtcp/debian/rules:DESTDIR = $(CURDIR)/$(BUILDDIR) Meaning only the dmtcp bits are being installed correctly. -david > On Apr 28, 2023, at 10:27 AM, Gunter, David O via Dmtcp-forum <dmt...@li...> wrote: > > Although I was finally able to build mana/dmtcp for our Cray system, the how-to section on running seems misleading. > > In my $MANA_ROOT/bin dir, all the commands begin with “dmtcp” and not “mana”, i.e. > -rwxr-xr-x 1 dog dog 3824384 Apr 27 16:25 dmtcp_command > -rwxr-xr-x 1 dog dog 3808448 Apr 27 16:25 dmtcp_coordinator > -rwxr-xr-x 1 dog dog 1604376 Apr 27 16:25 dmtcp_discover_rm > -rwxr-xr-x 1 dog dog 20952 Apr 27 16:25 dmtcp_get_libc_offset > -rwxr-xr-x 1 dog dog 3931984 Apr 27 16:25 dmtcp_launch > -rwxr-xr-x 1 dog dog 15984 Apr 27 16:25 dmtcp_nocheckpoint > -rwxr-xr-x 1 dog dog 4529840 Apr 27 16:25 dmtcp_restart > -rwxr-xr-x 1 dog dog 5393 Apr 27 16:25 dmtcp_rm_loclaunch > -rwxr-xr-x 1 dog dog 76448 Apr 27 16:25 dmtcp_srun_helper > -rwxr-xr-x 1 dog dog 97808 Apr 27 16:25 dmtcp_ssh > -rwxr-xr-x 1 dog dog 104008 Apr 27 16:25 dmtcp_sshd > -rwxr-xr-x 1 dog dog 162432 Apr 27 16:25 mtcp_restart > > The how-to guide says to do the following, > > 3a. Launching an MPI application > > The MANA directory comes with many test MPI applications that can be found in mpi-proxy-plugin/test. Depending on the application, you may require more than one MPI process running -- for example, ping_pong.mana.exe requires two. To support this, change the argument after -np accordingly. For this tutorial, we'll use mpi_hello_world.mana.exe, which can run with one MPI process. > > $ mana_coordinator > $ srun -n 1 mana_launch mpi-proxy-split/test/mpi_hello_world.mana.exe > > Did I do something wrong in the build or is the documentation incorrect? > > Thanks, > david > -- > David Gunter > CCS-7 > Los Alamos National Laboratory > > > > > > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/dmtcp-forum__;!!Bt8fGhp8LhKGRg!EqLqBkv1zzJgbs-_g2fVXM45l9fU2eMjpAAns8OTplgYIE_DE9CD0iVacNuFZgRKAnb1fSz9NoSucijCc4_xwOCRO54$ |
From: Gunter, D. O <do...@la...> - 2023-04-28 16:27:51
|
Although I was finally able to build mana/dmtcp for our Cray system, the how-to section on running seems misleading. In my $MANA_ROOT/bin dir, all the commands begin with “dmtcp” and not “mana”, i.e. -rwxr-xr-x 1 dog dog 3824384 Apr 27 16:25 dmtcp_command -rwxr-xr-x 1 dog dog 3808448 Apr 27 16:25 dmtcp_coordinator -rwxr-xr-x 1 dog dog 1604376 Apr 27 16:25 dmtcp_discover_rm -rwxr-xr-x 1 dog dog 20952 Apr 27 16:25 dmtcp_get_libc_offset -rwxr-xr-x 1 dog dog 3931984 Apr 27 16:25 dmtcp_launch -rwxr-xr-x 1 dog dog 15984 Apr 27 16:25 dmtcp_nocheckpoint -rwxr-xr-x 1 dog dog 4529840 Apr 27 16:25 dmtcp_restart -rwxr-xr-x 1 dog dog 5393 Apr 27 16:25 dmtcp_rm_loclaunch -rwxr-xr-x 1 dog dog 76448 Apr 27 16:25 dmtcp_srun_helper -rwxr-xr-x 1 dog dog 97808 Apr 27 16:25 dmtcp_ssh -rwxr-xr-x 1 dog dog 104008 Apr 27 16:25 dmtcp_sshd -rwxr-xr-x 1 dog dog 162432 Apr 27 16:25 mtcp_restart The how-to guide says to do the following, 3a. Launching an MPI application The MANA directory comes with many test MPI applications that can be found in mpi-proxy-plugin/test. Depending on the application, you may require more than one MPI process running -- for example, ping_pong.mana.exe requires two. To support this, change the argument after -np accordingly. For this tutorial, we'll use mpi_hello_world.mana.exe, which can run with one MPI process. $ mana_coordinator $ srun -n 1 mana_launch mpi-proxy-split/test/mpi_hello_world.mana.exe Did I do something wrong in the build or is the documentation incorrect? Thanks, david -- David Gunter CCS-7 Los Alamos National Laboratory |
From: Gunter, D. O <do...@la...> - 2023-04-25 21:08:05
|
It seems I need -lzma and -lz. -lzma is no longer packaged for SUSE SEL 15 so I will have to build it from scratch. Have you thought at all about moving on from LZMA Utils to XZ Utils instead? Also, a configure check for -lzma and -lz would be useful as well. Anyway, here is my build info and errors. $ module list Currently Loaded Modules: 1) craype-x86-rome 4) perftools-base/22.09.0 7) gcc/12.1.0 10) cray-mpich/8.1.21 2) libfabric/1.15.0.0 5) xpmem/2.4.4-2.3_13.8__gff0e1d9.shasta 8) craype/2.7.19 11) cray-libsci/22.11.1.2 3) craype-network-ofi 6) cray-pmi/6.1.7 9) cray-dsmml/0.2.2 12) PrgEnv-gnu/8.3.3 $ ./configure (no troubles) $ make mana ... make[3]: Entering directory '/usr/projects/icapt/dog/dmtcp/mana/mpi-proxy-split/lower-half' if mpicc -v 2>&1 | grep -q 'MPICH version'; then \ rm -f tmp.sh; \ mpicc -show -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -Wl,-Ttext-segment=E000000 -o lh_proxy -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group | \ sed -e 's^-lunwind ^ ^'> tmp.sh; \ sh tmp.sh; \ rm -f tmp.sh; \ elif true; then \ mpicc -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -Wl,-Ttext-segment=E000000 -o lh_proxy -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 `cat static_libs.txt` -Wl,--end-group; \ else \ mpicc -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -Wl,-Ttext-segment=E000000 -o lh_proxy -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -ldl -Wl,-end-group; \ fi /usr/bin/ld: cannot find -llzma /usr/bin/ld: cannot find -lz collect2: error: ld returned 1 exit status if mpicc -v 2>&1 | grep -q 'MPICH version'; then \ rm -f tmp.sh; \ mpicc -show -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy_da -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group | \ sed -e 's^-lunwind ^ ^'> tmp.sh; \ sh tmp.sh; \ rm -f tmp.sh; \ elif true; then \ mpicc -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy_da -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 `cat static_libs.txt` -Wl,--end-group; \ else \ mpicc -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy_da -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -ldl -Wl,-end-group; \ fi /usr/bin/ld: cannot find -llzma /usr/bin/ld: cannot find -lz collect2: error: ld returned 1 exit status cp -f lh_proxy lh_proxy_da gethostbyname-static/gethostbyname_static.o /usr/projects/icapt/dog/dmtcp/mana/bin/ cp: cannot stat 'lh_proxy': No such file or directory cp: cannot stat 'lh_proxy_da': No such file or directory — David Gunter CCS-7: Application Performance Team CCS-7: Future Architectures Team Los Alamos National Laboratory |
From: Kapil A. <ka...@cc...> - 2023-04-21 13:09:28
|
Hi Vakho, Can you please try out the master branch and see if that works for you? If you also share your test code and work environment with me, I can try to reproduce and diagnose it. Kapil On Thu, Apr 20, 2023 at 11:04 PM Vakho Tsulaia <vts...@lb...> wrote: > Hello, > > I'm running some tests with checkpoint-restarting a mutithreaded > application using DMTCP 2.6.0, > and my tests reproducibly deadlock on recursive mutexes shortly after > restart. After replacing > recursive mutexes with regular mutexes, and slightly refactoring the > code, the problem seems > to be gone. > > Is this a known issue that DMTCP cannot work with recursive mutexes? > I can provide more details about my work environment (OS, compiler, > etc.) if necessary. > > Thank you, > -- vakho > > > > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > |
From: Vakho T. <vts...@lb...> - 2023-04-21 06:04:30
|
Hello, I'm running some tests with checkpoint-restarting a mutithreaded application using DMTCP 2.6.0, and my tests reproducibly deadlock on recursive mutexes shortly after restart. After replacing recursive mutexes with regular mutexes, and slightly refactoring the code, the problem seems to be gone. Is this a known issue that DMTCP cannot work with recursive mutexes? I can provide more details about my work environment (OS, compiler, etc.) if necessary. Thank you, -- vakho |
From: Gunter, D. O <do...@la...> - 2023-04-19 19:24:32
|
Hello, It has been quite some time since I last tried to get DMTCP/MANA to work anywhere here at Los Alamos. I thought I would check in and see if the code is still under development, as well as to check which branch I would use for testing with MPI. Thanks, david — David Gunter CCS-7: Application Performance Team CCS-7: Future Architectures Team Los Alamos National Laboratory |
From: Analabha R. <har...@gm...> - 2023-02-10 11:50:40
|
Hi, Attempting to checkpoint the following on a VirtualBox VM (because the dmtcp build appears to fail on Ubuntu 22.04 at the moment) running ubuntu 20.04 LTS DMTCP compiled from upstream by cloning repo at this version <https://github.com/dmtcp/dmtcp/tree/47746500dc2c2a5f5de0c984d102a40acb21f140> "./configure, make" works okay. The result of "make check" is that all 63 tests pass. Compiled This MPI program <https://raw.githubusercontent.com/cornellcac/CR-demos/master/demos/MPI/mpi_count.c>. Compiles fine with OpenMPI-4.0.3-0ubuntu; however, checkpointing fails. $ mpicc mpi_count.c -o mpi_count $ dmtcp_launch -i 5 --rm mpirun -np 2 ./mpi_count mpirun: symbol lookup error: /usr/local/lib/dmtcp/libdmtcp_batch-queue.so: undefined symbol: process_fd_event There appears to be a similar problem reported on the GitHub issues page on 2020 <https://github.com/dmtcp/dmtcp/issues/843>, but no resolution. Does the present dmtcp not work with MPI, or does there need to be some extra configging? Plz advise. Thanks and Regards, AR -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: da...@ut..., ar...@ph..., har...@gm... Webpage: http://www.ph.utexas.edu/~daneel/ |
From: Services d. <ser...@vv...> - 2023-02-03 10:09:25
|
Hi , I am trying to run mpi job with dmtcp_launch, as following: dmtcp_launch mpirun -np 20 ./test.mpi I am getting the error like this: warning at dlwarappers.cpp:76 in dlopen; reason=JWARNING(ret) failed filename = libze_loader.so you may see a message 'ERROR: ld.so from libdl.so consider setting the environment varible 'DMTCP_DL_PLUGIN' to 0 before 'dmtcp_launch'. ERROR at sysvipcwrappers.cpp in shmctl; REASON='JASSERT(realShmid != -1) fialed' test.mpi : Terminating Please help. Thanks. Amit -- _Disclaimer: _© 2023 VVDN Technologies Pvt. Ltd. This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. __ _ _ _ _ |
From: Cyrill B. <cyr...@ma...> - 2022-09-26 16:39:15
|
Hi, First of all thank you Maksym. I've dedicated some more time to it and ran perf and looked into the memory allocation with the numa_maps under /proc/<PID>/numa_maps. First perf. I've recorded the instructions as well as page faults. Recording the instructions revealed that after restart a way longer time was spent in an MPI call, but most likely this is due to synchronization between the MPI ranks. The reason for some MPI ranks being slower probably lays within the different memory allocation. In the picture below is the initial run before restart on the left and the run after restart on the right. perf record -e instructions output The idea behind looking at page faults was to maybe find some irregularities after restart since the "restart penalty" seems to scale with the speed of the used filesystem for restarting (without the restart itself obviously) and the number of tasks used. Meaning its faster when I restart it from RAM disk then from the local SSD, not only the restart itself which is obvious but also the execution of threads afterwards (less "restart penalty"). However I could not find anything of interest. The high amount of page faults within the mtcp_restart process was to be expected and should not influence the performance of the execution since it should just terminate after restoring threads and memory, if I'm not mistaken. However I've included the output anyways, again on the left side is the initial run and on the right side the restarted one. perf record page faults output And last I've looked at the memory allocation of the processes. I've first pinned eight processes in one NUMA domain and eight in the other. And when checking the numa_maps for each process it was obvious that all of them had most of their memory mapped in their NUMA domain. After restart I've checked the numa_maps again and all of them had most of their memory mapped in NUMA domain 0, which is probably the domain where the restart process started. This could be an explanation for the "restart penalty" and this could also explain why our nodes with the AMD EPYC Rome CPUs seemed to be more affected than the older Haswell nodes, since the AMD CPUs have more NUMA domains and therefore also a higher latency between the two which are the furthest apart. However this does still not explain why it also seems to scale with the speed of the for restarting used filesystem. In the picture below is for each process shown how many pages they are mapped in NUMA domain 0 and how many in NUMA domain 1. The output on the left is before the restart and the one on the right after restart. analysis of numa_maps I wanted to publish my findings here in the hope it might help someone else in future. Sadly I won't have be able to dedicate any further time to it in near future. However I would still love to hear if someone finds a reason to why the "restart penalty" seems to scale with the speed (or possibly latency?) of the for restarting used filesystem or has in general anything to add or new ideas. Best regards, Cyrill On 01.09.22 19:18, Maksym Planeta wrote: > Hi Cyrill, > > I remember having similarly looking problem with CRIU [1]. Try to take > a look at performance counters. > > > https://github.com/checkpoint-restore/criu/issues/1171 > > On 8/29/22 09:47, Cyrill Burth wrote: >> Hi, >> >> I was working the last few weeks with DMTCP and made some performance >> benchmarks. Therefore I have used the NPB 3.4.2 BT - MPI benchmark >> [1] at the Taurus Supercomputer at the TU Dresden always with 16 MPI >> ranks and gzip disabled. >> >> I have realized that if I would restart an application from its >> checkpoint it would (drastically) slow down compared to before the >> checkpoint, I will refer to this as phenomena as "restart penalty". >> >> I will describe shortly my methodology: I have performed an >> checkpoint in the 20th iteration and if I took the time before >> restart from the 21st to last iteration of the benchmark it would be >> between 25% to 45% less then when I did the same after restarting >> from the checkpoint in the 20th iteration. I verified this with the >> MPI benchmark (25%-45% "restart penalty") as well as with the OpenMP >> benchmark (consistent 15% restart penalty) which is also provided by >> NPB under [1]. I ran all tests multiple times on multiple nodes and >> all of them yielded the same results. To compile and run the >> benchmark I have used the intel/2019b toolchain, since I had some >> compatibility issues with newer versions. >> I have repeated the tests with application initiated checkpointing as >> well as with the "-i" option, without modifying the benchmarks source >> code. Both yielded the same results. >> >> However the reason I am contacting you is since I have not only >> realized the behavior described above but also that the "restart >> penalty" seems to scale with the speed of the used filesystem at >> least when using MPI. If I would restart from our relatively slow >> local SSDs, I have seen a "restart penalty" of roughly 45%, however >> if I restarted the same checkpoint from a RAM disk, I would only see >> a "restart penalty" of 25%. This could only be seen when using the >> MPI version of the benchmark, for the OpenMP version there was seen a >> "restart penalty" of 15%, but it would not scale with the used >> filesystem. >> >> I was wondering if anyone could give me any insights that could >> explain this behavior. >> >> The restart times themselves obviously go up when the slower >> filesystem is used, but this was to be expected, however it appears >> rather odd that the performance after restart depends on the >> filesystem used for restart. Some further research showed that every >> single iteration of the benchmark gets slowed down. It is *not* the >> case that some iterations take significantly longer than others. >> There were no further checkpoints taken except for the very first one >> in the 20th iteration from which I have restarted and which was >> excluded from the time measurements. >> >> >> Thank you very much in advance. >> >> >> Best regards, >> >> C. Burth >> >> >> [1] https://www.nas.nasa.gov/software/npb.html >> >> >> >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmt...@li... >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > |
From: Maksym P. <mpl...@os...> - 2022-09-01 17:54:57
|
Hi Cyrill, I remember having similarly looking problem with CRIU [1]. Try to take a look at performance counters. https://github.com/checkpoint-restore/criu/issues/1171 On 8/29/22 09:47, Cyrill Burth wrote: > Hi, > > I was working the last few weeks with DMTCP and made some performance benchmarks. Therefore I have used the NPB 3.4.2 BT > - MPI benchmark [1] at the Taurus Supercomputer at the TU Dresden always with 16 MPI ranks and gzip disabled. > > I have realized that if I would restart an application from its checkpoint it would (drastically) slow down compared to > before the checkpoint, I will refer to this as phenomena as "restart penalty". > > I will describe shortly my methodology: I have performed an checkpoint in the 20th iteration and if I took the time > before restart from the 21st to last iteration of the benchmark it would be between 25% to 45% less then when I did the > same after restarting from the checkpoint in the 20th iteration. I verified this with the MPI benchmark (25%-45% > "restart penalty") as well as with the OpenMP benchmark (consistent 15% restart penalty) which is also provided by NPB > under [1]. I ran all tests multiple times on multiple nodes and all of them yielded the same results. To compile and run > the benchmark I have used the intel/2019b toolchain, since I had some compatibility issues with newer versions. > I have repeated the tests with application initiated checkpointing as well as with the "-i" option, without modifying > the benchmarks source code. Both yielded the same results. > > However the reason I am contacting you is since I have not only realized the behavior described above but also that the > "restart penalty" seems to scale with the speed of the used filesystem at least when using MPI. If I would restart from > our relatively slow local SSDs, I have seen a "restart penalty" of roughly 45%, however if I restarted the same > checkpoint from a RAM disk, I would only see a "restart penalty" of 25%. This could only be seen when using the MPI > version of the benchmark, for the OpenMP version there was seen a "restart penalty" of 15%, but it would not scale with > the used filesystem. > > I was wondering if anyone could give me any insights that could explain this behavior. > > The restart times themselves obviously go up when the slower filesystem is used, but this was to be expected, however it > appears rather odd that the performance after restart depends on the filesystem used for restart. Some further research > showed that every single iteration of the benchmark gets slowed down. It is *not* the case that some iterations take > significantly longer than others. > There were no further checkpoints taken except for the very first one in the 20th iteration from which I have restarted > and which was excluded from the time measurements. > > > Thank you very much in advance. > > > Best regards, > > C. Burth > > > [1] https://www.nas.nasa.gov/software/npb.html > > > > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum -- Regards, Maksym Planeta |
From: Cyrill B. <cyr...@ma...> - 2022-08-29 08:04:25
|
Hi, I was working the last few weeks with DMTCP and made some performance benchmarks. Therefore I have used the NPB 3.4.2 BT - MPI benchmark [1] at the Taurus Supercomputer at the TU Dresden always with 16 MPI ranks and gzip disabled. I have realized that if I would restart an application from its checkpoint it would (drastically) slow down compared to before the checkpoint, I will refer to this as phenomena as "restart penalty". I will describe shortly my methodology: I have performed an checkpoint in the 20th iteration and if I took the time before restart from the 21st to last iteration of the benchmark it would be between 25% to 45% less then when I did the same after restarting from the checkpoint in the 20th iteration. I verified this with the MPI benchmark (25%-45% "restart penalty") as well as with the OpenMP benchmark (consistent 15% restart penalty) which is also provided by NPB under [1]. I ran all tests multiple times on multiple nodes and all of them yielded the same results. To compile and run the benchmark I have used the intel/2019b toolchain, since I had some compatibility issues with newer versions. I have repeated the tests with application initiated checkpointing as well as with the "-i" option, without modifying the benchmarks source code. Both yielded the same results. However the reason I am contacting you is since I have not only realized the behavior described above but also that the "restart penalty" seems to scale with the speed of the used filesystem at least when using MPI. If I would restart from our relatively slow local SSDs, I have seen a "restart penalty" of roughly 45%, however if I restarted the same checkpoint from a RAM disk, I would only see a "restart penalty" of 25%. This could only be seen when using the MPI version of the benchmark, for the OpenMP version there was seen a "restart penalty" of 15%, but it would not scale with the used filesystem. I was wondering if anyone could give me any insights that could explain this behavior. The restart times themselves obviously go up when the slower filesystem is used, but this was to be expected, however it appears rather odd that the performance after restart depends on the filesystem used for restart. Some further research showed that every single iteration of the benchmark gets slowed down. It is *not* the case that some iterations take significantly longer than others. There were no further checkpoints taken except for the very first one in the 20th iteration from which I have restarted and which was excluded from the time measurements. Thank you very much in advance. Best regards, C. Burth [1] https://www.nas.nasa.gov/software/npb.html |
From: Tobias v. E. <tob...@po...> - 2022-05-30 09:26:41
|
Hi, I am using dmtcp to checkpoint a MCMC program with long run times because the cluster I use has a runtime limit of 48 h. After successfully checkpointing and restarting the program 21 times, I now get the following message during checkpointing of the 22nd run: [40000] WARNING at procselfmaps.cpp:101 in ~ProcSelfMaps; REASON='JWARNING(numAllocExpands == jalib::JAllocDispatcher::numExpands()) failed' numAllocExpands = 10 jalib::JAllocDispatcher::numExpands() = 11 Message: JAlloc: memory expanded through call to mmap(). Inconsistent JAlloc will be a problem on restart I am not completely understanding what is the problem here. It is probably related to the large amount of memory required to checkpoint a MCMC run that has already ran for about 40 days? Does anyone know how to fix the issue? Kind regards and thanks, Tobias |
From: guo g. <guo...@gm...> - 2022-03-09 04:55:37
|
Hello, I'm using the desktop version of Ubuntu. I want to use dmtcp to set checkpoints on the calculator. Can you do that? What should I do? Thank you and look forward to your reply! On Wed, Mar 9, 2022 at 12:47 PM guo guo <guo...@gm...> wrote: > Hello, I'm using the desktop version of Ubuntu. I want to use dmtcp to set > checkpoints on the calculator. Can you do that? What should I do? Thank you > and look forward to your reply! > |
From: Jo-To S. <joh...@gm...> - 2022-01-03 14:50:55
|
Hello, An application i use (in this case the Julia programming language runtime) uses the USR2 signal internally. What other signals could i try for DMTCP? I hope this is enough to make Julia work but i might also need to address the dlopen warning and errors already reported by Philippe Marion. Here https://sourceforge.net/p/dmtcp/mailman/dmtcp-forum/thread/b30cc819-f579-bc58-16f7-ee2d65eb7880%40univ-littoral.fr/#msg36435580 which never got a reply. Sincerly, Johann-Tobias Schäg |