From: Derek G. <fri...@gm...> - 2013-04-10 01:48:24
|
Hey guys, I've got a fairly large job (>3500 procs) that is hanging while trying to setup the mesh. The procs are in 2 separate places. ~Half of them are here: #35 0x00002b1746aea7f0 in libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #36 0x00002b1746afc17b in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #37 0x00002b1746c78528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #38 0x00002b1746c7b3e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 And the other ~half are here: #6 0x00002ba5c95c8b90 in libMesh::Parallel::Communicator::send_receive<unsigned int> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #7 0x00002ba5c95da2e2 in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #8 0x00002ba5c9756528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #9 0x00002ba5c97593e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 Obviously they are in slightly different spots... Any ideas on what's going on here or where to start looking? I was intermittently getting weird errors around this point from mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. The mesh itself isn't enormous... it's only about 1 million nodes or so. We've definitely done more than this before. Thanks in advance for any advice! Derek |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-04-10 02:16:27
|
Hmm - I'll look through that section of code tomorrow morning and see if there could possibly be any mismatched send/receives or anything. -Ben On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fri...@gm...> wrote: > Hey guys, > > I've got a fairly large job (>3500 procs) that is hanging while trying to setup the mesh. The procs are in 2 separate places. ~Half of them are here: > > #35 0x00002b1746aea7f0 in libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #36 0x00002b1746afc17b in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #37 0x00002b1746c78528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #38 0x00002b1746c7b3e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > > > And the other ~half are here: > > #6 0x00002ba5c95c8b90 in libMesh::Parallel::Communicator::send_receive<unsigned int> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #7 0x00002ba5c95da2e2 in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #8 0x00002ba5c9756528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #9 0x00002ba5c97593e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > > > > Obviously they are in slightly different spots... > > > Any ideas on what's going on here or where to start looking? > > I was intermittently getting weird errors around this point from mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. > > The mesh itself isn't enormous... it's only about 1 million nodes or so. We've definitely done more than this before. > > Thanks in advance for any advice! > > Derek > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-04-10 02:21:33
|
Serial or parallel mesh? On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" <ben...@na...> wrote: > Hmm - I'll look through that section of code tomorrow morning and see if there could possibly be any mismatched send/receives or anything. > > -Ben > > On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fri...@gm...> wrote: > >> Hey guys, >> >> I've got a fairly large job (>3500 procs) that is hanging while trying to setup the mesh. The procs are in 2 separate places. ~Half of them are here: >> >> #35 0x00002b1746aea7f0 in libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #36 0x00002b1746afc17b in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #37 0x00002b1746c78528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #38 0x00002b1746c7b3e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> >> And the other ~half are here: >> >> #6 0x00002ba5c95c8b90 in libMesh::Parallel::Communicator::send_receive<unsigned int> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #7 0x00002ba5c95da2e2 in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #8 0x00002ba5c9756528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> #9 0x00002ba5c97593e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> >> >> Obviously they are in slightly different spots... >> >> >> Any ideas on what's going on here or where to start looking? >> >> I was intermittently getting weird errors around this point from mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. >> >> The mesh itself isn't enormous... it's only about 1 million nodes or so. We've definitely done more than this before. >> >> Thanks in advance for any advice! >> >> Derek >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Libmesh-devel mailing list >> Lib...@li... >> https://lists.sourceforge.net/lists/listinfo/libmesh-devel > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel |
From: Derek G. <fri...@gm...> - 2013-04-10 02:21:51
|
serial On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > Serial or parallel mesh? > > > > On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" < > ben...@na...> wrote: > > > Hmm - I'll look through that section of code tomorrow morning and see if > there could possibly be any mismatched send/receives or anything. > > > > -Ben > > > > On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fri...@gm...> wrote: > > > >> Hey guys, > >> > >> I've got a fairly large job (>3500 procs) that is hanging while trying > to setup the mesh. The procs are in 2 separate places. ~Half of them are > here: > >> > >> #35 0x00002b1746aea7f0 in > libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () > from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #36 0x00002b1746afc17b in void > libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox > const&, libMesh::MeshBase::element_iterator const&, > libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, > std::allocator<unsigned int> >&) const () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #37 0x00002b1746c78528 in > libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, > unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #38 0x00002b1746c7b3e5 in > libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () > from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> > >> > >> And the other ~half are here: > >> > >> #6 0x00002ba5c95c8b90 in > libMesh::Parallel::Communicator::send_receive<unsigned int> () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #7 0x00002ba5c95da2e2 in void > libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox > const&, libMesh::MeshBase::element_iterator const&, > libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, > std::allocator<unsigned int> >&) const () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #8 0x00002ba5c9756528 in > libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, > unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> #9 0x00002ba5c97593e5 in > libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > >> > >> > >> > >> Obviously they are in slightly different spots... > >> > >> > >> Any ideas on what's going on here or where to start looking? > >> > >> I was intermittently getting weird errors around this point from > mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. > >> > >> The mesh itself isn't enormous... it's only about 1 million nodes or > so. We've definitely done more than this before. > >> > >> Thanks in advance for any advice! > >> > >> Derek > >> > ------------------------------------------------------------------------------ > >> Precog is a next-generation analytics platform capable of advanced > >> analytics on semi-structured data. The platform includes APIs for > building > >> apps and a phenomenal toolset for data science. Developers can use > >> our toolset for easy data analysis & visualization. Get a free account! > >> http://www2.precog.com/precogplatform/slashdotnewsletter > >> _______________________________________________ > >> Libmesh-devel mailing list > >> Lib...@li... > >> https://lists.sourceforge.net/lists/listinfo/libmesh-devel > > > > > ------------------------------------------------------------------------------ > > Precog is a next-generation analytics platform capable of advanced > > analytics on semi-structured data. The platform includes APIs for > building > > apps and a phenomenal toolset for data science. Developers can use > > our toolset for easy data analysis & visualization. Get a free account! > > http://www2.precog.com/precogplatform/slashdotnewsletter > > _______________________________________________ > > Libmesh-devel mailing list > > Lib...@li... > > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > |
From: Derek G. <fri...@gm...> - 2013-04-10 02:26:49
|
Is there any way to disable the hilbert stuff for now? With serial mesh can we just take the numbering from the node numbering? On Tue, Apr 9, 2013 at 8:21 PM, Derek Gaston <fri...@gm...> wrote: > serial > > > On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) < > ben...@na...> wrote: > >> Serial or parallel mesh? >> >> >> >> On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" < >> ben...@na...> wrote: >> >> > Hmm - I'll look through that section of code tomorrow morning and see >> if there could possibly be any mismatched send/receives or anything. >> > >> > -Ben >> > >> > On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fri...@gm...> wrote: >> > >> >> Hey guys, >> >> >> >> I've got a fairly large job (>3500 procs) that is hanging while trying >> to setup the mesh. The procs are in 2 separate places. ~Half of them are >> here: >> >> >> >> #35 0x00002b1746aea7f0 in >> libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () >> from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #36 0x00002b1746afc17b in void >> libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox >> const&, libMesh::MeshBase::element_iterator const&, >> libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, >> std::allocator<unsigned int> >&) const () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #37 0x00002b1746c78528 in >> libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, >> unsigned int) () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #38 0x00002b1746c7b3e5 in >> libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () >> from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> >> >> >> >> And the other ~half are here: >> >> >> >> #6 0x00002ba5c95c8b90 in >> libMesh::Parallel::Communicator::send_receive<unsigned int> () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #7 0x00002ba5c95da2e2 in void >> libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox >> const&, libMesh::MeshBase::element_iterator const&, >> libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, >> std::allocator<unsigned int> >&) const () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #8 0x00002ba5c9756528 in >> libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, >> unsigned int) () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> #9 0x00002ba5c97593e5 in >> libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from >> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >> >> >> >> >> >> >> >> Obviously they are in slightly different spots... >> >> >> >> >> >> Any ideas on what's going on here or where to start looking? >> >> >> >> I was intermittently getting weird errors around this point from >> mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. >> >> >> >> The mesh itself isn't enormous... it's only about 1 million nodes or >> so. We've definitely done more than this before. >> >> >> >> Thanks in advance for any advice! >> >> >> >> Derek >> >> >> ------------------------------------------------------------------------------ >> >> Precog is a next-generation analytics platform capable of advanced >> >> analytics on semi-structured data. The platform includes APIs for >> building >> >> apps and a phenomenal toolset for data science. Developers can use >> >> our toolset for easy data analysis & visualization. Get a free account! >> >> http://www2.precog.com/precogplatform/slashdotnewsletter >> >> _______________________________________________ >> >> Libmesh-devel mailing list >> >> Lib...@li... >> >> https://lists.sourceforge.net/lists/listinfo/libmesh-devel >> > >> > >> ------------------------------------------------------------------------------ >> > Precog is a next-generation analytics platform capable of advanced >> > analytics on semi-structured data. The platform includes APIs for >> building >> > apps and a phenomenal toolset for data science. Developers can use >> > our toolset for easy data analysis & visualization. Get a free account! >> > http://www2.precog.com/precogplatform/slashdotnewsletter >> > _______________________________________________ >> > Libmesh-devel mailing list >> > Lib...@li... >> > https://lists.sourceforge.net/lists/listinfo/libmesh-devel >> > > |
From: Derek G. <fri...@gm...> - 2013-04-10 02:27:49
|
Another data point... job starts fine on half the procs.... Derek On Tue, Apr 9, 2013 at 8:26 PM, Derek Gaston <fri...@gm...> wrote: > Is there any way to disable the hilbert stuff for now? With serial mesh > can we just take the numbering from the node numbering? > > > On Tue, Apr 9, 2013 at 8:21 PM, Derek Gaston <fri...@gm...> wrote: > >> serial >> >> >> On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) < >> ben...@na...> wrote: >> >>> Serial or parallel mesh? >>> >>> >>> >>> On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" < >>> ben...@na...> wrote: >>> >>> > Hmm - I'll look through that section of code tomorrow morning and see >>> if there could possibly be any mismatched send/receives or anything. >>> > >>> > -Ben >>> > >>> > On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fri...@gm...> wrote: >>> > >>> >> Hey guys, >>> >> >>> >> I've got a fairly large job (>3500 procs) that is hanging while >>> trying to setup the mesh. The procs are in 2 separate places. ~Half of >>> them are here: >>> >> >>> >> #35 0x00002b1746aea7f0 in >>> libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () >>> from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #36 0x00002b1746afc17b in void >>> libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox >>> const&, libMesh::MeshBase::element_iterator const&, >>> libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, >>> std::allocator<unsigned int> >&) const () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #37 0x00002b1746c78528 in >>> libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, >>> unsigned int) () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #38 0x00002b1746c7b3e5 in >>> libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () >>> from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> >>> >> >>> >> And the other ~half are here: >>> >> >>> >> #6 0x00002ba5c95c8b90 in >>> libMesh::Parallel::Communicator::send_receive<unsigned int> () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #7 0x00002ba5c95da2e2 in void >>> libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox >>> const&, libMesh::MeshBase::element_iterator const&, >>> libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, >>> std::allocator<unsigned int> >&) const () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #8 0x00002ba5c9756528 in >>> libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, >>> unsigned int) () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> #9 0x00002ba5c97593e5 in >>> libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from >>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>> >> >>> >> >>> >> >>> >> Obviously they are in slightly different spots... >>> >> >>> >> >>> >> Any ideas on what's going on here or where to start looking? >>> >> >>> >> I was intermittently getting weird errors around this point from >>> mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. >>> >> >>> >> The mesh itself isn't enormous... it's only about 1 million nodes or >>> so. We've definitely done more than this before. >>> >> >>> >> Thanks in advance for any advice! >>> >> >>> >> Derek >>> >> >>> ------------------------------------------------------------------------------ >>> >> Precog is a next-generation analytics platform capable of advanced >>> >> analytics on semi-structured data. The platform includes APIs for >>> building >>> >> apps and a phenomenal toolset for data science. Developers can use >>> >> our toolset for easy data analysis & visualization. Get a free >>> account! >>> >> http://www2.precog.com/precogplatform/slashdotnewsletter >>> >> _______________________________________________ >>> >> Libmesh-devel mailing list >>> >> Lib...@li... >>> >> https://lists.sourceforge.net/lists/listinfo/libmesh-devel >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > Precog is a next-generation analytics platform capable of advanced >>> > analytics on semi-structured data. The platform includes APIs for >>> building >>> > apps and a phenomenal toolset for data science. Developers can use >>> > our toolset for easy data analysis & visualization. Get a free account! >>> > http://www2.precog.com/precogplatform/slashdotnewsletter >>> > _______________________________________________ >>> > Libmesh-devel mailing list >>> > Lib...@li... >>> > https://lists.sourceforge.net/lists/listinfo/libmesh-devel >>> >> >> > |
From: Derek G. <fri...@gm...> - 2013-04-10 17:58:02
|
Anyone sleep on this and come up with any ideas to try? One thing to note is that we are actually reading the mesh on every processor still (because of the block / sideset naming stuff that Cody only recently fixed). Do you believe that could be part of the problem? Currently I can't run over about 1,700 procs without hitting this hang. I'm recompiling with a new version of mvapich... and I'm hoping that that fixes it... but I'd like to know if there is anything else I can try... Derek On Tue, Apr 9, 2013 at 8:27 PM, Derek Gaston <fri...@gm...> wrote: > Another data point... job starts fine on half the procs.... > > Derek > > > On Tue, Apr 9, 2013 at 8:26 PM, Derek Gaston <fri...@gm...> wrote: > >> Is there any way to disable the hilbert stuff for now? With serial mesh >> can we just take the numbering from the node numbering? >> >> >> On Tue, Apr 9, 2013 at 8:21 PM, Derek Gaston <fri...@gm...> wrote: >> >>> serial >>> >>> >>> On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) < >>> ben...@na...> wrote: >>> >>>> Serial or parallel mesh? >>>> >>>> >>>> >>>> On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" < >>>> ben...@na...> wrote: >>>> >>>> > Hmm - I'll look through that section of code tomorrow morning and see >>>> if there could possibly be any mismatched send/receives or anything. >>>> > >>>> > -Ben >>>> > >>>> > On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fri...@gm...> >>>> wrote: >>>> > >>>> >> Hey guys, >>>> >> >>>> >> I've got a fairly large job (>3500 procs) that is hanging while >>>> trying to setup the mesh. The procs are in 2 separate places. ~Half of >>>> them are here: >>>> >> >>>> >> #35 0x00002b1746aea7f0 in >>>> libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () >>>> from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #36 0x00002b1746afc17b in void >>>> libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox >>>> const&, libMesh::MeshBase::element_iterator const&, >>>> libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, >>>> std::allocator<unsigned int> >&) const () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #37 0x00002b1746c78528 in >>>> libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, >>>> unsigned int) () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #38 0x00002b1746c7b3e5 in >>>> libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) >>>> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> >>>> >> >>>> >> And the other ~half are here: >>>> >> >>>> >> #6 0x00002ba5c95c8b90 in >>>> libMesh::Parallel::Communicator::send_receive<unsigned int> () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #7 0x00002ba5c95da2e2 in void >>>> libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox >>>> const&, libMesh::MeshBase::element_iterator const&, >>>> libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, >>>> std::allocator<unsigned int> >&) const () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #8 0x00002ba5c9756528 in >>>> libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, >>>> unsigned int) () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> #9 0x00002ba5c97593e5 in >>>> libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from >>>> /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 >>>> >> >>>> >> >>>> >> >>>> >> Obviously they are in slightly different spots... >>>> >> >>>> >> >>>> >> Any ideas on what's going on here or where to start looking? >>>> >> >>>> >> I was intermittently getting weird errors around this point from >>>> mvapich so I've tried to switch to OpenMPI.... and it's hanging up here. >>>> >> >>>> >> The mesh itself isn't enormous... it's only about 1 million nodes or >>>> so. We've definitely done more than this before. >>>> >> >>>> >> Thanks in advance for any advice! >>>> >> >>>> >> Derek >>>> >> >>>> ------------------------------------------------------------------------------ >>>> >> Precog is a next-generation analytics platform capable of advanced >>>> >> analytics on semi-structured data. The platform includes APIs for >>>> building >>>> >> apps and a phenomenal toolset for data science. Developers can use >>>> >> our toolset for easy data analysis & visualization. Get a free >>>> account! >>>> >> http://www2.precog.com/precogplatform/slashdotnewsletter >>>> >> _______________________________________________ >>>> >> Libmesh-devel mailing list >>>> >> Lib...@li... >>>> >> https://lists.sourceforge.net/lists/listinfo/libmesh-devel >>>> > >>>> > >>>> ------------------------------------------------------------------------------ >>>> > Precog is a next-generation analytics platform capable of advanced >>>> > analytics on semi-structured data. The platform includes APIs for >>>> building >>>> > apps and a phenomenal toolset for data science. Developers can use >>>> > our toolset for easy data analysis & visualization. Get a free >>>> account! >>>> > http://www2.precog.com/precogplatform/slashdotnewsletter >>>> > _______________________________________________ >>>> > Libmesh-devel mailing list >>>> > Lib...@li... >>>> > https://lists.sourceforge.net/lists/listinfo/libmesh-devel >>>> >>> >>> >> > |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-04-10 18:18:35
|
On Apr 10, 2013, at 12:57 PM, Derek Gaston <fri...@gm...> wrote: > Anyone sleep on this and come up with any ideas to try? I'm reviewing the code now… Is there a restart file with this case, or is it a fresh start? If the latter, you may be able to turn libMesh::MeshCommunication::find_global_indices() into a no-op - just with a naked 'return;' at the top. The goal of a partition-agnistoc node ordering is still something I'd like to achieve, but the current implementation has been the source of way too much woe. I'll revisit the discussion on the list from a few years ago and figure out what a proper path forward should be. Is this mesh something I can get my hands on down the road for testing? -Ben |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-04-10 18:43:44
|
On Apr 10, 2013, at 1:18 PM, "Kirk, Benjamin (JSC-EG311)" <ben...@na...> wrote: >> >> Anyone sleep on this and come up with any ideas to try? > > I'm reviewing the code now… Is there a restart file with this case, or is it a fresh start? I'm curious if we have a good-old-fasioned race condition here. We are locking at the same section of code called from two places, suggesting a synchronization problem. Now, there is some allgather() in there, which I would expect to force synchronization, so this is indeed curious… One idea: do we need a simple barrier() at the end of that function to avoid synchronization issues? Perhaps one set of processors is racing ahead and accidentally participating in the next allgather(), that is not actually intended for it?? Barring that, my only other idea is that the sorting is somehow breaking down when a processor has no objects, but pretty sure we've tested the heck out of that. So I'd say first off try a simple barrier() before that function returns and report back… -Ben |
From: Derek G. <fri...@gm...> - 2013-04-10 18:42:50
|
On Wed, Apr 10, 2013 at 12:18 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > I'm reviewing the code now… Is there a restart file with this case, or is > it a fresh start? > Fresh Start > If the latter, you may be able to turn libMesh::MeshCommunication::find_global_indices() > into a no-op - just with a naked 'return;' at the top. > OOoooooh! Really? That sounds perfect! I'll try it! > The goal of a partition-agnistoc node ordering is still something I'd like > to achieve, but the current implementation has been the source of way too > much woe. I'll revisit the discussion on the list from a few years ago and > figure out what a proper path forward should be. > Yeah - we know it causes problems with restart with solid mechanics for instance.... because we might actually have some "penetration" of one part of the mesh into another part of the mesh in a restart file... which causes the global_indices to get assigned differently from how they were assigned at the beginning of the simulation. That's a nasty problem. > Is this mesh something I can get my hands on down the road for testing? > Hmmm - I might be able to send it to you down the road. I'd have to double check on that. It's actually not all that interesting though. A pretty good approximation can be built from: MeshTools::Generation::build_cube(mesh, 60, 256, 60, 0, 1, 0, 4, 0, 1); It is being read in with the Exodus reader though.... and it does have quite a few subdomains (I don't know if that matters). Thanks for looking at this! Derek |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-04-10 18:45:21
|
On Apr 10, 2013, at 1:42 PM, Derek Gaston <fri...@gm...> wrote: > > OOoooooh! Really? That sounds perfect! I'll try it! Please try my barrier() first though, referring to the email that just crossed this one… -Ben |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-04-10 18:55:52
|
On Apr 10, 2013, at 1:44 PM, "Kirk, Benjamin (JSC-EG311)" <ben...@na...> wrote: > On Apr 10, 2013, at 1:42 PM, Derek Gaston <fri...@gm...> wrote: > >> >> OOoooooh! Really? That sounds perfect! I'll try it! > > Please try my barrier() first though, referring to the email that just crossed this one… And a no-op won't work, but something almost as simple - looping through the input range and assigning an incremental counter - should: index_map.clear(); dof_id_type idx=0; for (ForwardIterator::it=begin, it!=end; ++it) index_map.push_back(idx++); return; would be the "right" way to turn this into a no-op… -Ben |
From: Derek G. <fri...@gm...> - 2013-04-19 16:38:29
|
Just to put an end-cap on this.... I switched over to the newest mvapich (1.9b) and all of this stuff cleared up. It's still not clear to me what the issue is/was.... but it's working ;-) Derek On Wed, Apr 10, 2013 at 12:55 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > > On Apr 10, 2013, at 1:44 PM, "Kirk, Benjamin (JSC-EG311)" < > ben...@na...> wrote: > > > On Apr 10, 2013, at 1:42 PM, Derek Gaston <fri...@gm...> wrote: > > > >> > >> OOoooooh! Really? That sounds perfect! I'll try it! > > > > Please try my barrier() first though, referring to the email that just > crossed this one… > > And a no-op won't work, but something almost as simple - looping through > the input range and assigning an incremental counter - should: > > index_map.clear(); > dof_id_type idx=0; > for (ForwardIterator::it=begin, it!=end; ++it) > index_map.push_back(idx++); > return; > > would be the "right" way to turn this into a no-op… > > -Ben > > > |
From: Roy S. <roy...@ic...> - 2013-04-19 16:48:48
|
On Fri, 19 Apr 2013, Derek Gaston wrote: > Just to put an end-cap on this.... I switched over to the newest > mvapich (1.9b) and all of this stuff cleared up. It's still not > clear to me what the issue is/was.... but it's working ;-) That's a relief! Thanks for keeping us updated! It'd be nice to know what caused the issue, but probably not worth the massive effort it'd take to debug. I generally like how we add workarounds for compile-time problems with old PETSc versions and compiler versions, because that's generally easy enough to do... but for run-time problems with old MPI versions, forget about it. I've also got a "breaks-with-old-mpich2, works-with-new" bug on my plate here that I'm never planning to really investigate. --- Roy |
From: Derek G. <fri...@gm...> - 2013-04-19 17:04:13
|
Well - of course I spoke too soon. It just crashed again... but this time I got a core dump.... check out the stack trace: #0 variant_filter_iterator<libMesh::Predicates::multi_predicate, libMesh::Elem* const, libMesh::Elem* const&, libMesh::Elem* const*>::Pred<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*, std::allocator<libMesh::Elem*> > >, libMesh::Predicates::Local<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*, std::allocator<libMesh::Elem*> > > > >::operator() (this=0x36d9f950, in=<value optimized out>) at ./include/libmesh/variant_filter_iterator.h:298 298 } (gdb) where #0 variant_filter_iterator<libMesh::Predicates::multi_predicate, libMesh::Elem* const, libMesh::Elem* const&, libMesh::Elem* const*>::Pred<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*, std::allocator<libMesh::Elem*> > >, libMesh::Predicates::Local<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*, std::allocator<libMesh::Elem*> > > > >::operator() (this=0x36d9f950, in=<value optimized out>) at ./include/libmesh/variant_filter_iterator.h:298 #1 0x00002ae6c4810786 in variant_filter_iterator<libMesh::Predicates::multi_predicate, libMesh::Elem* const, libMesh::Elem* const&, libMesh::Elem* const*>::satisfy_predicate() () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 #2 0x00002ae6c48272c2 in variant_filter_iterator<libMesh::Predicates::Local<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*> > >, __gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*> > > (this=<value optimized out>) at ./include/libmesh/variant_filter_iterator.h:349 #3 const_element_iterator<libMesh::Predicates::Local<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*> > >, __gnu_cxx::__normal_iterator<libMesh::Elem* const*, std::vector<libMesh::Elem*> > > (this=<value optimized out>) at ./include/libmesh/mesh_base.h:961 #4 libMesh::SerialMesh::local_elements_begin (this=<value optimized out>) at src/mesh/serial_mesh_iterators.C:333 #5 0x00002ae6c47c76d2 in libMesh::MeshTools::n_local_levels (mesh=...) at src/mesh/mesh_tools.C:613 #6 0x00002ae6c47c7875 in libMesh::MeshTools::n_levels (mesh=...) at src/mesh/mesh_tools.C:628 #7 0x00002ae6c4849f48 in libMesh::UnstructuredMesh::find_neighbors (this=0x365fcee0, reset_remote_elements=<value optimized out>, reset_current_list=<value optimized out>) at src/mesh/unstructured_mesh.C:372 #8 0x00002ae6c4724438 in libMesh::MeshBase::prepare_for_use (this=0x365fcee0, skip_renumber_nodes_and_elements=<value optimized out>) at src/mesh/mesh_base.C:133 On Fri, Apr 19, 2013 at 10:48 AM, Roy Stogner <roy...@ic...>wrote: > > On Fri, 19 Apr 2013, Derek Gaston wrote: > > Just to put an end-cap on this.... I switched over to the newest >> mvapich (1.9b) and all of this stuff cleared up. It's still not >> clear to me what the issue is/was.... but it's working ;-) >> > > That's a relief! Thanks for keeping us updated! > > It'd be nice to know what caused the issue, but probably not worth the > massive effort it'd take to debug. > > I generally like how we add workarounds for compile-time problems with > old PETSc versions and compiler versions, because that's generally > easy enough to do... but for run-time problems with old MPI versions, > forget about it. I've also got a "breaks-with-old-mpich2, > works-with-new" bug on my plate here that I'm never planning to really > investigate. > --- > Roy |
From: Roy S. <roy...@ic...> - 2013-04-19 17:10:40
|
On Fri, 19 Apr 2013, Derek Gaston wrote: > Well - of course I spoke too soon. It just crashed again... but this time I got a core dump.... > check out the stack trace: > #0 variant_filter_iterator<libMesh::Predicates::multi_predicate, libMesh::Elem* const, > libMesh::Elem* const&, libMesh::Elem* const*>::Pred<__gnu_cxx::__normal_iterator<libMesh::Elem* > const*, std::vector<libMesh::Elem*, std::allocator<libMesh::Elem*> > >, > libMesh::Predicates::Local<__gnu_cxx::__normal_iterator<libMesh::Elem* const*, > std::vector<libMesh::Elem*, std::allocator<libMesh::Elem*> > > > >::operator() (this=0x36d9f950, > in=<value optimized out>) at ./include/libmesh/variant_filter_iterator.h:298 > #4 libMesh::SerialMesh::local_elements_begin (this=<value optimized out>) at > src/mesh/serial_mesh_iterators.C:333 > #8 0x00002ae6c4724438 in libMesh::MeshBase::prepare_for_use (this=0x365fcee0, > skip_renumber_nodes_and_elements=<value optimized out>) at src/mesh/mesh_base.C:133 It's impossible to tell for sure without more debugging options, but I'd swear this looks like you've somehow got a stray Elem* (probably pointing to uninitialized or freed memory) in that mesh. This would be ugly, because if that's the case then you're not looking at the bug, you're looking at the fallout from some potentially-much-earlier bug. The only thing I can think to do debugging-wise is throw around loops over local element iterators and see where they first start triggering a segfault. --- Roy |