From: Boris B. <bor...@bu...> - 2017-09-18 16:20:09
|
Hello all, Ive run into an issue where the ParmetisPartitioner seems to ocassionaly hang during initialization on UB's CCR cluster. Unfortunately, this bug is a bit slippery and I havent always had the best results reproducing it completely consistently. The testing scenario is simply me uniformly refining (through a GRINS input file) a grid a number of times to prepare for some later multigrid computations. Ive seen the mentioned issue mostly commonly with square starting grids of 75^2 elements when running with 7 processors on one node, but more consistently I've seen the issue with 25^2 elements on two nodes with four processors. Small pertubations to the processor count and number of starting elements dont seem to trigger the bug so its something quite specific, and as odd as it sounds, Ive had some runs go through fine under seemingly identical settings. Some notes: - Ive noticed the same issue both configuring --with-metis=PETSc as well as without. Ive attached a sample config.log in case its useful. - I often attach a gdb session to the running program and notice the commonly recurring stack (see below). It seems the issue is always around the HilbertIndices parallel sort communicate_bins() with invalid looking communicator ids in the above PMPI_Allgather calls. Other than these couple of hints though, I'm at a loss as to what could cause such weird behaviour. Ive never managed to reproduce this on my local development machines so it it seems like it could be some machine / mpi specific configuration thing, but after a lot of reconfiguring attempts I'm running out of things to try. Any chance anyone has seen such a behaviour, or has any ideas as to what I can investigate further got get some more useful info? Thanks for any help! - Boris #0 poll_all_fboxes (cell=<optimized out>) at ../../src/mpid/ch3/channels/nemesis/include/mpid_nem_fbox.h:94 #1 MPID_nem_mpich_blocking_recv (cell=<optimized out>, in_fbox=<optimized out>, completions=<optimized out>) at ../../src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:1232 #2 PMPIDI_CH3I_Progress (progress_state=0x7ffcb5bff384, is_blocking=0) at ../../src/mpid/ch3/channels/nemesis/src/ch3_progress.c:589 #3 0x00007f1596f6ca73 in MPIC_Sendrecv (sendbuf=0x7ffcb5bff384, sendcount=0, sendtype=0, dest=1, sendtag=5, recvbuf=0x7, recvcount=7, recvtype=1275069445, source=2, recvtag=7, comm_ptr=0x7f159792e780 <MPID_Comm_builtin>, status=0x7ffcb5bff458, errflag=0x7ffcb5bff5f8) at ../../src/mpi/coll/helper_fns.c:268 #4 0x00007f1596d6297f in MPIR_Allgather_intra (sendbuf=0x7ffcb5bff384, sendcount=0, sendtype=0, recvbuf=0x1, recvcount=5, recvtype=7, comm_ptr=0x5265c, errflag=0x1) at ../../src/mpi/coll/allgather.c:257 #5 0x00007f1596d6548e in PMPI_Allgather (sendbuf=0x7ffcb5bff384, sendcount=0, sendtype=0, recvbuf=0x1, recvcount=5, recvtype=7, comm=-1245709088) at ../../src/mpi/coll/allgather.c:858 #6 0x00007f15a1491d17 in libMesh::Parallel::Sort<std::pair<Hilbert::HilbertIndices, unsigned long>, unsigned int>::communicate_bins() () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #7 0x00007f15a149d029 in libMesh::Parallel::Sort<std::pair<Hilbert::HilbertIndices, unsigned long>, unsigned int>::sort() () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #8 0x00007f15a1312d17 in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::const_element_iterator>(libMesh::Parallel::Communicator const&, libMesh::BoundingBox const&, libMesh::MeshBase::const_element_iterator const&, libMesh::MeshBase::const_element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #9 0x00007f15a14b2281 in libMesh::ParmetisPartitioner::initialize(libMesh::MeshBase const&, unsigned int) () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #10 0x00007f15a14b3d2d in libMesh::ParmetisPartitioner::_do_repartition(libMesh::MeshBase&, unsigned int) () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #11 0x00007f15a14bc2fe in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #12 0x00007f15a12f05c4 in libMesh::MeshBase::prepare_for_use(bool, bool) () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #13 0x00007f15a137c7e0 in libMesh::MeshRefinement::uniformly_refine(unsigned int) () from /projects/academic/pbauman/borisbou/software/planex/libmesh/install/lib/libmesh_opt.so.0 #14 0x00007f15a230390b in GRINS::MeshBuilder::do_mesh_refinement_from_input(GetPot const&, libMesh::Parallel::Communicator const&, libMesh::UnstructuredMesh&) const () from /projects/academic/pbauman/borisbou/software/planex/grins/install/opt/lib/libgrins.so.0 #15 0x00007f15a230614c in GRINS::MeshBuilder::build(GetPot const&, libMesh::Parallel::Communicator const&) () from /projects/academic/pbauman/borisbou/software/planex/grins/install/opt/lib/libgrins.so.0 #16 0x00007f15a231e86d in GRINS::SimulationBuilder::build_mesh(GetPot const&, libMesh::Parallel::Communicator const&) () from /projects/academic/pbauman/borisbou/software/planex/grins/install/opt/lib/libgrins.so.0 #17 0x00007f15a2311bfc in GRINS::Simulation::Simulation(GetPot const&, GetPot&, GRINS::SimulationBuilder&, libMesh::Parallel::Communicator const&) () from /projects/academic/pbauman/borisbou/software/planex/grins/install/opt/lib/libgrins.so.0 #18 0x0000000000407312 in main () |