From: Boris B. <bor...@bu...> - 2017-11-09 15:34:09
|
Well, I eliminated PETSc and have been linking to MPI using --with-mpi=$MPI_DIR and playing with the refinement example I had mentioned earlier to try and eliminate ParMETIS due the the hang/crash issue. In these configs I attach either the LinearPartitioner or an SFC in prepare_for_use right before calling partition(). This causes assert trips in MeshComm::Redistribute where elem.proc_id != proc_id while unpacking elems (stack below). These asserts trigger at slightly smaller square meshes than in the original issue; SFC with 3^2 initial elems, while Linear with 4^2. At this point I wasnt sure about the MPI-partitioner support; Is attaching a partitioner ok in prepare_for_use or is there some setup stage im missing? If so, it seems theres very little that touches the mesh before this point, it seems that pretty much somethings already off at _refine_elements() since this all seems seperated from the partitioner. I tried investigating some of the make_elems_parallel_consistent calls and the libmesh_assert_valid_parallel_ids() call right after but so far no luck. One nagging/lingering issue I have is with the us using PETSc flags for MPI. In the PETSc build scripts that I originally was using we had to pass in an extra -lpmi to the PETSc LDFLAGS on the local cluster. The recent gcc7.2/Mvapich2 upgrade came with pmi2 that im to also pass to slurm and so in the PETSc builds I supply -lpmi2 now. On the standalone MPI builds I tried exporting libmesh_LDFLAGS and libmesh_LIBS to link against this library, but was not sure if it was picked up as -lpmi2 didnt show in the libmesh_optional_LIBS in the configure summaries like it does when linking though PETSc. Im quite unfamiliar with thie pmi library in general but I still have lingering fears this all could somehow stem from this. Thanks for any info you can provide, Boris Stack Trace ======= #0 __cxxabiv1::__cxa_throw (obj=obj@entry=0x9040e0, tinfo=0x407e68 <typeinfo for libMesh::LogicError>, tinfo@entry=0x7ffff76337b0 <typeinfo for libMesh::LogicError>, dest=0x403250 <libMesh::LogicError::~LogicError()>, dest@entry=0x7ffff6259370 <libMesh::LogicError::~LogicError()>) at ../../../../gcc/libstdc++-v3/libsupc++/eh_throw.cc:75 #1 0x00007ffff6a98f8a in libMesh::Parallel::Packing<libMesh::Elem*>::unpack<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, libMesh::MeshBase> (in=..., mesh=mesh@entry=0x64a0e0) at ../source/src/parallel/parallel_elem.C:474 #2 0x00007ffff6a995c5 in libMesh::Parallel::Packing<libMesh::Elem*>::unpack<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, libMesh::DistributedMesh> (in=..., in@entry=987654321, mesh=mesh@entry=0x64a0e0) at ../source/src/parallel/parallel_elem.C:814 #3 0x00007ffff688f8ab in unpack_range<libMesh::DistributedMesh, unsigned long, libMesh::mesh_inserter_iterator<libMesh::Elem>, libMesh::Elem*> (out_iter=..., context=<optimized out>, buffer=std::vector of length 195733, capacity 195733 = {...}) at ./include/libmesh/parallel_implementation.h:607 #4 libMesh::Parallel::Communicator::receive_packed_range<libMesh::DistributedMesh, libMesh::mesh_inserter_iterator<libMesh::Elem>, libMesh::Elem*> ( this=0x649188, src_processor_id=src_processor_id@entry=4294967294, context=context@entry=0x64a0e0, out_iter=out_iter@entry=..., output_type=output_type@entry=0x0, tag=...) at ./include/libmesh/parallel_implementation.h:2761 #5 0x00007ffff687d70e in libMesh::MeshCommunication::redistribute ( this=this@entry=0x7fffffff9cdf, mesh=..., newly_coarsened_only=newly_coarsened_only@entry=false) at ../source/src/mesh/mesh_communication.C:500 #6 0x00007ffff67ef2f2 in libMesh::DistributedMesh::redistribute ( this=0x64a0e0) at ../source/src/mesh/distributed_mesh.C:835 #7 0x00007ffff6acb1e6 in libMesh::Partitioner::partition ( this=<optimized out>, mesh=..., n=<optimized out>) at ../source/src/partitioning/partitioner.C:85 #8 0x00007ffff685aa6e in libMesh::MeshBase::partition ( this=this@entry=0x64a0e0, n_parts=2) at ../source/src/mesh/mesh_base.C:485 #9 0x00007ffff685f8fb in partition (this=0x64a0e0) at ./include/libmesh/mesh_base.h:728 #10 libMesh::MeshBase::prepare_for_use (this=0x64a0e0, skip_renumber_nodes_and_elements=skip_renumber_nodes_and_elements@entry=false, skip_find_neighbors=skip_find_neighbors@entry=false) at ../source/src/mesh/mesh_base.C:273 #11 0x00007ffff6938b02 in libMesh::MeshRefinement::uniformly_refine ( this=this@entry=0x7fffffffa9e0, n=5) at ../source/src/mesh/mesh_refinement.C:1723 #12 0x00007ffff7a8abc5 in GRINS::MeshBuilder::do_mesh_refinement_from_input ( this=this@entry=0x646820, input=..., comm=..., mesh=...) at ../../source/src/solver/src/mesh_builder.C:393 #13 0x00007ffff7a8bc6d in GRINS::MeshBuilder::build (this=0x646820, input=..., comm=...) at ../../source/src/solver/src/mesh_builder.C:167 #14 0x00007ffff7aa20ed in GRINS::SimulationBuilder::build_mesh ( this=this@entry=0x7fffffffb108, input=..., comm=...) at ../../source/src/solver/src/simulation_builder.C:68 #15 0x00007ffff7a947d2 in GRINS::Simulation::Simulation (this=0x89b910, input=..., sim_builder=..., comm=...) at ../../source/src/solver/src/simulation.C:123 #16 0x00007ffff7acdae2 in GRINS::Runner::init (this=this@entry =0x7fffffffb100) at ../../source/src/solver/src/runner.C:59 ---Type <return> to continue, or q <return> to quit--- #17 0x0000000000402c9d in main (argc=<optimized out>, argv=<optimized out>) at ../../source/src/apps/grins.C:31 On Thu, Nov 9, 2017 at 9:07 AM, Roy Stogner <roy...@ic...> wrote: > > On Mon, 6 Nov 2017, Boris Boutkov wrote: > > In some preliminary testing I encountered issues with the >> LinearPartitioner, >> > > Could you be more specific? That partitioner is dead simple, so I > wouldn't have expected to see many bugs, but it's also awful, so if > there were many bugs there's probably been nobody to use it and > encounter them for a decade. > > and the SFCPartitioner complained it wasnt enabled despite me >> configuring using --enable-everything. Any ideas if theres anything >> simple I could have forgotten? >> > > Yeah: the SFC partitioner isn't under an LGPL-friendly license, so > unless you add --disable-strict-lgpl to your configure line, it still > gets dropped for that reason. > --- > Roy > |