From: Stogner, R. H <roy...@ic...> - 2019-01-30 17:07:56
|
On Wed, 30 Jan 2019, Li Luo wrote: > I am using libMesh for large scale parallelization. To enable the usage of > 65536 processor cores, the options > --with-dof-id-bytes=8 --with-processor-id-bytes=4 > --with-subdomain-id-bytes=4 > are already used for configuration. > > However, the code 'sticks' in the following fuction: > Parallel::Sort<Hilbert::HilbertIndices> sorter (communicator, > sorted_hilbert_keys); > sorter.sort(); So, this looks horribly suspicious. In parallel_sort.h:52, where it defaults "IdxType=unsigned int", would you try "IdxType=dof_id_type" instead? That might be a red herring (the problem here would be sorting at least 2^32 objects, not sorting them on at least 2^16 processors) but it sure looks like a bug to me and there's at least a chance it's the bug affecting you. > in the routing MeshCommunication::find_global_indices (in > file mesh_communication_global_indices.C), which is called from routine > Partitioner::partition_unpartitioned_elem (in file partitioner.C). > Since libMesh calls libHilbert for this sort function, is there anything > should be noticed for the configuration of libHilbert when using large > scale parallelization? Quite possibly. We currently have libHilbert set to use 32-bit integers internally. That should be fine in theory (coordinates get identified by triples of integers, and if you're using unique_id then that disambiguates any contiguous nodes). But cranking that up to 64 in contrib/libHilbert/include/Hilbert/FixBitVec.hpp would be what I'd suggest as Plan B. > Is that possible not to use libHilbert? If so, any efficiency > degenerates? You lose the ability to do N->M restarts with xdr/xda EquationSystems output, and you lose compatibility of xdr/xda EquationSystems output between with- and without- libHilbert libMesh compiles... No efficiency loss, though, and I don't think either of those features can scale up to your processor count anyway, so trying with libHilbert disabled (it's a configure option) should be your plan C. There might also be an inadvertent libHilbert dependency somewhere when using "slit meshes" or anything else that gives multiple topologically distinct nodes the exact same geometric coordinates - I don't think this is the case but it's a possible bug to watch out for. Thanks for the bug report, and please keep us up to speed with what works or fails to fix it! --- Roy |