Anyone sleep on this and come up with any ideas to try?

One thing to note is that we are actually reading the mesh on every processor still (because of the block / sideset naming stuff that Cody only recently fixed).  Do you believe that could be part of the problem?

Currently I can't run over about 1,700 procs without hitting this hang.  I'm recompiling with a new version of mvapich... and I'm hoping that that fixes it... but I'd like to know if there is anything else I can try...

Derek



On Tue, Apr 9, 2013 at 8:27 PM, Derek Gaston <friedmud@gmail.com> wrote:
Another data point... job starts fine on half the procs....

Derek


On Tue, Apr 9, 2013 at 8:26 PM, Derek Gaston <friedmud@gmail.com> wrote:
Is there any way to disable the hilbert stuff for now?  With serial mesh can we just take the numbering from the node numbering?


On Tue, Apr 9, 2013 at 8:21 PM, Derek Gaston <friedmud@gmail.com> wrote:
serial


On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) <benjamin.kirk-1@nasa.gov> wrote:
Serial or parallel mesh?



On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" <benjamin.kirk-1@nasa.gov> wrote:

> Hmm - I'll look through that section of code tomorrow morning and see if there could possibly be any mismatched send/receives or anything.
>
> -Ben
>
> On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <friedmud@gmail.com> wrote:
>
>> Hey guys,
>>
>> I've got a fairly large job (>3500 procs) that is hanging while trying to setup the mesh.  The procs are in 2 separate places.  ~Half of them are here:
>>
>> #35 0x00002b1746aea7f0 in libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #36 0x00002b1746afc17b in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #37 0x00002b1746c78528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #38 0x00002b1746c7b3e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>>
>>
>> And the other ~half are here:
>>
>> #6  0x00002ba5c95c8b90 in libMesh::Parallel::Communicator::send_receive<unsigned int> () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #7  0x00002ba5c95da2e2 in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, std::allocator<unsigned int> >&) const () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #8  0x00002ba5c9756528 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>> #9  0x00002ba5c97593e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0
>>
>>
>>
>> Obviously they are in slightly different spots...
>>
>>
>> Any ideas on what's going on here or where to start looking?
>>
>> I was intermittently getting weird errors around this point from mvapich so I've tried to switch to OpenMPI.... and it's hanging up here.
>>
>> The mesh itself isn't enormous... it's only about 1 million nodes or so.  We've definitely done more than this before.
>>
>> Thanks in advance for any advice!
>>
>> Derek
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Libmesh-devel mailing list
>> Libmesh-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/libmesh-devel
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Libmesh-devel mailing list
> Libmesh-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/libmesh-devel