From: Roy S. <roy...@ic...> - 2012-05-30 23:33:34
|
While trying to track down a bug in my next ParallelMesh patch, I ran into an apparent bug in our Parmetis support that dates back for I don't know how far. We already serialize and fall back on Metis when we have more elements than processors, and according to our code comments N_e<N_p is simply an unsupported configuration for Parmetis... but I'm seeing crashes (segfaults and/or double-frees) in quite a few other cases. Running adaptivity_ex1 with 1 through 20 elements on 1 through 15 processors gives failures whenever num_proc: num_elem is in 3:3 4:4 5:5,6,7,8 6:6,7,8 7:7,8,9 8:8,9,10,11 9:9,10,11,12 10:10,11,12,13,14,15,16,17 11:11,12,13,14,15,16,17 12:12,13,14,15,16,17 13:13,14,15,16,17,18,19 14:14,15,16,17,18,19,20 15:15,16,17,18,19,20 Not sure what the pattern is other than "1 or 2 procs always works, otherwise 'N_e==N_p + {0,...n}' always fails for some n >= 0" Since n seems to be N_p dependent we can't just increase the "fall back on Metis" threshhold. Anyone know enough about Parmetis to say what might be going on here? It's not like we're giving it some tricky topology, either; adaptivity_ex1 is just trying to prepare_for_use the result of a build_line when it dies. --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-05-30 23:47:27
|
I do know enough to say our bundled parmetis is quite crusty, and it may be worth upgrading that before getting too far inside it... -Ben On May 30, 2012, at 7:33 PM, "Roy Stogner" <roy...@ic...> wrote: > > While trying to track down a bug in my next ParallelMesh patch, I ran > into an apparent bug in our Parmetis support that dates back for I > don't know how far. > > We already serialize and fall back on Metis when we have more elements > than processors, and according to our code comments N_e<N_p is simply > an unsupported configuration for Parmetis... but I'm seeing crashes > (segfaults and/or double-frees) in quite a few other cases. Running > adaptivity_ex1 with 1 through 20 elements on 1 through 15 processors > gives failures whenever num_proc: num_elem is in > > 3:3 > 4:4 > 5:5,6,7,8 > 6:6,7,8 > 7:7,8,9 > 8:8,9,10,11 > 9:9,10,11,12 > 10:10,11,12,13,14,15,16,17 > 11:11,12,13,14,15,16,17 > 12:12,13,14,15,16,17 > 13:13,14,15,16,17,18,19 > 14:14,15,16,17,18,19,20 > 15:15,16,17,18,19,20 > > Not sure what the pattern is other than "1 or 2 procs always works, > otherwise 'N_e==N_p + {0,...n}' always fails for some n >= 0" > > Since n seems to be N_p dependent we can't just increase the "fall > back on Metis" threshhold. Anyone know enough about Parmetis to say > what might be going on here? It's not like we're giving it some > tricky topology, either; adaptivity_ex1 is just trying to > prepare_for_use the result of a build_line when it dies. > --- > Roy > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel |
From: Roy S. <roy...@ic...> - 2012-05-31 01:01:39
|
On Wed, 30 May 2012, Kirk, Benjamin (JSC-EG311) wrote: > I do know enough to say our bundled parmetis is quite crusty, and it > may be worth upgrading that before getting too far inside it... We're using 3.1 from 2003, the latest stable version is 4.0.2 from 2011... yeah, I'll try bumping us up to that first. Thanks, --- Roy |
From: Roy S. <roy...@ic...> - 2012-06-03 05:31:36
|
On Wed, 30 May 2012, Roy Stogner wrote: > While trying to track down a bug in my next ParallelMesh patch, I ran > into an apparent bug in our Parmetis support that dates back for I > don't know how far. > > We already serialize and fall back on Metis when we have more elements > than processors, and according to our code comments N_e<N_p is simply > an unsupported configuration for Parmetis... but I'm seeing crashes > (segfaults and/or double-frees) in quite a few other cases. Running > adaptivity_ex1 with 1 through 20 elements on 1 through 15 processors > gives failures whenever num_proc: num_elem is in > > 3:3 > 4:4 > 5:5,6,7,8 > 6:6,7,8 > 7:7,8,9 > 8:8,9,10,11 > 9:9,10,11,12 ... Sadly, we seem to be getting the very same failure pattern with the current ParMETIS. --- Roy |
From: John P. <jwp...@gm...> - 2012-06-04 13:39:35
|
On Sat, Jun 2, 2012 at 11:31 PM, Roy Stogner <roy...@ic...> wrote: > > On Wed, 30 May 2012, Roy Stogner wrote: > >> While trying to track down a bug in my next ParallelMesh patch, I ran >> into an apparent bug in our Parmetis support that dates back for I >> don't know how far. >> >> We already serialize and fall back on Metis when we have more elements >> than processors, and according to our code comments N_e<N_p is simply >> an unsupported configuration for Parmetis... but I'm seeing crashes >> (segfaults and/or double-frees) in quite a few other cases. Running >> adaptivity_ex1 with 1 through 20 elements on 1 through 15 processors >> gives failures whenever num_proc: num_elem is in >> >> 3:3 >> 4:4 >> 5:5,6,7,8 >> 6:6,7,8 >> 7:7,8,9 >> 8:8,9,10,11 >> 9:9,10,11,12 > ... > > Sadly, we seem to be getting the very same failure pattern with the > current ParMETIS. Another sad thing: new Metis apparently doesn't build on Macs? Maybe it's just my Mac? I'm looking into this, but before I waste too much time, has anyone seen this error before? --- Building Metis --------------------------- Compiling C (in optimized mode) b64.c... In file included from GKlib.h:66, from b64.c:20: ./gk_externs.h:19: error: thread-local storage not supported for this target ./gk_externs.h:20: error: thread-local storage not supported for this target ./gk_externs.h:21: error: thread-local storage not supported for this target make[1]: *** [b64.x86_64-apple-darwin10.8.0.opt.o] Error 1 make: *** [all] Error 2 -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-06-04 14:09:12
|
>> Sadly, we seem to be getting the very same failure pattern with the >> current ParMETIS. > > Another sad thing: new Metis apparently doesn't build on Macs? Maybe > it's just my Mac? > > I'm looking into this, but before I waste too much time, has anyone > seen this error before? > > --- Building Metis --------------------------- > Compiling C (in optimized mode) b64.c... > In file included from GKlib.h:66, > from b64.c:20: > ./gk_externs.h:19: error: thread-local storage not supported for this target > ./gk_externs.h:20: error: thread-local storage not supported for this target > ./gk_externs.h:21: error: thread-local storage not supported for this target > make[1]: *** [b64.x86_64-apple-darwin10.8.0.opt.o] Error 1 > make: *** [all] Error 2 What do you get for LIBMESH_TLS in your include/base/libmesh_config.h? We check for compiler-support for __thread and define it as LIBMESH_TLS - If your mac compiler options don't support it then that should be empty. In which case we should replace __thread with LIBMESH_TLS in that header. -Ben |
From: John P. <jwp...@gm...> - 2012-06-04 14:12:21
|
On Mon, Jun 4, 2012 at 8:08 AM, Kirk, Benjamin (JSC-EG311) <ben...@na...> wrote: >>> Sadly, we seem to be getting the very same failure pattern with the >>> current ParMETIS. >> >> Another sad thing: new Metis apparently doesn't build on Macs? Maybe >> it's just my Mac? >> >> I'm looking into this, but before I waste too much time, has anyone >> seen this error before? >> >> --- Building Metis --------------------------- >> Compiling C (in optimized mode) b64.c... >> In file included from GKlib.h:66, >> from b64.c:20: >> ./gk_externs.h:19: error: thread-local storage not supported for this target >> ./gk_externs.h:20: error: thread-local storage not supported for this target >> ./gk_externs.h:21: error: thread-local storage not supported for this target >> make[1]: *** [b64.x86_64-apple-darwin10.8.0.opt.o] Error 1 >> make: *** [all] Error 2 > > What do you get for LIBMESH_TLS in your include/base/libmesh_config.h? /* If the compiler supports a TLS storage class define it to that here */ /* #undef TLS */ > We check for compiler-support for __thread and define it as LIBMESH_TLS - If > your mac compiler options don't support it then that should be empty. In > which case we should replace __thread with LIBMESH_TLS in that header. OK, I'll see about making that change. But does this imply Metis runs threaded by default and/or requires TLS to work correctly? -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-06-04 14:15:29
|
> OK, I'll see about making that change. But does this imply Metis runs > threaded by default and/or requires TLS to work correctly? I'll poke around - it is possible it is using OpenMP parallelism now, but if so then I would expect to have a __thread or similar function defined whenever OpenMP works. Perhaps there is a clash of compiler options or something... -Ben |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-06-04 14:22:32
|
>> OK, I'll see about making that change. But does this imply Metis runs >> threaded by default and/or requires TLS to work correctly? > > I'll poke around - it is possible it is using OpenMP parallelism now, but if > so then I would expect to have a __thread or similar function defined > whenever OpenMP works. Perhaps there is a clash of compiler options or > something... Sure enough: benkirk(18)$ grep pragma *.c csr.c: #pragma omp parallel private(i, j, ncand, rsum, tsum, cand) csr.c: #pragma omp for schedule(static) csr.c: #pragma omp parallel private(i, j, ncand, rsum, tsum, cand) csr.c: #pragma omp for schedule(static) csr.c: #pragma omp parallel if (n > 100) csr.c: #pragma omp single csr.c: #pragma omp for schedule(static) csr.c: #pragma omp parallel if (ptr[n] > OMPMINOPS) csr.c: #pragma omp for private(j,sum) schedule(static) csr.c: #pragma omp parallel if (ptr[n] > OMPMINOPS) csr.c: #pragma omp for private(j,sum) schedule(static) csr.c: #pragma omp parallel if (rowptr[nrows] > OMPMINOPS) csr.c: #pragma omp for private(j, maxtf) schedule(static) ... Looks like Metis is using OpenMP solely in the compressed row stroage handling now. No equivalent directives in Parmetis. -Ben |
From: John P. <jwp...@gm...> - 2012-06-04 14:51:16
|
On Mon, Jun 4, 2012 at 8:22 AM, Kirk, Benjamin (JSC-EG311) <ben...@na...> wrote: >>> OK, I'll see about making that change. But does this imply Metis runs >>> threaded by default and/or requires TLS to work correctly? >> >> I'll poke around - it is possible it is using OpenMP parallelism now, but if >> so then I would expect to have a __thread or similar function defined >> whenever OpenMP works. Perhaps there is a clash of compiler options or >> something... > > Sure enough: > > benkirk(18)$ grep pragma *.c > csr.c: #pragma omp parallel private(i, j, ncand, rsum, tsum, cand) > csr.c: #pragma omp for schedule(static) > csr.c: #pragma omp parallel private(i, j, ncand, rsum, tsum, cand) > csr.c: #pragma omp for schedule(static) > csr.c: #pragma omp parallel if (n > 100) > csr.c: #pragma omp single > csr.c: #pragma omp for schedule(static) > csr.c: #pragma omp parallel if (ptr[n] > OMPMINOPS) > csr.c: #pragma omp for private(j,sum) schedule(static) > csr.c: #pragma omp parallel if (ptr[n] > OMPMINOPS) > csr.c: #pragma omp for private(j,sum) schedule(static) > csr.c: #pragma omp parallel if (rowptr[nrows] > OMPMINOPS) > csr.c: #pragma omp for private(j, maxtf) schedule(static) > ... > > Looks like Metis is using OpenMP solely in the compressed row stroage > handling now. No equivalent directives in Parmetis. I've attached a patch adding LIBMESH_TLS that gets libmesh to compile, but I now get a runtime error from one of the examples: Running: ./adaptivity_ex2-opt -read_solution -n_timesteps 25 -output_freq 10 -init_timestep 25 terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast Here's the stack trace: 0: 0 libmesh.dylib 0x00000001006fdc1f libMesh::print_trace(std::ostream&) + 47 1: 1 libmesh.dylib 0x00000001006ed39b libMesh::libmesh_terminate_handler() + 955 2: 2 libstdc++.6.dylib 0x00007fff80f7cae1 _ZN10__cxxabiv111__terminateEPFvvE + 11 3: 3 libstdc++.6.dylib 0x00007fff80f7cb16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0 4: 4 libstdc++.6.dylib 0x00007fff80f7cbfc _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0 5: 5 libstdc++.6.dylib 0x00007fff80f7be6e __cxa_call_terminate + 0 6: 6 libmesh.dylib 0x0000000100ae89f5 libMesh::Partitioner::set_node_processor_ids(libMesh::MeshBase&) + 4101 7: 7 libmesh.dylib 0x0000000100aa6fe5 libMesh::XdrIO::read(std::string const&) + 1397 8: 8 libmesh.dylib 0x0000000100a8d542 libMesh::UnstructuredMesh::read(std::string const&, libMesh::MeshData*, bool) + 4050 9: 9 adaptivity_ex2-opt 0x00000001000055e0 main + 1792 Probably not a coincidence that it dies in a call to the Partitioner? I'm compiling in DEBUG mode to get more information, but perhaps we should just revert r5654 (or move it to an unstable branch) if it didn't actually fix the problem of partitioning small meshes? -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-06-04 14:57:07
|
> Probably not a coincidence that it dies in a call to the Partitioner? > > I'm compiling in DEBUG mode to get more information, but perhaps we > should just revert r5654 (or move it to an unstable branch) if it > didn't actually fix the problem of partitioning small meshes? I'm getting the same segfault without that patch and am investigating too... Let's see if we can fix the issue on trunk relatively quickly before reverting the long overdue update. -Ben |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-06-04 16:36:11
|
>> Probably not a coincidence that it dies in a call to the Partitioner? >> >> I'm compiling in DEBUG mode to get more information, but perhaps we >> should just revert r5654 (or move it to an unstable branch) if it >> didn't actually fix the problem of partitioning small meshes? > > I'm getting the same segfault without that patch and am investigating too... > > Let's see if we can fix the issue on trunk relatively quickly before > reverting the long overdue update. Perhaps the bad cast is unrelated and may be because of some debugging code that slipped through: See line 409: ParallelMesh& pmesh = dynamic_cast<ParallelMesh&>(mesh); pmesh.libmesh_assert_valid_parallel_ids(); Of course that will fail if you're not running with a parallel mesh. I'm guessing Roy tested it with a parallel mesh and all was happy. I've changed the code to cast instead to a pointer. If that fails it returns a NULL pointer instead of issuing a runtime exception. -Ben |
From: Roy S. <roy...@ic...> - 2012-06-04 17:05:09
|
On Mon, 4 Jun 2012, Kirk, Benjamin (JSC-EG311) wrote: > ParallelMesh& pmesh = dynamic_cast<ParallelMesh&>(mesh); > pmesh.libmesh_assert_valid_parallel_ids(); > > Of course that will fail if you're not running with a parallel mesh. I'm > guessing Roy tested it with a parallel mesh and all was happy. Oh hell, yes that's solely debugging code, yes it should fail on SerialMesh, and yes I only tested it with ParallelMesh. > I've changed the code to cast instead to a pointer. If that fails it > returns a NULL pointer instead of issuing a runtime exception. No, just delete that code. I actually fixed the bits being debugged with it, it was just to track down precisely where things went wrong. Thanks, --- Roy |
From: John P. <jwp...@gm...> - 2012-06-04 16:56:36
|
On Mon, Jun 4, 2012 at 10:36 AM, Kirk, Benjamin (JSC-EG311) <ben...@na...> wrote: >>> Probably not a coincidence that it dies in a call to the Partitioner? >>> >>> I'm compiling in DEBUG mode to get more information, but perhaps we >>> should just revert r5654 (or move it to an unstable branch) if it >>> didn't actually fix the problem of partitioning small meshes? >> >> I'm getting the same segfault without that patch and am investigating too... >> >> Let's see if we can fix the issue on trunk relatively quickly before >> reverting the long overdue update. > > Perhaps the bad cast is unrelated and may be because of some debugging code > that slipped through: > > See line 409: > > ParallelMesh& pmesh = dynamic_cast<ParallelMesh&>(mesh); > pmesh.libmesh_assert_valid_parallel_ids(); > > Of course that will fail if you're not running with a parallel mesh. I'm > guessing Roy tested it with a parallel mesh and all was happy. > > I've changed the code to cast instead to a pointer. If that fails it > returns a NULL pointer instead of issuing a runtime exception. Indeed, I was configure'd without parallel mesh. Your patch seems to have fixed the issue for me. I can also check in the LIBMESH_TLS patch if you have a chance to test it out on a linux box..... -- John |
From: John P. <jwp...@gm...> - 2012-06-04 17:22:12
|
On Mon, Jun 4, 2012 at 10:56 AM, John Peterson <jwp...@gm...> wrote: > On Mon, Jun 4, 2012 at 10:36 AM, Kirk, Benjamin (JSC-EG311) > <ben...@na...> wrote: >>>> Probably not a coincidence that it dies in a call to the Partitioner? >>>> >>>> I'm compiling in DEBUG mode to get more information, but perhaps we >>>> should just revert r5654 (or move it to an unstable branch) if it >>>> didn't actually fix the problem of partitioning small meshes? >>> >>> I'm getting the same segfault without that patch and am investigating too... >>> >>> Let's see if we can fix the issue on trunk relatively quickly before >>> reverting the long overdue update. >> >> Perhaps the bad cast is unrelated and may be because of some debugging code >> that slipped through: >> >> See line 409: >> >> ParallelMesh& pmesh = dynamic_cast<ParallelMesh&>(mesh); >> pmesh.libmesh_assert_valid_parallel_ids(); >> >> Of course that will fail if you're not running with a parallel mesh. I'm >> guessing Roy tested it with a parallel mesh and all was happy. >> >> I've changed the code to cast instead to a pointer. If that fails it >> returns a NULL pointer instead of issuing a runtime exception. > > Indeed, I was configure'd without parallel mesh. > > Your patch seems to have fixed the issue for me. > > I can also check in the LIBMESH_TLS patch if you have a chance to test > it out on a linux box..... BTW, since Metis is using OpenMP (no raw pthreads that I can see) the proper TLS mechanism is probably something like #pragma omp threadprivate(var) I feel like my patch (which effectively makes those variables global) will potentially break anyone actually running libmesh with OMP_NUM_THREADS > 1. Not sure how much work this would be to fix and test. Should we just notify the Metis developers? -- John |
From: Roy S. <roy...@ic...> - 2012-06-04 17:52:22
|
On Mon, 4 Jun 2012, John Peterson wrote: > BTW, since Metis is using OpenMP (no raw pthreads that I can see) the > proper TLS mechanism is probably something like > > #pragma omp threadprivate(var) OpenMP uses pthreads under the hood on POSIX systems, IIRC, so thread local storage variables ought to work fine as well. Probably safer for the corner cases to use consistent idioms, though. > I feel like my patch (which effectively makes those variables global) > will potentially break anyone actually running libmesh with > OMP_NUM_THREADS > 1. Almost certainly. > Not sure how much work this would be to fix and test. To fix? Your pragma (suitably replicated and made specific) will probably do it. To test? We can't and shouldn't try to test that unmodified Metis is thread-safe, but it might be easy enough to write fprintf debugging code that makes sure the different threads are still getting openmp compatible tls variables after our patch. > Should we just notify the Metis developers? Check and see if the unmodified parmetis (which includes the metis files we upgraded to) works on Mac? They might have some hack in their build system, which I ignored and threw away. If unmodified parmetis doesn't build on Mac either, then definitely notify them. Thanks, --- Roy |
From: John P. <jwp...@gm...> - 2012-06-04 19:00:09
|
On Mon, Jun 4, 2012 at 11:52 AM, Roy Stogner <roy...@ic...> wrote: > > On Mon, 4 Jun 2012, John Peterson wrote: > >> BTW, since Metis is using OpenMP (no raw pthreads that I can see) the >> proper TLS mechanism is probably something like >> >> #pragma omp threadprivate(var) > > > OpenMP uses pthreads under the hood on POSIX systems, IIRC, so > thread local storage variables ought to work fine as well. Probably > safer for the corner cases to use consistent idioms, though. > > >> I feel like my patch (which effectively makes those variables global) >> will potentially break anyone actually running libmesh with >> OMP_NUM_THREADS > 1. > > > Almost certainly. > > >> Not sure how much work this would be to fix and test. > > > To fix? Your pragma (suitably replicated and made specific) will > probably do it. > > To test? We can't and shouldn't try to test that unmodified Metis is > thread-safe, but it might be easy enough to write fprintf debugging > code that makes sure the different threads are still getting openmp > compatible tls variables after our patch. > Check and see if the unmodified parmetis (which includes the metis > files we upgraded to) works on Mac? They might have some hack in > their build system, which I ignored and threw away. If unmodified > parmetis doesn't build on Mac either, then definitely notify them. On a Mac, I followed their basic instructions (make config && make) for compiling and I see that they pass -D__thread= on the compile line, effectively disabling the TLS stuff. This would seem to be equivalent to what my patch does. Also, a cursory examination of the code suggests that the handful of variables that they marked __thread are not even used in the openmp parts of csr.c. Perhaps they were just trying to make the overall library a bit more thread-safe by adding TLS? Anyway, my opinion is we can probably go ahead and go with the patch removing the __shared stuff as-is. -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-06-04 13:52:10
|
>> We already serialize and fall back on Metis when we have more elements >> than processors, and according to our code comments N_e<N_p is simply >> an unsupported configuration for Parmetis... but I'm seeing crashes >> (segfaults and/or double-frees) in quite a few other cases. Running >> adaptivity_ex1 with 1 through 20 elements on 1 through 15 processors >> gives failures whenever num_proc: num_elem is in >> >> 3:3 >> 4:4 >> 5:5,6,7,8 >> 6:6,7,8 >> 7:7,8,9 >> 8:8,9,10,11 >> 9:9,10,11,12 > ... > > Sadly, we seem to be getting the very same failure pattern with the > current ParMETIS. A telling comment from G.K. circa 2004: "How large is the graph? ParMetis seems to have problems with small graphs" http://www-users.cs.umn.edu/~karypis/.discus/messages/16/36.html While that didn't end up being the cause for the particular problem in that thread, I also came across some known issues reported by the Zoltan developers: http://www.tddft.org/trac/octopus/browser/trunk/external_libs/zoltan/Known_P roblems?rev=7107 which suggests metis and parmetis are both fragile when any partition winds up with 0 objects. A more comprehensive workaround, but still a hackish workaround, may be to fall back to metis or even something else when NE/NP < 10 or something reasonable where we would expect to avoid this 0 object per partition problem. -Ben |