I like (C) but agree it might be tricker to get right and worry about some corner cases.
I haven't had a chance to reproduce the issue, but I wonder if we could reproduce this with two coordinates and libHilbert directly with 32bit ints? Then submit it to Chris and see if he has any suggestions.
I'd have thougt using the first 32bits would be safe and not cause this issue, and apparently that's the case the vast majority of the time. So is this just an unlucky coincidence or a bug in the hilbert code?
On May 19, 2012, at 11:03 PM, "Roy Stogner" <roystgnr@...> wrote:
> On Fri, 18 May 2012, Kirk, Benjamin (JSC-EG311) wrote:
>> Damn - this sounds subtle, and good catch. I'll have a look but it
>> may be mid week before I can do anything useful.
> So here's the bug:
> This appears to be the *same* libHilbert bug that was hitting us a
> couple years ago. As best as I can tell, the explanation here is that
> I'm utterly incompetent.
> It's been a few years, but my best guess at what happened years ago
> was that I tested upgrading libHilbert to 64-bit FixBitVec types, it
> corrected the problem, I reenabled libHilbert by default and committed
> configure* to trunk, and I *forgot* to commit the 32-bit->64-bit
> change to trunk. So my tests suggested things were fixed, this bug
> crops up so rarely that nobody else noticed it, and it just got left
> So do we just reenable the change now? Well, I'm feeling more
> paranoid and did more testing, and here's the problems:
> If we continue using double in our get_hilbert_coords() helper
> function, but we switch to 64-bit Hilbert indices, then floating point
> arithmetic is simply too fragile to reliably give us correct results.
> I'll upload some test cases if people are curious about the details,
> but basically I can get different coordinates (and in some cases
> grossly wrong coordinates or more duplicated coordinates) by switching
> compilers. Unless we want to risk solution output being corrupted by
> a bad compiler or solution input being corrupted by a changed
> compiler, this isn't an option.
> Long double seems to give reliable results, but that's on a platform
> with 80-bit long doubles - the C/C++ standards allow "long double" ==
> "double", and IIRC on many platforms that's all you get. Likewise
> 128-bit FP formats are pretty non-portable, so if we want reliable
> 64-bit Hilbert indices we're going to need to use MPFR or some other
> extended-precision software implementation.
> But here's the real kicker: even with long double, changing from
> 32-bit to 64-bit libHilbert renders old solution files *incompatible*
> with new ones. In theory we ought to just be getting a finer-grained
> approximation to the same space-filling curve; in practice I'm seeing
> regression test failures.
> So I see three solutions here.
> (A) we augment the 32-bit libHilbert sorting with some subsorting that
> distinguishes duplicate indices, via something a simple as a tacked-on
> lexicographical order sort. This would keep compatibility with old
> files and require no API changes.
> (B) we switch to 64-bit libHilbert bitvectors plus MPFR floating point
> in get_hilbert_coords(). This is a bit more efficient and a lot less
> ugly than (A), but it probably breaks all our current solution files.
> It would also fail for new solutions in the case of a mesh that's
> ridiculously h-refined (e.g. 64 levels) at the origin point. That's
> something we don't even support now, and it's (literally) a corner
> case, but I've seen such a mesh used with another code for an EM
> problem with a ridiculously strong singularity.
> (C) we stop trying to enforce *any* ordering on our solution files, and
> we just assume that the solution file and corresponding mesh file were
> ordered consistently. This might be trickier to implement (Do we just
> hold off on *any* initial renumbering of any mesh? We'll need to fix
> some MeshCommunications stuff to handle a ParallelMesh with
> discontiguous DofObjects) but I think it's the right way to go in the
> long run. We'd want to keep writing space-filling-curve sorted files
> (using either method (A) or (B)) by default, to improve the initial
> partitioning in N-to-M restarts, but this way we wouldn't have to do
> any sorting at input time and we'd have total compatibility with old
> libMesh files, regardless of whether libHilbert was 32-bit, 64-bit, or
> completely disabled.
> I'm obviously leaning toward implementing (A) right away, then adding
> (C) if I ever find the time, but other opinions would be welcome.