I've got a problem with 1000+ variables in it (meaning 1000+ DoFs per node with Lagrange).  If I do normal domain decomposition I end up with only about 50 nodes per processor to get down to ~50000 DoFs per processor.  Unfortunately, that means that for any given variable there are only 50 DoFs for that variable on a processor... which apparently is causing _extremely_ poor preconditioning (even using AMG like Hypre).

What this case really needs is to do decomposition _by variable_.  So if you had 1000 processors each one would take the part of the problem corresponding to one variable.  This would allow you to form great block-diagonal preconditioners (which is what this problem needs... all of the variables are coupled... but the block diagonals dominate).

I'm pretty sure we're far off from being able to do that... but I thought I would ping you guys to see what you thought about the idea.  Any thoughts on how doable that would be?