Menu

#895 Problems with thread related functions

closed-out-of-date
None
5
2014-11-25
2013-03-15
No

Dear dev,

Thanks a lot for the great software, its been more than crucial for Sage performance.

We're currently trying to update our old ATLAS 3.8.4 to 3.10.1 and encounter some problems on "unusual" systems because some thread related functions stay undefined.
As we use shared libraries, linking to the produced libraries then fail.
See:
* http://trac.sagemath.org/sage_trac/ticket/10508
for lots of comments on this and progress on our side.

More precisely, the function ATL_DecAtomicCount and two others which by default (if I understood correctly a little part of ATLAS tuning and build system) try to use assembly implementations should fallback to C implementations if the assembly counterpart is not available.
This is the case for Debian/sparc for instance for ATL_DecAtomicCount.
In addition, threaded libraries do not get built.
The problem is that the non-threaded library produced wants to use some function which calls ATL_DecAtomicCount, so loading it fails.
You can find a log of such a build on Debian sparc:
* http://boxen.math.washington.edu/home/jpflori/atlas-3.10.1.p0-gcc63-sparc64-32.log
*

Note that rerunning make and then make as suggested here:
* https://sourceforge.net/p/math-atlas/bugs/170/#6a80
seems to correctly build the needed object files on the Debian/sparc.

Also note that trying to pass "-t 0" to configure to disable threading still wants to define functions pointing to these functions (although it won't try to build libpt. libraries), but not these functions themselves.
Unfortunately it seems the former still get called so this cannot be used as a workaround.
Please note this breaks the "t 0" build for us not only on Debian/sparc, but on Debian/ppc, Debian/ia64 and Debian/amd64.
For example, we are able to build numpy on top of ATLAS/LAPACK, but then Scipy wants to load the linalg part of numpy and it fails:
* http://trac.sagemath.org/sage_trac/ticket/10508#comment:356
It seems it was already reported here:
* https://sourceforge.net/p/math-atlas/support-requests/855/#d87a

On Solaris we also had to apply a fix similar to what is mentioned on ATLAS errata page and replace "test -e" by "test -f" in two places (not the one mentioned on the errata page).
See:
* http://trac.sagemath.org/sage_trac/ticket/10508#comment:373
Also note that on Solaris sparc the fallback code was correctly picked up.

Finally, for completeness, it seems we still get segfault on ia64, maybe related to what we reported here:
* https://sourceforge.net/p/math-atlas/support-requests/846/
See:
* http://trac.sagemath.org/sage_trac/ticket/10508#comment:366
for more details.
Please also note that this last report does not appear on "every" ia64, I did not have problems on gcc60 and gcc66.

Finally, I gave ATLAS 3.11.8 a shot on a Debian/amd64 and got problem with a different undefined symbol: ATL_sammm, see:
* http://boxen.math.washington.edu/home/jpflori/atlas-3.11.8.p0-ubuntu-amd64.log
A quick inspection of the log and the way ATLAS builds could indicate that something segfaulted or at least errored out (look for "make[3]: *** [res/seAMMRES.sum] Erreur 255") and in the ATL_sammm.o was not built.

Ill be pleased to provide additional info on request.
Thanks in advance for your feedback and thanks again for the great software!

Best,
JP

Discussion

  • R. Clint Whaley

    R. Clint Whaley - 2013-03-25

    OK, deleted the 3.11 missing symbol post, since that is not relavant to this 3.10 problem.

    I have a hard time reading your special "everything dumped in continuous output" logfiles. Any chance I can get the standard ATLAS error files?

    Is this problem specific to dynamic libs, or do you have similar problems with static builds?

    I know you don't use static libs yourself, but if we can reproduce the error there, I have a much better shot of fixing, since I don't really understand the dynamic stuff well.

    Thanks,
    Clint

     
    • Jean-Pierre Flori

      Hi,

      I issued "make error_report" in ATLAS build dir.
      The result tar.bz2 is at http://boxen.math.washington.edu/home/jpflori/error_USIV32.tar.bz2

      I hope this is what you wanted.

      I'll try to link everything to a static ATLAS but it may be more involved.

      Best,
      JP

       
  • Jean-Pierre Flori

    (The tar file I linked above is for a "usual" build on debian/sparc which is only missing a few ...Atomic... functions, I'll post other tarfiles from a debian/sparc and a debian/amd64 when -t 0 is used and the build is still problematic if needed)

     
  • Jean-Pierre Flori

    The static libraries do not seem problematic (I can issue "./sage -python -c "from numpy import linalg"" without problems when numpy is linked with a static libatlas.a).

     
  • Jean-Pierre Flori

    Log of a debian/amd64 build with "-t 0" at http://boxen.math.washington.edu/home/jpflori/error_Corei26AVX-t0.tar.bz2.

    Here is my understanding of the error we encounter.
    Please note:
    - I'm not completly sure of the analysis
    - it affects shared libraries on system where threads are enabled but the build system fails to pick fallback C code,
    - it affects shared libraries on all systems if you disable threads by passing "-t 0"
    - it never affect static libraries if you disable threads by passing "-t 0"
    - it "surely" affects threaded static libraries on systems where threads are enabled but the build system fails to pick falback C code

    More details follow:

    In the debian/amd64 static archive built with "-t 0" whose log I just posted there is the same undefined symbols that one can find on OS X/ppc and debian/sparc built without "-t 0".

    In the former case with "-t 0", whatever the platform is, its "kind of" expected, as I remember that you mentioned somewhere that you wanted a simple build system and it was easier to leave undefined symbols in.
    In the latter case, even without "-t 0", this is definitely unexpected and seems to be because there is no assembly implementation of some functions and the build system fails to pick up the _mut C impementations.

    Note that for the non-threaded libraries, if you link to static archives, the fact there are undefined symbols won't be a problem (Ive checked that) as the linker is smart enough not to include the unused part of the archive where the undefined thread related functions are located.

    But if you link to shared libraries (which at link time won't be a problem usually), then at runtime the presence of undefined symbols (whether they are really needed as in the case of a build without "-t 0" when fallback code is not picked up to define the functions, OR they are not needed as in the case you pass "-t 0" and the thread functions are not used but their symbols still present AND undefined) will make loading the shared library fail.

    So the way of solving all of this seems twofold:
    - (easy?) make sure the build system picks up the _mut C implem of Atomic functions when there is no assembly implementation (sparc/ppc), that will solve the issue of using shared libraries (threaded or not) on all systems, but only when built wihtout "-t 0"
    - (more involved?) do not include the symbols corresponding to thread related functions in the non-threaded static archives, that would even superseed (in a not so clean way) the use of non-threaded shared libraries without the above fix, one would just have to link to the non-threaded shared libraries, which are built whether "-t0" is passed or not.

     
  • Jean-Pierre Flori

    For info, on the debian/sparc system, with ATLAS 3.11.8, I still get the same undefined symbols (in addition to the one I mentioned in the deleted post, but that's another issue).

     
  • Jean-Pierre Flori

    FYI I get the same problematic ATL_DecAtomicCnt undefined symbol issue when using shared libraries on FreeBSD 9.0 x86.

    I'm currently trying FreeBSD 8.3 amd64.

     
  • Jean-Pierre Flori

    FYI, all the stuffed I've linked in this ticket has been moved on the same server into an "atlas" subdirectory.

     
  • Jean-Pierre Flori

    Here is a patch mitigating the problem.
    Best,
    JP

     
  • Jean-Pierre Flori

    And another one to correctly detect arch, number and speed of CPUs on various systems (no CPU speed on Linux ARM though, there does not seem to exist a clean solution except relying on dmesg, BogoMIPS or cpufreq).

    Best,
    JP

     
  • R. Clint Whaley

    R. Clint Whaley - 2014-07-09
    • status: open --> closed-out-of-date
    • assigned_to: R. Clint Whaley
     
  • R. Clint Whaley

    R. Clint Whaley - 2014-07-09

    I'm closing this report as out of date, as it seems to be a mixture of issues that are also handled on other threads. If you are still interested in these fixes, can you you issue new requests for each unique feature not already in other ticket items by the sage guys?

    Many thanks,
    Clint

     
  • Jean-Pierre Flori

    Sure, I'll have a look at ATLAS dev version status and reopen tickets if needed.

    Best,
    JP

     

Log in to post a comment.

MongoDB Logo MongoDB