#3992 TclpDlopen failing on MacOSX 10.4 and later

obsolete: 8.5.2
closed-fixed
8
2014-07-12
2008-05-09
No

Hi

A change was made to tclLoadDyld.c to use the dlfcn API (dlopen) on MacOSX 10.4 or later instead of the obsolete/deprecated NSModule API.

However, the call to dlopen is problematic:
dlHandle = dlopen(nativePath, RTLD_NOW | RTLD_LOCAL);

RLTD_LOCAL is *not* the default value for dlopen, the default is RTLD_GLOBAL. RLTD_LOCAL prevents Tcl/Tk from being able to load any dynamic library/module which depends (i.e. was linked against) a previously loaded dynamic library/module.

Unless I missed the rational for picking RLTD_LOCAL, could Tcl/Tk please use the default RLTD_GLOBAL?

From the man page:
http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/dlopen.3.html

RTLD_GLOBAL: Symbols exported from this image (dynamic library or bundle) will be available to any images build with -flat_namespace option to ld(1) or to calls to dlsym() when using a spe-cial specialcial handle.

RTLD_LOCAL: Symbols exported from this image (dynamic library or bundle) are generally hidden and only availble to dlsym() when directly using the handle returned by this call to dlopen().

In the second case, a library/module A will load correctly but will hide its symbols. Tcl/Tk will fail to load a library/module B if B depends on symbols in A (i.e. was dynamically linked against A).

Example:

barre [562] $ /opt/tcltk8.5.0/bin/wish8.5
% load libvtkCommonTCL.dylib
% load libvtkFilteringTCL.dylib
dlopen(libvtkFilteringTCL.dylib, 6): Symbol not found: __Z14vtkTclInDeleteP10Tcl_Interp
Referenced from: /Users/barre/build/VTK-VTK-5-2-tcl85-debug/bin/libvtkFilteringTCL.dylib
Expected in: flat namespace

the __Z14vtkTclInDeleteP10Tcl_Interp symbol *is* actually in libvtkCommonTCL.dylib.

barre [563] $ otool -L libvtkFilteringTCL.dylib
libvtkFilteringTCL.dylib:
libvtkFilteringTCL.5.2.dylib (compatibility version 0.0.0, current version 0.0.0)
libvtkFiltering.5.2.dylib (compatibility version 0.0.0, current version 0.0.0)
libvtkCommonTCL.5.2.dylib (compatibility version 0.0.0, current version 0.0.0)
libvtkCommon.5.2.dylib (compatibility version 0.0.0, current version 0.0.0)
/opt/tcltk8.5.0/lib/libtcl8.5.dylib (compatibility version 8.5.0, current version 8.5.0)
[...]

Manually changing RLTD_LOCAL to RLTD_GLOBAL, I was able to load both libraries without any problem inside Tcl/Tk 8.5.2.

Thank you

Discussion

  • Sébastien BARRE

    • assigned_to: kennykb --> das
     
  • Daniel A. Steffen

    Logged In: YES
    user_id=90580
    Originator: NO

    I'll try to track down where the decision to use RLTD_LOCAL came from but at first glance I agree that this appears to be a bug, thanks for the report.

     
  • Daniel A. Steffen

    • priority: 5 --> 8
    • status: open --> open-accepted
     
  • Jan Nijtmans

    Jan Nijtmans - 2008-05-13

    Logged In: YES
    user_id=61031
    Originator: NO

    2008/5/9 SourceForge.net <noreply@sourceforge.net>:
    > However, the call to dlopen is problematic:
    > dlHandle = dlopen(nativePath, RTLD_NOW | RTLD_LOCAL);
    >
    > RLTD_LOCAL is *not* the default value for dlopen, the default is RTLD_GLOBAL. RLTD_LOCAL prevents Tcl/Tk from being able to load any dynamic library/module which depends (i.e. was linked against) a previously loaded dynamic library/module.
    >
    > Unless I missed the rational for picking RLTD_LOCAL, could Tcl/Tk please use the default RLTD_GLOBAL?

    Generally, undefined symbols in libraries are a bad idea, because at
    run-time those unresolved symbols must
    be resolved. Therefore, RLTD_LOCAL is faster but has the disadvantage
    that all libraries must know which
    other libraries they depend on. In my view, RLTD_LOCAL is prefered
    whenever possible.

    > In the second case, a library/module A will load correctly but will hide its symbols. Tcl/Tk will fail to load a library/module B if B depends on symbols in A (i.e. was dynamically linked against A).

    The 'correct' way to solve this is make sure that when compiling B,
    make sure to add '-lA' to the link line, then B will
    see the symbols. Additional advantage: loading B will load A
    automatically when it is not already done.

    > I'll try to track down where the decision to use RLTD_LOCAL came from but
    > at first glance I agree that this appears to be a bug, thanks for the
    > report.

    Because tclLoadDl uses RLTD_GLOBAL as well, probably there are already
    libraries out there who
    fail to indicate all linked libraries. Therefore, I would recommend to
    change it to RLTD_GLOBAL
    in tclLoadDyld.c as well. However, I suggest to change it to
    RLTD_LOCAL in Tcl 8.6, and
    document the change clearly.

    Regards,
    Jan Nijtmans

     
  • Sébastien BARRE

    Logged In: YES
    user_id=214100
    Originator: YES

    Jan,

    Thanks for your comment.
    However, I think your statement might be incorrect:

    > The 'correct' way to solve this is make sure that when compiling B,
    > make sure to add '-lA' to the link line, then B will
    > see the symbols. Additional advantage: loading B will load A
    > automatically when it is not already done.

    Our libraries were linked that way. If you check my first email, you will see that I ran "otool -L" against libvtkFilteringTCL, and it correctly reports libvtkCommonTCL as a known dependency. I also just double-checked our link line, it is correct. So we suspect the problem is on the dlopen side and that very specific flag.

    But even if it did work with RLTD_LOCAL, there would be a problem with global variables (i.e. static members of classes for example). They would be duplicated if each library gets its own version of them, and hell would break loose.

    > Because tclLoadDl uses RLTD_GLOBAL as well, probably there are already
    > libraries out there who fail to indicate all linked libraries.
    > Therefore, I would recommend to change it to RLTD_GLOBAL
    > in tclLoadDyld.c as well.

    Would be great.

    > However, I suggest to change it to RLTD_LOCAL in Tcl 8.6, and
    > document the change clearly.

    I'm afraid I don't follow the rational here. Not only would the (reasonable) example I describe in my first email fail on MacOSX >= 10.4 (it does fail, I assure you), but it would start failing on all our others Unix platforms. We have nightly regressions tests here that show it works fine on our Unix systems and MacOSX < 10.4. It would be unfortunate if that situation was reverted.

    Thank you

     
  • Nobody/Anonymous

    Logged In: NO

    > But even if it did work with RLTD_LOCAL, there would be a problem with
    > global variables (i.e. static members of classes for example). They would
    > be duplicated if each library gets its own version of them, and hell would
    > break loose.
    No, RLTD_LOCAL does not mean that all static members are duplicated. It
    only changes the visibility of the symbols, not the way the library is
    loaded. It still will be loaded once for each application (executable)
    no matter how many dlopen's are done.

    I handled more similar problems in the past, and still I am convinced that
    there is a problem with run-time resolution of symbols in your libraries.
    It might be that library A and B are correct, but the problem is in
    C which is used by both A and B. It's hard to tell from here. One way
    to find out is make a dependancy graph of all your libraries, and try
    to load them separately from bottom to top. Does your linker have an
    option like --no-undefined? Then your linker can find out about such
    problems at build-time.

    I'm not trying to break currently working code. But some platforms,
    like win32, don't support undefined symbols in dll's at all, so
    if your libraries have a symbol resolution problem it will be
    impossible to port your libraries to win32.

    Good luck. If you have more questions, feel free to ask.

     
  • Sébastien BARRE

    Logged In: YES
    user_id=214100
    Originator: YES

    > No, RLTD_LOCAL does not mean that all static members are duplicated.

    OK, good to know.

    > there is a problem with run-time resolution of symbols in your libraries.

    I know Tcl/Tk is solid, but this C++ toolkit (VTK) is more than 10 years old. We have wrapped it using Tcl for pretty much the same amount time (as well as Python and Java), and have performed "package require" or calls to Tcl's "load" in hundreds of Tcl tests to exercise the toolkit every night, for years, on dozens of Unix platforms and Win32 platforms, using many, many compilers. It has always been divided into different shared libraries. Problems arose only recently when testing on MacOSX > 10.4 with Tcl/Tk 8.5, since it is now using dlopen (instead of NSModule).

    > It might be that library A and B are correct, but the problem is in
    > C which is used by both A and B. It's hard to tell from here.

    Please check my first message. I'm firing the shell, then loading A (libvtkCommonTCL). Then loading B (libvtkFilteringTCL). The symbol it is clearly complaining about (__Z14vtkTclInDeleteP10Tcl_Interp) *is* in A, not in a library that would be in a dependency of A.

    > I'm not trying to break currently working code. But some platforms,
    > like win32, don't support undefined symbols in dll's at all, so
    > if your libraries have a symbol resolution problem it will be
    > impossible to port your libraries to win32.

    This toolkit has been cross-platform from its origin. Regression testing shows we have no resolution problems on Win32 platforms, from Win 2000 to Vista, using MsDev6 to VisualStudio8. Therefore, I'm inclined to think there is no symbol resolution issue at the moment, though I might be wrong, but we do stress test VTK *a lot*, every night, and in a continuous manner during the day.

    I think there is probably a good reason why the default is RLTD_GLOBAL and not RLTD_LOCAL, on MacOSX, as opposed to some other OS...

    My question is, and I'm not certain you answered it: what would Tcl *break* by using RLTD_GLOBAL consistently?

    I also did some quick Googling:

    http://groups.google.com/group/comp.lang.perl.misc/msg/a2877cf7e0c656fe
    => this Perl user seemed to have the exact same problem while loading a Perl module.

    Another msg that seems to indicate that RLTD_LOCAL can not be used if you load two libs A and B, B depending on A (wherewas RLTD_GLOBAL would allow it to work, and not break other examples, unless I missed something)
    http://gcc.gnu.org/ml/gcc/2002-05/msg02034.html

    Thank you

     
  • Nobody/Anonymous

    Logged In: NO

    > My question is, and I'm not certain you answered it: what would Tcl
    > *break* by using RLTD_GLOBAL consistently?

    Using RLTD_LOCAL would allow two versions of the same library cooperate
    without problems. e.g. suppose we have a library libtclcrypt.so that
    depends on libssl.so.1 and another libtclssl.so which depends on
    libssl.so.2. Then, even if libssl.so.1 and libssl.so.2 have the same
    symbols, both Tcl extensions can cooperate fine. Using RLTD_GLOBAL
    the outcome is platform-dependant. If there is only one version of
    each library, then it is highly unlikely that such a thing happens.
    But you can never assure that two different libraries don't define
    the same symbol (e.g. myalloc()) and by accident both export them.
    Which myalloc() will be used then?......

    Anyway, I rest my case.

     
  • Daniel A. Steffen

    Logged In: YES
    user_id=90580
    Originator: NO

    can you confirm that the current implementation via NSModule behaves differently than implementation via dlfcn in this case?

    i.e. compile with
    -DTCL_DYLD_USE_DLFCN=0 -DTCL_DYLD_USE_NSMODULE=1 -DTCL_DEBUG_LOAD
    and with
    -DTCL_DYLD_USE_DLFCN=1 -DTCL_DYLD_USE_NSMODULE=0 -DTCL_DEBUG_LOAD
    and if there is a difference, paste results of your [load]s above.

    Please make sure to test on both OSX 10.4 and 10.5 if possible and with binaries linked on 10.5 as well as on 10.4, the dyld and linker implementations in 10.5 are very different from 10.4...

    if the behaviour is indeed different between NSModule and dlfcn in 10.5, I'd want to fix it irrespective of the general RLTD_GLOBAL vs RLTD_LOCAL debate.

    BTW, have you considered not linking with -flat_namespace? that is a legacy option that completely changes how symbols are resolved, using two-level namespaces will record which library a given symbol comes from at link time, which may take care of the problem at hand...

     
  • Sébastien BARRE

    Logged In: YES
    user_id=214100
    Originator: YES

    > nobody:
    > Anyway, I rest my case.

    I think your example will happen much less often than the scenario I described initially.
    Anyway.

    > das:
    >i.e. compile with
    > -DTCL_DYLD_USE_DLFCN=0 -DTCL_DYLD_USE_NSMODULE=1 -DTCL_DEBUG_LOAD
    >and with
    > -DTCL_DYLD_USE_DLFCN=1 -DTCL_DYLD_USE_NSMODULE=0 -DTCL_DEBUG_LOAD
    >and if there is a difference, paste results of your [load]s above.

    Yes, I had tried that last week, since I test using different major versions of Tcl/Tk compiled from source; sadly, the difference is that Tcl 8.5 would hang while loading the second library, on MacOSX 10.5. Though I had not tried with TCL_DEBUG_LOAD, I can try that again tomorrow.

    > Please make sure to test on both OSX 10.4 and 10.5 if possible and with
    > binaries linked on 10.5 as well as on 10.4, the dyld and linker

    10.4 and 10.5 failed in the exact same way for me. One of our regression machine is 10.4, which is where I spotted the problem. I then tried on my own Mac, running 10.5, and this failed as well. Another one of our regression computer is running 10.3, and has no problem (since it's not using dlopen).

    > BTW, have you considered not linking with -flat_namespace? that is a
    > legacy option that completely changes how symbols are resolved, using
    > two-level namespaces will record which library a given symbol comes

    I wasn't aware of that, and will try.
    Thanks

     
  • Jan Nijtmans

    Jan Nijtmans - 2008-05-15

    Logged In: YES
    user_id=61031
    Originator: NO

    This is related to
    [ tktoolkit-Bugs-1958367 ] Tk no longer builds correctly on Tru64
    which is fact is a TEA bug. How can we expect Tcl extensions to
    link with all dependant libraries, if TEA and Tk don't do it
    correctly for all platforms!!! Therefore, I suggest to change
    the flag to RLTD_GLOBAL now, then make sure that
    [tktoolkit-Bugs-1958367] gets fixed and that Sebastian's problem
    gets solved indepandant from the flag value (did removing the
    '-flat_namespace' help???). Only when those steps succeed, we
    can even think of changing the flag to RLTD_LOCAL.
    > I think your example will happen much less often than the scenario I
    > described initially.
    Agreed. But if the real bug here can be fixed, then we can have both.

     
  • Nobody/Anonymous

    Logged In: NO

    Removing -flat_namespace did the trick for me on my 10.5 testing machine. X11 didn't fire properly on our regression machines last night, so I'll keep you posted about 10.4.
    Thanks!

     
  • Jan Nijtmans

    Jan Nijtmans - 2008-05-15

    Logged In: YES
    user_id=61031
    Originator: NO

    Wow, I didn't expect your problem to be fixed that easy. Shouldn't we document
    then that Tcl extensions cannot be compiled with -flat_namespace on the Mac?

    So, I change my recommendation to just do nothing (except eventually
    modify documentation), and close this Issue. At least on Mac we don't have
    the problem about possible symbol conflicts. On other platforms we
    still have it (see tktoolkit-Bugs-1958367), but that is a separate issue.

    Regards,
    Jan Nijtmans

     
  • Nobody/Anonymous

    Logged In: NO

    And maybe they can not rebuild their extension. Maybe it's old. Maybe they don't have the sources. etc.
    With or without -flat_namespace, you could use RLTD_GLOBAL so that any extension is supported.

     
  • Nobody/Anonymous

    Logged In: NO

    Tcl does not need to support every combination of broken build ever conceived of. Truly.

     
  • Nobody/Anonymous

    Logged In: NO

    It's funny how those build used to be very much "not broken" before that change occurred in tclLoadDyld.c...

    If you do not want to support legacy libraries, that's not a problem, you are entitled to, just document it, but don't call other people's work "broken" when it's obviously not.

    Removing -flat_namespace was just a workaround; if you switch to RLTD_LOCAL on all other Unix platforms, and there is no such workaround available, you will probably hear people complaining about their "broken" build again, rightfully so.

     
  • Jan Nijtmans

    Jan Nijtmans - 2008-05-16

    Logged In: YES
    user_id=61031
    Originator: NO

    > but don't call other people's work
    > "broken" when it's obviously not.
    I don't consider your work broken, I consider the Mac
    option -flat_namespace broken when used in combination
    with Tcl extesions. Tcl expects its extensions to be
    linked using the command defined in tclConfig.sh
    (see TCL_SHLIB_LD in that file). Additional flags
    are not Tcl's reposibility, sorry for that.....

    So, what you can do is create a shared library containing
    the XXX_Init function only. All it does is dlopen the remaining
    of the libraries with the RTLD_GLOBAL flag, and call whatever
    functions it wants. All other libraries can be legacy
    libraries, and compiled with or without -flat_namespace
    whatever you like. Only the library that is loaded by Tcl
    cannot be compiled with -flat_namespace.

    This way, you can have your legacy libraries as you like, while
    the Tcl extension itself, which is only a small wrapper, conforms
    with the Tcl guidelines. So, yes, Tcl supports legacy libraries
    just fine, you only have to wrap it up following the Tcl guidelines.
    If that guidelines conflict with how the legacy libraries are
    build, then that's a (solvable) problem.

    Regards,
    Jan Nijtmans

     
  • Daniel A. Steffen

    committed switch to RTLD_GLOBAL to HEAD and core-8-5-branch

     
  • Daniel A. Steffen

    • status: open-accepted --> closed-fixed
     
  • Andreas Kupries

    Andreas Kupries - 2011-03-24

    The recent change to the core fixing #3216070 reintroduced this bug, by switching back to RTLD_LOCAL. If you are getting tripped, notably on Darwin, i.e. OS X, see below:

    If your library is linked using -flat_namespace and fails to load with a message like

    dyld: Symbol not found: ...
    Referenced from: ...
    Expected in: flat namespace

    Trace/BPT trap

    then this option has to be removed, and a possibly present -undefined
    suppress|warning as well, to make the library loadable again.

    This happened for Metakit.

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-11-08
    • assigned_to: das --> nijtmans
     
  • Jan Nijtmans

    Jan Nijtmans - 2012-11-08

    See: <http://www.tcl.tk/cgi-bin/tct/tip/416>

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks