#23 segfault running test suite on py2.3 x86_64

Errors
closed-fixed
7
2005-02-28
2005-02-10
No

Testing: BuildClass, FindClass, ClassList, {...} ... ok
Testing: Clear, Reset, Class.PPForm, Class.Description,
Class.Module ... ok
Testing: Class.BuildSubclass ... ok
Testing: Class.WatchSlots, Class.WatchInstances ... ok
Testing: Class.MessageHandlerIndex,
Class.MessageHandlerWatched ... ok
Testing: Class.UnwatchMessageHandler,
Class.WatchMessageHandler ... ok
Testing: Class.MessageHandlerName,
Class.MessageHandlerType ... ok
Testing: Class.NextMessageHandlerIndex,
Class.MessageHandlerDeletable ... ok
Testing: BuildMessageHandler, Class.AddMessageHandler,
{...} ... ok
Testing: Class.BuildInstance, Class.RawInstance ... ok
Testing: Class.MessageHandlerList,
Class.AllMessageHandlerList ... ok
Testing: BuildInstance, Class.Deletable,
Instance.Slots, Instance.PPForm ... ok
Testing: FindInstance, Instance.Class, Instance.Name ... ok
Testing: LoadInstancesFromString ... ok
Testing: Slots.Names, Slots.Exists,
Slots.ExistsDefined, {...} ... ok
Testing: Slots.Cardinality, Slots.AllowedValues ... ok
Testing: Slots.Types, Slots.Sources ... ok
Testing: Slots.IsPublic, Slots.IsInitable, Slots.Range
... ok
Testing: Slots.IsWritable, Slots.HasDirectAccess,
Slots.Facets ... ok
Testing: ClassList, InitialClass, FindClass, Class.Next
... ok
Testing: InitialDeffacts, DeffactsList, Deffacts.Next,
Deffacts.Name ... ok
Testing: FindDeffacts, Deffacts.PPForm,
Deffacts.Deletable, {...} ... ok
Testing: InitialDefinstances, DefinstancesList,
Definstances.Next, {...} ... ok
After installing the module, I ran `cd testsuite;
python tests.py` and it yielded the following segfault.

If I can help further, please contact me.

===
Testing: Definstances.Name, Definstances.Module,
Definstances.Deletable ... ok
Testing: FactList, InitialFact, Fact.Next, Fact.Index
... ok
Testing: InitialFunction, FunctionList, Function.Name,
Function.Next ... ok
Testing: FindFunction, Function.PPForm, Function.Watch,
{...} ... ok
Testing: BuildGeneric, Generic.Name, Generic.PPForm,
Generic.Watch ... ok
Testing: InitialGeneric, GenericList, FindGeneric,
{...} ... ok
Testing: BuildGlobal, Global.Name, Global.PPForm, {...}
... ok
Testing: InitialGlobal, GlobalList, FindGlobal,
Global.Watch, {...} ... ok
Testing: InstancesChanged, InitialInstance,
FindInstance, Instance.Next ... ok
Testing: Instance.IsValid, Instance.Remove,
Instance.DirectRemove ... ok
Testing: Instance.Class, Instance.Slots, Instance.Send
... ok
Testing: Class.InitialInstance,
Class.InitialSubclassInstance, {...} ... Segmentation
fault (core dumped)

====
(gdb) bt
#0 EnvGetNextInstanceInClassAndSubclasses_PY
(theEnv=0x57ff80,
cptr=0x100000000, iptr=0x1,
iterationInfo=0x7fbfffde40) at inscptch_py.c:98
#1 0x0000002a97f1c11a in
g_getNextInstanceInClassAndSubclasses (
self=0x57ff80, args=0x2a9821a3b0) at clipsmodule.c:5719
#2 0x0000003f0218973f in _PyEval_SliceIndex ()
from /usr/lib64/libpython2.3.so.1.0

Discussion

  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Hi...

    Sorry for noticing it so late: I thought I was monitoring
    this forum and apparently I'm not. Now I'll start working on
    the bug.

    In fact the only 64 bit platform I have for testing is an
    old SPARC, and it did not show any problem using the test
    suite. I'll try to figure out what happened using your
    backtrace...

    Thank you for submiting the bug!

    F.

     
  • Francesco Garosi

    • priority: 5 --> 7
    • assigned_to: nobody --> franzg
     
  • Matthew L Daniel

    Logged In: YES
    user_id=88251

    I believe I am in a position to help you work through this.
    While direct ssh to my machine is not an option, I am
    relatively competant with gdb and c hacking. I am just not
    an expert at Python modules.

    If I can help, please let me know.

    Toward that end, doesn't Sourceforge have an x86_64 in their
    compile farm?

     
  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Well... of course you can be of big, big help, especially as
    you are offering it! The fact is only that I usually do not want
    to bother other people *asking* for help. And I still have not
    investigated whether or not SF offers an x86_64 in their CF
    (but I was going to, I confess).

    I'm not so good with gdb: I'm so bad at it that I use ddd to
    feel more comfortable - I must say that this graphical front
    end is also a great tool.

    Apart from my stupid comments above, now I'm "unfolding"
    the guilty test, and if you don't mind I'll attach to my next
    comment a Python source file containing the tests in a more
    straightforward sequence: then I will be able to really see
    what happens.

    There are some things that make me think, in fact:

    1) the crash happens apparently when the g_* function is
    called for the first time (the instance pointer is NULL)
    because it's line 5719: from the traceback it looks like this
    NULL is passed as 0x1. Strange, isn't it? Also the other
    pointer (cptr), which should be almost ordinary, is a rather
    curious 0x100000000! I think we should look at this.

    2) the "self" parameter passed to a Python-visible function is
    also a pointer, actually to a PyObject, which should have
    nothing to do with CLIPS environments. But then my g_*
    function expands void *CurrentEnvironment() exactly to the
    same pointer as self (=0x57ff80). This is also quite strange,
    looks like I'm misusing something somewhere, but for now I
    can't figure out where or what.

    By the way, I lied before: I wasn't remembering that my
    SPARC Python is compiled in 32 bit mode, and thus all
    extensions are compiled exactly the same way. So I'm still
    unable to give any results about 64 bit environments: your
    help will surely be precious.

    I'll pop up later with this two or three tests, so that at least
    we will see what are the parameters passed to g_... before
    the segfault.

    Thank you again,

    F.

     
  • Francesco Garosi

    The "unfolded" test

     
    Attachments
  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Here again...

    if you would like to use the attached file and report the
    output, I will see when the error occurs. Maybe it's anyway
    a "border condition" (mathematically speaking).

    Til soon,

    F.

     
  • Matthew L Daniel

    Logged In: YES
    user_id=88251

    I built _clips.so with debugging (-g) and ran the g_test.py.
    Please find the output below.

    I also discovered the joy of "bt full" in gdb, which prints
    out the local variables, too. That should help lots.

    #0 EnvIncrementInstanceCount (theEnv=0x520530,
    vptr=0x4554415254535f48)
    at ./clipssrc/insfun.c:112
    No locals.
    #1 0x0000002a9574b15a in
    g_getNextInstanceInClassAndSubclasses (
    self=0x520530, args=0x0) at clipsmodule.c:5725
    p = (clips_InstanceObject *) 0x2a955c8290
    q = (clips_InstanceObject *) 0x2a955c82b0
    c = (clips_DefclassObject *) 0x2a955c62b8
    o = {supplementalInfo = 0x0, type = 4, value =
    0x522cd0, begin = 1,
    end = 4294967295, next = 0x0}
    ptr = (void *) 0x4554415254535f48
    #2 0x0000003f0218973f in _PyEval_SliceIndex ()
    from /usr/lib64/libpython2.3.so.1.0

     
  • Matthew L Daniel

    Logged In: YES
    user_id=88251

    Further, the vptr is actually an ascii string; I am not a
    savy enough stack-overflow guru to know which order it is,
    but if either of these two strings look familiar to you,
    even the better:

    H_STRATE
    ETARTS_H

     
  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Hmmm... this looks more normal than the one before. But the
    vptr (the actual instance address) seems not to have been
    initialized by the underlying CLIPS engine to anything
    useful: in fact the string "H_STRATE" is either a substring
    of "BREADTH_STRATEGY" or "DEPTH_STRATEGY" which are declared
    as "manifest constants" in the high level module, and thus
    have also a string representation (their name in __dict__).
    This means that ptr was set to something that points to the
    outer space. Unfortunately I have not much control on CLIPS
    internals (it's the GetNextInstanceInClassAndSubclasses_PY
    function that lets the "cursor" advance in the list of
    instances, and that is almost copied from CLIPS source).

    But there is something that gives me some thought in the
    local variables dump you report. I guess that a DATA_OBJECT
    -- it is a struct -- uses the "begin" and "end" members to
    track some array bounds, or something like that: actually
    they are integers or integeroids (printf usually shows
    addresses in hex, and I bet gdb does the same ;-)). Did you
    see? The "end" member is exactly 4294967295. If it was a 32
    bit integer, it would be 0xffffffff which is -1 in decimal
    (a good value to say "hey, there's nothing beyond this" or
    to construct a MULTIFIELD with no elements). But maybe the
    standard "int" type with gcc -m64 is 64 bit long, and in
    this case "end" is just a plain, useless number... Since I
    often see during the CLIPS compilation some
    "signed/unsigned" mismatches or "int/size_t" mismatches, I'm
    a little bit suspicious about the possibility of one of
    these conversions to really be dangerous!

    Unfortunately I also don't see any of the output that
    g_test.py provides: I need just that to see what parameters
    are given to the function when the module crashes. Maybe gdb
    eats up the debuggee's output... Could you please just
    attach the output of a non-gdb session running g_test.py, so
    that I can see the progress? It would be very kind of yours.

    Til soon!

    F.

     
  • Matthew L Daniel

    Logged In: YES
    user_id=88251

    Sorry, I don't know why I forgot to post the output:

    Testing PyCLIPS top level module
    Building classes
    Initial/Next Instance 1
    Initial/Next SubclassInstance 1
    Test01
    Test02
    Test03
    Test04
    Initial/Next Instance 2
    Initial/Next SubclassInstance 2
    Test05
    Test06
    Testing ends of lists
    Test07 ok
    Test08 ok
    Test09 ok
    Segmentation fault (core dumped)

     
  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Hi,

    I just checked into the CVS repository a version that (among
    other things) swaps the lines where some checks are
    performed: now the module checks for a valid instance before
    incrementing the CLIPS reference count for it, I don't
    remember why I did the opposite before. It should result in an
    exception where it now segfaults: this is still not the
    behaviour I was expecting, but that requires some more step-
    by-step debugging, and maybe some patching to the CLIPS
    sources. I will try to thoroughly debug the code... but maybe I
    will need to spend part of this weekend in a more
    constructive way than pizzas and social life.

    OTOH, the information you posted is really useful for me to
    see what happens: this kind of bugs are usually the result of
    a poor design, and probably a good debugging session will
    help me isolate it.

    I hope to pop up as soon as possible with a solution, within
    the next two or three days. Meanwhile I'll stay tuned on SF,
    in case you test the CVS version and find some more clues. I
    still have to sign in for the amd64 compile server on SF,
    probably I'll do it tomorrow.

    Have a nice weekend,

    F.

     
  • Matthew L Daniel

    Logged In: YES
    user_id=88251

    cvs update from 2004-02-25 13:34 EST

    #####

    Testing PyCLIPS top level module
    Building classes
    Initial/Next Instance 1
    Initial/Next SubclassInstance 1
    Test01
    Test02
    Test03
    Test04
    Initial/Next Instance 2
    Initial/Next SubclassInstance 2
    Test05
    Test06
    Testing ends of lists
    Test07 ok
    Test08 ok
    Test09 ok
    Segmentation fault (core dumped)

    #######

    #0 0x0000002a9577f9e2 in EnvValidInstanceAddress
    (theEnv=0x520530, iptr=0x4554415254535f48) at
    ./clipssrc/inscom.c:648
    No locals.
    #1 0x0000002a9574a41a in
    g_getNextInstanceInClassAndSubclasses (self=0x520530,
    args=0x0) at clipsmodule.c:5760
    p = (clips_InstanceObject *) 0x2a955c7290
    q = (clips_InstanceObject *) 0x2a955c72b0
    c = (clips_DefclassObject *) 0x2a955c52b8
    o = {supplementalInfo = 0x0, type = 4, value =
    0x522cd0, begin = 1, end = 4294967295, next = 0x0}
    ptr = (void *) 0x4554415254535f48
    #2 0x0000003f0218973f in _PyEval_SliceIndex () from
    /usr/lib64/libpython2.3.so.1.0
    No symbol table info available.

    #####

    Testing: InstancesChanged, InitialInstance, FindInstance,
    Instance.Next ... ok
    Testing: Instance.IsValid, Instance.Remove,
    Instance.DirectRemove ... ok
    Testing: Instance.Class, Instance.Slots, Instance.Send ... ok
    Testing: Class.InitialInstance,
    Class.InitialSubclassInstance, {...} ... Segmentation fault
    (core dumped)

    #####

    #0 EnvGetNextInstanceInClassAndSubclasses_PY
    (theEnv=0x51dda0, cptr=0x100000000, iptr=0x1,
    iterationInfo=0x7fbfffded0) at inscptch_py.c:98
    nextInstance = (INSTANCE_TYPE *) 0x0
    theClass = (DEFCLASS *) 0x0
    #1 0x0000002a959973da in
    g_getNextInstanceInClassAndSubclasses (self=0x51dda0,
    args=0x2a95c983f8) at clipsmodule.c:5754
    p = (clips_InstanceObject *) 0x2a955c7f30
    q = (clips_InstanceObject *) 0x6031e0
    c = (clips_DefclassObject *) 0x2a95c8fd20
    o = {supplementalInfo = 0x0, type = 4, value =
    0x1648e70, begin = 1, end = 4294967295, next = 0x0}
    ptr = (void *) 0x1643320
    #2 0x0000003f0218973f in _PyEval_SliceIndex () from
    /usr/lib64/libpython2.3.so.1.0
    No symbol table info available.

     
  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Well, the *same* error in different conditions makes my
    opinions about this riddle stronger. In fact I followed the code
    flow, and there is a point where a (long)((register
    unsigned) "could-be-0" - 1) operation is performed. Maybe
    this is the guilty piece of code. For you to see, I bet it is in
    file clipssrc/classinf.c:614, where a register unsigned is used
    in SetpDOEnd(...). If you look at the SetpDOEnd _macro_, it
    casts the result of a subtraction to long *after* performing the
    subtraction. But "unsigned int register" might not be the
    same as long... I don't know the details about the x86_64
    platform, but situations like these could be influenced in my
    opinion.

    All of this just to say that I have a small patch for the "[Env]
    Set?DO*" macros (clipssrc/evaluatn.h), which I enclosed in a
    small shell file. The script (it just invokes patch against a
    homemade diff) is attached as "pev_ia64.sh" at the bottom of
    the page: download it to the same directory as setup.py, and
    run it. What it does, is to add a cast to long to the macros
    argument before the operation: maybe it's enough. In case it
    is, it should also correct other possible error conditions on
    this platform. I tested that the patched CLIPS source works
    on 32 bit platforms, and asked the folks@SF to have access
    to the compile farm (an AMD64 for x86_64).

    If you want to try to see what happens, collect the patch and
    try. As soon as my CF account is enabled I will do the same.
    For now this is the most that I can do, but there is still some
    time before the weekend ends.

    Have a nice weekend,

    F.

     
  • Francesco Garosi

    Possible patch for IA64 architectures

     
    Attachments
  • Francesco Garosi

    Logged In: YES
    user_id=328337

    Good news!

    I had my compile farm account, and just had my successful
    experience building and testing PyCLIPS there. I tried the
    unpatched version before, and had exactly your error. After
    patching evaluatn.h I ran the test suite and it completed
    successfully. Too bad the error is not in my source tree, but
    I'll provide the patch in following releases for people so lucky
    to have access to 64bit platforms. Still nothing is known
    about Win64 - I can't afford it.

    I will just wait for your feedback to close the bug, and go
    back to my memory leak hunting.

    Thank you again for submitting the annoyance; if it still
    persists I'll keep working on it with your help. And if you find
    some other unexpected behaviour, bug notices and
    suggestions are always welcome...

    Hope to hear from you soon,

    F.

     
  • Matthew L Daniel

    • status: open --> closed
     
  • Matthew L Daniel

    Logged In: YES
    user_id=88251

    That patch works great. All test suites I have tried work
    without issue.

    I appreciate your help with this and hope it helps the CLIPS
    folks, too.

    Plus now you have an x86_64 compile farm account. Go you. :-)

     
  • Matthew L Daniel

    • status: closed --> closed-fixed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks