Menu

#26 [0.60.1] Flakey behaviour on Win98SE

posadis_0.60.x
open
meilof
None
7
2003-10-01
2003-09-29
hardings
No

Sorry about this. The 812280 fault has been fixed, but it
has unmasked a more insidious problem. I have opened a
new report because I have no reason to believe they are
related. My experiences may in fact be different
problems but I begin with one report.

I dont yet have a reproducable trigger apart from time
frame. The fault (if it is only one) has manifested in
three different ways.

1/ Posadis gave an IPF and the PC completely froze.
Could not get details of the IPF. Could not get Task
Manager up. Ctrl+Alt+Del did not respond. The PC was
completely unresponsive to anything except a hard reset.

2/ The PC froze again but this time no IPF message.
Same outcome.

3/ Posadis stopped returning lookups. The Posadis
window was still logging requests, and the log file was
still being updated, but Posadis was not returning and
DNS data to the calling machine. Checked from two
request sources.

The machine is a completely fresh Win98SE install fully
patched. No other applications running. None. It takes
about half an hour of consistent testing activity to
produce a failure. So far I have not run the program for
more than 30 minutes or so without a problem.

I will continue testing to see if I can get some
correlation.

Discussion

1 2 3 > >> (Page 1 of 3)
  • hardings

    hardings - 2003-09-29
    • assigned_to: nobody --> meilof
     
  • meilof

    meilof - 2003-09-29
    • priority: 5 --> 7
     
  • hardings

    hardings - 2003-09-30

    Logged In: YES
    user_id=872022

    Adding another failure manifestation :-

    4/ Posadis stops responding and displays
    "abnormal program termination
    Press any key to continue . . ."

    Pressing "any key" unloads Posadis OK. Restarting Posadis
    without a PC reboot results in Posadis working OK, at least
    initially.

     
  • meilof

    meilof - 2003-09-30

    Logged In: YES
    user_id=186178

    After some testing, I can report that I have at least
    experienced (1) and (4). Trouble for me is my Win98 box is
    very unstable even without Posadis making trouble (for
    example, at this point, I am running without Explorer
    because it just gave an IPF). But either way, I'm going to
    make a debug build of Posadis, and see what happens.

     
  • hardings

    hardings - 2003-09-30

    Logged In: YES
    user_id=872022

    Adding another failure manifestation :-

    5/ On several occasions now the IPF has not locked up the
    PC. Two examples of an IPF text follow :-

    POSADIS caused an invalid page fault in
    module MSAFD.DLL at 017f:7b412d29.
    Registers:
    EAX=004c4c44 CS=017f EIP=7b412d29 EFLGS=00010202
    EBX=00000000 SS=0187 ESP=01a1fb7c EBP=01a1fbc8
    ECX=01a1ff88 DS=0187 ESI=00000000 FS=15f7
    EDX=8163f64c ES=0187 EDI=000003e8 GS=2fbe
    Bytes at CS:EIP:
    80 78 05 00 74 05 bb 14 27 00 00 8b 45 ec 6a 18
    Stack dump:
    bff86b28 01a1fcc4 bff7dfbf 01a1fbb0 01a1fcc4 00000000
    bff741f7 bffc9490 bff86b47 bffc9490 000001c8 81628b4c
    81631490 434f5357 2e32334b 004c4c44

    POSADIS caused an invalid page fault in
    module MSAFD.DLL at 017f:7b412d29.
    Registers:
    EAX=004c4c44 CS=017f EIP=7b412d29 EFLGS=00010202
    EBX=00000000 SS=0187 ESP=01c3fb7c EBP=01c3fbc8
    ECX=01c3ff88 DS=0187 ESI=00000000 FS=394f
    EDX=81641424 ES=0187 EDI=000003e8 GS=2fbe
    Bytes at CS:EIP:
    80 78 05 00 74 05 bb 14 27 00 00 8b 45 ec 6a 18
    Stack dump:
    bff86b28 01c3fcc4 bff7dfbf 01c3fbb0 01c3fcc4 00000000
    bff741f7 bffc9490 bff86b47 bffc9490 000000bc 81640024
    816416a8 434f5357 2e32334b 004c4c44

    MSAFD.DLL is the WinSocks handler?

    I dont have a reliable cause sequence yet, but am working on
    it.

     
  • meilof

    meilof - 2003-10-01
    • summary: 0.60.1 Flakey behaviour on Win98SE --> [0.60.1] Flakey behaviour on Win98SE
     
  • hardings

    hardings - 2003-10-02

    Logged In: YES
    user_id=872022

    I have performed an extensive series of tests over two days.

    I have not been able to reproduce the failure with any
    consistent sequence but I always get a failure. Failure has
    occurred anywhere between 2 mins and 60 mins into a
    pseudo-consistent series of multiple DNS request loops.

    So far there have only been the 5 failure manifestations
    already notified.

    My conclusion is that the failure does not seem to be lookup
    specific nor activity specific, which I know makes it REALLY
    hard to find. I will continue playing but I am not a strong
    believer in serendipity.

    Let me know if there is anything else I can do to assist.

     
  • meilof

    meilof - 2003-10-03

    Logged In: YES
    user_id=186178

    Well... I am actually more or less able to reproduce the
    problems on my Windows 98 box, but it it way too unstable to
    be able to debug them, so I'm going to reinstall Win98 and
    hope the problems still occur. The problem doesn't seem to
    occur on Windows XP, though I'm not entirely sure about that.

    If there is no reliable way of reproducing the problem,
    which is pretty well possible, well, there is very little
    you can do without installing a complete compiler toolchain
    which is something I _really_ wouldn't recommend... On Unix,
    there are core dums, where an application automagically
    dumps its internal state when its crashes, but I don't think
    Windows has such as feature, so that wouldn't be helpful.
    I'd say stay tuned and I hope I am able to find the cause of
    this problem by myself. Thank you anyway for the research
    you put into this problem.

     
  • meilof

    meilof - 2003-10-04

    Logged In: YES
    user_id=186178

    I just found out the solution for debugging crashes for Win32
    programs has been right under my nose for some time...
    Mingw, the C++ compiler I use, has a great tool called "Dr.
    Mingw" that can provide stack traces for crashed applications
    (http://jrfonseca.dyndns.org/projects/gnu-
    win32/software/drmingw/). I have the executable here:

    http://www.posadis.org/files/drmingw.exe

    I don't know whether Mingw needs to be installed for this to
    work, but I don't think so. Now, when you run "drmingw.exe -
    i" from the command line, this program installs itself as the
    default crash handler (as a replacement for Dr. Watson for
    example that doesn't tell /me/ anything useful). This should
    give a stack trace of where exactly Posadis crashed. Now, for
    this, you will need a Posadis with debugging symbols installed,
    which can be found here:

    http://www.posadis.org/files/posstatic.zip

    (due to its nature, this executalbe doesn't support loadable
    modules, and it isn't dependent upon posadis.dll which is
    baked right in).

    While I don't have access to my Windows 98 machine right
    now, I will fiddle around with this when I have time, but if
    you'd care to run Drmingw on this debug build of Posadis, this
    should provide really helpful information for me.

     
  • hardings

    hardings - 2003-10-05
     
  • hardings

    hardings - 2003-10-05

    Logged In: YES
    user_id=872022

    POSSTATIC failed immediately looking up the ntp.cs.mu.oz.au
    address that reliabley triggered the 812280 fault. I got
    several drmingw dumps from it. They were all identical. I
    attach one.

    Then, very strangely, after successfully looking up other
    addresses, the ntp.cs.mu.oz address no longer causes a
    failure. Very odd indeed.

    I will continue to play.

     
  • hardings

    hardings - 2003-10-05

    Logged In: YES
    user_id=872022

    The results I am getting with POSSTATIC closely resemble
    those for 812280. Can you confirm that buffer problem has
    not found its way back in?

     
  • meilof

    meilof - 2003-10-05

    Logged In: YES
    user_id=186178

    Hmmm... You're right, I'm sorry. I built it from todays CVS,
    but with the old Poslib CVS installed. New exes:

    http://www.posadis.org/files/posstatic.zip

    In related news, the IP number lookup you mentioned as a
    reliable means of reproducing the previous crash did crash
    this posstatic on my Win95 just now. Sadly though this was
    an "abnormal program termination" rather than a crash,
    leaving me without a stacktrace :( Methinks Win98 might
    still have problems with IPv6-enabled code, but I'll have a
    closer look tomorrow.

     
  • hardings

    hardings - 2003-10-08

    Logged In: YES
    user_id=872022

    OK - so I expect you thought I might have given up. Well
    nearly.

    Posstatic is now tending to mirror the non debug Posadis
    behaviour - which is good. Drmingw is not helping much yet.
    Most failures are lockups or result in Drmingw not revealing
    anything. I upload a file with both standard and Drmingw
    dumps in it.

    Interestingly, I have noticed that posstatic occasionally
    doesnt return any DNS data for a valid request. Often the full
    response is returned following one or more repeated requests.
    However once the data has been returned fully it is always
    returned fully during that session. This is not TLD related.

    On one occasion I have seen posstatic return PARTIAL data
    from an all records request repeatedly. This only changed
    when I requested a subdomain of that domain, which it
    properly returned fully on first request. Thereafter the original
    domain request was returned fully. Quite odd by my
    reckoning, but perhaps a useful clue.

    Still testing ...

     
  • hardings

    hardings - 2003-10-08
     
  • hardings

    hardings - 2003-10-08

    Logged In: YES
    user_id=872022

    An interesting sequence. The following is part of a series of all
    record UDP requests to POSSTATIC relating to the domain
    muq.org. The first two are the last in a series of 15 requests
    for muq.org, the next was a request for btech.muq.org and
    the last for muq.org again. Confirmation with a different DNS
    server showed the first 15 responses to be erroneous. It was
    only after the lookup for the subdomain that the domain
    lookup came good. Strange indeed.

    Answer Section:
    muq.org, NS, LAUREL.ACTLAB.UTEXAS.EDU
    muq.org, NS, HRDY.ACTLAB.UTEXAS.EDU
    ---
    Answer Section:
    muq.org, NS, LAUREL.ACTLAB.UTEXAS.EDU
    muq.org, NS, HRDY.ACTLAB.UTEXAS.EDU
    ---
    Answer Section:
    btech.muq.org, A, 128.83.194.15
    Authority Records Section:
    muq.org, NS, ns2.muq.org
    muq.org, NS, eith.muq.org
    Additional Records Section:
    ns2.muq.org, A, 128.83.194.18
    eith.muq.org, A, 128.83.194.15
    ---
    Answer Section:
    muq.org, NS, ns2.muq.org
    muq.org, NS, eith.muq.org
    Additional Records Section:
    ns2.muq.org, A, 128.83.194.18
    eith.muq.org, A, 128.83.194.15

     
  • meilof

    meilof - 2003-10-08

    Logged In: YES
    user_id=186178

    > OK - so I expect you thought I might have given up. Well
    > nearly.

    I'm relieved :)

    > Drmingw is not helping much yet. Most failures are lockups or
    > result in Drmingw not revealing anything. I upload a file
    with
    > both standard and Drmingw dumps in it.

    Actually, looking at the drmingw output, it _does_ seem to
    be very helpful: the Drmingw output contains a stack trace
    with line number references for the source code, which lets
    me know exactly where the crash occured. This can really
    help me debugging.

    > Interestingly, I have noticed that posstatic occasionally
    > doesnt return any DNS data for a valid request. Often the
    full
    > response is returned following one or more repeated requests.
    > However once the data has been returned fully it is always
    > returned fully during that session. This is not TLD related.

    Well, I have done my stress testing mostly under Linux, and
    Posadis does seem to be working quite well there. So I will
    do some more Win98 testing to see if I have the same issues.

    > On one occasion I have seen posstatic return PARTIAL data
    > from an all records request repeatedly. This only changed
    > when I requested a subdomain of that domain, which it
    > properly returned fully on first request. Thereafter the
    original
    > domain request was returned fully. Quite odd by my
    > reckoning, but perhaps a useful clue.

    /This/ is not a bug, it's how the DNS protocol works: it
    says that a DNS server should respond to an ANY query with
    all RRsets it has already cached. Only if it doesn't have
    any data about the domain, it tries to get all domain
    information. This is because a sever cannot determine from
    the RRs it already has whether that's all the data for the
    domain, making it in effect hard to completely cache an ANY
    answer.

    > An interesting sequence. The following is part of a series of
    > all record UDP requests to POSSTATIC relating to the domain
    > muq.org. The first two are the last in a series of 15
    requests
    > for muq.org, the next was a request for btech.muq.org and
    > the last for muq.org again. Confirmation with a different DNS
    > server showed the first 15 responses to be erroneous. It was
    > only after the lookup for the subdomain that the domain
    > lookup came good. Strange indeed.

    > Answer Section:
    > muq.org, NS, LAUREL.ACTLAB.UTEXAS.EDU
    > muq.org, NS, HRDY.ACTLAB.UTEXAS.EDU
    >
    > Answer Section:
    > muq.org, NS, ns2.muq.org
    > muq.org, NS, eith.muq.org
    >Additional Records Section:
    > ns2.muq.org, A, 128.83.194.18
    > eith.muq.org, A, 128.83.194.15

    Actually, these first responses are _not_ erroneous: they're
    the NS records from the ORG nameservers. Apparently, the
    administrators for the muq.org domain have told the
    administrators of the ORG domain that Laurel and Hardy are
    the nameservers for muq.org, so when Posadis asks the ORG
    nameservers for {muq.org,NS} and it finds this answer, it
    stops looking because it considers this information enough
    (and it should be). Apparently in their /own/ zone data, the
    muq.org people have listed ns2 and either as their DNS
    servers. So we have two authoritive sources spreading out
    different information. Posadis gets the information from ORG
    first, so it uses that. And when Posadis asks the muq.org
    nameservers about the btech domain, the muq.org nameservers
    will send a list of their nameservers along with the btech
    answers, and Posadis will store these new NS records in its
    cache, causing a next {muq.org,NS} record to return the
    other NS list.

    In conclusion: this is not a Posadis problem, but a problem
    with the muq.org zone data. And FYI, the laurel and hardy
    domains _do_ point to the same IP numbers as the ns2 and
    eith domains.

     
  • meilof

    meilof - 2003-10-08

    Logged In: YES
    user_id=186178

    Remarkably enough, now that I have taken a short look, the
    Drmingw stack trace you sent is exactly the same as the
    stack trace I found when debugging the IPv6 problem that I
    have now (I hope at least) solved! Are you sure you're
    running the latest Posstatic, that is, the one with the date
    5 Oct 2003, and the latest Poslib and Posserver DLLs,
    labeled 5 Oct 2003 as well? Please try renaming the
    poslib.dll file temporarily and see if posstatic says it
    cannot find the DLL. If this /not/ the case, you'll know
    there's another poslib.dll file somewhere else Posadis may use.

    If you have old posstatic.exe, poslib.dll and/or
    posserver.dll files (labeled before 5-Oct-2003), you can
    download the very latest at

    http://www.posadis.org/files/posstatic.zip

    As a side note, I'm running this Posadis for an hour or so
    now without problems (only query logging doesnt work but
    I'll look into that later). Let me know about your experiences.

     
  • hardings

    hardings - 2003-10-08

    Logged In: YES
    user_id=872022

    posstatic.exe 05/10/03 04:57
    poslib.dll 05/10/03 04:22
    posserver.dll 05/10/03 04:26

    These are the only instances on the machine.

    I accept what you say about the DNS behaviour, but does it
    explain the nul returns?

    Happy hunting.

     
  • meilof

    meilof - 2003-10-09

    Logged In: YES
    user_id=186178

    > posstatic.exe 05/10/03 04:57
    > poslib.dll 05/10/03 04:22
    > posserver.dll 05/10/03 04:26
    >
    > These are the only instances on the machine.

    Hmmm... That's a bit odd then. Well, I guess I'll need to do
    more debugging myself then.

    > I accept what you say about the DNS behaviour, but does it
    > explain the nul returns?

    You mean this:

    > Interestingly, I have noticed that posstatic occasionally
    > doesnt return any DNS data for a valid request. Often the
    full
    > response is returned following one or more repeated requests.
    > However once the data has been returned fully it is always
    > returned fully during that session. This is not TLD related.

    Most likely, the lookups for these requests took so long
    that the client gave up waiting before the answer was
    received. Depending on the domain name being looked up, a
    lookup might take a few seconds or more if the DNS servers
    Posadis uses take a long time to answer. Once a lookup
    succeeds, Posadis can directly return the cached information.

     
  • Nobody/Anonymous

    Logged In: NO

    Yes, but I said nul returns, not no returns. The client I use for
    testing when I see abberant behaviour is Cyberkit by Luc
    Niejens. It is useful for changing target DNS servers quickly
    and seeing verbose return records. Cyberkit waits indefinitely
    (so far as I can tell) for a response. The nul returns I get
    from Posstatic are quite quick, but they are _empty_ of any
    data.

    Im sorry Im not getting you much debug data, but the failures
    take a lot longer on average now, and nearly all of them are
    resulting in a complete machine lockup. I hope to do some
    more consistent long term testing on the weekend.

     
  • meilof

    meilof - 2003-10-10

    Logged In: YES
    user_id=186178

    > Yes, but I said nul returns, not no returns. The client I use
    > for testing when I see abberant behaviour is Cyberkit by Luc
    > Niejens. It is useful for changing target DNS servers quickly
    > and seeing verbose return records.

    Allright, so can you turn on the "Verbose" display stile of
    Cyberkit in the options menu, and post the DNS message of a
    nul return? This might help me analyze what's wrong.

     
  • hardings

    hardings - 2003-10-11
     
  • hardings

    hardings - 2003-10-11

    Logged In: YES
    user_id=872022

    This Drmingw dump actually looks useful. It occurred looking
    up www.aoob.org. Uploaded as 0310111912.txt.
    Other dumps uploaded 0310111113.txt, 0310111905.txt, and
    0310111908.txt. The second (05) is a standard IPF where
    Drmingw didnt capture anything.
    Have not had another nul return yet. Still testing.

     
  • hardings

    hardings - 2003-10-11
     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

MongoDB Logo MongoDB