Menu

#18 get headers doesn't always handle server failure gracefully

closed-fixed
nobody
None
5
2003-09-27
2003-09-18
No

I typically start an "nget" session with "nget -g
usual", where "usual"
is a group alias for a list of other group aliases.
This normally works
fine, unless one of the servers can't be connected to,
in which case it saves the cache and goes on to the
next server... sometimes. Other times, it saves the
cache but doesn't go on the next server, but rather
simply stops. Repeated tries eventually gets headers
for all the groups in "usual", even if the bad server
stays unreachable.

Moreover, "nget -h badserver -g group" does a really
bad thing if a succesful connection isn't made -- it
wipes the cache for *all* servers!

You should be able to reproduce this fairly easily by
adding "biggulp.readfreenews.net" to your server mix,
as it often returns a "400" or "502" code, and needs to
be retried a lot to get through. (It's a widely
publicized free server, and well, you get what you pay
for.)

Discussion

  • Matthew Mueller

    Matthew Mueller - 2003-09-25

    Logged In: YES
    user_id=65253

    I'm unable to reproduce these problems. What version of
    nget are you using? What os/compiler?

    It sounds like perhaps you are having some sort of
    mishandling of exceptions...

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    nget 0.24.1 (CVS), built on NetBSD 1.6.1/i386, which uses
    gcc-2.95.3, and running on NetBSD 1.6.1 or NetBSD 1.6ZC.
    nget 0.23 seemed to do it occasionally, too.

    I saw a slightly different problem today, and there wasn't
    even an error; nget seemed to get the headers for the first
    group from two servers successfully, but then just stopped
    before going to the third, and wouldn't even save the cache.
    It did this three times in a row, until I did a SIGINT to
    force nget to save the cache and quit. The timing was
    different on the next round, so then it worked.

    The bug is very timing dependent. You might be able to
    reproduce it easier by setting short "idletimeout"'s.

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-26

    Logged In: YES
    user_id=65253

    I tried recompiling with 2.95.4 and using idletimeout=1, I
    still couldn't reproduce it.

    If you have python >= 2.2, could you try running the test
    suite? (cd test; ./test_nget.py)

     
  • Frederick Bruckman

    error log

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    I just built python-2.2.3 on NetBSD-1.6ZC, but it's
    evidently inadequate for the test...

    fredb@seduction-> ./test_nget.py
    Traceback (most recent call last):
    File "./test_nget.py", line 24, in ?
    import nntpd, util
    File "./nntpd.py", line 22, in ?
    import threading
    File "/usr/pkg/lib/python2.2/threading.py", line 5, in ?
    import thread
    ImportError: No module named thread

    I also rebuilt nget with the native compiler (now
    gcc-3.3.1), but it did it again (log attached). For this
    run, there are only two servers scored > _level. Notice that
    it inexplicably bails after timing out on nntpworld. On the
    next run, nntpworld stays connected, and it goes on to the
    next group with giganews.

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-26

    Logged In: YES
    user_id=65253

    Ah, yes, it requires thread support too. I'm not sure why
    your build of python didn't include it.. perhaps python
    doesn't support threading on openbsd, or needs specific
    prodding on how to find it... dunno :/

    That is certainly an abnormal exit. Could you run it under
    gdb and try to get a backtrace of where it happens? (Maybe
    try a few times to see if its consistant.)

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    NetSD. I found the *threaded* python package, built and
    installed it, and ran the test suite. "nget" failed eight of
    its tests, including five core dumps. I rebuilt "nget" with
    "-O", same thing (which would tend to rule out a compiler
    bug). The backtrace from the latest core is:

    (gdb) bt
    #0 0x08080b4c in c_file_tcp::isopen() const ()
    #1 0x0805cd71 in c_prot_nntp::nntp_group(c_group_info*
    const&, int, nget_options const&) ()
    #2 0x080513fd in do_args(int, char const**, nget_options,
    int) ()
    #3 0x08053f6c in main ()
    #4 0x0804b9b2 in ___start ()

    I rebuilt again from a pristine nget-0.24.1+uulib tarball,
    and now only python dumps core the first time -- native
    pthreads + multiprocessor aren't very stable on NetBSD yet
    -- the second time, only one test fails, no core.... but it
    just showed the bug again on a real session. :-(

    I'll try running it under the debugger tomorrow.

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    Attached is a trace from a gdb session where nget fails as
    discussed.

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-26

    Logged In: YES
    user_id=65253

    Hm. Well, I dunno what to make of that. No functions from nget are even listed in the backtrace... Running it on an unstripped nget exe might help ("(no debugging symbols found)..." messages..).

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    Hmm, yes. Attached.

     
  • Frederick Bruckman

    Ugly raw "script" of gdb session

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-27

    probable fix

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-27

    Logged In: YES
    user_id=65253

    Ah, now that is more helpful.

    It seems to me that the connection times out while nget is
    retrieving headers from a different server, then when it
    tries to write the quit command to the server, the os kills
    it with a sigpipe rather than just returning an error from
    the send function... I see there is a MSG_NOSIGNAL flag
    that should be used avoid this, though its curious I've
    never encountered this myself.

    Try the attached patch and see if it fixes it.

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    Doesn't compile:

    fredb@seduction-> gmake
    g++ -g -O2 -Wall -DHAVE_CONFIG_H -Iuulib -I/usr/mm/include
    -c -o nget.o nget.cc
    In file included from nget.cc:47:
    sockstuff.h: In function `int sock_write(int, const char*,
    unsigned int)':
    sockstuff.h:61: error: `MSG_NOSIGNAL' undeclared (first use
    this function)
    sockstuff.h:61: error: (Each undeclared identifier is
    reported only once for
    each function it appears in.)
    gmake: *** [nget.o] Error 1

    I'm looking through the kernel code, and there doesn't seem
    to be any way to avoid delivery of the SIGPIPE if writing to
    a disconnecting socket. I suppose you could use signal() to
    just ignore it.

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    Adding a

    signal(SIGPIPE,SIG_IGN);

    to near the other signal()'s in nget.cc:main() seems to do
    the job.
    At least, I just finally got past nntpworld timing out,
    without nget getting killed.

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-27

    Logged In: YES
    user_id=65253

    Cool. Thanks for the help tracking that down. I just
    committed the fix to cvs.

    Now, onto the other part. Even with possibly getting killed
    by sigpipe, an error shouldn't make it wipe the cache... Is
    this reproducable? Does the cache file disappear
    completely, or is it just replaced with an empty one?

     
  • Frederick Bruckman

    Logged In: YES
    user_id=803104

    I've been thinking about that, and I think what must have
    happened is that the old headers were flushed because they
    had expired, but then the downloaded headers were never
    saved because of the SIGPIPE thing. The file of gzipped
    headers was never actually deleted; it just seemed that it
    was downloading too much.

    I believe the unhandled SIGPIPE from writing into a dropped
    connection, or into a connection that happened to be in
    CLOSE_WAIT state, accounts for all the problems I've been
    seeing, and that it would be fine to close this PR now. Thanks!

     
  • Matthew Mueller

    Matthew Mueller - 2003-09-27
    • status: open --> closed-fixed
     
  • Matthew Mueller

    Matthew Mueller - 2003-09-27

    Logged In: YES
    user_id=65253

    Ah, that makes sense. Thanks for your help.

     

Log in to post a comment.