nget / Bugs / #18 get headers doesn't always handle server failure gracefully

Matthew Mueller - 2003-09-25

Logged In: YES
user_id=65253

I'm unable to reproduce these problems. What version of
nget are you using? What os/compiler?

It sounds like perhaps you are having some sort of
mishandling of exceptions...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

Logged In: YES
user_id=803104

nget 0.24.1 (CVS), built on NetBSD 1.6.1/i386, which uses
gcc-2.95.3, and running on NetBSD 1.6.1 or NetBSD 1.6ZC.
nget 0.23 seemed to do it occasionally, too.

I saw a slightly different problem today, and there wasn't
even an error; nget seemed to get the headers for the first
group from two servers successfully, but then just stopped
before going to the third, and wouldn't even save the cache.
It did this three times in a row, until I did a SIGINT to
force nget to save the cache and quit. The timing was
different on the next round, so then it worked.

The bug is very timing dependent. You might be able to
reproduce it easier by setting short "idletimeout"'s.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-26

Logged In: YES
user_id=65253

I tried recompiling with 2.95.4 and using idletimeout=1, I
still couldn't reproduce it.

If you have python >= 2.2, could you try running the test
suite? (cd test; ./test_nget.py)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

error log

nget.log

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

Logged In: YES
user_id=803104

I just built python-2.2.3 on NetBSD-1.6ZC, but it's
evidently inadequate for the test...

fredb@seduction-> ./test_nget.py
Traceback (most recent call last):
File "./test_nget.py", line 24, in ?
import nntpd, util
File "./nntpd.py", line 22, in ?
import threading
File "/usr/pkg/lib/python2.2/threading.py", line 5, in ?
import thread
ImportError: No module named thread

I also rebuilt nget with the native compiler (now
gcc-3.3.1), but it did it again (log attached). For this
run, there are only two servers scored > _level. Notice that
it inexplicably bails after timing out on nntpworld. On the
next run, nntpworld stays connected, and it goes on to the
next group with giganews.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-26

Logged In: YES
user_id=65253

Ah, yes, it requires thread support too. I'm not sure why
your build of python didn't include it.. perhaps python
doesn't support threading on openbsd, or needs specific
prodding on how to find it... dunno :/

That is certainly an abnormal exit. Could you run it under
gdb and try to get a backtrace of where it happens? (Maybe
try a few times to see if its consistant.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

Logged In: YES
user_id=803104

NetSD. I found the *threaded* python package, built and
installed it, and ran the test suite. "nget" failed eight of
its tests, including five core dumps. I rebuilt "nget" with
"-O", same thing (which would tend to rule out a compiler
bug). The backtrace from the latest core is:

(gdb) bt
#0 0x08080b4c in c_file_tcp::isopen() const ()
#1 0x0805cd71 in c_prot_nntp::nntp_group(c_group_info*
const&, int, nget_options const&) ()
#2 0x080513fd in do_args(int, char const**, nget_options,
int) ()
#3 0x08053f6c in main ()
#4 0x0804b9b2 in ___start ()

I rebuilt again from a pristine nget-0.24.1+uulib tarball,
and now only python dumps core the first time -- native
pthreads + multiprocessor aren't very stable on NetBSD yet
-- the second time, only one test fails, no core.... but it
just showed the bug again on a real session. :-(

I'll try running it under the debugger tomorrow.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

Logged In: YES
user_id=803104

Attached is a trace from a gdb session where nget fails as
discussed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-26

Logged In: YES
user_id=65253

Hm. Well, I dunno what to make of that. No functions from nget are even listed in the backtrace... Running it on an unstripped nget exe might help ("(no debugging symbols found)..." messages..).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

Logged In: YES
user_id=803104

Hmm, yes. Attached.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-26

Ugly raw "script" of gdb session

gdb.nget.log

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-27

probable fix

nget-sigpipe.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-27

Logged In: YES
user_id=65253

Ah, now that is more helpful.

It seems to me that the connection times out while nget is
retrieving headers from a different server, then when it
tries to write the quit command to the server, the os kills
it with a sigpipe rather than just returning an error from
the send function... I see there is a MSG_NOSIGNAL flag
that should be used avoid this, though its curious I've
never encountered this myself.

Try the attached patch and see if it fixes it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-27

Logged In: YES
user_id=803104

Doesn't compile:

fredb@seduction-> gmake
g++ -g -O2 -Wall -DHAVE_CONFIG_H -Iuulib -I/usr/mm/include
-c -o nget.o nget.cc
In file included from nget.cc:47:
sockstuff.h: In function `int sock_write(int, const char*,
unsigned int)':
sockstuff.h:61: error: `MSG_NOSIGNAL' undeclared (first use
this function)
sockstuff.h:61: error: (Each undeclared identifier is
reported only once for
each function it appears in.)
gmake: *** [nget.o] Error 1

I'm looking through the kernel code, and there doesn't seem
to be any way to avoid delivery of the SIGPIPE if writing to
a disconnecting socket. I suppose you could use signal() to
just ignore it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-27

Logged In: YES
user_id=803104

Adding a

signal(SIGPIPE,SIG_IGN);

to near the other signal()'s in nget.cc:main() seems to do
the job.
At least, I just finally got past nntpworld timing out,
without nget getting killed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-27

Logged In: YES
user_id=65253

Cool. Thanks for the help tracking that down. I just
committed the fix to cvs.

Now, onto the other part. Even with possibly getting killed
by sigpipe, an error shouldn't make it wipe the cache... Is
this reproducable? Does the cache file disappear
completely, or is it just replaced with an empty one?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frederick Bruckman - 2003-09-27

Logged In: YES
user_id=803104

I've been thinking about that, and I think what must have
happened is that the old headers were flushed because they
had expired, but then the downloaded headers were never
saved because of the SIGPIPE thing. The file of gzipped
headers was never actually deleted; it just seemed that it
was downloading too much.

I believe the unhandled SIGPIPE from writing into a dropped
connection, or into a connection that happened to be in
CLOSE_WAIT state, accounts for all the problems I've been
seeing, and that it would be fine to close this PR now. Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-27

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthew Mueller - 2003-09-27

Logged In: YES
user_id=65253

Ah, that makes sense. Thanks for your help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

get headers doesn't always handle server failure gracefully

Group

Searches

Help

#18 get headers doesn't always handle server failure gracefully

Discussion