I typically start an "nget" session with "nget -g
usual", where "usual"
is a group alias for a list of other group aliases.
This normally works
fine, unless one of the servers can't be connected to,
in which case it saves the cache and goes on to the
next server... sometimes. Other times, it saves the
cache but doesn't go on the next server, but rather
simply stops. Repeated tries eventually gets headers
for all the groups in "usual", even if the bad server
stays unreachable.
Moreover, "nget -h badserver -g group" does a really
bad thing if a succesful connection isn't made -- it
wipes the cache for *all* servers!
You should be able to reproduce this fairly easily by
adding "biggulp.readfreenews.net" to your server mix,
as it often returns a "400" or "502" code, and needs to
be retried a lot to get through. (It's a widely
publicized free server, and well, you get what you pay
for.)
Logged In: YES
user_id=65253
I'm unable to reproduce these problems. What version of
nget are you using? What os/compiler?
It sounds like perhaps you are having some sort of
mishandling of exceptions...
Logged In: YES
user_id=803104
nget 0.24.1 (CVS), built on NetBSD 1.6.1/i386, which uses
gcc-2.95.3, and running on NetBSD 1.6.1 or NetBSD 1.6ZC.
nget 0.23 seemed to do it occasionally, too.
I saw a slightly different problem today, and there wasn't
even an error; nget seemed to get the headers for the first
group from two servers successfully, but then just stopped
before going to the third, and wouldn't even save the cache.
It did this three times in a row, until I did a SIGINT to
force nget to save the cache and quit. The timing was
different on the next round, so then it worked.
The bug is very timing dependent. You might be able to
reproduce it easier by setting short "idletimeout"'s.
Logged In: YES
user_id=65253
I tried recompiling with 2.95.4 and using idletimeout=1, I
still couldn't reproduce it.
If you have python >= 2.2, could you try running the test
suite? (cd test; ./test_nget.py)
error log
Logged In: YES
user_id=803104
I just built python-2.2.3 on NetBSD-1.6ZC, but it's
evidently inadequate for the test...
fredb@seduction-> ./test_nget.py
Traceback (most recent call last):
File "./test_nget.py", line 24, in ?
import nntpd, util
File "./nntpd.py", line 22, in ?
import threading
File "/usr/pkg/lib/python2.2/threading.py", line 5, in ?
import thread
ImportError: No module named thread
I also rebuilt nget with the native compiler (now
gcc-3.3.1), but it did it again (log attached). For this
run, there are only two servers scored > _level. Notice that
it inexplicably bails after timing out on nntpworld. On the
next run, nntpworld stays connected, and it goes on to the
next group with giganews.
Logged In: YES
user_id=65253
Ah, yes, it requires thread support too. I'm not sure why
your build of python didn't include it.. perhaps python
doesn't support threading on openbsd, or needs specific
prodding on how to find it... dunno :/
That is certainly an abnormal exit. Could you run it under
gdb and try to get a backtrace of where it happens? (Maybe
try a few times to see if its consistant.)
Logged In: YES
user_id=803104
NetSD. I found the *threaded* python package, built and
installed it, and ran the test suite. "nget" failed eight of
its tests, including five core dumps. I rebuilt "nget" with
"-O", same thing (which would tend to rule out a compiler
bug). The backtrace from the latest core is:
(gdb) bt
#0 0x08080b4c in c_file_tcp::isopen() const ()
#1 0x0805cd71 in c_prot_nntp::nntp_group(c_group_info*
const&, int, nget_options const&) ()
#2 0x080513fd in do_args(int, char const**, nget_options,
int) ()
#3 0x08053f6c in main ()
#4 0x0804b9b2 in ___start ()
I rebuilt again from a pristine nget-0.24.1+uulib tarball,
and now only python dumps core the first time -- native
pthreads + multiprocessor aren't very stable on NetBSD yet
-- the second time, only one test fails, no core.... but it
just showed the bug again on a real session. :-(
I'll try running it under the debugger tomorrow.
Logged In: YES
user_id=803104
Attached is a trace from a gdb session where nget fails as
discussed.
Logged In: YES
user_id=65253
Hm. Well, I dunno what to make of that. No functions from nget are even listed in the backtrace... Running it on an unstripped nget exe might help ("(no debugging symbols found)..." messages..).
Logged In: YES
user_id=803104
Hmm, yes. Attached.
Ugly raw "script" of gdb session
probable fix
Logged In: YES
user_id=65253
Ah, now that is more helpful.
It seems to me that the connection times out while nget is
retrieving headers from a different server, then when it
tries to write the quit command to the server, the os kills
it with a sigpipe rather than just returning an error from
the send function... I see there is a MSG_NOSIGNAL flag
that should be used avoid this, though its curious I've
never encountered this myself.
Try the attached patch and see if it fixes it.
Logged In: YES
user_id=803104
Doesn't compile:
fredb@seduction-> gmake
g++ -g -O2 -Wall -DHAVE_CONFIG_H -Iuulib -I/usr/mm/include
-c -o nget.o nget.cc
In file included from nget.cc:47:
sockstuff.h: In function `int sock_write(int, const char*,
unsigned int)':
sockstuff.h:61: error: `MSG_NOSIGNAL' undeclared (first use
this function)
sockstuff.h:61: error: (Each undeclared identifier is
reported only once for
each function it appears in.)
gmake: *** [nget.o] Error 1
I'm looking through the kernel code, and there doesn't seem
to be any way to avoid delivery of the SIGPIPE if writing to
a disconnecting socket. I suppose you could use signal() to
just ignore it.
Logged In: YES
user_id=803104
Adding a
signal(SIGPIPE,SIG_IGN);
to near the other signal()'s in nget.cc:main() seems to do
the job.
At least, I just finally got past nntpworld timing out,
without nget getting killed.
Logged In: YES
user_id=65253
Cool. Thanks for the help tracking that down. I just
committed the fix to cvs.
Now, onto the other part. Even with possibly getting killed
by sigpipe, an error shouldn't make it wipe the cache... Is
this reproducable? Does the cache file disappear
completely, or is it just replaced with an empty one?
Logged In: YES
user_id=803104
I've been thinking about that, and I think what must have
happened is that the old headers were flushed because they
had expired, but then the downloaded headers were never
saved because of the SIGPIPE thing. The file of gzipped
headers was never actually deleted; it just seemed that it
was downloading too much.
I believe the unhandled SIGPIPE from writing into a dropped
connection, or into a connection that happened to be in
CLOSE_WAIT state, accounts for all the problems I've been
seeing, and that it would be fine to close this PR now. Thanks!
Logged In: YES
user_id=65253
Ah, that makes sense. Thanks for your help.