"Wesley W. Terpstra" wrote:
>
>
> On Fri, Jun 06, 2003 at 12:06:03PM -0400, Kevin Brosius wrote:
> > It seems I'm loosing parts of the database sometimes. I was all up and
> > running again, and I was trying to reproduce the problem. It appears to
> > be a lurker-index problem (or a lurker-prune interaction.) I had
> > delivery working, and pages were being served fine. I verified delivery
> > was clean, then enabled lurker prune. I ran it a couple times manually
> > also, to prove that it didn't delete any parts of the database.
>
> Since lurker-prune only opens read-only it shouldn't delete anything.
> It might be a locking interaction though.
>
> > Everything looked clean, and I had already received some of the deferred
> > messages from my external postfix queue. I then tried to flush the
> > queue and pick up all the deferred mail. After some number of messages
> > (6 actually), postfix started reporting:
> >
> > Command died with status 1: "/usr/local/bin/lurker-index -c
> > /srv/www/lurker.conf -l plug -m". Command output: opening database: No
> > such file or directory
>
> Great!!! This is the problem Federico ran into. I would love to fix it.
>
> > The db directory looks like:
> >
> > total 290118
> > drwxr-xr-x 2 wwwrun root 368 2003-06-06 11:47 .
> > drwxr-xr-x 10 root root 304 2003-06-06 11:19 ..
> > -rw-r--r-- 1 wwwrun users 33 2003-06-06 11:47 db
> > -rw------- 1 wwwrun nogroup 8192 2003-06-06 11:47 db.iee
> > -rw-r--r-- 1 wwwrun users 1114112 2003-06-06 10:36 db.ioq
> > -rw-r--r-- 1 wwwrun users 76259328 2003-06-06 10:35 db.ipk
> > -rw-r--r-- 1 wwwrun users 4120576 2003-06-06 10:36 db.iss
> > -rw-r--r-- 1 wwwrun users 0 2003-06-06 09:50 db.writer
> > -rw-r--r-- 1 wwwrun users 7645836 2003-06-06 11:47
> > linux-usb-devel
> > -rw-r--r-- 1 wwwrun users 45211535 2003-06-06 11:47 lkml
> > -rw-r--r-- 1 wwwrun users 551199 2003-06-06 11:38 lurker-users
> > -rw-r--r-- 1 wwwrun users 76637082 2003-06-06 11:47 plug
> > -rw-r--r-- 1 wwwrun users 73504916 2003-06-06 11:47 xfree86
> > -rw-r--r-- 1 wwwrun users 11724762 2003-06-06 09:54
> > xfree86-devel
> >
> > A backup of the directory was: (I did a cp -r, so no user info is
> > saved.)
> >
> > total 289966
> > -rw-r--r-- 1 root root 29 2003-06-06 11:19 db
> > -rw-r--r-- 1 root root 180224 2003-06-06 11:19 db.ieg
> > -rw-r--r-- 1 root root 1114112 2003-06-06 11:19 db.ioq
> > -rw-r--r-- 1 root root 76259328 2003-06-06 11:19 db.ipk
> > -rw-r--r-- 1 root root 4120576 2003-06-06 11:19 db.iss
> > -rw-r--r-- 1 root root 0 2003-06-06 11:19 db.writer
> > -rw-r--r-- 1 root root 7562894 2003-06-06 11:19
> > linux-usb-devel
> > -rw-r--r-- 1 root root 45135014 2003-06-06 11:19 lkml
> > -rw-r--r-- 1 root root 542427 2003-06-06 11:19 lurker-users
> > -rw-r--r-- 1 root root 76534237 2003-06-06 11:19 plug
> > -rw-r--r-- 1 root root 73444116 2003-06-06 11:20 xfree86
> > -rw-r--r-- 1 root root 11724762 2003-06-06 11:19
> > xfree86-devel
> >
> > Where did db.ieg go? :)
>
> Ok.. So, ipk is the same size, so it's fine. iss is also the same size.
> ioq is the same. ieg is gone and .iee is new.
>
> Since you only imported six messages there is no way you got 180k worth of
> keyword data (the only way ieg would be affected)... (they were normal sized
> messages right?) ... hence only new smaller files should have been generated.
>
Ok, that may have been a little misleading. There were more than size
messages since the backup copy, but only six from the time I did a
postfix flush to finish off the update. I think there were about 39
messages in the postfix queue. I can go back and look at the delivery
logs to tell how many successful deliveries prior to my start of the
flush.
> Can you verify that 'cat db' of the new database lists iee, ieg, ioq, iss,
> ipk in that order? Is the backup database consistent? (ie: no lurker-index
> was run during backup?)
>
Hmm, bad dir:
cat db
255 8192 255
iee
iiz
ioq
ipk
iss
backup dir:
cat db_ok_withxfree86/db
255 8192 255
ieg
ioq
ipk
iss
> If you are brave, uncomment
> # Main dump : dump.cpp ;
> # LinkLibraries dump : libesort ;
> in libesort/Jamfile, and build it by running 'jam' in that dir.
>
> Then copy db.iee and db to a new dir, and strip all the other db.* files from
> db. Run dump on this database and confirm that all the keyword data there
> comes only from the new 6 messages (and that all six messages are in it).
> (PS: pipe dump though less or the high ascii will bite you)
> (lines beginning with 's' are summary data and include subject lines
> which would help determine what's in there)
I'll take a look, and run through the mail log to get an idea of how
many messages were added prior to the last 6 good ones.
>
> Heh, I just noticed a bug using dump; I don't decode quoted-printable before
> indexing it for keywords. Sheeet.
>
> I really wonder what could cause this since it doesn't happen during
> straight import. It most be some form of locking issue... There was no error
> on the last non "does not exist" index run? Maybe log the results of
> lurker-prune -v ?
Doesn't appear to be any other error. postfix has several:
postfix/local[1634]: B5E42115622: to=<xfree86@...]>,
relay=local, delay=0, status=sent ("|/usr/local/bin/lurker-index -c
/srv/www/lurker.conf -l xfree86 -m")
postfix/local[1640]: B323C1155D9: to=<lkml@...]>, relay=local,
delay=0, status=sent ("|/usr/local/bin/lurker-index -c
/srv/www/lurker.conf -l lkml -m")
postfix/local[1644]: C66D3115235: to=<plug@...]>, relay=local,
delay=0, status=sent ("|/usr/local/bin/lurker-index -c
/srv/www/lurker.conf -l plug -m")
(several connects from the postfix system I'm flushing occur here, with
the next batch of messages.)
postfix/local[1634]: E2E91112CBA: to=<plug@...]>, relay=local,
delay=0, status=deferred (SOFT BOUNCE - Command died with status 1:
"/usr/local/bin/lurker-index -c /srv/www/lurker.conf -l plug -m".
Command output: opening database: No such file or directory )
That's a very abbreviated snippet. Three successful lurker-indexes,
postfix queues 4 or some more messages, then starts to attempt local
delivery. That one to 'plug' fails, followed by ones to:
linux-usb-devel
lurker-users
lkml
lkml
Then we queue some more from the remote system. IIRC, the postfix
connection limit is set at 4, so this fits. Queue 4, deliver 4.
>
> Man I really hope you can make this reproducible as it is a bug that has
> been troubling me. Is this the same symptom as you experienced with the
> older version of cvs?
>
Possibly. I think it was aggravated by using a 2.5.69 kernel which
appears to also have some kind of a lockup bug related to heavy load. I
think that's what took the system down. That kernel problem prevented
me from being able to see the database problem right away.
> /me rubs hands together in anticipation of bug squashing.
Okay, I've got a couple ideas :) I'm wondering if I can reproduce it by
running a half dozen instances of lurker-index with single message input
(ignoring postfix) all concurrently. I need a way that doesn't depend
on queued mail, as there's only so much of that. :)
--
Kevin
|