Thread: Re: [htdig-dev] Re: Residual database errors (Page 2)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

According to Lachlan Andrew:
> On Friday 14 February 2003 11:16, Neal Richter wrote:
> 
> > Is there something you can tell us about the type of data you are
> > indexing?  Are they big pages with lots of repetitive information..
> > giving htdig many similar keys which hash/sort to the same pages?
> 
> Greetings Neal,
> 
> I've found one page in the  qt  documentation which may be causing 
> those problems (attached).  I hadn't realised it, but the 
> valid_punctuation  attribute seems to be treated as an *optional* 
> word break.  (The docs say it is *not* a word break, and that seems 
> the intention of  WordType::WordToken...)

I guess the docs haven't kept up with what the code does.  It used to
be that valid_punctuation didn't cause word breaks at all, i.e. these
punctuation characters were valid inside a word, and got stripped
out but didn't break up the word.  However, for some time now, this
functionality was extended to also index each word part, so that something
like "post-doctoral" gets indexed as postdoctoral, post and doctoral.
This greatly enhances searches for compound words, or parts thereof,
but it tends to break down when you're indexing something that's not
really words...

>  The page has long strings 
> with many valid_punctuation symbols, and gives output like
> 
> elliptical	1060	0	1113	34
> elp	1363	0	131	0
> elphick	1516	0	750	0
> elsbs	1372	0	968	4
> elsbsw	1372	0	968	4
> elsbswp	1372	0	968	4
> elsbswpe	1372	0	968	4
> elsbswpew	1372	0	968	4
> elsbswpewg	1372	0	968	4
> elsbswpewgr	1372	0	968	4
> elsbswpewgrr	1372	0	968	4
> elsbswpewgrr1	1372	0	968	4
> elsbswpewgrr1t	1372	0	968	4
> elsbswpewgrr1twa7	1372	0	968	4
> elsbswpewgrr1twa7z	1372	0	968	4
> elsbswpewgrr1twa7z1bea0	1372	0	968	4
> elsbswpewgrr1twa7z1bea0f	1372	0	968	4
> elsbswpewgrr1twa7z1bea0fk	1372	0	968	4
> elsbswpewgrr1twa7z1bea0fkd	1372	0	968	4
> elsbswpewgrr1twa7z1bea0fkdrbk	1372	0	968	4
> elsbswpewgrr1twa7z1bea0fkdrbke	1372	0	968	4
> elsbswpewgrr1twa7z1bea0fkdrbkezb	1372	0	968	4
> else	225	0	1285	0
> 
> Might that be the trouble?

Well, I would think that if you're going to feed a bunch of C code
into htdig, especially C code containing many pixmaps, then you should
probably do so with a severely stripped down setting of valid_punctuation.
This would speed up the process a lot and get rid of a lot of the spurious
junk that's getting indexed.  However, if the underlying word database is
solid, then it shouldn't fall apart no matter how much junk you throw at
it.  So, this might be the trigger that brings the trouble to the surface,
but the root cause of the trouble seems to be a bug somewhere in the code.

> (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly 
> different data set.)

Bummer.  Have you tried running with no compression at all, and if so,
does that work reliably?

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)