|
From: Lachlan A. <lh...@us...> - 2003-02-19 12:49:20
|
On Friday 14 February 2003 11:16, Neal Richter wrote: > Is there something you can tell us about the type of data you are > indexing? Are they big pages with lots of repetitive information.. > giving htdig many similar keys which hash/sort to the same pages? Greetings Neal, I've found one page in the qt documentation which may be causing=20 those problems (attached). I hadn't realised it, but the=20 valid_punctuation attribute seems to be treated as an *optional*=20 word break. (The docs say it is *not* a word break, and that seems=20 the intention of WordType::WordToken...) The page has long strings=20 with many valid_punctuation symbols, and gives output like elliptical=091060=090=091113=0934 elp=091363=090=09131=090 elphick=091516=090=09750=090 elsbs=091372=090=09968=094 elsbsw=091372=090=09968=094 elsbswp=091372=090=09968=094 elsbswpe=091372=090=09968=094 elsbswpew=091372=090=09968=094 elsbswpewg=091372=090=09968=094 elsbswpewgr=091372=090=09968=094 elsbswpewgrr=091372=090=09968=094 elsbswpewgrr1=091372=090=09968=094 elsbswpewgrr1t=091372=090=09968=094 elsbswpewgrr1twa7=091372=090=09968=094 elsbswpewgrr1twa7z=091372=090=09968=094 elsbswpewgrr1twa7z1bea0=091372=090=09968=094 elsbswpewgrr1twa7z1bea0f=091372=090=09968=094 elsbswpewgrr1twa7z1bea0fk=091372=090=09968=094 elsbswpewgrr1twa7z1bea0fkd=091372=090=09968=094 elsbswpewgrr1twa7z1bea0fkdrbk=091372=090=09968=094 elsbswpewgrr1twa7z1bea0fkdrbke=091372=090=09968=094 elsbswpewgrr1twa7z1bea0fkdrbkezb=091372=090=09968=094 else=09225=090=091285=090 Might that be the trouble? (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly=20 different data set.) Cheers, Lachlan |