|
From: Gilles D. <gr...@sc...> - 2003-02-19 15:19:48
|
According to Lachlan Andrew: > On Friday 14 February 2003 11:16, Neal Richter wrote: > > > Is there something you can tell us about the type of data you are > > indexing? Are they big pages with lots of repetitive information.. > > giving htdig many similar keys which hash/sort to the same pages? > > Greetings Neal, > > I've found one page in the qt documentation which may be causing > those problems (attached). I hadn't realised it, but the > valid_punctuation attribute seems to be treated as an *optional* > word break. (The docs say it is *not* a word break, and that seems > the intention of WordType::WordToken...) I guess the docs haven't kept up with what the code does. It used to be that valid_punctuation didn't cause word breaks at all, i.e. these punctuation characters were valid inside a word, and got stripped out but didn't break up the word. However, for some time now, this functionality was extended to also index each word part, so that something like "post-doctoral" gets indexed as postdoctoral, post and doctoral. This greatly enhances searches for compound words, or parts thereof, but it tends to break down when you're indexing something that's not really words... > The page has long strings > with many valid_punctuation symbols, and gives output like > > elliptical 1060 0 1113 34 > elp 1363 0 131 0 > elphick 1516 0 750 0 > elsbs 1372 0 968 4 > elsbsw 1372 0 968 4 > elsbswp 1372 0 968 4 > elsbswpe 1372 0 968 4 > elsbswpew 1372 0 968 4 > elsbswpewg 1372 0 968 4 > elsbswpewgr 1372 0 968 4 > elsbswpewgrr 1372 0 968 4 > elsbswpewgrr1 1372 0 968 4 > elsbswpewgrr1t 1372 0 968 4 > elsbswpewgrr1twa7 1372 0 968 4 > elsbswpewgrr1twa7z 1372 0 968 4 > elsbswpewgrr1twa7z1bea0 1372 0 968 4 > elsbswpewgrr1twa7z1bea0f 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fk 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkd 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbk 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbke 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbkezb 1372 0 968 4 > else 225 0 1285 0 > > Might that be the trouble? Well, I would think that if you're going to feed a bunch of C code into htdig, especially C code containing many pixmaps, then you should probably do so with a severely stripped down setting of valid_punctuation. This would speed up the process a lot and get rid of a lot of the spurious junk that's getting indexed. However, if the underlying word database is solid, then it shouldn't fall apart no matter how much junk you throw at it. So, this might be the trigger that brings the trouble to the surface, but the root cause of the trouble seems to be a bug somewhere in the code. > (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly > different data set.) Bummer. Have you tried running with no compression at all, and if so, does that work reliably? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Lachlan A. <lh...@us...> - 2003-02-22 08:50:37
|
On Thursday 20 February 2003 02:19, Gilles Detillieux wrote: > According to Lachlan Andrew: > > I hadn't realised it, but the > > valid_punctuation attribute seems to be treated as an *optional* > > word break. (The docs say it is *not* a word break) > > I guess the docs haven't kept up with what the code does. > this functionality was extended to also index each word part, > so that something like "post-doctoral" gets indexed as > postdoctoral, post and doctoral. This greatly enhances searches for > compound words, or parts thereof, but it tends to break down when > you're indexing something that's not really words... Thanks for that clarification Gilles. Would it be better to convert queries for post-doctoral into the=20 phrase "post doctoral" in queries, and simply the words post and =20 doctoral in the database? As it stands, a search for "the=20 non-smoker" will match "the smoker", since all the words are given=20 the same position in the database. It also reduces the size of the=20 database (marginally in most cases, but significantly for=20 pathological documents). Now that there is phrase searching, is=20 there any benefit of the current approach? If not, we could do away=20 with valid_punctuation entirely (after 3.2.0b5). > if you're going to feed a bunch of C code into htdig, you > should probably do so with a severely stripped down setting of > valid_punctuation.... However, > if the underlying word database is solid, then it shouldn't fall > apart no matter how much junk you throw at it. the root > cause of the trouble seems to be a bug somewhere in the code. My thoughts exactly. I'm only using this page for debugging... Cheers, Lachlan |
|
From: Neal R. <ne...@ri...> - 2003-02-19 17:32:21
|
Thanks. I'll give this page a test. What page sizes are you seeing the errors on? Ie what is your wordlist_page_size set to? Thanks again. On Wed, 19 Feb 2003, Lachlan Andrew wrote: > On Friday 14 February 2003 11:16, Neal Richter wrote: > > > Is there something you can tell us about the type of data you are > > indexing? Are they big pages with lots of repetitive information.. > > giving htdig many similar keys which hash/sort to the same pages? > > Greetings Neal, > > I've found one page in the qt documentation which may be causing > those problems (attached). I hadn't realised it, but the > valid_punctuation attribute seems to be treated as an *optional* > word break. (The docs say it is *not* a word break, and that seems > the intention of WordType::WordToken...) The page has long strings > with many valid_punctuation symbols, and gives output like > > elliptical 1060 0 1113 34 > elp 1363 0 131 0 > elphick 1516 0 750 0 > elsbs 1372 0 968 4 > elsbsw 1372 0 968 4 > elsbswp 1372 0 968 4 > elsbswpe 1372 0 968 4 > elsbswpew 1372 0 968 4 > elsbswpewg 1372 0 968 4 > elsbswpewgr 1372 0 968 4 > elsbswpewgrr 1372 0 968 4 > elsbswpewgrr1 1372 0 968 4 > elsbswpewgrr1t 1372 0 968 4 > elsbswpewgrr1twa7 1372 0 968 4 > elsbswpewgrr1twa7z 1372 0 968 4 > elsbswpewgrr1twa7z1bea0 1372 0 968 4 > elsbswpewgrr1twa7z1bea0f 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fk 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkd 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbk 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbke 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbkezb 1372 0 968 4 > else 225 0 1285 0 > > Might that be the trouble? > > (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly > different data set.) > > Cheers, > Lachlan Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Lachlan A. <lh...@us...> - 2003-02-23 05:50:52
|
On Wednesday 19 February 2003 23:44, Lachlan Andrew wrote: > (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly > different data set.) Whoops! I didn't make clean after installing the new libraries. =20 Now that I have, I haven't been able to reproduce the problem. I'm=20 keeping trying, but sorry for leading people on a wild goose chase... Cheers, Lachlan |
|
From: Lachlan A. <lh...@us...> - 2003-02-23 13:23:48
|
OK, now try this on for size... If I run the attached rundig script, with -v and the attached =20 =2Econf script on the attached directory (51 copies of the attached=20 file hash) with an empty .../var/htdig-crash1 directory, then all=20 is well. However, if I run it a *second* time, it gives the attached=20 log file. This is odd since the script uses -i which is supposed to ignore the=20 contents of the directory. (On another note, should -i also ignore=20 the db.log file? It currently doesn't.) Neal, can you (or anyone else) replicate this behaviour? Thanks! Lachlan On Sunday 23 February 2003 16:50, Lachlan Andrew wrote: > Whoops! I didn't make clean after installing the new libraries. > Now that I have, I haven't been able to reproduce the problem. |
|
From: Jim C. <li...@yg...> - 2003-02-24 07:32:28
|
On Sunday, February 23, 2003, at 06:21 AM, Lachlan Andrew wrote: > If I run the attached rundig script, with -v and the attached > .conf script on the attached directory (51 copies of the attached > file hash) with an empty .../var/htdig-crash1 directory, then all > is well. However, if I run it a *second* time, it gives the attached > log file. > > This is odd since the script uses -i which is supposed to ignore the > contents of the directory. (On another note, should -i also ignore > the db.log file? It currently doesn't.) > > Neal, can you (or anyone else) replicate this behaviour? Hi - I was able to duplicate the problem on a machine running Red Hat 8.0. The results of the second run were almost identical to what you log file shows. I was neither redirecting stderr nor paying attention when the errors started, so I didn't catch the page number of the first failure. I am in the process of repeating the experiment and will catch the page number this time around. I tried the same thing on an OS X box, and htdig core dumps (segfault) early on during the first pass with rundig. The core file indicates failure in a vm_allocate call, however the backtrace shows the call to be 3000+ CDB___* calls deep. The entry point from the backtrace is show below. I can provide the full backtrace if anyone wants to see it. Jim #3249 0x00054bc4 in CDB___bam_c_put (dbc_orig=0x19e9660, key=0xbfffdd30, data=0xbfffdd50, flags=15) at bt_cursor.c:925 #3250 0x0003c7b0 in CDB___db_put (dbp=0x19e9904, txn=0x19e9660, key=0xa99640, data=0xbfffdd30, flags=27170944) at db_am.c:508 #3251 0x00025298 in WordList::Put(WordReference const&, int) (this=0xbffff1f0, arg=@0xbfffdd30, flags=1) at WordDB.h:126 #3252 0x0001eeac in HtWordList::Flush() (this=0xbffff1f0) at ../htword/WordList.h:118 #3253 0x00020c9c in DocumentRef::AddDescription(char const*, HtWordList&) (this=0x548ab90, d=0x11bd98 "", words=@0xbffff1f0) at DocumentRef.cc:512 #3254 0x0000bac8 in Retriever::got_href(URL&, char const*, int) (this=0xbffff140, url=@0x1ae8d30, description=0x5446880 "QPainter", hops=1) at Retriever.cc:1496 #3255 0x00005c84 in HTML::do_tag(Retriever&, String&) (this=0xbba9b0, retriever=@0xbffff140, tag=@0xbbaa24) at ../htlib/htString.h:45 #3256 0x00005120 in HTML::parse(Retriever&, URL&) (this=0xbba9b0, retriever=@0xbffff140, baseURL=@0x10000) at HTML.cc:414 #3257 0x00009a24 in Retriever::RetrievedDocument(Document&, String const&, DocumentRef*) (this=0xbffff140, doc=@0xa99950, url=@0x10000, ref=0xbb9ad0) at Retriever.cc:818 #3258 0x000094d0 in Retriever::parse_url(URLRef&) (this=0xbffff140, urlRef=@0x1b730c0) at ../htcommon/URL.h:51 #3259 0x00008b28 in Retriever::Start() (this=0xbffff140) at Retriever.cc:432 #3260 0x0000f988 in main (ac=5, av=0xbffff9c4) at htdig.cc:338 #3261 0x0000266c in _start (argc=5, argv=0xbffff9c4, envp=0xbffff9dc) at /SourceCache/Csu/Csu-45/crt.c:267 #3262 0x000024ec in start () at /usr/include/gcc/darwin/3.1/g++-v3/streambuf:129 |
|
From: Jim C. <li...@yg...> - 2003-02-25 04:32:29
|
Hi - I was able to repeat the problem again. The second time around I made a point of catching the page numbers. They were the same as those listed in your log file. Jim On Sunday, February 23, 2003, at 06:21 AM, Lachlan Andrew wrote: > OK, now try this on for size... > > If I run the attached rundig script, with -v and the attached > .conf script on the attached directory (51 copies of the attached > file hash) with an empty .../var/htdig-crash1 directory, then all > is well. However, if I run it a *second* time, it gives the attached > log file. > > This is odd since the script uses -i which is supposed to ignore the > contents of the directory. (On another note, should -i also ignore > the db.log file? It currently doesn't.) > > Neal, can you (or anyone else) replicate this behaviour? > > Thanks! > Lachlan > > On Sunday 23 February 2003 16:50, Lachlan Andrew wrote: >> Whoops! I didn't make clean after installing the new libraries. >> Now that I have, I haven't been able to reproduce the >> problem.<rundig><valid_punct.conf><directory><hash><log.first-200- >> lines> |
|
From: Lachlan A. <lh...@us...> - 2003-02-25 12:55:25
|
Thanks for that, Jim! I am glad it isn't just my system. The OS X crash is probably a good thing to look at, if it occurs early=20 in the dig. (a) Could you please post (or mail me) the complete=20 backtrace? (b) Does it still occur without compression? (c) Are you=20 using zlib-1.1.4? (I have had core dumps with earlier versions, but=20 not since upgrading.) Thanks again, Lachlan On Tuesday 25 February 2003 15:32, Jim Cole wrote: > Hi - I was able to repeat the problem again. The second time around > I made a point of catching the page numbers. They were the same as > those listed in your log file. |
|
From: Jim C. <li...@yg...> - 2003-02-27 00:16:53
Attachments:
osx_bt.gz
|
On Tuesday, February 25, 2003, at 05:55 AM, Lachlan Andrew wrote: > The OS X crash is probably a good thing to look at, if it occurs early > in the dig. (a) Could you please post (or mail me) the complete > backtrace? (b) Does it still occur without compression? (c) Are you > using zlib-1.1.4? (I have had core dumps with earlier versions, but > not since upgrading.) The backtrace is attached. The problem does not occur if I turn off compression; by turning off compression, I mean that I changed your provided config file so that wordlist_compress and wordlist_compress_zlib are false and compression_level is commented out. OS X is still using 1.1.3; according to Apple, the vulnerability that drove the move to 1.1.4 did not affect their system. When I get a chance, I will try rebuilding everything against the 1.1.4 zlib available via fink and see if that makes any difference. Jim |
|
From: Lachlan A. <lh...@us...> - 2003-02-27 11:41:40
|
On Thursday 27 February 2003 11:16, Jim Cole wrote:
> The backtrace is attached. The problem does not occur if I turn off
> compression.
Thanks. My guess is that (part of) the reason for the very deep=20
recursion is that it's trying to allocate a block of len=3D8247 =20
bytes, when the page size is only 8192:
#3244 0x00070958 in CDB___memp_alloc (dbmp=3D0xa98c30, memreg=3D0xa99f60,=
=20
mfp=3D0xc84e98, len=3D8247, offsetp=3D0x0, retp=3D0xbfffd900) at=20
mp_alloc.c:88
I used to get the error
Unable to allocate %lu bytes from mpool shared region
at some stage too, which is generated inside CDB___memp_alloc. From=20
memory, that was when I was using 1.1.3.
If that is really the problem, it can be fixed by testing explicitly=20
whether len>pagesize (if the pagesize is available somewhere...).
|
|
From: Jim C. <li...@yg...> - 2003-02-28 00:23:58
|
Hi - Just a follow up on the issue of zlib version. I installed the 1.1.4 version of zlib available via Fink and rebuilt everything. Using the newer version of zlib, I encounter the same problem (i.e. a segfault from htdig with a very deep stack trace). I did perform a distclean and verified the use of the 1.1.4 version libz via otool. Jim On Thursday, February 27, 2003, at 04:41 AM, Lachlan Andrew wrote: > On Thursday 27 February 2003 11:16, Jim Cole wrote: > >> The backtrace is attached. The problem does not occur if I turn off >> compression. > > Thanks. My guess is that (part of) the reason for the very deep > recursion is that it's trying to allocate a block of len=8247 > bytes, when the page size is only 8192: > #3244 0x00070958 in CDB___memp_alloc (dbmp=0xa98c30, memreg=0xa99f60, > mfp=0xc84e98, len=8247, offsetp=0x0, retp=0xbfffd900) at > mp_alloc.c:88 > > I used to get the error > Unable to allocate %lu bytes from mpool shared region > at some stage too, which is generated inside CDB___memp_alloc. From > memory, that was when I was using 1.1.3. > > If that is really the problem, it can be fixed by testing explicitly > whether len>pagesize (if the pagesize is available somewhere...). > |
|
From: Lachlan A. <lh...@us...> - 2003-03-09 09:23:03
Attachments:
mp_alloc.patch
|
Greetings Jim, Attached is a hack which explicitly stops the recursion in OS X. Does=20 it work? (Neal, would it be better in one of the other functions in=20 the loop?) I don't know why a different OS should crash in a different place. =20 Does OS X support pread? Type man pread. Are you having any luck with the other errors in 'make check'? Thanks! Lachlan On Friday 28 February 2003 11:23, Jim Cole wrote: > Hi - Just a follow up on the issue of zlib version. I installed > the 1.1.4 version of zlib available via Fink and rebuilt > everything. Using the newer version of zlib, I encounter the same > problem (i.e. a segfault from htdig with a very deep stack trace). > I did perform a distclean and verified the use of the 1.1.4 version > libz via otool. > > Jim > > On Thursday, February 27, 2003, at 04:41 AM, Lachlan Andrew wrote: > > On Thursday 27 February 2003 11:16, Jim Cole wrote: > >> The backtrace is attached. The problem does not occur if I turn > >> off compression. > > > > Thanks. My guess is that (part of) the reason for the very deep > > recursion is that it's trying to allocate a block of len=3D8247 > > bytes, when the page size is only 8192: > > #3244 0x00070958 in CDB___memp_alloc (dbmp=3D0xa98c30, > > memreg=3D0xa99f60, mfp=3D0xc84e98, len=3D8247, offsetp=3D0x0, > > retp=3D0xbfffd900) at mp_alloc.c:88 > > > > I used to get the error > > Unable to allocate %lu bytes from mpool shared region > > at some stage too, which is generated inside CDB___memp_alloc.=20 > > From memory, that was when I was using 1.1.3. > > > > If that is really the problem, it can be fixed by testing > > explicitly whether len>pagesize (if the pagesize is available > > somewhere...). > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Neal R. <ne...@ri...> - 2003-03-10 06:56:06
|
I tried using "make check" on a RedHat 8.0 machine.. no joy. There seems to be fundamental difference between htdig-3.2.0b4-20030302/test/conf/httpd.conf wants and what RedHat 8.0 provides. I'll try it on a different machine at work tommorow. Interesting patch.. I need to read more code around it. I did however look at over snapshots of BDB (http://www.sleepycat.com/download/patchlogs.shtml) There are a large number of changes to mp_alloc.c from 3.0.55 to 3.1.14, but that all seem to be superficial changes (function names etc). The diff from 3.0.55 to 3.3.11 is more interesting.. but nothing along the lines of your patch that I saw... My feeling at this point is that the bug is caused by a problem in the mp_cmpr.c code, and since you've found a nice small test case it will hopefully be easier to fix.. I'll report back tommorow. Thanks Neal On Sun, 9 Mar 2003, Lachlan Andrew wrote: > Greetings Jim, > > Attached is a hack which explicitly stops the recursion in OS X. Does > it work? (Neal, would it be better in one of the other functions in > the loop?) > > I don't know why a different OS should crash in a different place. > Does OS X support pread? Type man pread. > > Are you having any luck with the other errors in 'make check'? > > Thanks! > Lachlan > > On Friday 28 February 2003 11:23, Jim Cole wrote: > > Hi - Just a follow up on the issue of zlib version. I installed > > the 1.1.4 version of zlib available via Fink and rebuilt > > everything. Using the newer version of zlib, I encounter the same > > problem (i.e. a segfault from htdig with a very deep stack trace). > > I did perform a distclean and verified the use of the 1.1.4 version > > libz via otool. > > > > Jim > > > > On Thursday, February 27, 2003, at 04:41 AM, Lachlan Andrew wrote: > > > On Thursday 27 February 2003 11:16, Jim Cole wrote: > > >> The backtrace is attached. The problem does not occur if I turn > > >> off compression. > > > > > > Thanks. My guess is that (part of) the reason for the very deep > > > recursion is that it's trying to allocate a block of len=8247 > > > bytes, when the page size is only 8192: > > > #3244 0x00070958 in CDB___memp_alloc (dbmp=0xa98c30, > > > memreg=0xa99f60, mfp=0xc84e98, len=8247, offsetp=0x0, > > > retp=0xbfffd900) at mp_alloc.c:88 > > > > > > I used to get the error > > > Unable to allocate %lu bytes from mpool shared region > > > at some stage too, which is generated inside CDB___memp_alloc. > > > From memory, that was when I was using 1.1.3. > > > > > > If that is really the problem, it can be fixed by testing > > > explicitly whether len>pagesize (if the pagesize is available > > > somewhere...). > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > htdig-dev mailing list > > htd...@li... > > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Lachlan A. <lh...@us...> - 2003-03-11 13:26:01
|
Thanks for your email. I think Jim's infinite recursion problem was fairly separate from the=20 database corruption I was having. The recursion is fairly simple.=20 (1) The allocation routine selects a dirty cache page to flush (2) When the page is written, it is "compressed" on the fly, but it is=20 actually slightly expanded and needs to be stored in two pages (3) When the compression routine tries to allocate a new page, it=20 recursively calls the same allocation routine, and the same dirty=20 cache page is selected... Ideally it would be nice to fix mp_cmpr.c so that the "weakcmpr"=20 page was allocated by a completely different mechanism (on the=20 stack?), but that has the potential to introduce lots more bugs. The new releases of BDB don't seem to have mp_cmpr.c at all. Is=20 that an add-on from another project? Mifluz? Cheers, Lachlan On Monday 10 March 2003 17:57, Neal Richter wrote: > Interesting patch.. I need to read more code around it. > > My feeling at this point is that the bug is caused by a problem in > the mp_cmpr.c code |
|
From: Geoff H. <ghu...@ws...> - 2003-03-11 14:34:35
|
On Tuesday, March 11, 2003, at 07:25 AM, Lachlan Andrew wrote: > The new releases of BDB don't seem to have mp_cmpr.c at all. Is > that an add-on from another project? Mifluz? While Loic offered lots of the improvements and bug-fixes to the Sleepycat folks, I don't think they took any. In particular, they felt the database compression feature didn't make sense because disk space is cheap. In any case, the core Berkeley DB code gets hammered pretty hard since it's used in lots of places. -Geoff |
|
From: Lachlan A. <lh...@us...> - 2003-03-12 11:36:17
|
Just a thought, but may the problem be related to: =09General Access Method Changes: =096. Fix a bug in which DB-managed memory returned by a DB->get or =09 DB->put call may be corrupted by a later cursor call. [#3576] ? My suspicion comes since mp_cmpr_alloc calls get() and Jim's=20 example shows that it can recursively burrow further than we'd like. It is still taking me forever to understand all the code/changes... Cheers, Lachlan On Monday 10 March 2003 17:57, Neal Richter wrote: > I did however look at over snapshots of BDB > (http://www.sleepycat.com/download/patchlogs.shtml) |
|
From: Neal R. <ne...@ri...> - 2003-02-25 23:28:45
|
Jim, Does the error happen when you run htdig -i twice (NOT using rundig)? Thanks. On Mon, 24 Feb 2003, Jim Cole wrote: > Hi - I was able to repeat the problem again. The second time around I > made a point of catching the page numbers. They were the same as those > listed in your log file. > > Jim > > On Sunday, February 23, 2003, at 06:21 AM, Lachlan Andrew wrote: > > > OK, now try this on for size... > > > > If I run the attached rundig script, with -v and the attached > > .conf script on the attached directory (51 copies of the attached > > file hash) with an empty .../var/htdig-crash1 directory, then all > > is well. However, if I run it a *second* time, it gives the attached > > log file. > > > > This is odd since the script uses -i which is supposed to ignore the > > contents of the directory. (On another note, should -i also ignore > > the db.log file? It currently doesn't.) > > > > Neal, can you (or anyone else) replicate this behaviour? > > > > Thanks! > > Lachlan > > > > On Sunday 23 February 2003 16:50, Lachlan Andrew wrote: > >> Whoops! I didn't make clean after installing the new libraries. > >> Now that I have, I haven't been able to reproduce the > >> problem.<rundig><valid_punct.conf><directory><hash><log.first-200- > >> lines> > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Lachlan A. <lh...@us...> - 2003-02-26 12:45:00
|
Greetings all,
Just for the record:
1) The -i option doesn't remove the _weakcmpr file.
Neal, what effect will that have?
2) I've just run htdig on an existing database *without* -i and
it also complained about weakcmpr problems.
(I've forgotten whether I ran htpurge after the first run, so
I'm running it again without it.)
3) There is still a (different) problem with pagesize 32k. The
htdig ran OK, but the second htpurge complained near the end.
Cheers,
Lachlan
On Wednesday 26 February 2003 10:30, Neal Richter wrote:
> Does the error happen when you run htdig -i twice NOT using rundig?
|
|
From: Neal R. <ne...@ri...> - 2003-02-26 17:35:04
|
On Wed, 26 Feb 2003, Lachlan Andrew wrote:
> Greetings all,
>
> Just for the record:
> 1) The -i option doesn't remove the _weakcmpr file.
> Neal, what effect will that have?
> 2) I've just run htdig on an existing database *without* -i and
> it also complained about weakcmpr problems.
> (I've forgotten whether I ran htpurge after the first run, so
> I'm running it again without it.)
> 3) There is still a (different) problem with pagesize 32k. The
> htdig ran OK, but the second htpurge complained near the end.
#1 is easy to fix.
Note that there is no word_db_weakcmp config variable....
Changes near htdig.cc:279
const String word_filename = config->Find("word_db");
const String word_weakcmp_filename = word_filename;
word_weakcmp_filename.append("_weakcmpr");
if (initial)
{
unlink(word_filename);
unlink(word_weakcmp_filename);
}
#3
What is htpurge being run for???? Isn't its used to remove entries from
the index? I know that htpurge is called immediately after htdig in
rundig... my question is WHY???!!!
How are you guys using it?
What happens when you try and use it to remove URLs from the index,
and try to add more URLs after purging??
An interesting test would be to establish two test datasets that are
exact duplicates of each other at different URLs on your server.
%htdig -i URL1
%htdig -i URL2
This would access, expand and rewrite nearly every page in the WordDB.
If there are problems rewriting/expanding pages, they may show up.
Thanks!
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Lachlan A. <lh...@us...> - 2003-02-26 22:21:19
|
On Thursday 27 February 2003 04:36, Neal Richter wrote: > On Wed, 26 Feb 2003, Lachlan Andrew wrote: > > 1) The -i option doesn't remove the _weakcmpr file. > > 2) I've just run htdig on an existing database *without* -i=20 > > and it also complained about weakcmpr problems. > > (I've forgotten whether I ran htpurge after the first run, > > so I'm running it again without it.) > > #1 is easy to fix. Yes. While we're at it, we should remove db.log (=3D"url_log"). I was=20 just thinking it might give you/us some insight into the cause of the=20 problem. For #2, I have run htdig again without -i and without having purged=20 the database, but after 'touch'ing each html file. It complains: WordDB: CDB___memp_cmpr_read: unable to uncompress page at pgno =3D=20 40435 WordDB: PANIC: Input/output error Whenever this appears, it appears twice. > #3 > What is htpurge being run for???? Isn't its used to remove > entries from the index? I know that htpurge is called immediately > after htdig in rundig... my question is WHY???!!! Entries are created for all of the pages referred to during the dig,=20 even if they don't exist. Purging gets rid of these useless entries. > How are you guys using it? =2E./bin/htpurge -v -c <file>.conf > An interesting test would be to establish two test datasets that > are exact duplicates of each other at different URLs on your > server. > > %htdig -i URL1 > %htdig -i URL2 > > This would access, expand and rewrite nearly every page in the > WordDB. If there are problems rewriting/expanding pages, they may > show up. If -i works, the database should be erased before being accessed in=20 the second dig, shouldn't it? Regards, Lachlan |
|
From: Jim C. <li...@yg...> - 2003-02-27 00:09:32
|
Hi - The problem does not occur when running only htdig with -i Jim On Tuesday, February 25, 2003, at 04:30 PM, Neal Richter wrote: > > Jim, > Does the error happen when you run htdig -i twice (NOT using > rundig)? > > Thanks. > > > On Mon, 24 Feb 2003, Jim Cole wrote: > >> Hi - I was able to repeat the problem again. The second time around I >> made a point of catching the page numbers. They were the same as those >> listed in your log file. >> >> Jim >> >> On Sunday, February 23, 2003, at 06:21 AM, Lachlan Andrew wrote: >> >>> OK, now try this on for size... >>> >>> If I run the attached rundig script, with -v and the attached >>> .conf script on the attached directory (51 copies of the attached >>> file hash) with an empty .../var/htdig-crash1 directory, then all >>> is well. However, if I run it a *second* time, it gives the attached >>> log file. >>> >>> This is odd since the script uses -i which is supposed to ignore >>> the >>> contents of the directory. (On another note, should -i also ignore >>> the db.log file? It currently doesn't.) >>> >>> Neal, can you (or anyone else) replicate this behaviour? >>> >>> Thanks! >>> Lachlan >>> >>> On Sunday 23 February 2003 16:50, Lachlan Andrew wrote: >>>> Whoops! I didn't make clean after installing the new libraries. >>>> Now that I have, I haven't been able to reproduce the >>>> problem.<rundig><valid_punct.conf><directory><hash><log.first-200- >>>> lines> >> >> >> >> ------------------------------------------------------- >> This sf.net email is sponsored by:ThinkGeek >> Welcome to geek heaven. >> http://thinkgeek.com/sf >> _______________________________________________ >> htdig-dev mailing list >> htd...@li... >> https://lists.sourceforge.net/lists/listinfo/htdig-dev >> > > Neal Richter > Knowledgebase Developer > RightNow Technologies, Inc. > Customer Service for Every Web Site > Office: 406-522-1485 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Neal R. <ne...@ri...> - 2003-02-14 23:19:55
|
Lachlan, Question: What is your wordlist_page_size set to? The htdig default is zero, and the BDB default of 8K (in most situations) is then used. Altough the BDB max page size is 64K, we can't use that yet as a result of a multiplication bug in mp_cmpr I haven't tracked down yet. I use this as my default: wordlist_page_size: 32768 Larger pages are usually more efficient, especially since here we pay the overhead of deflating each page individually before returning the data. If your bug is caused by page overflow as I suspect, then this change will at least push the bug 'away' so that you may have to index several orders of magnitude more than 50,000 pages to see the bug. We've got all kinds of problems if we want to try and index 5 Million+ pages. I could be wrong, but I'd be interested to see if it makes the problem go away. Thanks! On Fri, 14 Feb 2003, Lachlan Andrew wrote: > An error occurs during an htdump straight after htdig. However, I > haven't yet got it to occur *within* htdig. > > Interestingly, the error first reported by htdump is similar to the > one I last reported, > > WordDB: CDB___memp_cmpr_read: unable to uncompress page at pgno = 23 > WordDB: PANIC: Input/output error > WordDBCursor::Get(17) failed DB_RUNRECOVERY: Fatal error, run > database recovery > > but the one by htpurge (and subsequent htdumps) is > > WordDB: CDB___memp_cmpr_read: unexpected compression flag value 0x8 > at pgno = 26613 > WordDB: PANIC: Successful return: 0 > WordDBCursor::Get(17) failed DB_RUNRECOVERY: Fatal error, run > database recovery > > I'll keep looking... > > On Friday 14 February 2003 05:05, Neal Richter wrote: > > Please attempt to reproduce the error using ONLY htdig next. > > > > If the error is still present, the the error is in htdig. If the > > error is not present then the bug is happening during htpurge. > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: FREE SSL Guide from Thawte > are you planning your Web Server Security? Click here to get a FREE > Thawte SSL guide and find the answers to all your SSL security issues. > http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |