You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Gilles D. <gr...@sc...> - 2003-10-03 18:13:12
|
According to Neal Richter: > On Fri, 3 Oct 2003, Lachlan Andrew wrote: > > I'm not sure that I understand this. If a page 'X' is linked only by > > a page 'Y' which isn't changed since the previous dig, do we parse > > the unchanged page 'Y'? If so, why not run htdig -i? If not, how > > do we know that page 'X' should still be in the database? > > X does not change, but Y does.. it no longer has a link to X. > > If the website is big enough htdig -i is wastefull of network bandwidth. > > The locical error as I see it is that we revisit the list of documents > currently in the index, rather than starting from the beginning and > spidering... then removing the all documents we didn't find links for. But if we need to re-spider everything, don't we need to re-index all documents, whether they've changed or not? If so, then we need to do htdig -i all the time. If we don't reparse every document, we need some other means to re-validate every document to which an unchanged document has links. I think you misinterpreted what Lachlan suggested, i.e. the case where Y does NOT change. If Y is the only document with a link to X, and Y does not change, it will still have the link to X, so X is still "valid". However, if Y didn't change, and htdig (without -i) doesn't reindex Y, then how will it find the link to X to validate X's presence in the db? > > I'd be inclined not to fix this until after we've released the next > > "archive point", whether that be 3.2.0b5 or 3.2.0rc1... I'd be inclined to agree. If it comes down to the possibility of losing valid documents in the db vs. keeping invalid ones, I'd prefer the latter behaviour. Until we can find a way to ensure all currently linked documents remain in the db, without having to reparse them all, then I think the current behaviour is the best compromise. If you want to reparse everything to ensure a clean db with accurate linkages, that's what -i is for. A somewhat related problem/limitation in update digs is that the backlink count and link depth from start_url may not get properly updated for documents that aren't reparsed. If these matter to you, periodic full digs may be needed to restore the accuracy of these fields. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2003-10-03 17:54:00
|
On Fri, 3 Oct 2003, Steve Eidemiller wrote: > Hi Lachlan, > > Thanks for the suggestion :-) I tried it in all the environments I listed earlier, but it didn't appear to work using the default settings for the compression flags. Here's what htdb_dump reports for all attempts: > > ================ > C:\htdig\bin>htdb_dump -Wz -p c:/htdig/var/htdig/db.words.db > htdb_dump: open: c:/htdig/var/htdig/db.words.db: No such file or directory > > C:\htdig\bin>htdb_dump -W -p c:/htdig/var/htdig/db.words.db > htdb_dump: c:/htdig/var/htdig/db.words.db: file size not a multiple of the pages > ize This is strange given the above error. IF this error is accurate it's a harbinger of bad values in the DB. I have fixed this and thought I checked it in! Basically I tracked the state of the file pointer and at somepoint the system twesks it to 'text' mode, and this hoses the DB. I'll check CVS to see if I got that fix in. If the fix is in CVS, then its a new bug! > > On Fri, 3 Oct 2003 02:51, Steve Eidemiller wrote: > > I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc > > 3.3.1, on both Windows XP Pro SP1 and Windows 2000 Server SP4. > > db.words.db is always a zero length file > > -- > lh...@us... > ht://Dig developer DownUnder (http://www.htdig.org) > __________________________________ > > Confidentiality Statement: > This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately. > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-03 17:46:45
|
On Fri, 3 Oct 2003, Lachlan Andrew wrote: > Greetings Neal, > > I'm not sure that I understand this. If a page 'X' is linked only by > a page 'Y' which isn't changed since the previous dig, do we parse > the unchanged page 'Y'? If so, why not run htdig -i? If not, how > do we know that page 'X' should still be in the database? X does not change, but Y does.. it no longer has a link to X. If the website is big enough htdig -i is wastefull of network bandwidth. The locical error as I see it is that we revisit the list of documents currently in the index, rather than starting from the beginning and spidering... then removing the all documents we didn't find links for. > I'd be inclined not to fix this until after we've released the next > "archive point", whether that be 3.2.0b5 or 3.2.0rc1... > Cheers, > Lachlan > > On Fri, 3 Oct 2003 08:56, Neal Richter wrote: > > The workaround is to use 'htdig -i'. This is a disadvantage as we > > will revisit and index pages even if they haven't changes since the > > last run of htdig. > > > > Here's the Fix: > > > > 1) At the start of Htdig, after we've opened the DBs we 'walk' the > > docDB and mark EVERY document as Reference_obsolete. I wrote code > > to do this.. very short. > > -- > lh...@us... > ht://Dig developer DownUnder (http://www.htdig.org) > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Steve E. <Ste...@ch...> - 2003-10-03 15:37:01
|
Hi Lachlan,
Thanks for the suggestion :-) I tried it in all the environments I listed earlier, but it didn't appear to work using the default settings for the compression flags. Here's what htdb_dump reports for all attempts:
================
C:\htdig\bin>htdb_dump -Wz -p c:/htdig/var/htdig/db.words.db
htdb_dump: open: c:/htdig/var/htdig/db.words.db: No such file or directory
C:\htdig\bin>htdb_dump -W -p c:/htdig/var/htdig/db.words.db
htdb_dump: c:/htdig/var/htdig/db.words.db: file size not a multiple of the pages
ize
htdb_dump: open: c:/htdig/var/htdig/db.words.db: Invalid argument
C:\htdig\bin>
================
The db.words.db.work_weakcmpr file gets created now, and words.db has size to it, but it still seems like it's corrupt or something since I can't dump it. Perhaps I've used the wrong command? htsearch doesn't seem to like words.db either:
================
C:\htdig\bin>htsearch
Enter value for words: patients
WordDB: DB->cursor: method meaningless before open
Content-type: text/html
================
I'm working on the Win32 native build as Neal suggested.
Thanx
>>> Lachlan Andrew <lh...@us...> 10/03/03 08:18AM >>>
Greetings Steve,
Thanks for the very clear bug report. Someone else has the same
problem. It's bug #814268...
This may be my fault. What happens if you replace the NULL in line
806 of db/mp_cmpr.c by dbenv ? That is, make it
if(CDB_db_create(&dbp, dbenv, 0) != 0
That was changed to avoid the possibility of infinite loops, but is a
bit of a kludge. If making the change described above works, then
I'll try to fix it properly.
Cheers,
Lachlan
On Fri, 3 Oct 2003 02:51, Steve Eidemiller wrote:
> I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc
> 3.3.1, on both Windows XP Pro SP1 and Windows 2000 Server SP4.
> db.words.db is always a zero length file
--
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
__________________________________
Confidentiality Statement:
This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately.
|
|
From: Gabriele B. <bar...@in...> - 2003-10-03 14:42:43
|
Hi guys,
well ... I really like your idea Neal (I got a similar one for
ht://Check, but I have never had the time to realise that!).
However, I agree with Lachlan. I'd prefer to wait until we release this
*benedetta* 3.2.0b5 version, hopefully soon.
Any other opinions?
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Lachlan A. <lh...@us...> - 2003-10-03 13:21:00
|
Greetings Steve,
Thanks for the very clear bug report. Someone else has the same=20
problem. It's bug #814268...
This may be my fault. What happens if you replace the NULL in line=20
806 of db/mp_cmpr.c by dbenv ? That is, make it
if(CDB_db_create(&dbp, dbenv, 0) !=3D 0
That was changed to avoid the possibility of infinite loops, but is a=20
bit of a kludge. If making the change described above works, then=20
I'll try to fix it properly.
Cheers,
Lachlan
On Fri, 3 Oct 2003 02:51, Steve Eidemiller wrote:
> I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc
> 3.3.1, on both Windows XP Pro SP1 and Windows 2000 Server SP4.
> db.words.db is always a zero length file
--=20
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
|
|
From: Lachlan A. <lh...@us...> - 2003-10-03 13:10:03
|
Greetings Neal, I'm not sure that I understand this. If a page 'X' is linked only by=20 a page 'Y' which isn't changed since the previous dig, do we parse=20 the unchanged page 'Y'? If so, why not run htdig -i? If not, how=20 do we know that page 'X' should still be in the database? I'd be inclined not to fix this until after we've released the next=20 "archive point", whether that be 3.2.0b5 or 3.2.0rc1... Cheers, Lachlan On Fri, 3 Oct 2003 08:56, Neal Richter wrote: > The workaround is to use 'htdig -i'. This is a disadvantage as we > will revisit and index pages even if they haven't changes since the > last run of htdig. > > Here's the Fix: > > 1) At the start of Htdig, after we've opened the DBs we 'walk' the > docDB and mark EVERY document as Reference_obsolete. I wrote code > to do this.. very short. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Jessica B. <jes...@ya...> - 2003-10-03 00:49:17
|
--- Neal Richter <ne...@ri...> wrote: > Hey all, > I've got a question for all of you about how the > htdig 'indexer' > should function. > I've tested this fix and it works. > > Eh? I felt like I was sharing a beer with you at the pub, and you just got done "schematicizing" the problem and fix on a napkin-coaster and ended it with, "Eh?" Sounds like a good fix to a problem that I think (subconciously) I knew existed. How about this one -- does your patch help with the check_unique_md5 problem? Even when I use a "-i" option (or without), if the start_url's MD5 hash-sig matches the one from my previous index, it just says that it detected an MD5 duplicate and exits. Deleting db.md5hash.db seems to do the trick. But would that be sacrilege removing the db.md5hash.db before a refresh? -Jes __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com |
|
From: Neal R. <ne...@ri...> - 2003-10-02 22:58:03
|
Hey all,
I've got a question for all of you about how the htdig 'indexer'
should function.
htdig.cc
337 List *list = docs.URLs();
338 retriever.Initial(*list);
339 delete list;
340
341 // Add start_url to the initial list of the retriever.
342 // Don't check a URL twice!
343 // Beware order is important, if this bugs you could change
344 // previous line retriever.Initial(*list, 0) to Initial(*list,1)
345 retriever.Initial(config->Find("start_url"), 1);
Note lines 337-339. This code loads the entire list of documents
currently in the index and feeds this to the retriever object for
retrieval and processing.
The effect of this is that we potentially are visiting and keeping
webpages that we aren't about to find via a link, and we will keep
revisiting a website even if we remove it from the 'start_url' in
htdig.conf.
The workaround is to use 'htdig -i'. This is a disadvantage as we will
revisit and index pages even if they haven't changes since the last run of
htdig.
Here's the Fix:
1) At the start of Htdig, after we've opened the DBs we 'walk' the docDB
and mark EVERY document as Reference_obsolete. I wrote code to do this..
very short.
2) Comment out htdig.cc 337-339
3) When the indexer fires up and spiders a site, documents that are in
the tree and marked as Reference_obsolete are remarked as
Reference_normal.
4) when htpurge is run, the obsoleted docs are flushed.
Documents that aren't revisited (since a link isn't found) are flushed.
This is fix addresses two flaws:
1)Changing 'start_url' and removing a starting url.. the documents are
still in the index after the next run of htdig (unless you use -i)
2)Pages that still exist on a webserver at a give URL, that are no longer
linked to by any other pages on the site.
I've tested this fix and it works.
Eh?
Thanks.
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Neal R. <ne...@ri...> - 2003-10-02 21:43:30
|
Hey, I have produced a set of makefiles for a native windows binaries. You do need cygwin to run 'make' (the makefiles are for GNU make). The makefiles use the Microsoft compiler. Could you get a copy of the latest snapshot and try and do the build? I'll work with you to get it fixed if it's still broken. We've tested older snapshots of HtDig compiled Win32 native and run nearly a million documents through it.... If this doesn't satisfy your needs, I'd be willing to put in some time looking at the cygwin build. Neal Richter. On Thu, 2 Oct 2003, Steve Eidemiller wrote: > I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc 3.3.1, on both Windows XP Pro SP1 and Windows 2000 Server SP4. Compiling and installation is not a problem. But db.words.db is always a zero length file after running htdig with the compression flags at their default values. After some profiling I also noticed that it wasn't creating the "db.words.db.work_weakcmpr" file during the dig. When compiled under Cygwin 1.3.22 using "gcc-3.2 20020927 (prerelease)", the work file *is* created during the dig and db.words.db has size to it afterwards. However, I am not able to htdb_dump that file or use htsearch against it. It's corrupt or something. The other db files seem to get created fine under both sets of binaries, although I didn't try to dump them. And the same version related behavior occurs under both XP and 2000 OS's. > > After reading all the SF posts about compression and db issues, I decided to disable compression and see what happens: > > wordlist_compress: false > wordlist_compress_zlib: false > compression_level: 0 > > With those settings, everything appears to work fine for both sets of binaries: I can dig pages and run htsearch. I haven't modified any of the code to try and address the problem yet, but it looks like others are having similar issues on other platforms? Is anybody else having trouble with db compression on Windows? I have tried different settings for compression_level with no success. > > Also, my initial attempts at changing the compression flag values failed with error messages from htdig while trying to read the configuration file. It seems that the htdig.conf parser doesn't like CR (ASCII=13) characters. Notepad and Wordpad are obvious choices for editing this file on Windows, but those don't work because both insert CRLF pairs to terminate lines in the file (e.g. DOS format). And then the parser apparently won't see flags at the bottom of the CRLF file. The solution was a simple JavaScript program to modify htdig.conf by removing all CR characters *before* running htdig. Is anybody else seeing this on Cygwin builds? > > Sorry for the long post :) > > PS - I'm running 3.1.6 in production on Windows at http://www.childrenshc.org/Search/ and it rocks!! > > Thanx > -Steve > __________________________________ > > Confidentiality Statement: > This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately. > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Steve E. <Ste...@ch...> - 2003-10-02 16:52:20
|
I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc 3.3.1, on both Windows XP Pro SP1 and Windows 2000 Server SP4. Compiling and installation is not a problem. But db.words.db is always a zero length file after running htdig with the compression flags at their default values. After some profiling I also noticed that it wasn't creating the "db.words.db.work_weakcmpr" file during the dig. When compiled under Cygwin 1.3.22 using "gcc-3.2 20020927 (prerelease)", the work file *is* created during the dig and db.words.db has size to it afterwards. However, I am not able to htdb_dump that file or use htsearch against it. It's corrupt or something. The other db files seem to get created fine under both sets of binaries, although I didn't try to dump them. And the same version related behavior occurs under both XP and 2000 OS's. After reading all the SF posts about compression and db issues, I decided to disable compression and see what happens: wordlist_compress: false wordlist_compress_zlib: false compression_level: 0 With those settings, everything appears to work fine for both sets of binaries: I can dig pages and run htsearch. I haven't modified any of the code to try and address the problem yet, but it looks like others are having similar issues on other platforms? Is anybody else having trouble with db compression on Windows? I have tried different settings for compression_level with no success. Also, my initial attempts at changing the compression flag values failed with error messages from htdig while trying to read the configuration file. It seems that the htdig.conf parser doesn't like CR (ASCII=13) characters. Notepad and Wordpad are obvious choices for editing this file on Windows, but those don't work because both insert CRLF pairs to terminate lines in the file (e.g. DOS format). And then the parser apparently won't see flags at the bottom of the CRLF file. The solution was a simple JavaScript program to modify htdig.conf by removing all CR characters *before* running htdig. Is anybody else seeing this on Cygwin builds? Sorry for the long post :) PS - I'm running 3.1.6 in production on Windows at http://www.childrenshc.org/Search/ and it rocks!! Thanx -Steve __________________________________ Confidentiality Statement: This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately. |
|
From: Mirrors A. <mi...@so...> - 2003-10-02 09:37:02
|
Hi there, We have just setup a new Ht://Dig Mirror in the UK, London which is updated nightly and on a 2Mbit dedicated line. Ht://Dig Web Site: http://www.sourcekeg.co.uk/htdig/ Ht://Dig Files Web Site: http://www.sourcekeg.co.uk/htdig/files/ Ht://Dig Patch Web Site: http://www.sourcekeg.co.uk/htdig/htdig-patches/ Ht://Dig Developer Web/FTP Site: http://www.sourcekeg.co.uk/htdig/dev/ Please add this to the Ht://Dig official mirror page accordingly and use "Onino" in the Organisation field, pointing to http://www.onino.co.uk If you need a contact email address, you can use mi...@so... Best Regards, Jonathan Menmuir mi...@so... |
|
From: Geoff H. <ghu...@us...> - 2003-09-29 20:55:00
|
STATUS of ht://Dig branch 3-2-x CHECKLIST FOR 3.2.0b5: * Check bugs listed in bug-tracker... * Polish release docs (Geoff) * Must be able to (a) make check and (b) index www.htdig.org using "robotstxt_name: master-htdig" on all systems listed as "supported". Systems tested so far: - RH AdvancedServer 2.1 ItaniumII, gcc 2.96.x (David Bannon, 21 Sep) Very out-of-date tests: - Mandrake 8.2, gcc 3.2 (lha, 21 May) - FreeBSD 4.6, gcc 2.95.3 (lha, 23 May) - Debian, Linux kernel 2.2.19, gcc 2.95.4 (lha, 23 May) - SunOS 5.8 = Solaris 2.8, gcc 3.1 (lha, 25 May) - SunOS 5.8 = Solaris 2.8, Sun cc with g++ 3.1 (lha, 29 May) - OS X (Jim, 30 May) Partly tested: - RedHat 8 (Jim, 1 June. make check requires tweaking for apache) - SunOS 5.8 = Solaris 2.8, gcc 2.95.2 (lha. Makes check minus apache, Digs small htdig.org. 27 May) - SunOS 5.8 = Solaris 2.8, Sun cc with g++ 2.95.2 (lha. Makes check minus apache, Digs small htdig.org. 2 June) - RedHat 7.3 (lha. Makes check minus apache. Digs small htdig.org. 25 May) - Alpha Debian (lha. Makes check minus apache. Digs small htdig.org. 25 May) To be tested: - HP-UX 10.20, gcc 2.8.1 (Jesse) - RedHat, other versions anyone? Known to have problems: - SGI/Irix 6.5.3 using SGI compilers <http://www.geocrawler.com/mail/msg.php3?msg_id=8025827&list=8825> RELEASES: 3.2.0b5: Next release, July 2003 3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease. 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. (Please note that everything added here should have a tracker PR# so we can be sure they're fixed. Geoff is currently trying to add PR#s for what's currently here.) SHOWSTOPPERS: * Mifluz database errors are a severe problem (PR#428295) -- Does Neal's new zlib patch solve this for now? KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) PR#618737. * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#618738) Can anyone reproduce this? I can't! -- Lachlan Me either. Let's remove the PR. -Geoff PENDING PATCHES (available but need work): * Additional support for Win32. (Neal) * Memory improvements to htmerge. (Backed out b/c htword API changed.) * Mifluz merge. NEEDED FEATURES: * Quim's new htsearch/qtest query parser framework. * File/Database locking. PR#405764. TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. (PR#405279) * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#405278.) Should we make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file? * Split attrs.html into categories for faster loading. * Turn defaults.cc into an XML file for generating documentation and defaults.cc. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. regex matching, database compression.) PRs# 405280 #405281. * TODO.html has not been updated for current TODO list and completions. I've tried. Someone "official" please check and remove this -- Lachlan * Htfuzzy could use more documentation on what each fuzzy algorithm does. PR#405714. * Document the list of all installed files and default locations. PR#405715. OTHER ISSUES: * Can htsearch actually search while an index is being created? * The code needs a security audit, esp. htsearch. PR#405765. |
|
From: Jim C. <li...@yg...> - 2003-09-27 06:51:07
|
On Friday, September 26, 2003, at 11:40 AM, Jessica Biola wrote: > In any case, I compiled gcc-3.3.1 to not be the main > native compiler, but rather, into the prefix: > /usr/gcc-3.3.1. > > Can someone help me with the correct CPP CXX > LIBRARY_PATH compiler environment settings that I > should be setting? I would start by setting CC and CXX to the paths of the alternate gcc and g++ compilers, respectively. Often that is all that is needed. If you run into problems with depreciated headers, see http://www.htdig.org/FAQ.html#q3.8. However it is my understanding that this should no longer be an issue if you are using a current snapshot. Jim |
|
From: Jessica B. <jes...@ya...> - 2003-09-26 17:41:14
|
I just downloaded the latest snapshot and am trying to compile with gcc-3.3.1. I was using 2.95.3 but had some thoughts regarding better compiler optimizations after reading some posts by Neal and the responses (re: the String.cc changes he made). In any case, I compiled gcc-3.3.1 to not be the main native compiler, but rather, into the prefix: /usr/gcc-3.3.1. Can someone help me with the correct CPP CXX LIBRARY_PATH compiler environment settings that I should be setting? Thanks, -Jes __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com |
|
From: Frank L. <Fr...@am...> - 2003-09-26 17:09:10
|
Gentlemen; I was wondering if your HTdig search engine can be modified just to = search a page of my web site rather than the whole site. If this is = possible, which I feel it should be, could you email me instructions on = how to do it. <<ht--Dig WWW Search.htm>>=20 Frank Leff Office Manager AMW Ougheltree & Associates 197 Cedar Lane Teaneck NJ 07666 www.amwocorp.com tel: 201-836-6257 x100 fax:201-836-6258 Please note: If your fax machine has a flash key or subaddress key, you = can send a fax directly into my email. Dial 201-836-8544; then press the flash key;then dial 100 and start. |
|
From: Lachlan A. <lh...@us...> - 2003-09-26 13:56:05
|
Greetings Jesse,
Most frustrating... Out of interest, what happens if you type
cd test
cp /bin/true testnet
make check
? That should cause the failure of all tests which require testnet,=20
but at least it may let you run the other tests, or uncover other=20
bugs.
The reason I asked about shared libraries was that Gabriele's recent=20
upgrade of configure has fixed them on the Mac. Have you tried=20
since the upgrade?
Cheers,
Lachlan =20
On Tue, 23 Sep 2003 23:22, Jesse op den Brouw wrote:
> Lachlan Andrew wrote:
> >Out of interest, can you compile it using shared libraries?
>
> Still doesn't work.. Same error.
> Shared libs won't work on UX. Not for ages......
--=20
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
|
|
From: Lachlan A. <lh...@us...> - 2003-09-26 11:39:58
|
That sounds like a good strategy. However, I'd vote for keeping that=20 until after the release of 3.2.0 (or at least 3.2.0b5!) Should we perhaps start a new branch in CVS so that development can=20 continue? I have a couple of patches that I have been sitting on for=20 ages, because of the pending release. Any news on when the release is likely to be, or what I can do to=20 expedite it? I plan to test and commit the patch which Jesses says=20 works on HP-UX this weekend, unless we're already in code freeze. Cheers, Lachlan On Fri, 26 Sep 2003 04:40, Neal Richter wrote: > it would take a > walk of the docdb to accomplish a fix! We would need to tag every > document as 'obsolete' and let the spiderer set the values back to > 'normal' as they see the pages. AFter its finished and htpurge is > run, they 'lost' pages are killed. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Gabriele B. <bar...@in...> - 2003-09-26 01:23:05
|
Hi! At 20.59 24/09/2003 +0530, Sunil Raskar wrote: > We are looking for reusable search code which has below features >The search code must be configurable to search a set of static web pages >located on a number of web sites which are hosted on number of severs. >It should have capability to allow the user to search only certain subsets >of the sites for examples - >to search only one of the counties served by the site. >The search must be easily configurable so they can change the scope of the >search as the network of sites served grows. >Please let us know whether this think can implemented using your >code(ht://Dig system) or not? Yes ... ht://Dig answers to all your questions so far. >If yes can you provide use demo version of it? Well .... ht://Dig is probably the most used free search engine of the planet. Just give a look at the 'uses' section of the Website and give a look at some of the real uses of the system. Just click on here: http://www.htdig.org/uses.html >How much ht://Dig system will cost us? Absolutely nothing; of course your time for getting to know it and that's all. ht://Dig is open-source. Ciao ciao -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Neal R. <ne...@ri...> - 2003-09-25 18:41:47
|
On Tue, 23 Sep 2003, Lachlan Andrew wrote:
> On Sun, 21 Sep 2003 07:55, Neal Richter wrote:
> > I've got a fix for it.. a couple lines of code in the section that
> > builds the linked list of search results...
>
> That sounds great. If it checks the search results, I take it that it
> doesn't purge the pages from the database itself. What is the patch?
Oops, I misspoke... I don't have a fix for that.. it would take a walk
of the docdb to accomplish a fix! We would need to tag every document as
'obsolete' and let the spiderer set the values back to 'normal' as they
see the pages. AFter its finished and htpurge is run, they 'lost' pages
are killed.
I do have a short fox for another related issue:
Parser::parse(...)
There is a bug in this for loop:
for (int i = 0; i < elements->Count(); i++)
{
dm = (DocMatch *) (*elements)[i];
dm->collection = collection; // back reference
if (dm->orMatches > 1)
dm->score *= multimatch_factor;
resultMatches.add(dm);
}
If the query returned any Documents with a DocState of !=
Reference_normal, they are included in the linked-list of results. They
are filtered out on display... this is a bit inefficient. It also screws
up libhtdig results since I don't use display.
Here's the fix, it excludes any document that is not Reference_normal from
the results list.
for (int i = 0; i < elements->Count(); i++)
{
dm = (DocMatch *) (*elements)[i];
ref = collection->getDocumentRef(dm->GetId());
if(ref->DocState() == Reference_normal)
{
dm->collection = collection; // back reference
if (dm->orMatches > 1)
dm->score *= multimatch_factor;
resultMatches.add(dm);
}
}
Thanks
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Jim C. <li...@yg...> - 2003-09-25 05:48:23
|
The problem is not unique to your configuration. I see the same thing even if I delete my copy of the code entirely and start over from scratch. Jim On Sunday, September 21, 2003, at 10:20 AM, Ted Stresen-Reuter wrote: > Should the permissions thing be logged as a bug or is this a problem > with my personal configuration of CVS? > > Ted Stresen-Reuter > > On Sunday, September 21, 2003, at 03:24 AM, Jim Cole wrote: > >> On Saturday, September 20, 2003, at 01:04 PM, Ted Stresen-Reuter >> wrote: >> >>> Following the same instructions on Mac OS X produced the following >>> output. Not sure if this is an error or not, though, because I don't >>> know what all the tests are doing... >>> >>> creating url >>> make MAKE="make" check-TESTS >>> PASS: t_wordkey >>> PASS: t_wordlist >>> PASS: t_wordskip >>> PASS: t_wordbitstream >>> PASS: t_search >>> PASS: t_htdb >>> PASS: t_rdonly >>> PASS: t_trunc >>> ../test/test_prepare: /Users/tedsr/htdig/test/./t_url: Permission >>> denied >>> ../test/test_prepare: exec: /Users/tedsr/htdig/test/./t_url: cannot >>> execute: Undefined error: 0 >>> FAIL: t_url >> >> This is due to the fact that the execute permissions on the t_url >> script in the test directory are not being maintained. If you change >> the permissions on that file (e.g. chmod 754 t_url) all tests >> currently pass under OS X. I think something needs to be tweaked in >> CVS to correct this problem. >> >> Jim >> >> >> >> ------------------------------------------------------- >> This sf.net email is sponsored by:ThinkGeek >> Welcome to geek heaven. >> http://thinkgeek.com/sf >> _______________________________________________ >> ht://Dig Developer mailing list: >> htd...@li... >> List information (subscribe/unsubscribe, etc.) >> https://lists.sourceforge.net/lists/listinfo/htdig-dev >> > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Sunil R. <su...@ma...> - 2003-09-24 05:45:01
|
Hi, We are looking for reusable search code which has below features The search code must be configurable to search a set of static web pages located on a number of web sites which are hosted on number of severs. It should have capability to allow the user to search only certain subsets of the sites for examples - to search only one of the counties served by the site. The search must be easily configurable so they can change the scope of the search as the network of sites served grows. Please let us know whether this think can implemented using your code(ht://Dig system) or not? If yes can you provide use demo version of it? How much ht://Dig system will cost us? Please let us know as soon as possible. Thank you, Regards, Sunil Raskar Project Leader Nathan Ark Software Pvt. Ltd. |
|
From: Jim C. <li...@yg...> - 2003-09-23 16:19:03
|
I ran into the same problem under OS X at one point. If I recall correctly, I was able to work around the problem by rearranging the ordering of the libraries. I don't recall the ordering that worked me; it has been some time since this was an issue with my system. Jim On Tuesday, September 23, 2003, at 07:27 AM, Lachlan Andrew wrote: > Thanks for that, Jesse. It looks like WordType::instance is > definitely in there. Any luck with > > make check > cd test > g++ -g -O2 -Wall -fno-rtti -fno-exceptions -o testnet testnet.o > -L/opt/htdig/lib/zlib/lib ../htnet/.libs/libhtnet.a > ../htcommon/.libs/libcommon.a ../htword/.libs/libhtword.a > ../db/.libs/libhtdb.a ../htlib/.libs/libht.a > ../htword/.libs/libhtword.a -lz > make check > > ? > > Cheers, > Lachlan > > > On Tue, 23 Sep 2003 23:18, Jesse op den Brouw wrote: >> [msql@chaos htdig-3.2.0b4-20030914]$ nm */.libs/*.a | grep >> WordType\$instance _8WordType$instance | |undef |data | >> _8WordType$instance | |undef |data | >> _8WordType$instance |1073741828|extern|data |$DATA$ >> [msql@chaos htdig-3.2.0b4-20030914]$ > > -- > lh...@us... > ht://Dig developer DownUnder (http://www.htdig.org) > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Lachlan A. <lh...@us...> - 2003-09-23 13:30:15
|
Thanks for that, Jesse. It looks like WordType::instance is=20
definitely in there. Any luck with
make check =20
cd test
g++ -g -O2 -Wall -fno-rtti -fno-exceptions -o testnet testnet.o
-L/opt/htdig/lib/zlib/lib ../htnet/.libs/libhtnet.a
../htcommon/.libs/libcommon.a ../htword/.libs/libhtword.a
../db/.libs/libhtdb.a ../htlib/.libs/libht.a
../htword/.libs/libhtword.a -lz
make check
?
Cheers,
Lachlan
On Tue, 23 Sep 2003 23:18, Jesse op den Brouw wrote:
> [msql@chaos htdig-3.2.0b4-20030914]$ nm */.libs/*.a | grep
> WordType\$instance _8WordType$instance | |undef |data |
> _8WordType$instance | |undef |data |
> _8WordType$instance |1073741828|extern|data |$DATA$
> [msql@chaos htdig-3.2.0b4-20030914]$
--=20
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
|
|
From: Lachlan A. <lh...@us...> - 2003-09-23 13:25:07
|
On Sun, 21 Sep 2003 07:55, Neal Richter wrote: > I've got a fix for it.. a couple lines of code in the section that > builds the linked list of search results... That sounds great. If it checks the search results, I take it that it=20 doesn't purge the pages from the database itself. What is the patch? Cheers, Lachlan --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |