You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Geoff H. <ghu...@us...> - 2002-11-03 08:16:02
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b5: Next release, tentatively 1 Dec 2002.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
-- Does Neal's new zlib patch solve this for now?
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
* The code needs a security audit, esp. htsearch. PR#405765.
|
|
From: Gilles D. <gr...@sc...> - 2002-11-03 02:29:20
|
According to Neal Richter: > On Tue, 22 Oct 2002, Geoff Hutchison wrote: > > On Tuesday, October 22, 2002, at 02:50 PM, Neal Richter wrote: > > > > > It looks to me like the db.words.db is using only a 'key' value, and has > > > a blank 'value' for each and every key! > > > > Nope. Remember that "value" as it currently stands is the anchor--if > > any. So if your documents don't have anchors defined, the value will > > never be entered. The key needs to be unique, so it has a structure: > > > > word // DocID // flags // location -> anchor > > Could you give an example where the anchor won't be empty? I tried a > couple things and couldn't get one. "anchor" won't be empty if "word" occurs after a <a name="foo..."> tag. The db.docdb record for DocID has a list of all anchor names encountered, so "anchor" in the db.words.db record should be an index into that list. > > > A more efficient solution to make the index smaller would be this: > > > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00' > > > > OK, remember what I said a while ago about the data structure here. > > (a) Yes, it's "brain dead" for an traditional inverted index--but a > > traditional inverted index is done in 2 steps: index, then sort/invert. > > You have different problems if you're creating a "live" database in one > > step. > > (b) B-Tree databases get pretty convoluted if you start growing the size > > of the value record as you index: i.e. I add a word and then start > > tacking on new DocIDs... This happens because the database essentially > > becomes fragmented in the same way that files can get fragmented over a > > hard drive. > > Now I understand what you meant ;-). I was confused since my conceptual > model was using the value for word-num and flags. > > Would you explain your previous idea in more detail? How much of this database fragmentation would be due to the fact that there are records of different lengths, and how much would be due to updating a given record from one length to a larger length. E.g., if instead of having a whole bunch of entries like this... word // DocID // flags // location -> anchor what if we had entries like this... word // DocID -> flags/location/anchor flags/location/anchor ... but instead of making database updates each time another word is parsed (as is done now in 3.2, if I'm not mistaken), how about if htdig stored this information in memory as it did in 3.1, and then just dumped out all the records like above after the whole document is parsed. That way, none of the records ever have to be updated and lengthened. They're just written once. In an update dig, the DocID for a given URL changes anyway if the document is reparsed, so we'd end up deleting all the ones with the old DocID and adding some with the new one. Would this lead to too much fragmentation due to the different record lengths? Am I oversimplifying things, or would something like this help with htdig's performance while indexing? > > So we take the above and think and performance test... > > * While the Berkeley DB allows you to have duplicate keys, performance > > suffers some. IIRC this happens b/c you get non-uniform trees. > > * You want to store as much as possible in the value field, *but* need > > to have a unique key. > > I think we can avoid duplicate keys with this scenario: > > ['people' is in a document 9 times] > > [Row 1] > key = people\02\00\00\00\00\c9\00 value == \00\c9\00\cb\00\ce\00\d1 > > [Row 2] > key = people\02\00\00\00\00\d4\00 value == \00\d4\00\df\00\ee\00\00 > > I'm not sure where it gets implemented, but a row isn't written until > there are X references to it or until we change documents. This would > involve a cache of rows in memory, which is probably in the classes > somewhere. > > This avoids updating previously written rows and keeps the value size to a > known quantity. I think even optimizations like this become easier if we don't dump out any of the db.words.db records until a document is fully parsed, and then dump them all out at once. Am I wrong? I know that 3.2 is supposed to allow indexing a live database on the fly, and still have it be searchable, but that doesn't mean the DB needs to be updated a word at a time. Doing it a document at a time should make sense, just as db.docdb is updated. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2002-11-02 01:11:13
|
Hey, Here's a question motivated by my quest to make the WordDB faster: word_db_cmp (which uses WordKey::Compare) is the functions passed into BDB for the compare function for the binary database tree used in our db.words.db. Do we REALLY need this function to goto the complexity of calling 'Unpack' and comparing the keys? Why not treat the entire key bitstream after the word-string as a binary compare and return? Can you think or a situation where this kind of high-speed comparision would screw something up? Basically it comes down to a question of does the BDB internal binary tree really need a 100% logically correct key comparison? Given that most of the bits after the word/id section are zeros until the latter part which is the word-location.. we REALLY don't need to be calling unpack. Eh? Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2002-10-31 21:54:28
|
Hey all, I hacked up a db_stat.c from BDB 3.0.55 to link against libhtdb.a and support zlib WordDB page compression. It does NOT support the Mifluz page compression nor does it pay attention to your htdig.conf. Here's some documentation from Sleepycat http://www.sleepycat.com/docs/utility/db_stat.html This utility allows you to look at the stats for your WordDB database. Get it here: http://ai.rightnow.com/htdig/ Compile: gcc -o db_stat db_stat.c .libs/libhtdb.a -lz Example Output: [xxx]$ ./db_stat -z -m -d db.words.db 53162 Btree magic number. 7 Btree version number. Flags: 2 Minimum keys per-page. 32768 Underlying database page size. 3 Number of levels in the tree. 7733646 Number of keys in the tree. 17 Number of tree internal pages. 133424 Number of bytes free in tree internal pages (76% ff). 12491 Number of tree leaf pages. 183M Number of bytes free in tree leaf pages (55% ff). 0 Number of tree duplicate pages. 0 Number of bytes free in tree duplicate pages (0% ff). 0 Number of tree overflow pages. 0 Number of bytes free in tree overflow pages (0% ff). 0 Number of pages on the free list. Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Gilles D. <gr...@sc...> - 2002-10-30 20:57:07
|
According to Jim Cole: > On Friday, October 18, 2002, at 02:12 PM, Gilles Detillieux wrote: > > Well, since you're offering, I wouldn't mind a hand in going through > > the 3.1.6 ChangeLog to see what other changes need to go in 3.2 still. > > I'd like to add to the list above to make it a complete list of 3.1.x > > things still needed for 3.2. > > > > If that's not too unappealing a task to start with, that would be a > > good > > I am not too picky :) I will help where needed. Did you have a > particular approach in mind? Is it safe to compare ChangeLog files, or > should entries in the 3.1.6 ChangeLog be checked against the 3.2 > source? Have you already checked part of the ChangeLog? How far back do > we need to go? Other suggestions/advice? Sorry for taking so long to reply. I lost track of your message in the midst of the deluge. Comparing ChangeLog entries between 3.1.6 and the 3.2 cvs would be the first step, and would find most of the missing stuff. Note that there may be differences in wording between the two, especially if someone other than me make the entry in the 3.2 ChangeLog. If you can't find anything close in the 3.2 ChangeLog to a given 3.1.6 entry, then comparing the specific changes in the source against the 3.2 source would be the next step. I'd be glad to answer any questions you have at that stage, including punting a few ChangeLog entries my way for my verification. Potentially all entries since 3.1.5 was released would need to be checked. That may seem like a lot, given that 3.1.5 was released almost 2 full years before 3.1.6, but the CVS tree for 3.1.x was dormant for a long time after 3.1.5. Thanks. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Budd, S. <s....@ic...> - 2002-10-28 14:44:25
|
Hello
working with gcc2.95.2 on solaris 8.
In identical enviroments the 0929 snapshot configures fine.
whereas the 1027 configure fails. First signs are about cache and /dev/null.
chopped up configure output follows.
Having some problems configureing 1027 and 1020 snapshots with problems
re cache to /dev/null. see last few lines below:
htdig-3.2.0b4-20021027 ....
near to of config
checking for gcc option to accept ANSI C... none needed
checking for ld used by GCC... /usr/local/sparc-sun-solaris2.8/bin/ld
checking if the linker (/usr/local/sparc-sun-solaris2.8/bin/ld) is GNU ld...
yes
checking for BSD-compatible nm... /usr/local/bin/nm -B
checking whether ln -s works... yes
loading cache /dev/null within ltconfig
./ltconfig: .: /dev/null: not a regular file
......
which later leads to the disasters .....
checking whether make sets ${MAKE}... (cached) yes
checking host system type... Invalid configuration `CC=gcc': machine
`CC=gcc' not recognized
checking build system type... Invalid configuration `CC=gcc': machine
`CC=gcc' not recognized
.....
checking for BSD-compatible nm... /usr/local/bin/nm -B
checking whether ln -s works... yes
updating cache /dev/null
loading cache /dev/null within ltconfig
./../ltconfig: .: /dev/null: not a regular file
ltconfig: you must specify a host type if you use `--no-verify'
Try `ltconfig --help' for more information.
configure: error: libtool configure failed
configure: error: /bin/bash './configure' failed for db
END OF CONFIGURE
with htdig-3.2.0b4-20020929 the configure process give proper results :
checking whether gcc accepts -g... (cached) yes
checking for ld used by GCC... (cached)
/usr/local/sparc-sun-solaris2.8/bin/ld
checking if the linker (/usr/local/sparc-sun-solaris2.8/bin/ld) is GNU ld...
(cached) yes
checking for BSD-compatible nm... (cached) /usr/local/bin/nm -B
checking whether ln -s works... (cached) yes
updating cache ./config.cache
loading cache ./config.cache within ltconfig
|
|
From: Martin <mar...@un...> - 2002-10-27 15:59:20
|
On Sat, Oct 26, 2002 at 10:06:12PM -0500, Geoff Hutchison wrote: > Instead, I'd suggest using the SourceForge bug tracker for ht://Dig > http://sourceforge.net/tracker/?atid=3D104593&group_id=3D4593&func=3Dbrow= se OK, I've tried to avoid it ;-) If this doesn't get resolved soon I will submit it. > >I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked > >well but after upgrading to 3.2.0b4-072201 it broke. The cached > >pages are under "/search/index" directory and "/index" is > >disallowed. You can see that 3.2.0b rejects "/search/index" in > >debug output: >=20 > Yes. I can't see anything in particular that would have solved this > in the meantime (which surprises me since I seem to remember this > before). For my own benefit, could you confirm that it fails for you > on the current snapshot? Hm, I've got sucking slow and expensive dialup here (Czech Republic, monopolistic phone operator ... you know) so I would like to avoid downloading extra 2MB ... Back to topic - I've got ht://Dig 3.2.0b4-072201 source code here and I tried to fix it after some short time of looking at the code. See the attachment and review it cause I'm not too familiar with htdig code internals, this is just a quick-try-hack, but it seems to be working here but not heavily tested though... By the way, I think that using regular expressions here is a way too big hammer for this simple task (i.e. just for testing if one string is equal to or just an extension of another). Robots.txt is not defined to contain regular expressions but htdig handles disallow lines as if they are regexps. Are you sure that won't cause any problems if somebody puts some "weird" characters in it? Thanks for you reply and have a nice day --=20 Martin Ma=E8ok http://underground.cz/ mar...@un... http://Xtrmntr.org/ORBman/ |
|
From: Geoff H. <ghu...@us...> - 2002-10-27 15:52:38
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b5: Next release, tentatively 1 Dec 2002.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
-- Does Neal's new zlib patch solve this for now?
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
* The code needs a security audit, esp. htsearch. PR#405765.
|
|
From: Jim C. <gre...@yg...> - 2002-10-27 11:39:54
|
On Friday, October 18, 2002, at 02:12 PM, Gilles Detillieux wrote: > Well, since you're offering, I wouldn't mind a hand in going through > the 3.1.6 ChangeLog to see what other changes need to go in 3.2 still. > I'd like to add to the list above to make it a complete list of 3.1.x > things still needed for 3.2. > > If that's not too unappealing a task to start with, that would be a > good I am not too picky :) I will help where needed. Did you have a particular approach in mind? Is it safe to compare ChangeLog files, or should entries in the 3.1.6 ChangeLog be checked against the 3.2 source? Have you already checked part of the ChangeLog? How far back do we need to go? Other suggestions/advice? Jim |
|
From: Geoff H. <ghu...@ws...> - 2002-10-27 03:06:35
|
> Since it was bounced and after resend I got no replies for 10 days, > I'm trying to post it to dev@ list... As is noted on the website, this is not the appropriate way to submit bug reports anymore. Unfortunately, we don't have our own server, so there's no way to post a bug report by e-mail. Instead, I'd suggest using the SourceForge bug tracker for ht://Dig http://sourceforge.net/tracker/?atid=104593&group_id=4593&func=browse > I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked > well but after upgrading to 3.2.0b4-072201 it broke. The cached pages > are under "/search/index" directory and "/index" is disallowed. You > can see that 3.2.0b rejects "/search/index" in debug output: Yes. I can't see anything in particular that would have solved this in the meantime (which surprises me since I seem to remember this before). For my own benefit, could you confirm that it fails for you on the current snapshot? <http://www.htdig.org/files/snapshots/> I believe the problem is that the robots.txt handling (incorrectly) switched to using regex patterns in the 3.2 code and of course a pattern of "/index" as a regex would match both. Thanks very much, -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-10-27 03:02:07
|
> it? This patch also contains the patches I've submitted > earlier, since I can't find a snapshot which incorporates > them. (I realise that you are very busy...) I apologize--I think somehow I missed them. I did inspect them tonight and am integrating them. I *think* they should make this next snapshot. > Yes, I think that Gilles' suggestion is excellent. Ditto. 3.2.0b5 it is. As will be noted in the STATUS file, I think we'll stick to snapshots called "3.2.0b4" until we actually get to a pre-release. -Geoff |
|
From: Lachlan A. <lh...@ee...> - 2002-10-26 08:26:26
|
Greetings Geoff, Here is a patch, relative to 20020922, which turns on URL testing as part of the test suite. Could you please apply it? This patch also contains the patches I've submitted earlier, since I can't find a snapshot which incorporates them. (I realise that you are very busy...) <http://www.ee.mu.oz.au/staff/lha/patch.URL> <http://www.ee.mu.oz.au/staff/lha/changelog.URL> In doing this, I found and fixed a couple of bugs in URL.cc. These are noted in the changelog.URL file. It also contains the two-line fix for PR#405294 that I mentioned earlier. (I know it isn't 'good practice' to put too many patches together, but this one is harmless...) > ... 18 months > worth of 3.2.0b4 snapshots out there, incorporated into > RPMs and such. When a bug report mentions 3.2.0b4, will > we be able to trust that it's actually the official > 3.2.0b4 release? Would it be helpful to skip b4 and jump > to b5? Yes, I think that Gilles' suggestion is excellent. Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Scott G. <sgi...@su...> - 2002-10-24 04:17:45
|
"Joe R. Jah" <jj...@cl...> writes: [...] > There are no noindex_* lines in my configuration, Then why are you using the patch?... ----ScottG. |
|
From: Neal R. <ne...@ri...> - 2002-10-23 21:36:49
|
Foot in Mouth: I didn't mean anything with the "brain dead" remark.. poor choice of words. Sorry! On Tue, 22 Oct 2002, Geoff Hutchison wrote: > > On Tuesday, October 22, 2002, at 02:50 PM, Neal Richter wrote: > > > It looks to me like the db.words.db is using only a 'key' value, and has > > a blank 'value' for each and every key! > > Nope. Remember that "value" as it currently stands is the anchor--if > any. So if your documents don't have anchors defined, the value will > never be entered. The key needs to be unique, so it has a structure: > > word // DocID // flags // location -> anchor Could you give an example where the anchor won't be empty? I tried a couple things and couldn't get one. > > A more efficient solution to make the index smaller would be this: > > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00' > > OK, remember what I said a while ago about the data structure here. > (a) Yes, it's "brain dead" for an traditional inverted index--but a > traditional inverted index is done in 2 steps: index, then sort/invert. > You have different problems if you're creating a "live" database in one > step. > (b) B-Tree databases get pretty convoluted if you start growing the size > of the value record as you index: i.e. I add a word and then start > tacking on new DocIDs... This happens because the database essentially > becomes fragmented in the same way that files can get fragmented over a > hard drive. Now I understand what you meant ;-). I was confused since my conceptual model was using the value for word-num and flags. Would you explain your previous idea in more detail? > So we take the above and think and performance test... > * While the Berkeley DB allows you to have duplicate keys, performance > suffers some. IIRC this happens b/c you get non-uniform trees. > * You want to store as much as possible in the value field, *but* need > to have a unique key. I think we can avoid duplicate keys with this scenario: ['people' is in a document 9 times] [Row 1] key = people\02\00\00\00\00\c9\00 value == \00\c9\00\cb\00\ce\00\d1 [Row 2] key = people\02\00\00\00\00\d4\00 value == \00\d4\00\df\00\ee\00\00 I'm not sure where it gets implemented, but a row isn't written until there are X references to it or until we change documents. This would involve a cache of rows in memory, which is probably in the classes somewhere. This avoids updating previously written rows and keeps the value size to a known quantity. > I'd be glad to go into more depth into why the current database > structure has been set up this way. There is probably a better > structure, like I outlined a few weeks ago. But you simply can't build > an on-line inverted word index in one fell swoop and use a traditional > design. (Google doesn't do it, the current 3.2 code has never attempted > to do it...) > If you'd like, I can probably pull up some of the original message > threads. I need to read this stuff... is it in the archive? Just point me in the right direction. Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Joe R. J. <jj...@cl...> - 2002-10-23 04:30:06
|
On 19 Oct 2002, Scott Gifford wrote:
> Date: 19 Oct 2002 20:02:47 -0400
> From: Scott Gifford <sgi...@su...>
> To: Joe R. Jah <jj...@cl...>
> Cc: htd...@li...
> Subject: Re: [htdig-dev] Updated Patch for supporting multiple
noindex_start/noindex_end
> > > You can get the new version at:
> > >
> > > http://www.suspectclass.com/~sgifford/htdig/htdig-3.1.6-multiple-noindex.patch
> >
> > FYI, I applied the patch, but:
>
> That looks like the sort of crash the newer version of this patch
> should fix. Look in the code and make sure that on the lines where
> skip_start and skip_end are being set, the number are all typed in
> correctly (subscripts to skip_start should go consecutively from to
> 10, and ditto for skip_end). Otherwise, you've got the old version of
> the patch; get a fresh copy of htdig and re-apply the latest version
> of the patch.
I applied the latest version, from the above URL.
> If it still crashes, show me what lines you've got for noindex_* in
> your configuration, and at the crash look at what i is when htdig
> crashes, and what skip_start[i] and skip_end[i] are.
There are no noindex_* lines in my configuration, and at the crash htdig
-vvvv does not have any indication of what i is, or what skip_start[i] and
skip_end[i] are. These are all it reports after picking the server and
starting on the first local document:
---------------8<---------------
Read 3131 from document
Read a total of 3131 bytes
Segmentation fault - core dumped
---------------8<---------------
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jj...@cl...
|
|
From: Geoff H. <ghu...@ws...> - 2002-10-23 04:11:57
|
On Tuesday, October 22, 2002, at 02:50 PM, Neal Richter wrote: > It looks to me like the db.words.db is using only a 'key' value, and has > a blank 'value' for each and every key! Nope. Remember that "value" as it currently stands is the anchor--if any. So if your documents don't have anchors defined, the value will never be entered. The key needs to be unique, so it has a structure: word // DocID // flags // location -> anchor > A more efficient solution to make the index smaller would be this: > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00' OK, remember what I said a while ago about the data structure here. (a) Yes, it's "brain dead" for an traditional inverted index--but a traditional inverted index is done in 2 steps: index, then sort/invert. You have different problems if you're creating a "live" database in one step. (b) B-Tree databases get pretty convoluted if you start growing the size of the value record as you index: i.e. I add a word and then start tacking on new DocIDs... This happens because the database essentially becomes fragmented in the same way that files can get fragmented over a hard drive. So we take the above and think and performance test... * While the Berkeley DB allows you to have duplicate keys, performance suffers some. IIRC this happens b/c you get non-uniform trees. * You want to store as much as possible in the value field, *but* need to have a unique key. OK, you could probably shove "flags" into the value, like so--but I'm not sure if that'll make a huge difference: word // DocID // location -> flags // anchor I'd be glad to go into more depth into why the current database structure has been set up this way. There is probably a better structure, like I outlined a few weeks ago. But you simply can't build an on-line inverted word index in one fell swoop and use a traditional design. (Google doesn't do it, the current 3.2 code has never attempted to do it...) If you'd like, I can probably pull up some of the original message threads. -Geoff |
|
From: Neal R. <ne...@ri...> - 2002-10-22 19:51:40
|
Hey, So here's a very weird observation. It looks to me like the db.words.db is using only a 'key' value, and has a blank 'value' for each and every key! How did I find this? 1. I indexed a single web-page consisting of the 'Gettysburg Address' 250+ words. 2. I added printfs to htPack/htUnpack & WordDB:Get & WordDB:Put This is what is see during and htdig HtPack [] db->put key=[people] data=[] flags=[0] HtPack [] db->put key=[people] data=[] flags=[0] HtPack [] db->put key=[people] data=[] flags=[0] HtPack [] db->put key=[perish] data=[] flags=[0] HtPack [] db->put key=[earth] data=[] flags=[0] HtPack [] db->put key=[abraham] data=[] flags=[0] HtPack [] db->put key=[lincoln] data=[] flags=[0] Nothing in the data-value! This seems to contradict (in spirit) whats in the db.worddump produced by htdump! I also downloaded and built the 3.0.55 BDB and used the db_dump utility to dump the db.words.db. This is what I get (The first line is a key, the following line is the value): %db_dump_3055 -pk db.words.db people\02\00\00\00\00\c9\00 \00 people\02\00\00\00\00\cb\00 \00 people\02\00\00\00\00\ce\00 \00 perish\02\00\00\00\00\d1\00 \00 poor\02\00\00\00\00j\00 \00 portion\02\00\00\00\009\00 \00 power\02\00\00\00\00k\00 \00 Note the Zeros in the VALUE!! Here's the relevant entries in db.worddump people 2 0 201 0 people 2 0 203 0 people 2 0 206 0 perish 2 0 209 0 poor 2 0 106 0 c9 = 201 cb = 203 ce = 206 This is brain dead for an inverted index! It should at least be key = 'people\02', value = '00\00\00\00\c9\00' A more efficient solution to make the index smaller would be this: key = 'people\02', value = '00\00\00\00\c9\cb\ce\00' Eh? Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Martin <mar...@un...> - 2002-10-22 16:08:24
|
Since it was bounced and after resend I got no replies for 10 days, I'm trying to post it to dev@ list... ----- Forwarded message from Martin Ma=E8ok <mar...@un...> -= ---- Date: Thu, 10 Oct 2002 09:27:13 +0200 From: Martin Ma=E8ok <mar...@un...> To: htd...@ht... Subject: robots.txt URL matching (OK in 3.1.x, bad in 3.2.0b) Hi, I've (probably) found a bug (with a little help from wwwoffle author "Andrew M. Bishop" <amb(at)gedanken.demon.co.uk>) in ht://Dig 3.2.0b4-072201 (from Mandrake package) in robots.txt URL matching. When you disallow "/foo", htdig then rejects "/bar/foo" but according to http://www.robotstxt.org/wc/norobots.html it should reject only URLs _starting_ with (not just containing) disallowed string. I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked well but after upgrading to 3.2.0b4-072201 it broke. The cached pages are under "/search/index" directory and "/index" is disallowed. You can see that 3.2.0b rejects "/search/index" in debug output: ------------------- Robots.txt line: Disallow: /index Found 'disallow' line: /index Pattern: /control|/configuration|/refresh|/monitor|/index [...] pushing http://localhost:8080/search/start3.html +href: http://localhost:8080/search/index/ (The WWWOFFLE searchable index o= f all cached web pages) Rejected: forbidden by server robots.txt! ------------------- I'm sorry for not sending a patch, I'm offline now and don't have the sources on my hdd (and dialup is expensive here through the day) but I think that it should be trivial to fix. Thanks a lot and have a nice day --=20 Martin Ma=E8ok http://underground.cz/ mar...@un... http://Xtrmntr.org/ORBman/ Reclaim your rights! - http://www.digitalspeech.org/ ----- End forwarded message ----- --=20 Martin Ma=E8ok |
|
From: Brian W. <bw...@st...> - 2002-10-22 07:38:54
|
Ok. I have the first cut of the defaults.xml patch.
It is all in the attached file - it is patched against
htdig-3.2.0b4-20021013.
The remainder of this email is the README file from the
tar file.
Regs
Brian White
===============================================================
Documentation attached to initial version of defaults.xml patch
===============================================================
1. Overview of what it does
====================================
* Adds defaults.xml and defaults.dtd
plus manage_attributes.pl for managing access
to them
* Addition of make_defaults_cc.pl for creating
defaults.cc
* Complete rewrite of the cf_generate.pl that
creates attrs.html, cf_byprog.html and
cf_byname.html
* Reducing the size of the ConfigDefaults
structure to just have "name" and "value"
The version of defaults.xml as it exists in
this patch is valid against defaults.dtd.
The patch is done against htdig-3.2.0b4-20021013
2. Affected Files
====================================
New Files:
* htcommon/defaults.dtd
* htcommon/manage_attributes.pl
* htcommon/make_defaults_cc.pl
Replaced Files:
* htcommon/defaults.xml
* htdoc/cf_generate.pl
(Note that most of the patches are only 1 or 2 lines
- the biggest is probably about 10 )
Patched Files:
* htcommon/Makefile.am.patch
* htdoc/Makefile.am.patch
* htdoc/attrs_head.html.patch
* htdoc/attrs_tail.html.patch
* htdoc/cf_byname_head.html.patch
* htdoc/cf_byprog_head.html.patch
* htlib/Configuration.h.patch
* htdb/htdb_dump.cc.patch
* htdb/htdb_load.cc.patch
* htdb/htdb_stat.cc.patch
Files to removed from CVS
* defaults.cc
3. Creating Descriptions
====================================
The description is essentially a html
snippet, with the following differences
* It is limited to p,br,ol,ul,table,em,
strong,code and a elements, with
two additions:
1) <!ELEMENT codeblock (%paratext;)* >
This is used to provide block code or html
snippets. An example of this would be
<codeblock>
<SELECT NAME="search_algorithm">
<OPTION VALUE="exact:1 prefix:0.6 synonyms:0.5 endings:0.1"
SELECTED>fuzzy
<OPTION VALUE="exact:1">exact
</SELECT>
</codeblock>
2) <!ELEMENT ref (#PCDATA) >
<!ATTLIST ref type (program|attr|faq) #REQUIRED >
This is used to link to programs, faqs and other attributes.
Some examples are:
<ref type="attr">build_select_lists</ref>
<ref type="program">htdig</ref>
<ref type="faq">4.1</ref>
The purpose of doing this is to allow the info
to be reused, and remove the dependency
on html files in a particular place.
* The only allowed attributes in the description
are:
table : border, width
td,th : align, valign, rowspan, colspan
dl : compact="true"
4. A Discussion of XML Validation
====================================
Ideally the code should validate the XML
against the DTD, and should check for
well formedness. Unfortunately that requires
an XML parser, and I did not want to add
an extra dependency at this stage!
What I did as a compromise was to create
an API that is used to load and then
query the XML data - this API is
documented and implemented in
htcommon/manage_attributes.pl.
At the moment the internal data structures
are populated using standard perl pattern
matching - it *assumes* that defaults.xml
is valid against defaults.dtd, but is does
not test it.
The aim is that when an XML parser is
readily available, that can be used
to populate the internal data structures -
and everything else should just work!
5. The Current state of defaults.xml
====================================
The version of defaults.xml that is
presented in this patch is valid
against defaults.dtd, but is
desperately in need of a cleanup.
However:
a) I don't have time to clean it up at the moment
b) It is currently completely generated from the old
defaults.cc, which makes it easier to adjust
until it's form becomes stable
What I would like to do is get the patch
in place and once it has stablized
embark on a bit of a cleanup
6. Possible Issues
====================================
* Examples are just entered as values
- the "name : " before are now
automatically generated. This may seem
limiting, but it is exactly what is in
the docs at the moment.
* There are a few remaining links to
other parts of the documentation that
I have left a <a> elements, simply
because I couldn't see an obvious way
to include them
|
|
From: Geoff H. <ghu...@us...> - 2002-10-20 07:14:13
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
(mifluz merge essentially finished, contact Geoff for patch to test)
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
* The new mifluz merged code is slow.
(no PR, Geoff hasn't added mifluz-merge to CVS yet.)
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#405294. (The documentation for 3.2.0b1 was updated, but can
we fix this?)
(More importantly, do we ever want exact to /not/ be specified?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch. PR#405765.
* URL.cc tries to parse malformed URLs (which can cause further problems)
(It should probably just set everything to empty).
|
|
From: Scott G. <sgi...@su...> - 2002-10-20 00:03:55
|
"Joe R. Jah" <jj...@cl...> writes: > On 8 Jul 2002, Scott Gifford wrote: > > > Date: 08 Jul 2002 04:08:00 -0500 > > From: Scott Gifford <sgi...@su...> > > To: htd...@li..., htd...@li... > > Subject: [htdig-dev] Updated Patch for supporting multiple > noindex_start/noindex_end > > > > My multiple noindex_start/noindex_end has been updated. The old one > > had some typos that I didn't notice, which caused crashes on some > > systems. > > > > You can get the new version at: > > > > http://www.suspectclass.com/~sgifford/htdig/htdig-3.1.6-multiple-noindex.patch > > FYI, I applied the patch, but: That looks like the sort of crash the newer version of this patch should fix. Look in the code and make sure that on the lines where skip_start and skip_end are being set, the number are all typed in correctly (subscripts to skip_start should go consecutively from to 10, and ditto for skip_end). Otherwise, you've got the old version of the patch; get a fresh copy of htdig and re-apply the latest version of the patch. If it still crashes, show me what lines you've got for noindex_* in your configuration, and at the crash look at what i is when htdig crashes, and what skip_start[i] and skip_end[i] are. Good luck, ---ScottG. > ---------------------------8<--------------------------- > gdb htdig htdig.core > GNU gdb > This GDB was configured as "i386-unknown-bsdi4.3"... > Core was generated by `htdig'. > Program terminated with signal 11, Segmentation fault. > Reading symbols from /usr/lib/libz.so...done. > Reading symbols from /usr/lib/libstdc++.so.1...done. > Reading symbols from /shlib/libgcc.so.1...done. > Reading symbols from /shlib/libc.so.2...done. > Reading symbols from /shlib/ld-bsdi.so...done. > #0 0x80585ce in HTML::parse (this=0x86a2200, retriever=@0x8047a80, baseURL=@0x8224680) > at HTML.cc:186 > 186 skip_end_len[i] = strlen(skip_end[i]); > (gdb) bt > #0 0x80585ce in HTML::parse (this=0x86a2200, retriever=@0x8047a80, baseURL=@0x8224680) > at HTML.cc:186 > #1 0x805cfa4 in Retriever::RetrievedDocument (this=0x8047a80, doc=@0x8224500, ref=0x8692000) > at Retriever.cc:577 > #2 0x805cb29 in Retriever::parse_url (this=0x8047a80, urlRef=@0x81c8780) at Retriever.cc:473 > #3 0x805c3c5 in Retriever::Start (this=0x8047a80) at Retriever.cc:292 > #4 0x80629e6 in main (ac=6, av=0x8047c7c) at htdig.cc:338 > #5 0x8055853 in __start () > (gdb) q > ---------------------------8<--------------------------- > > Regards, > > Joe > -- > _/ _/_/_/ _/ ____________ __o > _/ _/ _/ _/ ______________ _-\<,_ > _/ _/ _/_/_/ _/ _/ ......(_)/ (_) > _/_/ oe _/ _/. _/_/ ah jj...@cl... |
|
From: Joe R. J. <jj...@cl...> - 2002-10-19 06:19:59
|
On 8 Jul 2002, Scott Gifford wrote:
> Date: 08 Jul 2002 04:08:00 -0500
> From: Scott Gifford <sgi...@su...>
> To: htd...@li..., htd...@li...
> Subject: [htdig-dev] Updated Patch for supporting multiple
noindex_start/noindex_end
>
> My multiple noindex_start/noindex_end has been updated. The old one
> had some typos that I didn't notice, which caused crashes on some
> systems.
>
> You can get the new version at:
>
> http://www.suspectclass.com/~sgifford/htdig/htdig-3.1.6-multiple-noindex.patch
FYI, I applied the patch, but:
---------------------------8<---------------------------
gdb htdig htdig.core
GNU gdb
This GDB was configured as "i386-unknown-bsdi4.3"...
Core was generated by `htdig'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libz.so...done.
Reading symbols from /usr/lib/libstdc++.so.1...done.
Reading symbols from /shlib/libgcc.so.1...done.
Reading symbols from /shlib/libc.so.2...done.
Reading symbols from /shlib/ld-bsdi.so...done.
#0 0x80585ce in HTML::parse (this=0x86a2200, retriever=@0x8047a80, baseURL=@0x8224680)
at HTML.cc:186
186 skip_end_len[i] = strlen(skip_end[i]);
(gdb) bt
#0 0x80585ce in HTML::parse (this=0x86a2200, retriever=@0x8047a80, baseURL=@0x8224680)
at HTML.cc:186
#1 0x805cfa4 in Retriever::RetrievedDocument (this=0x8047a80, doc=@0x8224500, ref=0x8692000)
at Retriever.cc:577
#2 0x805cb29 in Retriever::parse_url (this=0x8047a80, urlRef=@0x81c8780) at Retriever.cc:473
#3 0x805c3c5 in Retriever::Start (this=0x8047a80) at Retriever.cc:292
#4 0x80629e6 in main (ac=6, av=0x8047c7c) at htdig.cc:338
#5 0x8055853 in __start ()
(gdb) q
---------------------------8<---------------------------
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jj...@cl...
|
|
From: Neal R. <ne...@ri...> - 2002-10-18 20:35:48
|
Hey,
Goto: http://ai.rightnow.com/htdig/ for a link to a patch to
htdig-3.2.0b4-20021013 that adds zlib-based WordDB compression.
This patch is a workaround for the WordDB compression errors we are
seeing in current snapshots.
It adds a new config option 'wordlist_compress_zlib' that it true by
default. Not also that this feature uses 'compression_level' as a
parameter for zlib compression, which is used to compress the excerpts.
Thanks!
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Gilles D. <gr...@sc...> - 2002-10-18 20:12:19
|
Earlier today, I wrote... > According to Geoff Hutchison: > > On Thu, 17 Oct 2002, Gilles Detillieux wrote: > > > 3) My own lack of time in being able to get the 3.1.6 fixes/updates > > > forward ported to 3.2. > > > > If you have a list of particular things, it would help significantly. I'll > > check through the mailing list, but if you have a list somewhere it'd save > > some time. > > I haven't updated the list since I sent it to Jessica Biola back in August. > Here it is again... ... > - multi-excerpt patch (max_excerpts attribute) for htsearch/Display.cc > - better handling of htdig -m option > - add startyear et al. to defaults.cc > - make startyear et al. handle relative date ranges in Display.cc > - fuzzy endings patch and updated english.0 file > - get updated external parser scripts into contrib directory > (fix eof handling bug in .pl scripts) > - list-all feature in htsearch for a query of * or prefix_match_character > - ignore_dead_servers attribute > - description_meta_tag_names attribute > - ignore_alt_text attribute > - translate_latin1 attribute, with hooks into SGMLCodec class > - search_rewrite_rules attribute > - anchor_target attribute > - search_results_contenttype attribute > - boolean_keywords attribute > - boolean_syntax_errors attribute > - multimatch_method attribute (still VERY buggy in 3.1.6 though) ... > Note that some items in the list may already be in the 3.2 cvs. I just > haven't checked yet. Also, a close look at the 3.1.6 ChangeLog may reveal > bug fixes I've missed in both the list above and the 3.2 cvs. OK, scratch the first sentence in the paragraph above. I just looked over the list, and I'm fairly certain that none of these are in 3.2 cvs yet. Lachlan's patch to defaults.cc will add startyear et al. to defaults.cc, but it doesn't include the full description that 3.1.6's attrs.html file has for these attributes. And in a separate message, Jim Cole wrote... > Gilles - Please let me know what I can do to help you with all > the 3.1.x and 3.1.x->3.2 issues that seem to fall into your > court. I am a proficient C/C++ programmer. My knowledge of the > auto* tools is minimal, but I have a book :) I can find my way > around CVS. > > I can't promise a lot as my free time is nearly non-existent at > the moment, not to imply that the same is not true for others > around here. However if you have a couple tasks to toss my way, I > will see what I can do. I am certainly not going to make any > significant contributions by forever sticking to the "not enough > time" excuse ;) Well, since you're offering, I wouldn't mind a hand in going through the 3.1.6 ChangeLog to see what other changes need to go in 3.2 still. I'd like to add to the list above to make it a complete list of 3.1.x things still needed for 3.2. If that's not too unappealing a task to start with, that would be a good starting point. Apart from that, probably all the htsearch changes and htsearch-related attributes in the list above would be the easier ones to go into 3.2, and probably the most pressing because during 3.1.6 development I deliberately held back parallel changes to 3.2's htsearch, because of other changes that were to take place there, but never fully materialized. The htfuzzy, endings and contrib changes should all be pretty straightforward too. How's that for starters? If you want help or clarification on any of these, please let me know. And regardless of what you do end up working on, thank you for offering! I'll look after these two... > - better handling of htdig -m option > - translate_latin1 attribute, with hooks into SGMLCodec class because I have a pretty clear idea in mind of what I want to do with those. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-10-18 17:39:13
|
According to Geoff Hutchison: > I talked to Neal off-list, so I'd like to clarify as well. I think the > three of us are thinking basically the same thing, but it doesn't help > when we talk about "3.3" or "4.0." So let's talk about "how to get 3.2.0b4 > out soon." Agreed. We can hammer out the details of later versions later. For now, though, we need a reliable 3.2.0b4 out there. My only immediate concern, which just occurred to me this morning, is the confusion caused by 18 months worth of 3.2.0b4 snapshots out there, incorporated into RPMs and such. When a bug report mentions 3.2.0b4, will we be able to trust that it's actually the official 3.2.0b4 release? Would it be helpful to skip b4 and jump to b5? > On Thu, 17 Oct 2002, Gilles Detillieux wrote: > > 3) My own lack of time in being able to get the 3.1.6 fixes/updates > > forward ported to 3.2. > > If you have a list of particular things, it would help significantly. I'll > check through the mailing list, but if you have a list somewhere it'd save > some time. I haven't updated the list since I sent it to Jessica Biola back in August. Here it is again... --- --- From: Gilles Detillieux <gr...@sc...> Subject: Re: [htdig-dev] Features in 3.1.6 and not in 3.2.0b4? To: jes...@ya... (Jessica Biola) Cc: htd...@li... Date: Fri, 16 Aug 2002 17:19:43 -0500 (CDT) According to Jessica Biola: > Are there any features that are in 3.1.6 and not in > 3.2.0b4? If so, could someone kindly provide a list > of the features? (i.e. ignore_dead_servers) I haven't yet compiled an exhaustive list of these. The sketchy list I have so far is... - multi-excerpt patch (max_excerpts attribute) for htsearch/Display.cc - better handling of htdig -m option - add startyear et al. to defaults.cc - make startyear et al. handle relative date ranges in Display.cc - fuzzy endings patch and updated english.0 file - get updated external parser scripts into contrib directory (fix eof handling bug in .pl scripts) - list-all feature in htsearch for a query of * or prefix_match_character - ignore_dead_servers attribute - description_meta_tag_names attribute - ignore_alt_text attribute - translate_latin1 attribute, with hooks into SGMLCodec class - search_rewrite_rules attribute - anchor_target attribute - search_results_contenttype attribute - boolean_keywords attribute - boolean_syntax_errors attribute - multimatch_method attribute (still VERY buggy in 3.1.6 though) The only way to get a really complete list is to go through the release notes and ChangeLog for 3.1.6, and make sure that each of these things (or something equivalent) is in the 3.2 CVS tree already. --- --- Note that some items in the list may already be in the 3.2 cvs. I just haven't checked yet. Also, a close look at the 3.1.6 ChangeLog may reveal bug fixes I've missed in both the list above and the 3.2 cvs. > > library (iconv). I think Neal's idea of the zlib-WordDB-compression > > retrofit has merit, if only to get an interim beta 4 out the door soon. > > I see it as a quicker solution to the reliability issue. > > I think we're all on the same page here, though I'd like to see the patch > first, obviously. I've been working on the mifluz merge because I think it > needs to be done and b/c I can't see how we can ship a 3.2.0b4 with these > database bugs. If there's a smaller bug-fix, that's great. :-) Sounds like a plan! > > The only other thing I see as essential for 3.2.0b4 is getting the > > 3.1.6 changes in there. Otherwise, there'll be too much confusion > > I think there are a few remaining minor bugs which we should probably > stomp along the way. Yes, we should comb through the bug database for anything that's tackleable and/or urgent enough to warrant working on for b4. As for Gabriele's question about the Content-Encoding header handling in HTTP/1.1, I'd say that depends. Is Content-Encoding header handling optional in an HTTP/1.1 client, or is it fully up to the server's discretion whether it is used. If HTTP/1.1 clients are required, by the standard, to recongnize Content-Encoding, then I'd call it a bug that htdig doesn't. If the standard makes it optional, then I'd think there should be a way htdig can tell the server "I don't grok this." > > Other side projects like defaults.xml are great, but this seems to be > > shaping up to be a much bigger task that originally envisioned, what > > No offense to Gabriele, but I'd rather consider translations to the > documentation _after_ we switch to an XML documentation setup. > > Personally, I'd consider switching to defaults.xml for 3.2.0b4 if I can > see a patch in the near future. I'm willing to handle the documentation > fixes by hand if I need to do it. Again, sounds like a plan. My concern was that the whole translation issue was going to affect the design of the XML DTD and coding for defaults.xml, and it would take a while to nail that down too. If the basic framework and the English version of the file can be readied in time for b4, let's go with that, and fill in other languages later. Another question to consider for this is, do we want all languages in the same file, or do we want separate default.xml files for each one? What about encodings? We'd need to handle different encodings for different languages, and pass this encoding specification into the generated HTML files. (Or is it all Unicode?) I know we don't need to nail all this down now, but I thought I'd ask just in case these issues affect the basic design we need to get in place for 3.2.0b4. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |