You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
From: Dan C. <dan...@so...> - 2002-01-23 08:28:35
|
Hi all, This is my first post and I've only been using Ht://Dig for 2 days, so please go easy on me. :) First, a little about my system - I'm running Ht://Dig 3.1.5 on a Red Hat 7.2 Linux PC. The standard sort command that is installed is 2.0.14. I was playing around with 3 databases which all worked fine individually. However, I wished to merge them into a single database so that I could search across all 3 at once. Looking through the archive I found a post from Gilles explaining how to do this with htmerge and the -m switch. I tried this and all seemed to work beautifully; I was able to search across all databases. However I noticed that some of the documents that SHOULD have shown up in a particular search did not. They showed up in their individual databases, and running the htmerge program with full verbosity showed they were being merged into the conglomerate database. Looking through the db.wordlist file showed that the word I was searching for, 'support' was not correctly sorted, as there were also some instances of 'supported' which were mixed in amongst them. I couldn't work out how this could happen until I tried playing with the sort command for a while. As one of the posts from the archive says, sort is assumed to sort across the whole line, but this simply does not produce the intended effect, at least on my (very common) system. Here's a part of the db.wordlist I get from htmerge as it stands: supply i:29 l:796 w:204 support i:20 l:111 w:2585 c:4 support i:69 l:35 w:201890 c:4 support i:70 l:11 w:4650 c:6 supported i:57 l:710 w:290 supported i:38 l:797 w:203 support i:59 l:240 w:760 support i:18 l:797 w:203 support i:29 l:799 w:201 support i:73 l:869 w:131 sure i:20 l:656 w:344 surname i:30 l:115 w:1607 c:2 surname i:31 l:115 w:1607 c:2 Here's what you get when you use the straight /bin/sort on it: supply i:29 l:796 w:204 supported i:38 l:797 w:203 supported i:57 l:710 w:290 support i:18 l:797 w:203 support i:20 l:111 w:2585 c:4 support i:29 l:799 w:201 support i:59 l:240 w:760 support i:69 l:35 w:201890 c:4 support i:70 l:11 w:4650 c:6 support i:73 l:869 w:131 sure i:20 l:656 w:344 surname i:30 l:115 w:1607 c:2 surname i:31 l:115 w:1607 c:2 And here's what's actually needed to get correct results from htsearch supply i:29 l:796 w:204 support i:18 l:797 w:203 support i:20 l:111 w:2585 c:4 support i:29 l:799 w:201 support i:59 l:240 w:760 support i:69 l:35 w:201890 c:4 support i:70 l:11 w:4650 c:6 support i:73 l:869 w:131 supported i:38 l:797 w:203 supported i:57 l:710 w:290 sure i:20 l:656 w:344 surname i:30 l:115 w:1607 c:2 surname i:31 l:115 w:1607 c:2 The reason 'supported' appears above 'support' is because the whole line is being used as the key and the 'e' in 'supported' comes before the 'i' in the next field. Below is a patch for htmerge/words.cc that appends a '--key=1,1' parameter to the sort command in htmerge. This seems to fix the problem. Gilles mentions elsewhere that the intended behaviour is to sort by the first field, then second, etc. so you may wish to include those parameters also. No idea if this would work on any systems apart from my own (or even if the problem exists in different versions, etc). Obviously it would be better if this parameter were in the Makefile or something and was configured as necessary by make, but I don't know enough about it. 59a60,72 > > > // START PATCH > // patch added by Dan Cutting (da...@wo...) 23/01/2002 to make htmerge > // use first field of wordlist to sort instead of entire line which could lead > // to incorrectly sorted database. this in turn leads to missing results > // from searches. NB: this has not been tested on any systems (apart from my own > // Linux box) and is designed purely for sort version 2.0.14. Other versions > // will probably also work, but no guarantees! Try it from the command line first. > command << " --key=1,1"; > /// END PATCH > > Regards, Dan Cutting dan...@so... ********************************************************************** visit http://www.solution6.com visit http://www.eccountancy.com - everything for accountants. UK Customers - http://www.solution6.co.uk ********************************************************************* This email message (and attachments) may contain information that is confidential to Solution 6. If you are not the intended recipient you cannot use, distribute or copy the message or attachments. In such a case, please notify the sender by return email immediately and erase all copies of the message and attachments. Opinions, conclusions and other information in this message and attachments that do not relate to the official business of Solution 6 are neither given nor endorsed by it. ********************************************************************* |
From: Gilles D. <gr...@sc...> - 2002-01-22 19:30:29
|
According to Neal Richter: > I'm not getting any success merging databases. > > Here's the process: > > 1. htdig -c htdig.conf.red > > Searching works fine > > 2. htdig -c htdig.conf.blue > [htdig.conf.blue has a different database_dir] > > 3. htmerge -m htdig.conf.blue > > htmerge finshes after some time. You don't use -c on the 3rd command, so it will default to htdig.conf. What database does that select? That's the one that the "blue" db will be merged into. > 4. A comparison of the old RED-only db files + BLUE db files to the new > RED&BLUE db files looks OK. The new files increase nearly to the sum of > the seperate sizes of RED and BLUE files. > > 5. Searching works fine for words in RED, but words in BLUE returns > nothing. > > 6. If I make the default database_dir point to BLUE, searching works > fine (of course only for words in BLUE) > > Ideas? > > I am using a DEBUG build. You don't mention which version of htdig you're using. Whether it's 3.1.x or 3.2.0bx, you should be running the latest snapshot from http://www.htdig.org/files/snapshots/. 3.1.5 and earlier have some bugs in the merging code, that can cause a number of problems. Also, for 3.1.x (even the latest 3.1.6 snapshot), you must run htmerge on the individual databases before merging them together, as Jim Cole suggested. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Jim C. <gre...@yg...> - 2002-01-22 02:55:08
|
Neal Richter's bits of Mon, 21 Jan 2002 translated to: >Hey, > I'm not getting any success merging databases. > >Here's the process: > >1. htdig -c htdig.conf.red > > Searching works fine > >2. htdig -c htdig.conf.blue > [htdig.conf.blue has a different database_dir] > >3. htmerge -m htdig.conf.blue Hi - You need to run htmerge on the individual databases before you merge the two together (before you run with -m). Jim |
From: Neal R. <ne...@ri...> - 2002-01-22 02:28:31
|
Hey, I'm not getting any success merging databases. Here's the process: 1. htdig -c htdig.conf.red Searching works fine 2. htdig -c htdig.conf.blue [htdig.conf.blue has a different database_dir] 3. htmerge -m htdig.conf.blue htmerge finshes after some time. 4. A comparison of the old RED-only db files + BLUE db files to the new RED&BLUE db files looks OK. The new files increase nearly to the sum of the seperate sizes of RED and BLUE files. 5. Searching works fine for words in RED, but words in BLUE returns nothing. 6. If I make the default database_dir point to BLUE, searching works fine (of course only for words in BLUE) Ideas? I am using a DEBUG build. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Gilles D. <gr...@sc...> - 2002-01-18 21:50:08
|
According to Geoff Hutchison: > On Fri, 18 Jan 2002, Gilles Detillieux wrote: > > + else if (strcmp(word.get(), prefix_suffix) == 0) > > + { > > + tempWords.Add(new WeightWord(prefix_suffix, 1.0)); > > + } > > This won't work in the case where people set prefix_suffix to '' to run > prefix matching on everything. We probably want a test something like: > > ((strnlen(prefix_suffix) != 0 && strcmp(word.get(), prefix_suffix) == 0) || > (strnlen(prefix_suffix) == 0 && strcmp(word.get(), "*") == 0)) > > Maybe that can be simplified. > > > + if (strcmp(temp.get(), prefix_suffix) == 0) { > > Same problem here. There was also a little wrinkle in the word separation code if you force a new wildcard character when prefix_match_character is empty. Here's the new patch... --- htsearch/htsearch.cc.orig Thu Nov 1 14:45:07 2001 +++ htsearch/htsearch.cc Fri Jan 18 15:43:24 2002 @@ -444,6 +444,9 @@ setupWords(char *allWords, List &searchW String word; // Why use a char type if String is the new char type!!! char *prefix_suffix = config["prefix_match_character"]; + char *wildcard = prefix_suffix; + if (*wildcard == '\0') + wildcard = "*"; while (*pos) { while (1) @@ -461,12 +464,12 @@ setupWords(char *allWords, List &searchW tempWords.Add(new WeightWord(s, -1.0)); break; } - else if (HtIsWordChar(t) || t == ':' || + else if (HtIsWordChar(t) || t == ':' || t == *wildcard || (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255)) { word = 0; - while (t && (HtIsWordChar(t) || - t == ':' || (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255))) + while (t && (HtIsWordChar(t) || t == ':' || t == *wildcard || + (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255))) { word << (char) t; t = *pos++; @@ -485,6 +488,10 @@ setupWords(char *allWords, List &searchW else if (boolean && mystrcasecmp(word.get(), boolean_keywords[2]) == 0) { tempWords.Add(new WeightWord("!", -1.0)); + } + else if (strcmp(word.get(), wildcard) == 0) + { + tempWords.Add(new WeightWord(wildcard, 1.0)); } else { --- htsearch/parser.cc.orig Thu Jul 26 15:08:52 2001 +++ htsearch/parser.cc Fri Jan 18 15:24:18 2002 @@ -240,6 +240,24 @@ Parser::perform_push() list->isIgnore = 1; return; } + static char *wildcard = config["prefix_match_character"]; + if (*wildcard == '\0') + wildcard = "*"; + if (strcmp(temp.get(), wildcard) == 0) { + Database *docIndex = Database::getDatabaseInstance(); + docIndex->OpenRead(config["doc_index"]); + docIndex->Start_Get(); + while ((p = docIndex->Get_Next())) + { + dm = new DocMatch; + dm->score = current->weight; + dm->id = atoi(p); + dm->anchor = 0; + list->add(dm); + } + delete docIndex; + return; + } temp.lowercase(); p = temp.get(); if (temp.length() > maximum_word_length) -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-18 20:54:53
|
On Fri, 18 Jan 2002, Gilles Detillieux wrote: No, I didn't think it was complicated to code--just to figure out the cleanest way to do it. ;-) > something that originally seemed very complicated), but I'd appreciate a > few extra eyeballs on this code, or some other testers. The weight is > + else if (strcmp(word.get(), prefix_suffix) == 0) > + { > + tempWords.Add(new WeightWord(prefix_suffix, 1.0)); > + } This won't work in the case where people set prefix_suffix to '' to run prefix matching on everything. We probably want a test something like: ((strnlen(prefix_suffix) != 0 && strcmp(word.get(), prefix_suffix) == 0) || (strnlen(prefix_suffix) == 0 && strcmp(word.get(), "*") == 0)) Maybe that can be simplified. > + if (strcmp(temp.get(), prefix_suffix) == 0) { Same problem here. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-18 20:35:12
|
According to Geoff Hutchison: > Doing a quick Database lookup inside of parser.cc to get the DocIDs out of > the index is probably the best solution. OK, that was almost too easy. The patch is below. I'll commit it, because it works for me (in fact it worked the first try, which is kinda scary for something that originally seemed very complicated), but I'd appreciate a few extra eyeballs on this code, or some other testers. The weight is essentially hard-coded as 1, because I didn't see the point in dragging text_factor or some other attribute into this. --- htsearch/htsearch.cc.orig Thu Nov 1 14:45:07 2001 +++ htsearch/htsearch.cc Fri Jan 18 13:37:35 2002 @@ -486,6 +486,10 @@ setupWords(char *allWords, List &searchW { tempWords.Add(new WeightWord("!", -1.0)); } + else if (strcmp(word.get(), prefix_suffix) == 0) + { + tempWords.Add(new WeightWord(prefix_suffix, 1.0)); + } else { // Add word to excerpt matching list --- htsearch/parser.cc.orig Thu Jul 26 15:08:52 2001 +++ htsearch/parser.cc Fri Jan 18 13:53:05 2002 @@ -224,6 +224,7 @@ void Parser::perform_push() { static int maximum_word_length = config.Value("maximum_word_length", 12); + static char *prefix_suffix = config["prefix_match_character"]; String temp = current->word.get(); String data; char *p; @@ -238,6 +239,21 @@ Parser::perform_push() // This word needs to be ignored. Make it so. // list->isIgnore = 1; + return; + } + if (strcmp(temp.get(), prefix_suffix) == 0) { + Database *docIndex = Database::getDatabaseInstance(); + docIndex->OpenRead(config["doc_index"]); + docIndex->Start_Get(); + while ((p = docIndex->Get_Next())) + { + dm = new DocMatch; + dm->score = current->weight; + dm->id = atoi(p); + dm->anchor = 0; + list->add(dm); + } + delete docIndex; return; } temp.lowercase(); -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-18 19:21:44
|
On Fri, 18 Jan 2002, Gilles Detillieux wrote: > docIDs. Wouldn't it be much quicker to get them from the db.docs.index, > which is keyed by docID in 3.1? It's a smaller database, and you'd just > need to traverse the "cursor" part of it to get the list of keys. Sure--this was my plan. However, the DocumentDB "holds" both databases. So either way, you're writing a new method to DocumentDB. > it, the parser already does some "config" lookups, and I was going to > add a lookup for prefix_match_character anyway, so why not just lookup > doc_index too and forget adding extra stuff to pass. Ah, I see what you're saying. That works. I was imagining passing around a List of all the DocIDs to/from DocumentDB and potentially via htsearch/main.cc to get it through the parser. (Which is why I thought it was a bad idea.) Doing a quick Database lookup inside of parser.cc to get the DocIDs out of the index is probably the best solution. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-18 19:15:39
|
According to Geoff Hutchison: > On Fri, 18 Jan 2002, Gilles Detillieux wrote: > > dummy ResultList with all valid document IDs. All it needs is a method > > to call to get that list of docIDs - that's the part I need help with. > > The reason I suggested in htsearch/main.cpp is that you could get this > from the DocumentDB class if you're not in the parser. I guess htsearch > could grab this for the parser as a callback, but this was why I was > looking to skip the parser entirely. OK, I've given this some more thought. The db.docdb in 3.1.x is keyed by encoded URLs, so to get a list of docIDs from it, you'd essentially need to read in and decode every record in that database just to get at the docIDs. Wouldn't it be much quicker to get them from the db.docs.index, which is keyed by docID in 3.1? It's a smaller database, and you'd just need to traverse the "cursor" part of it to get the list of keys. > > dummy record into the db.words.db with a list of cooked-up WordRecords. > > That would work, but it's not as clean as I'd like. > > It could also be a pretty big record. Yup, which is a big part of the reason this isn't as clean as I'd like. > > combining * with and or or doesn't really get you anything, but it might > > be nice to be able to do "* not foo". > > True. But we should make sure to remember the balance--how much code will > we add versus the utility of the feature. I see the utility of "return all > matches, then sort, restrict, etc." I also see the utility of "* not > foo," but I'm not sure it's as bulletproof. Should we pass the DocumentDB > to the parser too? Well, I'd say either we pass the doc_index filename to the parser, which would only create a database instance and open the database if it needs it, or we do the opening part in main() regardless and pass the Database pointer to the parser. I prefer the former. Come to think of it, the parser already does some "config" lookups, and I was going to add a lookup for prefix_match_character anyway, so why not just lookup doc_index too and forget adding extra stuff to pass. You know, I just may be able to code this thing after all. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2002-01-18 18:49:29
|
According to Geoff Hutchison: > On Fri, 18 Jan 2002, Gilles Detillieux wrote: > > As an aside, we've always operated under the assumption that the word > > location affected the word score somehow, but I can't find any code in > > htsearch that does this. > > It's not in htsearch. Remember that before 3.2, htsearch did no > scoring. So check htcommon/WordList.cc: > wordRef->WordCount++; > wordRef->Weight += int((1000 - location) * weight_factor); > if (location < wordRef->Location) > wordRef->Location = location; Sorry, brain fart. Of couse the score is calculated in htdig in 3.1. However, it does seem pointless to store locations in db.wordlist and db.words.db if only htdig uses them. Even in an update dig, htdig only adds to db.wordlist for reparsed documents, and schedules the old word records for deletion, so the locations are only used internally by htdig. > > As far as I can tell, when the info is transfered > > from WordRecords to DocMatches, the location field is completely ignored. > > You might argue that htsearch should use a class that's slimmer than > WordRecord, but htdig certainly uses location. I had entertained thoughts > of using the location flag as well for speeding up hilight in excerpts in > htsearch, but since it's a 1/1000 location and not a character or word > location, I scrapped those plans. Well, yeah, I guess htmerge and htsearch would save memory by using a slimmer class for word records, but you'd also cut down the size of the databases. Hmm. A little too radical a change for 3.1.6, though, I think. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-18 18:18:50
|
On Fri, 18 Jan 2002, Gilles Detillieux wrote: > dummy ResultList with all valid document IDs. All it needs is a method > to call to get that list of docIDs - that's the part I need help with. The reason I suggested in htsearch/main.cpp is that you could get this from the DocumentDB class if you're not in the parser. I guess htsearch could grab this for the parser as a callback, but this was why I was looking to skip the parser entirely. > dummy record into the db.words.db with a list of cooked-up WordRecords. > That would work, but it's not as clean as I'd like. It could also be a pretty big record. > combining * with and or or doesn't really get you anything, but it might > be nice to be able to do "* not foo". True. But we should make sure to remember the balance--how much code will we add versus the utility of the feature. I see the utility of "return all matches, then sort, restrict, etc." I also see the utility of "* not foo," but I'm not sure it's as bulletproof. Should we pass the DocumentDB to the parser too? -Geoff |
From: Geoff H. <ghu...@ws...> - 2002-01-18 18:08:59
|
On Fri, 18 Jan 2002, Gilles Detillieux wrote: > As an aside, we've always operated under the assumption that the word > location affected the word score somehow, but I can't find any code in > htsearch that does this. It's not in htsearch. Remember that before 3.2, htsearch did no scoring. So check htcommon/WordList.cc: wordRef->WordCount++; wordRef->Weight += int((1000 - location) * weight_factor); if (location < wordRef->Location) wordRef->Location = location; > As far as I can tell, when the info is transfered > from WordRecords to DocMatches, the location field is completely ignored. You might argue that htsearch should use a class that's slimmer than WordRecord, but htdig certainly uses location. I had entertained thoughts of using the location flag as well for speeding up hilight in excerpts in htsearch, but since it's a 1/1000 location and not a character or word location, I scrapped those plans. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-18 17:48:46
|
According to Geoff Hutchison: > At 10:20 AM +0100 1/17/02, J. op den Brouw wrote: > ><word1> and * not <word2> == <word1> not <word2> > > > >Can that be parsed easily? > > Not in the current parser.cc. Perhaps I should rephrase that... I'm > not going to write code to deal with that case and certainly not for > a "production release." > > If someone else thinks they can tackle that type of query in a > reasonable time-frame and can convince us that it doesn't introduce > add'l bugs in the query parser, then great. I'm of the opinion that > the sooner we can go to a new parser, the better. Well, what I had envisioned is actually a pretty trivial hook in the parser.cc code. It wouldn't optimize the and and or operations as Jesse suggested, but it would just treat the * (or actually whatever prefix_match_character is set to) as a word. All it would take is a simple test in Parser::perform_push() to test if the word in "temp" matches the string in prefix_match_character, and if so, it builds a dummy ResultList with all valid document IDs. All it needs is a method to call to get that list of docIDs - that's the part I need help with. Apart from that, all it would take is a little hook in setupWords() to allow a bare prefix_match_character as a word even though it's shorter than minimum_word_length. The only simple technique I can think of to get the list of valid docIDs into the parser without actually modifying the parser, is to put a hook in htmerge/words.cc to keep track of all the docIDs and then put in a dummy record into the db.words.db with a list of cooked-up WordRecords. That would work, but it's not as clean as I'd like. A better solution would be the parser hook I mentioned, but then getting that list of docIDs would require looking into one of the other two databases, as you're not going to find it in the word database. In any case, the eventual outcome must be a ResultList with all valid docIDs, so I don't think that's any more complicated to patch into the parser than right into htsearch's main() function. As Jesse pointed out, combining * with and or or doesn't really get you anything, but it might be nice to be able to do "* not foo". As for the score field in the DocMatch objects in the ResultList, they could be assigned something arbitrarily low, like 1, or text_factor, or text_factor * current->weight. As an aside, we've always operated under the assumption that the word location affected the word score somehow, but I can't find any code in htsearch that does this. As far as I can tell, when the info is transfered from WordRecords to DocMatches, the location field is completely ignored. Indeed, when I grep for "location" in all the copies of htsearch source I have (right back to 3.0.8b2), the only reference I find to that word is in 3.0.8b2's obsolete htsearch/display.cc module, which isn't used at all, and which contains a section of disabled code that calculates scores based on word location in DocHead. I think it's pretty safe to say the location codes that htdig & htmerge so carefully calculate and manage are entirely useless. Am I missing something here? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Williams, D. A. (DAWilliams) <DAW...@aa...> - 2002-01-18 17:22:16
|
Thanks for the reply. I understand completely about not including it if there is no interest in it. I had trouble getting an earlier 3.2 beta to compile and run (before I tried to monkey with the code) and got pulled off the project for other things. I do hope to be able to work on a 3.2 patch at some point and I'll pass it along, "just in case" the feature seems interesting to anyone else. We had given the wrapper idea some consideration, but it seemed cleaner (and actually pretty easy) to include it in ht://Dig -- there may have been some problems that I don't remember with the wrapper approach. At any rate, thanks for keeping ht://Dig moving forward. -David ========================== The views, opinions, and judgments expressed are solely my own. Message contents have not been reviewed or approved by AARP. -----Original Message----- From: Geoff Hutchison [mailto:ghu...@ws...] Sent: Friday, January 18, 2002 12:05 PM To: Williams, David A. (DAWilliams) Cc: Htdig Dev List (E-mail) Subject: Re: [htdig-dev] progress towards 3.1.6 On Fri, 18 Jan 2002, Williams, David A. (DAWilliams) wrote: > I'm sorry to be late to the party with this. If there is any chance > that my forced results patch might be included in 3.1.6, I need to point out > the one bone-headed bug I've run across in it. This section (for htMerge): I haven't heard many requests for this sort of feature. And while it's interesting, my inclination is to aim for the smallest necessary changes in the production branch. So I'd hesitate to include an optional database like this. (It's not that your code looks bad on a read-through, but there may be unforseen bugs.) So I would point it towards the 3.2 branch as a possibility. I'm also not sure if this could be better done with a wrapper around htsearch and a separate CGI to retrieve your "suggested links" (even something in Perl would seem to work well). -Geoff |
From: Geoff H. <ghu...@ws...> - 2002-01-18 17:14:27
|
On Fri, 18 Jan 2002, Williams, David A. (DAWilliams) wrote: > I'm sorry to be late to the party with this. If there is any chance > that my forced results patch might be included in 3.1.6, I need to point out > the one bone-headed bug I've run across in it. This section (for htMerge): I haven't heard many requests for this sort of feature. And while it's interesting, my inclination is to aim for the smallest necessary changes in the production branch. So I'd hesitate to include an optional database like this. (It's not that your code looks bad on a read-through, but there may be unforseen bugs.) So I would point it towards the 3.2 branch as a possibility. I'm also not sure if this could be better done with a wrapper around htsearch and a separate CGI to retrieve your "suggested links" (even something in Perl would seem to work well). -Geoff |
From: Williams, D. A. (DAWilliams) <DAW...@aa...> - 2002-01-18 16:16:48
|
I'm sorry to be late to the party with this. If there is any chance that my forced results patch might be included in 3.1.6, I need to point out the one bone-headed bug I've run across in it. This section (for htMerge): + + ref = db[value]; + force_data << ref->DocID(); + if(ref) + { should really be: + + ref = db[value]; + if(ref) + { + force_data << ref->DocID(); I've updated the patch file and tar ball of changed files at: http://www.kayakero.net/xfer/htdig/ -David ========================== The views, opinions, and judgments expressed are solely my own. Message contents have not been reviewed or approved by AARP. |
From: Geoff H. <ghu...@ws...> - 2002-01-17 14:00:35
|
At 10:20 AM +0100 1/17/02, J. op den Brouw wrote: >Maybe with (yet another new option) max_retries which will count the >retries; if it fails (max_retries is full), then the server is presumed dead. >But you all know what I mean anyway. Yes, but each TCP connection is already retried a few times. So you should try a URL over if the server seems dead? How do you do that when there's only one server--wait some length of time and try it again? Maybe. For the sake of 3.1.6 (and getting it out the door), I'd rather go with Gilles' approach--you can turn off the practice of marking a server as dead. Those users who want to keep trying a server will have that ability. The people who complained bitterly that htdig would keep trying a dead server will have the ability to ignore it. ><word1> and * not <word2> == <word1> not <word2> > >Can that be parsed easily? Not in the current parser.cc. Perhaps I should rephrase that... I'm not going to write code to deal with that case and certainly not for a "production release." If someone else thinks they can tackle that type of query in a reasonable time-frame and can convince us that it doesn't introduce add'l bugs in the query parser, then great. I'm of the opinion that the sooner we can go to a new parser, the better. -Geoff |
From: J. op d. B. <MSQ...@st...> - 2002-01-17 09:20:37
|
On Tue, 15 Jan 2002, Gilles Detillieux wrote: > > > 2. way to override "no server" problem > > > > I'm not quite sure what you mean by this. > > Since 3.1.5, if htdig fails to connect to a server, it sets the "dead > server" flag and won't try again to contact that server, giving instead a > bunch of "no server" errors. In most cases, that's the best thing to do, > but a number of users expressed a preference for the old way, or something > more fail-safe like waiting a while and trying that server again later. Maybe with (yet another new option) max_retries which will count the retries; if it fails (max_retries is full), then the server is presumed dead. But you all know what I mean anyway. > > > 4. a "match all documents" mechanism in htsearch > > > > This IMHO, is no small feat unless you hack htsearch to totally bypass the > > parser and htfuzzy phases for some specific query. (As in, if the query is > > just '*' and nothing else, it will return all documents and then > > restrict, exclude, etc. But something like 'foo and * not bar" is subject > > to the normal parsing.) > > > > Again, if this seems like a reasonable workaround, I can write this. <word> and * == <word> <word> or * == * <word1> and * not <word2> == <word1> not <word2> Can that be parsed easily? --jesse -------------------------------------------------------------------- J. op den Brouw Johanna Westerdijkplein 75 Haagse Hogeschool 2521 EN DEN HAAG Faculty of Engeneering Netherlands Electrical Engeneering +31 70 4458936 -------------------- J.E...@st... -------------------- Linux - because reboots are for hardware changes |
From: Gilles D. <gr...@sc...> - 2002-01-16 18:05:04
|
According to Joe R. Jah: > On Tue, 15 Jan 2002, Gilles Detillieux wrote: > > According to Geoff Hutchison: > > > On Tue, 15 Jan 2002, Gilles Detillieux wrote: > > > > 1. better configure test for regex problems on BSDi > > > > > > I don't know if this will ever become "automatic," but it looks like we > > > can relatively easily have a --with-rx flag to the configure script which > > > will bypass the included regex code (and use the rx code instead). > > > > > > Does this seem like a reasonable workaround? > > > > That's reasonable to me. But back in October, you had suggested an > > automatic test for the machine triplet "*-*-bsdi*". Have you given up on > > that idea? It seems it's only been BSDI systems that have had problems > > with the bundled GNU regex code. I think manual overrides with --with-... > > options are a good idea, but an automatic test for setting the default > > would be handy and may cut down the number of questions on the list. > > I believe --with-rx is the better solution. Administrators of systems > that have not manifested any problem with the bundled GNU regex code, > would have an option to compare htdig performance with and without it. > I am not sure if the automatic test for the machine triplet "*-*-bsdi*" > would work; however, I wouldn't mind to have both solutions;^) Having both solutions is exactly what I was proposing. The automatic tests are nice when they work, because they can save a lot of users a lot of fiddling to get things right, but it's nice to be able to override some of these tests with a command option should the test fail on a particular system. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Joe R. J. <jj...@cl...> - 2002-01-15 23:57:08
|
On Tue, 15 Jan 2002, Gilles Detillieux wrote: > Date: Tue, 15 Jan 2002 17:05:37 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: Geoff Hutchison <ghu...@ws...> > Cc: htd...@li... > Subject: Re: [htdig-dev] Re: Progress towards 3.1.6 > > According to Geoff Hutchison: > > On Tue, 15 Jan 2002, Gilles Detillieux wrote: > > > 1. better configure test for regex problems on BSDi > > > > I don't know if this will ever become "automatic," but it looks like we > > can relatively easily have a --with-rx flag to the configure script which > > will bypass the included regex code (and use the rx code instead). > > > > Does this seem like a reasonable workaround? > > That's reasonable to me. But back in October, you had suggested an > automatic test for the machine triplet "*-*-bsdi*". Have you given up on > that idea? It seems it's only been BSDI systems that have had problems > with the bundled GNU regex code. I think manual overrides with --with-... > options are a good idea, but an automatic test for setting the default > would be handy and may cut down the number of questions on the list. I believe --with-rx is the better solution. Administrators of systems that have not manifested any problem with the bundled GNU regex code, would have an option to compare htdig performance with and without it. I am not sure if the automatic test for the machine triplet "*-*-bsdi*" would work; however, I wouldn't mind to have both solutions;^) Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... > > > 2. way to override "no server" problem > > > > I'm not quite sure what you mean by this. > > Since 3.1.5, if htdig fails to connect to a server, it sets the "dead > server" flag and won't try again to contact that server, giving instead a > bunch of "no server" errors. In most cases, that's the best thing to do, > but a number of users expressed a preference for the old way, or something > more fail-safe like waiting a while and trying that server again later. > > > > 4. a "match all documents" mechanism in htsearch > > > > This IMHO, is no small feat unless you hack htsearch to totally bypass the > > parser and htfuzzy phases for some specific query. (As in, if the query is > > just '*' and nothing else, it will return all documents and then > > restrict, exclude, etc. But something like 'foo and * not bar" is subject > > to the normal parsing.) > > > > Again, if this seems like a reasonable workaround, I can write this. > > It's certainly reasonable for the purpose in mind (the what's new facility). > Thanks! > > -- > Gilles R. Detillieux E-mail: <gr...@sc...> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil > Dept. Physiology, U. of Manitoba Phone: (204)789-3766 > Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 > > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev |
From: Gilles D. <gr...@sc...> - 2002-01-15 23:05:49
|
According to Geoff Hutchison: > On Tue, 15 Jan 2002, Gilles Detillieux wrote: > > 1. better configure test for regex problems on BSDi > > I don't know if this will ever become "automatic," but it looks like we > can relatively easily have a --with-rx flag to the configure script which > will bypass the included regex code (and use the rx code instead). > > Does this seem like a reasonable workaround? That's reasonable to me. But back in October, you had suggested an automatic test for the machine triplet "*-*-bsdi*". Have you given up on that idea? It seems it's only been BSDI systems that have had problems with the bundled GNU regex code. I think manual overrides with --with-... options are a good idea, but an automatic test for setting the default would be handy and may cut down the number of questions on the list. > > 2. way to override "no server" problem > > I'm not quite sure what you mean by this. Since 3.1.5, if htdig fails to connect to a server, it sets the "dead server" flag and won't try again to contact that server, giving instead a bunch of "no server" errors. In most cases, that's the best thing to do, but a number of users expressed a preference for the old way, or something more fail-safe like waiting a while and trying that server again later. > > 4. a "match all documents" mechanism in htsearch > > This IMHO, is no small feat unless you hack htsearch to totally bypass the > parser and htfuzzy phases for some specific query. (As in, if the query is > just '*' and nothing else, it will return all documents and then > restrict, exclude, etc. But something like 'foo and * not bar" is subject > to the normal parsing.) > > Again, if this seems like a reasonable workaround, I can write this. It's certainly reasonable for the purpose in mind (the what's new facility). Thanks! -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-15 22:49:22
|
On Tue, 15 Jan 2002, Gilles Detillieux wrote: > 1. better configure test for regex problems on BSDi I don't know if this will ever become "automatic," but it looks like we can relatively easily have a --with-rx flag to the configure script which will bypass the included regex code (and use the rx code instead). Does this seem like a reasonable workaround? > 2. way to override "no server" problem I'm not quite sure what you mean by this. > 4. a "match all documents" mechanism in htsearch This IMHO, is no small feat unless you hack htsearch to totally bypass the parser and htfuzzy phases for some specific query. (As in, if the query is just '*' and nothing else, it will return all documents and then restrict, exclude, etc. But something like 'foo and * not bar" is subject to the normal parsing.) Again, if this seems like a reasonable workaround, I can write this. -Geoff |
From: Geoff H. <ghu...@ws...> - 2002-01-15 22:42:36
|
On Tue, 15 Jan 2002, Gilles Detillieux wrote: > > Would it be possible to remove duplicate pages from the search results > > before they are output to the html page? This is obviously something that > > htmerge would do is the databases were to be combined into one. > > Good point. This would appear to be a bug in the collections support > (not the only one!). This should go on our to-do list for 3.2, until > we can get back to actively developing it. Yes, as Greg discovered, the collections code just loops over the databases--it makes no attempt to check that URLs aren't duplicates. While it will take some work for this, culling the duplicates obviously speeds up the results (if you do it at the right point, you won't need to score, etc.). -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-15 21:35:47
|
According to Greg Lepore: > I've been experimenting with searching multiple databases from one search > form via the pipe in 3.2 (11/23 build): > <input type="checkbox" name="config" > value="aom1|aom2|aom2a|aom3|aom4|aom5|esr"> > (htmerge is not working at the moment). The searches appear to be working > correctly. One unusual result that I have noticed is that pages which are > included in multiple databases will appear multiple times in the output. i.e. ... > Would it be possible to remove duplicate pages from the search results > before they are output to the html page? This is obviously something that > htmerge would do is the databases were to be combined into one. Good point. This would appear to be a bug in the collections support (not the only one!). This should go on our to-do list for 3.2, until we can get back to actively developing it. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2002-01-15 19:42:58
|
According to Geoff Hutchison: > I've been in the middle of a few projects and haven't turned towards > ht://Dig much. But I took a look at the checklist I had for 3.1.6 and I > wasn't sure what needed to be done besides: > > * Release notes > * Maindocs merges > * Regex/RX issues > > And of course the usual sort of pre-release testing and beating... Is this > all that's left? Here's my own checklist: 1. better configure test for regex problems on BSDi 2. way to override "no server" problem 3. handle noindex_start & noindex_end as string lists in HTML parser 4. a "match all documents" mechanism in htsearch 5. a way of specifying relative date ranges in htsearch 6. release notes 7. merge maindocs updates into htdoc and vice versa I was going to tackle 5 today or maybe tomorrow. I'll see if I can come up with something for 3 after that. I'd really appreciate it if you did 1, 4, 6 & 7. I'm not sure about 2, as there isn't an easy fix to make it work well, so the best that could be done quickly may be just an attribute to make it revert to the pre-3.1.5 behaviour. Some of the testing that's needed most is trying out the last few attributes I've added, as all I've done so far is make sure the code compiles and runs with my current set of attributes on my site. So, I haven't really tested description_meta_tag_names and translate_latin1 at all, nor the new code for use_doc_date. I was also hoping to hear back from Alexander Lebedev about completing the audit of questionable 3-letter word expansions in english.0 (from L to Z). -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |